# Advanced Analytics & Modelling Basic Python Training Course - Session 1

This is the first of 2 sessions on basic usage of Python 3, which is focussed on general usage.

## Course Structure
#### Session 1
**Language overview of Python**
* introduction
* installing and importing packages
* variable declaration - integers, doubles, strings, datetime
* lists and dictionaries
* functions
* for loops and while loops
* numpy and pandas with pycba (leading into session 2)

#### Session 2
**Using Pandas at CBA (Case Study)**
* pandas package
    * obtaining data through pycba
    * basic data cleaning
    * data summary
* matplotlib
    * basic visualisation    
* [advanced] plotly

### Session 1

#### 1. Introduction

Python was developed in 1991 with an emphasis of **readability** - ie it is succinct and "forces" good coding habit. As a result, it is one of the most used general purpose languages.

AWB (https://jupyterhub.aiaa.ai.cba/) had already installed Anaconda for you. If you would like to use Python locally, please go to https://www.anaconda.com/download/ and download for **Python 3**.

Python is an object oriented language, ie just like how English is written as {Subject} {Verb} {Object}, Python is written with the same core structure
{Subject}.{Verb}({Objects})

In [1]:
##### This is test code
import os
from datetime import datetime as dt # dt is now the alias for datetime.datetime

x = os.listdir('./') # all package names must be specified for every function, unless explicitly imported
print(x)

for f in x:
    if os.path.isfile('./' + f): # Concatenation of strings is simply +, no more paste and paste0s from R
                          # Loops are created from indentation - forces user to indent properly
        print('{0} is a file'.format(f))
    elif os.path.isdir('./' + f): # elif for else if
        print('{0} is a directory'.format(f))
    else:
        print("I don't know what {0} is".format(f))

['Untitled.ipynb', 'Python_Session1.ipynb', 'sklearn-tutorial.ipynb', '.ipynb_checkpoints']
Untitled.ipynb is a file
Python_Session1.ipynb is a file
sklearn-tutorial.ipynb is a file
.ipynb_checkpoints is a directory


#### 2. Installing and Importing Packages

##### Installing Packages

Packages may be installed globally (for running on local/at home) or by user (for AWB) using pip. Jupyter can run console commands by using ! in front of the command

In [2]:
!pip install --user tensorflow

Looking in indexes: https://artifactory.ai.cba/artifactory/api/pypi/pypi/simple, https://artifactory.ai.cba/artifactory/api/pypi/python-zbi/simple
Collecting tensorflow
[?25l  Downloading https://artifactory.ai.cba/artifactory/api/pypi/pypi/packages/22/cc/ca70b78087015d21c5f3f93694107f34ebccb3be9624385a911d4b52ecef/tensorflow-1.12.0-cp36-cp36m-manylinux1_x86_64.whl (83.1MB)
[K    100% |████████████████████████████████| 83.1MB 19.7MB/s 
[?25hCollecting tensorboard<1.13.0,>=1.12.0 (from tensorflow)
[?25l  Downloading https://artifactory.ai.cba/artifactory/api/pypi/pypi/packages/e0/d0/65fe48383146199f16dbd5999ef226b87bce63ad5cd73c840cf722637969/tensorboard-1.12.0-py3-none-any.whl (3.0MB)
[K    100% |████████████████████████████████| 3.1MB 43.9MB/s 
Collecting termcolor>=1.1.0 (from tensorflow)
Collecting astor>=0.6.0 (from tensorflow)
  Downloading https://artifactory.ai.cba/artifactory/api/pypi/pypi/packages/35/6b/11530768cac581a12952a2aad00e1526b89d242d0b9f59534ef6e6a1752f/astor-0.

Packages can be precompiled into a wheel file, and stored on repositories such as PyPI. pip can also install these files

In [3]:
!pip install --user ./../h2o-3.22.0.1-py2.py3-none-any.whl

[33mRequirement './../h2o-3.22.0.1-py2.py3-none-any.whl' looks like a filename, but the file does not exist[0m
Looking in indexes: https://artifactory.ai.cba/artifactory/api/pypi/pypi/simple, https://artifactory.ai.cba/artifactory/api/pypi/python-zbi/simple
Processing /home/chonghsi/projects/h2o-3.22.0.1-py2.py3-none-any.whl
[31mCould not install packages due to an EnvironmentError: [Errno 2] No such file or directory: '/home/chonghsi/projects/h2o-3.22.0.1-py2.py3-none-any.whl'
[0m


##### Importing Packages

Packages may be imported with the import command.

In [4]:
# Essentially library function in R

import sys # Vanilla import command
import pandas as pd # Import with an alias, so now pd works, but pandas doesn't
from datetime import datetime as dt # Importing only a specific part of a package, eg functions/subpackages

Adhoc code can also be imported given it is in the sys.path.
sys.path can be lengthened for the session by

In [5]:
sys.path.insert(0, '/home/chonghsi') # insert at 0th place => overtake everything else's priority

In [6]:
import modelfunctions # .py at the end is not required

ModuleNotFoundError: No module named 'modelfunctions'

#### 3. Variable Declaration

In Python, like R, you can simply use **=** for assignment without explicit definitions of type

In [7]:
x = 1 # Note: In Python integers are treated as a list of digits
y = 2.5
z = "a piece of string" # In Python strings are treated as a list of characters

print(x)
print(y)
print(z)

1
2.5
a piece of string


Simple mathematical operations can be done without any packages, but more complicated element-wise operations requires the **math** package, whilst array-wise operations should be done with **NumPy**.

Datetime is done with the packages **date**, **datetime** and/or **time**

In [9]:
import math
import datetime

s = x + y
p = x * y
d = x / y

s2 = math.log(x + y)
p = math.exp(x * y)

z = "a piece of string" + " another piece of string"

dt = datetime.datetime(2018, 11, 15, 12, 31, 59)

self-assignment is allowed in Python

In [10]:
x = 1
y = 1

print("Before addition:")
print("x = " + str(x))
print("y = " + str(y))

x += 2
y = y + 2

print("After addition:")
print("x = " + str(x))
print("y = " + str(y))

print("Similarly...")
x *= 3
print(x)
x /= 3
print(x)
x -= 1
print(x)

print("For strings:")
z += " a third piece of string"
print(z)

dt += datetime.timedelta(days = 4)
print(dt)

Before addition:
x = 1
y = 1
After addition:
x = 3
y = 3
Similarly...
9
3.0
2.0
For strings:
a piece of string another piece of string a third piece of string
2018-11-19 12:31:59


[Advanced] Alternatively, in Cython, you can declare variables as C variables which may be faster for certain operations. Specific types are required to be declared

In [11]:
# Python version
def primes_py(nb_primes):
    ###############
    # Pythonic declaration here
    p = [0] * 1000
    ###############
    
    if nb_primes > 1000:
        nb_primes = 1000
    
    len_p = 0
    n = 2
    while len_p < nb_primes:
        # Is n prime?
        for i in p[:len_p]:
            if n % i == 0:
                break

        # If no break occurred in the loop, we have a prime.
        else:
            p[len_p] = n
            len_p += 1
        n += 1

    # Let's return the result in a python list:
    result_as_list  = [prime for prime in p[:len_p]]
    return result_as_list

In [12]:
%load_ext cython

In [13]:
%%cython -a
def primes_cy(int nb_primes):
    ###############
    # Cythonic declaration here
    cdef int n, i, len_p
    cdef int p[1000]
    ###############
    if nb_primes > 1000:
        nb_primes = 1000

    len_p = 0  # The current number of elements in p.
    n = 2
    while len_p < nb_primes:
        # Is n prime?
        for i in p[:len_p]:
            if n % i == 0:
                break

        # If no break occurred in the loop, we have a prime.
        else:
            p[len_p] = n
            len_p += 1
        n += 1

    # Let's return the result in a python list:
    result_as_list  = [prime for prime in p[:len_p]]
    return result_as_list

In [14]:
timeit primes_py(1000)

56.4 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
timeit primes_cy(1000)

1.48 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


Cython is faster than Python in this case, by ~20 times

#### 4. Lists and Dictionaries

Lists in Python can be defined with square brackets, and separating elements with commas. Objects in lists can be of different types.

However, lists in Python starts from 0, whereas lists start from 1 in R.

Dictionaries in Python can be defined with curly brackets, with the structure key:

In [16]:
import math 

list_1 = ["a", "b", "c", 1, math.pi]
dict_1 = {
    "a":"apple",
    "b":"boy",
    "c":"cat",
    1:"one",
    "tomorrow":(datetime.datetime.now() + datetime.timedelta(days = 1)).date()
}
print(list_1)
print(dict_1)

['a', 'b', 'c', 1, 3.141592653589793]
{'a': 'apple', 'b': 'boy', 'c': 'cat', 1: 'one', 'tomorrow': datetime.date(2018, 12, 12)}


Lists can be appended to or extended to via +, or using list.append(new_object) or list.extend(another_list)

Dictionaries can be extended by the update function, or extended element by element by using dict[new_key] = new_object

In [17]:
list_1 += [datetime.datetime(2018, 11, 30)]
dict_1.update({
    "rudolf":"red nose reindeer", 
    "santa":"Christmas"
})
print(list_1)
print(dict_1)

['a', 'b', 'c', 1, 3.141592653589793, datetime.datetime(2018, 11, 30, 0, 0)]
{'a': 'apple', 'b': 'boy', 'c': 'cat', 1: 'one', 'tomorrow': datetime.date(2018, 12, 12), 'rudolf': 'red nose reindeer', 'santa': 'Christmas'}


In [18]:
list_1.append(5)
dict_1[5] = "five"
print(list_1)
print(dict_1)

['a', 'b', 'c', 1, 3.141592653589793, datetime.datetime(2018, 11, 30, 0, 0), 5]
{'a': 'apple', 'b': 'boy', 'c': 'cat', 1: 'one', 'tomorrow': datetime.date(2018, 12, 12), 'rudolf': 'red nose reindeer', 'santa': 'Christmas', 5: 'five'}


Items in lists can be accessed using the square bracket and the index number 

Again, Python lists start from index of 0

Dictionary objects can be accessed using the square bracket and the key

In [19]:
print(list_1[5])
print(dict_1["c"])

2018-11-30 00:00:00
cat


List can also call a range of indices at the same time

In [20]:
print(list_1[0:5])
print(list_1[:5])
print(list_1[2:])
print(list_1[2:-1])

['a', 'b', 'c', 1, 3.141592653589793]
['a', 'b', 'c', 1, 3.141592653589793]
['c', 1, 3.141592653589793, datetime.datetime(2018, 11, 30, 0, 0), 5]
['c', 1, 3.141592653589793, datetime.datetime(2018, 11, 30, 0, 0)]


checking whether an object exists in a list

In [21]:
print("a" in list_1)
print("zebra" in list_1)

True
False


**[Advanced]** List comprehension can be done where for loops are ran within lists on a list/dictionary/iterator

[{operation on x} for x in list_name]

In [22]:
print([type(x) for x in list_1])
print([x + 1 for x in list_1 if type(x) in [int, float]])

[<class 'str'>, <class 'str'>, <class 'str'>, <class 'int'>, <class 'float'>, <class 'datetime.datetime'>, <class 'int'>]
[2, 4.141592653589793, 6]


#### 5. Functions

Functions and macros are created with the structure:

In [23]:
def function_name(parameter1, parameter2, defaultparameter = "default_value"):
    x = parameter1 + parameter2
    return "x is now {0} and the defaultparameter is {1}".format(x, defaultparameter)

In [24]:
function_name(5, 6)

'x is now 11 and the defaultparameter is default_value'

#### 6. For and While Loops

For loops can be done with lists or any type of "iterators", eg range(n) = from 0 to n-1 (in total n objects); dictinoary.items(); enumerate(list) which creates a list of [i, j] where i is the number of times it has looped through, and j is the i-th object from list

In [29]:
print("List comprehension")
x = [i for i in range(5)]
print(x)

print("====First Loop====")
for i in x:
    print(i)
    
print("====Second Loop====")
for i in range(5):
    print(i)
    
print("====Third Loop====")
for i, j in dict_1.items():
    print("{0}:{1}".format(i, j))
    
print("====Fourth Loop====")
for i, _ in dict_1.items():
    print("{0}".format(i))

List comprehension
[0, 1, 2, 3, 4]
====First Loop====
0
1
2
3
4
====Second Loop====
0
1
2
3
4
====Third Loop====
a:apple
b:boy
c:cat
1:one
tomorrow:2018-12-12
rudolf:red nose reindeer
santa:Christmas
5:five
====Fourth Loop====
a
b
c
1
tomorrow
rudolf
santa
5


while loops can be done through a certain condition, or just set on a tautology until broken out of the loop using break

In [30]:
print("First Loop")
i = 0
while i < 10:
    print(i)
    i += 1
    
print("Second Loop") 
i = 0
while True:
    print(i)
    i += 1
    if i >= 10:
        break

First Loop
0
1
2
3
4
5
6
7
8
9
Second Loop
0
1
2
3
4
5
6
7
8
9


#### 7. NumPy and Simple Pandas

NumPy is the numeric package written in python for the sole purpose of number crunching (note Cython reports significant performance uplifts https://notes-on-cython.readthedocs.io/en/latest/std_dev.html)

Pandas is the package build on NumPy to process data

PyCBA is the internal package written to access data from OMNIA and Teradata, with documentations at https://github.ai.cba/aia/aiaa.pycba

In [31]:
import math
import numpy as np

x = [i for i in range(1000)]

def sqrt_arr(x):
    output = []
    for i in x:
        output.append(math.sqrt(i))
    return output

In [32]:
timeit np.sqrt(x)

48.6 µs ± 551 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [33]:
timeit [math.sqrt(i) for i in x]

114 µs ± 4.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [34]:
timeit sqrt_arr(x)

178 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [35]:
df = pd.DataFrame()
df[""]

KeyError: ''

#### 8. How to Find Out What the Function Does

When searching, make sure the Python version (eg Python 3 vs Python 2) and the version of the package match

* Google
* Package documentation (eg http://pandas.pydata.org/pandas-docs/version/0.23/)

In [36]:
# ?function is equivalent to help
help(pd.read_csv)

Help on function read_csv in module pandas.io.parsers:

read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, doublequote=True, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None)
    Read CSV (comma-separated) file into DataFrame
    
    Also supports optionally itera

In [37]:
# Alternatively, given a function, you can examine the source code as:
import inspect
print(inspect.getsource(pd.read_csv))

    def parser_f(filepath_or_buffer,
                 sep=sep,
                 delimiter=None,

                 # Column and Index Locations and Names
                 header='infer',
                 names=None,
                 index_col=None,
                 usecols=None,
                 squeeze=False,
                 prefix=None,
                 mangle_dupe_cols=True,

                 # General Parsing Configuration
                 dtype=None,
                 engine=None,
                 converters=None,
                 true_values=None,
                 false_values=None,
                 skipinitialspace=False,
                 skiprows=None,
                 nrows=None,

                 # NA and Missing Data Handling
                 na_values=None,
                 keep_default_na=True,
                 na_filter=True,
                 verbose=False,
                 skip_blank_lines=True,

                 # Datetime Handling
                 parse_dates=False,
   

In [None]:
import pyodbc
import pandas as pd

connection = pyodbc.connect('Driver={Oracle in OraClient11g_home1};DBQ=pacfin;Uid=uid;Pwd=pw')

#test the connection
cursor = connection.cursor()

#Example command to print the unique values of the field 'pacfin_group_gear_code' 
SQLCommand = ("SELECT distinct pacfin_group_gear_code FROM pacfin_marts.comprehensive_ft")
cursor.execute(SQLCommand)
print cursor.fetchall()
