In [1]:
import numpy as np
import pandas as pd

# Intro

In this notebook, we go over a few  __programming building blocks__ that could be helpful when you use Python for Data Analytics tasks.

# Writing a Python function

__When should we write a Python function?__

Whenever we notice that we need to re-utilize the same code chunks many times. If we notice that we need the same lines of code in different analyses, we could write a function to encapsulate the logic contained in this code and then re-aply it easily!

We will see multiple situations in the two ML classes where we could write a function because we re-use the same code logic multiple times with different datasets or with the same dataset but in different circumstances.

__How to define a function?__

https://www.programiz.com/python-programming/function

Syntax of a function definition:

def function_name(parameters):
>"""docstring"""<br>
>Python statements<br>
>return () or print ()

__A function definition consists of the following components__ (according to the previous website):

- Keyword __def__ that marks the start of the function header
- The function name that we want to give to the function
- Parameters through which we pass values to a function. Some functions do not take any parameters
- A colon (:) to mark the end of the function header
- Optional documentation string (docstring) to describe what the function does
- One or more valid python statements that make up the function body. Statements must be indented (using the tab key)
- A return or a print statement to return/print a value from the function

__Example__:Define a function that computes the euclidean distance between two arrays

In [112]:
def my_euclidean_dist_calculator(x,y):
    return np.round (np.sqrt(np.sum(np.square(x-y))), 2)

An alternative to the previous function definition is:

def my_euclidean_dist_calculator(x,y):\
&emsp;out= np.round (np.sqrt(np.sum(np.square(x-y))), 2)\
&emsp;return out

In [6]:
array1= np.array([1,1,1,1])
array2= np.array([1,-1,0,0])

In [7]:
# Calling the newly defined my_euclidean_dist_calculator function

my_euclidean_dist_calculator(array1, array2)

2.45

When a function returns a value (or values), you can assign the results of the function to an object and then keep doing operations on this object. See next example:

In [8]:
distance_value= my_euclidean_dist_calculator(array1, array2)

if distance_value < 1:
    print ('Short distance')
else:
    print ('Long distance')

Long distance


__Example__: Define a function that obtains a __confusion matrix__ from two arrays: one with the actual values of Y and a second one with the predicted values of Y given by a ML algorithm. Y is a binary dependent (outcome) variable.

For a reminder of what a confusion matrix is, follow this link:

https://docs.google.com/presentation/d/11-hEfhOYDwwyZvy2as3QyTDW0VWrJdmb3SYs7flwatY/edit?usp=sharing

In [12]:
def my_confusion_matrix (y_actual, y_predicted):
    """
    This function computes the confusion matrix based on ...
    more text
    more text
    """
    confusion_table = pd.crosstab(y_actual, y_predicted)
    confusion_table.index.name = 'Actual'
    confusion_table.columns.name = 'Predicted'
    confusion_table['Row Total']= confusion_table.sum(axis=1)
    confusion_table.loc['Column Total'] = confusion_table.sum(axis = 0)
    return (confusion_table)

In [13]:
# The documentation that we create when we write a function can be printed by calling the method help()

help(my_confusion_matrix)

Help on function my_confusion_matrix in module __main__:

my_confusion_matrix(y_actual, y_predicted)
    This function computes the confusion matrix based on ...
    more text
    more text



Let's test my_confusion_matrix() function

First, generate simulated values for a binary Y variable

In [14]:
np.random.seed(seed= 1)
y_values= np.random.randint(0,2,size=100)
y_values

array([1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1])

Generate predicted values of Y (that mostly (~ 90% of the times) coincide with the values in y_values)

__DO NOT EXPLAIN THE FOLLOWING TWO CODE CHUNKS. NOT IMPORTANT TO UNDERSTAND NOW!!!__

In [15]:
np.random.seed(seed= 1)

threshold= np.random.uniform(low=0, high=1, size= 100)

threshold <0.9

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True, False,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True, False,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True, False,  True,  True,
        True, False,  True,  True,  True,  True, False,  True,  True,
        True])

In [16]:
y_predictions= (threshold <0.9) * y_values

y_predictions

array([1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1])

In [17]:
# Calling my_confusion_matrix() function to obtain a confusion matrix with the data in y_values and y_predictions

my_confusion_matrix (y_values, y_predictions)

Predicted,0,1,Row Total
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,45,0,45
1,7,48,55
Column Total,52,48,100


As we did before, we can save the output of applying the my_confusion_matrix () function in a variable and use this variable to do further analyses. For example, we can use this variable to easily retrieve the different cells of the confusion matrix. See next:                                            

In [18]:
confusion_matrix1 = my_confusion_matrix (y_values, y_predictions)

Let's get the number of true negative and true positives from the confusion matrix

In [19]:
# True negative

confusion_matrix1.iloc[0,0]

45

In [20]:
# True positive

confusion_matrix1.iloc[1,1]

48

In [21]:
# A different version of the previous function, where instead of returning a value, something gets printed

def my_confusion_matrix_ver2 (y_actual, y_predicted):
    """
    This function computes the confusion matrix
    ...
    ...
    """
    confusion_table = pd.crosstab(y_actual, y_predicted)
    confusion_table.index.name = 'Actual'
    confusion_table.columns.name = 'Predicted'
    confusion_table['Row Total']= confusion_table.sum(axis=1)
    confusion_table.loc['Column Total'] = confusion_table.sum(axis = 0)
    print('     Confusion Matrix')
    print ()
    print (confusion_table)

In [22]:
my_confusion_matrix_ver2 (y_values, y_predictions)

     Confusion Matrix

Predicted      0   1  Row Total
Actual                         
0             45   0         45
1              7  48         55
Column Total  52  48        100


### How to re-use a self-created function in other notebooks?

For example, how can we call and use the functions we wrote above (my_euclidean_dist_calculator and my_confusion_matrix) in other notebooks that we open?

Visit this webpage for a short explanation:

https://problemsolvingwithpython.com/07-Functions-and-Modules/07.05-Calling-Functions-from-Other-Files/


Let's do a demonstration of how to do it!

# How to loop through different data structures

We will only review For Loops (While Loops and For Loops are _usually_ interchangeable)

__One the most basic things you can do with a loop__ is to "visit" each element of a unidimensional data structure (e.g., a unidimensional array, a list) and use each element to trigger some action.

Example 1: Print each value from an array

In [2]:
# Let's creat an array of random numbers first

np.random.seed(seed= 1)
array_x= np.random.randint(0,21, 10)

In [3]:
# Let's print each value from array_x

for element in array_x:
    print (element)

5
11
12
8
9
11
5
15
0
16


Remember to indent the body of a Foor loop. Any code inside the foor loop must be indented. You can indent it using one Tab.

Example 2: Print whether each value of an array is even or odd

In the previous loop we used the variable 'element' as the __loop variable__ (the one which value changes after each iteration of the loop)

However, it is customary to use single words to name the loop variable. For example, 'i', 'j', 'k', etc.

In [4]:
# Note: Explain the indentation in this loop

for i in array_x:
    if i%2==0:
        print('even')
    else:
        print ('odd')

odd
odd
even
even
odd
odd
odd
odd
even
even


When writing conditional statements (e.g., IF-ELSE structures), the body of these statements must be indended too.

__Another thing you can do with a loop__ is to visit each element of a unidimensional structure and stop the search once something happens (e.g., once you find a number)

Example: Find whether the number 12 is in array_x

__Note__: You do not need a loop to do this (see below for a non-loop alternative), but we are going to use a loop to practice.

In [6]:
# Version 1

for number in array_x:
    print (number) # printing each element is optional. No needed to satisfy what is being asked.
    if number==12:
        print ('The number was found')
        break

5
11
12
The number was found


In [10]:
# Not so important to understand this version. More important to understand version 1.
# Version 2: It is better than version 1 because it tells us when the number is not found
# Try it with 12 (which is in the array) and then with 20 (which is not)

count=0
for number in array_x:
    count +=1
    print (number)
    if number==20:
        print ('The number was found')
        break
if count==array_x.size: # this conditional should not be inside the loop. Only needs to be checked after loop is done
    print ('The number was not found')

5
11
12
8
9
11
5
15
0
16
The number was not found


How to check if a number is in an array WITHOUT using a loop

In [14]:
# Check if 12 is in array_x

np.any(array_x==12)

True

In [15]:
# Check if 20 is in array_x

np.any(array_x==20)

False

__EACH OF YOU WORK ON THIS FOR A COUPLE OF MINUTES__: 

__Use the function np.any()__ to write a conditional statement that returns the message 'The number was found' if the number is in the array. Otherwise, return the message 'The number was not found'

In [20]:
# DO IT HERE


__Another thing you can do with a loop__ is to loop through the positions of a unidimensional structure. The previous loops looped through the elements of the array, but you can also loop through the positions. Sometimes you might care about the positions, but not the elements of the array.

Example, write a loop to return the index at which an element is at in an array.

__Note__: You do not need a loop to do this (see below for a non-loop alternative), but we are going to use a loop to practice.

In [30]:
array_x

array([ 5, 11, 12,  8,  9, 11,  5, 15,  0, 16])

In [62]:
Notfound= True
for position in np.arange(array_x.size): # np.arange(array_x.size) returns this sequence: 0, 1, ... (size -1)
    if array_x[position]==20:
        print ('The number was found at index',':',position)
        Notfound= False
        break
if Notfound:
    print ('The number was not found')

The number was not found


To loop through the positions of a unidimensional structure, one handy function is enumerate(), which returns a tuple for each value of the array (or list). Each tuple contains the index and the value.

In [43]:
for i in enumerate(array_x):
    print(i)

(0, 5)
(1, 11)
(2, 12)
(3, 8)
(4, 9)
(5, 11)
(6, 5)
(7, 15)
(8, 0)
(9, 16)


So, we could loop through the results of enumerate() and, for each position of an array, get the index, the value, or both if we want to. See next:

In [48]:
for index,value in enumerate(array_x):
    print(index,',', value)

0 , 5
1 , 11
2 , 12
3 , 8
4 , 9
5 , 11
6 , 5
7 , 15
8 , 0
9 , 16


Now let's use enumerate() to returns the index at which an element is at in an array.

In [78]:
Notfound= True
for index, value in enumerate(array_x):
    if value==16:
        print ('The number was found at index',':', index)
        Notfound= False
        break
if Notfound: # this conditional does not need to be inside the loop
    print ('The number was not found')

The number was found at index : 9


How to get the index of a number in an array WITHOUT using a loop

In [79]:
# Get the index where 16 is at

np.where(array_x==16)

(array([9], dtype=int64),)

In [80]:
np.where(array_x==16)[0]

array([9], dtype=int64)

In [81]:
# Check if 20 is in array_x

np.where(array_x==20)[0]

array([], dtype=int64)

An alternative to using np.where() is to convert the array to a list and apply the .index() method

In [82]:
list(array_x).index(16)

9

In [83]:
list(array_x).index(20)

ValueError: 20 is not in list

## Other loop examples. OPTIONAL !
## Just go over them if time permits

__Example__: Generate a sample of 5 random numbers for a Z normal distribution and compute the sample mean. Repeat these process 15 times (= generate 15 samples and compute the mean for each of them)

In [89]:
numberofsamples= 15

# Create an empty array called 'sample_means' to save the mean computed from each sample

sample_means= np.empty(numberofsamples, dtype= float)

In [91]:
for i in np.arange(numberofsamples): # i goes from 0 to 14, that is, 15 numbers (there are 15 samples)
    np.random.seed(i) # Why would it be incorrect to write np.random,seed(1) (using seed=1 for all interations)????
    sample_of_five= np.random.normal(0,1,size= 5)
    sample_means[i]= np.mean(sample_of_five) # sample_means in position i will store the sample mean for the sample generated on each loop iteration

In [97]:
print (sample_means)
print (np.mean(sample_means))

[ 1.45027975  0.05537124 -0.55247711  0.03615098 -0.03401978  0.47972925
 -0.55016628  0.17520035 -0.88946479 -0.35914921  0.22288346 -0.33655732
 -0.18271867  0.35875734 -0.05443286]
-0.01204091057882513


Repeating the previous loop but, instead of using an array, using a list

If you do not want to create an empty array, you can always create an empty list and start appending elements to the empty list inside the loop.

In [98]:
# here, sample_means2 is an empty list

sample_means2= []

In [99]:
for i in np.arange(numberofsamples): 
    np.random.seed(i)
    sample= np.random.normal(0,1,size= 5)
    sample_means2.append(np.mean(sample))

In [100]:
print (sample_means2)
print (np.mean(sample_means2))

[1.4502797455584104, 0.055371240983643745, -0.5524771094180257, 0.036150977227085324, -0.03401978049233899, 0.4797292467947679, -0.5501662819948029, 0.17520035026177536, -0.8894647931139439, -0.3591492077112229, 0.22288346207572268, -0.33655731527333477, -0.1827186702696239, 0.3587573370935722, -0.05443286040406155]
-0.01204091057882513


#### Two nested loops


We can use two nested loops when we need an action A to repeat many times and action A itself requires many repetitive steps.

For example, array 1 and array 2 are two dimensional arrays, both with 10 rows and 3 columns

Compute the distance between each row of array 1 and all rows in array 2.

In [109]:
# Create array1

np.random.seed(1)

array1= np.random.randint(1,100,30).reshape(10,3)

array1

array([[38, 13, 73],
       [10, 76,  6],
       [80, 65, 17],
       [ 2, 77, 72],
       [ 7, 26, 51],
       [21, 19, 85],
       [12, 29, 30],
       [15, 51, 69],
       [88, 88, 95],
       [97, 87, 14]])

In [110]:
# Create array2

np.random.seed(2)

array2= np.random.randint(1,100,30).reshape(10,3)

array2

array([[41, 16, 73],
       [23, 44, 83],
       [76,  8, 35],
       [50, 96, 76],
       [86, 48, 64],
       [32, 91, 21],
       [38, 40, 68],
       [ 5, 43, 52],
       [39, 34, 59],
       [68, 70, 89]])

a) We need to compute the distance from each row in array1 to all rows in array2

b) We need to go through array1 many times (as many times as the number of rows in array1). We can write onne loop to take care of this.

c) Then, we can use another loop to go through the rows in array2

In [113]:
distances=[]
for i in np.arange(array1.shape[0]): # array1.shape[0] gives us the number of rows in array1
    for j in np.arange(array2.shape[0]): # array2.shape[0] gives us the number of rows in array2
        distances.append(my_euclidean_dist_calculator(array1[i,:], array2[j,:]))

In [115]:
# There are 10 rows in array1 and 10 rows in array2, so  we should have 10*10= 100 distance values

len (distances)

100

In [116]:
distances

[4.24,
 35.86,
 53.97,
 83.92,
 60.08,
 93.94,
 27.46,
 49.3,
 25.26,
 66.37,
 95.13,
 84.39,
 99.1,
 83.07,
 99.62,
 30.56,
 76.97,
 56.83,
 73.58,
 101.43,
 84.01,
 89.7,
 59.91,
 73.09,
 50.34,
 54.74,
 70.64,
 85.64,
 66.38,
 73.16,
 72.41,
 40.63,
 107.73,
 51.78,
 89.22,
 60.8,
 51.78,
 39.56,
 58.2,
 68.51,
 41.71,
 40.05,
 73.08,
 85.87,
 83.03,
 75.83,
 38.03,
 17.15,
 33.94,
 84.27,
 23.52,
 25.16,
 75.14,
 82.77,
 74.21,
 96.96,
 31.92,
 43.83,
 35.0,
 69.47,
 53.47,
 56.17,
 67.54,
 89.72,
 83.62,
 65.76,
 47.34,
 27.0,
 39.94,
 91.09,
 43.78,
 17.58,
 82.01,
 57.44,
 71.24,
 64.75,
 25.51,
 21.28,
 31.06,
 59.75,
 88.75,
 79.4,
 100.72,
 43.23,
 50.65,
 92.85,
 74.38,
 103.74,
 81.32,
 27.57,
 107.97,
 109.94,
 84.4,
 78.32,
 64.36,
 65.5,
 92.77,
 108.83,
 90.54,
 82.19]