# Week 6

### Table of Contents

1. [Iteration](#bullet1)
2. [Practice](#bullet2)
3. [For loop example uses](#bullet3)
4. [Practice](#bullet4)
5. [List comprehensions](#bullet5)
6. [Lambda functions](#bullet6)
7. [Pandas apply with lambda functions](#bullet7)
8. [Nested Functions](#bullet8)

In [2]:
## importing necessary libraries
import numpy as np
import pandas as pd 

## 1. Iteration<a class="anchor" id="bullet1"></a>

I.E.- Repeating the same operation on different columns or on different datasets

In [3]:
rng = np.random.default_rng(seed=3252)

a = rng.standard_normal(10)
b = rng.standard_normal(10)
c = rng.standard_normal(10)
d = rng.standard_normal(10)

the_dict = {'a':a,'b':b,'c':c,'d':d}

df = pd.DataFrame(the_dict)

We want to compute the median of each column. You could do with copy-and-paste:

In [4]:
print(np.median(df['a']))
print(np.median(df['b']))
print(np.median(df['c']))
print(np.median(df['d']))

-0.7109216792484936
-0.1607751802165395
-0.7835127395579766
0.43195930703839136


But we want to limit repetition as much as possible to write good code. 

remember = __DRY PRINCIPLE: DO NOT REPEAT YOURSELF__

In [9]:
### iteration using a for loop
output = []

for i in df.columns:
    output.append(round(np.median(df[i]),2))
    
output

[-0.71, -0.16, -0.78, 0.43]

__Every for loop has three components:__

1) __Output:__ You must always allocate sufficient space for the output
* needs to be same structure as final output of your code
* to create and empty vector use the vector() function
* Has two arguments: type of vector and length of vector


2) __The sequence:__ for i in df.columns (or more generally, for index in interable). This determines what to loop over: 
* each run of the for loop will assign i to a different value from 
* seq_along(df)

3) __The body:__ output.append(round(np.median(df[i]),2)). This is the code that does the work. 
* It's run repeatedly, each time with a different value for i. 
* The first iteration will run output.append(round(np.median(df[1]),2)), the second will run output.append(round(np.median(df[2]),2)), and so on.

## 2. Practice<a class="anchor" id="bullet2"></a>

a) Compute the mean of every column in df.

In [7]:
output = []
for i in df.columns:
    output.append(np.mean(df[i]))

output

[-0.5651802647782048,
 -0.07332713554716727,
 -0.6693640830516738,
 0.3935178923880054]

b) Determine the type of data each column holds.

In [7]:
output = []
for i in df.columns:
    data_type = type(df[i][0])
    output.append(data_type)
    
output

[numpy.float64, numpy.float64, numpy.float64, numpy.float64]

c) Compute the number of unique values in each column of iris.

In [29]:
import numpy as np # for the np.unique function
import pandas as pd # for pd.DataFrame
from sklearn import datasets # to load iris dataset
iris0 = datasets.load_iris()
iris = pd.DataFrame(data=iris0.data, columns=iris0.feature_names)

In [30]:
iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [27]:
unique_vals = []

# nested for loops
for i in iris.columns:
    unique_vals.append(len(np.unique(iris[i])))
            
unique_vals

[35, 23, 43, 22]

## 3. For loop example uses<a class="anchor" id="bullet3"></a>

1) Modifying an existing object, instead of creating a new object.

In [9]:
df

Unnamed: 0,a,b,c,d
0,-0.757929,0.295862,0.417556,1.167163
1,0.807697,0.019923,-1.726416,1.767645
2,0.433601,-2.352702,-0.585652,0.461624
3,-0.949885,-0.548677,-0.012005,1.457183
4,-1.119631,1.061025,-0.981373,-0.216714
5,0.064981,0.760106,-1.244226,0.287626
6,-0.6124,-0.671701,1.370228,-1.224794
7,-1.144289,1.442454,-1.705773,0.694906
8,-1.710033,-0.398088,-0.13943,0.402294
9,-0.663915,-0.341473,-2.086549,-0.861754


In [31]:
def rescale01(col):
    return (col - min(col)) / (max(col)-min(col))

In [32]:
# now we want to apply the rescale function to every column in df
for i in df.columns:
    df[i] = rescale01(df[i])
    
df

Unnamed: 0,a,b,c,d
0,0.37816,0.69788,0.724405,0.799334
1,1.0,0.625172,0.104182,1.0
2,0.851415,0.0,0.43419,0.56356
3,0.301918,0.475349,0.600138,0.896251
4,0.234498,0.899496,0.319713,0.336876
5,0.705006,0.820205,0.243673,0.505414
6,0.435961,0.442933,1.0,0.0
7,0.224704,1.0,0.110154,0.641517
8,0.0,0.515029,0.563276,0.543733
9,0.415501,0.529946,0.0,0.121319


## 4. Practice<a class="anchor" id="bullet4"></a>

Write a function that prints the mean of each numeric column in a dataframe. Use the Iris dataset.

In [35]:
import seaborn as sns
print(type(iris0))
print(type(iris))
iris = sns.load_dataset("iris")
print(type(iris))
iris.head()

<class 'sklearn.utils.Bunch'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [36]:
## solution:
iris_means = []
for col in iris.columns:
    if type(iris[col][0]) == np.float64:
        m = np.mean(iris[col])
        iris_means.append(m)
        
iris_means

[5.843333333333335, 3.057333333333334, 3.7580000000000027, 1.199333333333334]

## 5. List comprehensions<a class="anchor" id="bullet5"></a>

List comprehensions are one concise expression that allows you to concisely form a new list either transforming or filtering the elements of a collection.

In [None]:
# structure of a list comprehension:
[do_something for value in collection if condition]

# equivalent:
output = []
for value in collection:
    if condition:
        output.append(do_something)

In [15]:
# example: 
strings = ["a", "as", "bat", "car", "dove", "python"]
[x.upper() for x in strings if len(x) > 2]

['BAT', 'CAR', 'DOVE', 'PYTHON']

In [None]:
# structure of a dictionary comprehension:
dict_comp = {key-expr: value-expr for value in collection if condition}

In [None]:
# structure of a set comprehension:
set_comp = {expr for value in collection if condition}

We can also use nested list comprehensions:

In [44]:
# 2-D List of planets
planets = [['Mercury', 'Venus', 'Earth'], ['Mars', 'Jupiter', 'Saturn'], ['Uranus', 'Neptune', 'Pluto']]
print(planets)
# print(planets.shape) Doesn't work!

# Nested List comprehension with an if condition
flatten_planets = [planet for sublist in planets for planet in sublist if len(planet) < 6]
print(flatten_planets)


[['Mercury', 'Venus', 'Earth'], ['Mars', 'Jupiter', 'Saturn'], ['Uranus', 'Neptune', 'Pluto']]
['Venus', 'Earth', 'Mars', 'Pluto']


In [38]:
all_data = [["John", "Emily", "Michael", "Mary", "Steven"],
            ["Maria", "Juan", "Javier", "Natalia", "Pilar"]]

In [45]:
result = [name for name_list in all_data for name in name_list if name.count('i') >= 1]

In [46]:
result

['Emily', 'Michael', 'Maria', 'Javier', 'Natalia', 'Pilar']

In [47]:
# equivalent to this for loop:
result2 = []
for names in all_data:
    for name in names:
        if name.count('i') >= 1:
            result2.append(name)
result2

['Emily', 'Michael', 'Maria', 'Javier', 'Natalia', 'Pilar']

## 6. Lambda Functions<a class="anchor" id="bullet6"></a>

Lambda functions, also known as anonymous functions, are short functions that dont have to be bound to a variable. Lambda functions are defined with less clarity/documentation then a standard function would have. 

Structure of lambda functions consists of:
* __The Keyword__: lambda
* __A bound variable:__ x
* __A body:__ x + 1

In [51]:
add_one = lambda x: x + 1
add_one(2)

3

In [52]:
# The above lambda function is equivalent to writing this:
def add_one(x):
    return x + 1
add_one(2)

3

In [54]:
(lambda x, y: x + y)(2, 3)  # we are passing 2 and 3 as x and y respectively

5

## 7. Pandas apply with lambda functions<a class="anchor" id="bullet7"></a>

The apply function takes in a function and applies it to all items within a series. Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). Read about these functions here: [pd.DataFrame.apply](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html)

In [55]:
df2 = df.apply(add_one) # lambda function we defined earlier
df2

Unnamed: 0,a,b,c,d
0,1.37816,1.69788,1.724405,1.799334
1,2.0,1.625172,1.104182,2.0
2,1.851415,1.0,1.43419,1.56356
3,1.301918,1.475349,1.600138,1.896251
4,1.234498,1.899496,1.319713,1.336876
5,1.705006,1.820205,1.243673,1.505414
6,1.435961,1.442933,2.0,1.0
7,1.224704,2.0,1.110154,1.641517
8,1.0,1.515029,1.563276,1.543733
9,1.415501,1.529946,1.0,1.121319


In [57]:
df3 = df.apply(lambda x: x + 1) # add_one
df3 # equal to df2 

## the above apply-lambda function combos are applying the lambda function to every item 
## within every column of the dataframe

Unnamed: 0,a,b,c,d
0,1.37816,1.69788,1.724405,1.799334
1,2.0,1.625172,1.104182,2.0
2,1.851415,1.0,1.43419,1.56356
3,1.301918,1.475349,1.600138,1.896251
4,1.234498,1.899496,1.319713,1.336876
5,1.705006,1.820205,1.243673,1.505414
6,1.435961,1.442933,2.0,1.0
7,1.224704,2.0,1.110154,1.641517
8,1.0,1.515029,1.563276,1.543733
9,1.415501,1.529946,1.0,1.121319


In [58]:
# to apply a lambda function to only one column:
df['a'] = df['a'].apply(lambda x: x-2)
df

Unnamed: 0,a,b,c,d
0,-1.62184,0.69788,0.724405,0.799334
1,-1.0,0.625172,0.104182,1.0
2,-1.148585,0.0,0.43419,0.56356
3,-1.698082,0.475349,0.600138,0.896251
4,-1.765502,0.899496,0.319713,0.336876
5,-1.294994,0.820205,0.243673,0.505414
6,-1.564039,0.442933,1.0,0.0
7,-1.775296,1.0,0.110154,0.641517
8,-2.0,0.515029,0.563276,0.543733
9,-1.584499,0.529946,0.0,0.121319


## 8. Nested functions <a class="anchor" id="bullet8"></a>

In [30]:
rng = np.random.default_rng(seed=3252)

a = rng.standard_normal(10)
b = rng.standard_normal(10)
c = rng.standard_normal(10)
d = rng.standard_normal(10)

the_dict = {'a':a,'b':b,'c':c,'d':d}

df = pd.DataFrame(the_dict)

In [31]:
# column means with a for loop:
output = []

for i in df.columns:
    mean = np.mean(df[i])
    output.append(mean)
    
output

[-0.5651802647782048,
 -0.07332713554716724,
 -0.6693640830516736,
 0.3935178923880054]

In [61]:
# what if we wanted to return multiple things at once? We can return a list object

def col_mean(df):
    
    n = df.shape[0]
    
    output = []
    for i in df.columns:
        output.append(np.mean(df[i]))
    return [n, output]

In [60]:
col_mean(df)

[10,
 [-1.5452837802149992,
  0.6006010541858376,
  0.409972979032969,
  0.5408003571487018]]

In [66]:
# Note: You can even return an entire dataset as one element of a list!
x = [iris, 0, 1, 2]
x

[     sepal_length  sepal_width  petal_length  petal_width    species
 0             5.1          3.5           1.4          0.2     setosa
 1             4.9          3.0           1.4          0.2     setosa
 2             4.7          3.2           1.3          0.2     setosa
 3             4.6          3.1           1.5          0.2     setosa
 4             5.0          3.6           1.4          0.2     setosa
 ..            ...          ...           ...          ...        ...
 145           6.7          3.0           5.2          2.3  virginica
 146           6.3          2.5           5.0          1.9  virginica
 147           6.5          3.0           5.2          2.0  virginica
 148           6.2          3.4           5.4          2.3  virginica
 149           5.9          3.0           5.1          1.8  virginica
 
 [150 rows x 5 columns],
 0,
 1,
 2]

You can even pass a function as an argument to another function:

In [67]:
def col_median(df):
    output = []
    for i in df.columns:
        output.append(np.median(df[i]))
    return output

def col_std(df):
    output = []
    for i in df.columns:
        output.append(np.std(df[i]))
    return output

In [68]:
# But, we can make this better through functional programming and the DRY principle

def col_computations(func, df):
    output = []
    for i in df.columns:
        computation = func(df[i])
        output.append(computation)
    return output

In [69]:
col_computations(np.mean, df)

[-1.5452837802149992,
 0.6006010541858376,
 0.409972979032969,
 0.5408003571487018]

In [70]:
col_computations(np.median, df)

[-1.6031698377708468,
 0.5775590423410861,
 0.37695127842558124,
 0.553646538292351]

In [71]:
col_computations(np.std, df)

[0.2926800238042268,
 0.268464901046673,
 0.29908704877691916,
 0.3043002488376972]