# Week 2 lecture notebook

## Outline

[Missing values](#missing-values)

[Decision tree classifier](#decision-tree)

[Apply a mask](#mask)

[Imputation](#imputation)

<a name="missing-values"></a>
## Missing values

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.DataFrame({"feature_1": [0.1,np.NaN,np.NaN,0.4],
                   "feature_2": [1.1,2.2,np.NaN,np.NaN]
                  })
df

### Check if each value is missing

In [None]:
df.isnull()

### Check if any values in a row are true


In [None]:
df_booleans = pd.DataFrame({"col_1": [True,True,False],
                            "col_2": [True,False,False]
                           })
df_booleans

- If we use pandas.DataFrame.any(), it checks if at least one value in a column is `True`, and if so, returns `True`.
- If all rows are `False`, then it returns `False` for that column

In [None]:
df_booleans.any()

- Setting the axis to zero also checks if any item in a column is `True`

In [None]:
df_booleans.any(axis=0)

- Setting the axis to `1` checks if any item in a **row** is `True`, and if so, returns true
- Similarily only when all values in a row are `False`, the function returns `False`.

In [None]:
df_booleans.any(axis=1)

### Sum booleans

In [None]:
series_booleans = pd.Series([True,True,False])
series_booleans

- When applying `sum` to a series (or list) of booleans, the `sum` function treats `True` as 1 and `False` as zero.

In [None]:
sum(series_booleans)

You will make use of these functions in this week's assignment!

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name="decision-tree"></a>
## Decision Tree Classifier


In [None]:
import pandas as pd

In [None]:
X = pd.DataFrame({"feature_1":[0,1,2,3]})
y = pd.Series([0,0,1,1])

In [None]:
X

In [None]:
y

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dt = DecisionTreeClassifier()
dt

In [None]:
dt.fit(X,y)

### Set tree parameters

In [None]:
dt = DecisionTreeClassifier(criterion='entropy',
                            max_depth=10,
                            min_samples_split=2
                           )
dt

### Set parameters using a dictionary

- In Python, we can use a dictionary to set parameters of a function.
- We can define the name of the parameter as the 'key', and the value of that parameter as the 'value' for each key-value pair of the dictionary.

In [None]:
tree_parameters = {'criterion': 'entropy',
                   'max_depth': 10,
                   'min_samples_split': 2
                  }

- We can pass in the dictionary and use `**` to 'unpack' that dictionary's key-value pairs as parameter values for the function.

In [None]:
dt = DecisionTreeClassifier(**tree_parameters)
dt

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name="mask"></a>
## Apply a mask

Use a 'mask' to filter data of a dataframe

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({"feature_1": [0,1,2,3,4]})
df

In [None]:
mask = df["feature_1"] >= 3
mask

In [None]:
df[mask]

### Combining comparison operators

You'll want to be careful when combining more than one comparison operator, to avoid errors.
- Using the `and` operator on a series will result in a `ValueError`, because it's 

In [None]:
df["feature_1"] >=2

In [None]:
df["feature_1" ] <=3

In [None]:
# NOTE: This will result in a ValueError
df["feature_1"] >=2 and df["feature_1" ] <=3

### How to combine two logical operators for Series
What we want is to look at the same row of each of the two series, and compare each pair of items, one row at a time. To do this, use:
- the `&` operator instead of `and`
- the `|` operator instead of `or`.
- Also, you'll need to surround each comparison with parenthese `(...)`

In [None]:
# This will compare the series, one row at a time
(df["feature_1"] >=2) & (df["feature_1" ] <=3)

### This is the end of this practice section.

Please continue on with the lecture videos!

---

<a name="imputation"></a>
## Imputation

We will use imputation functions provided by scikit-learn.  See the scikit-learn [documentation on imputation](https://scikit-learn.org/stable/modules/impute.html#iterative-imputer)

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({"feature_1": [0,1,2,3,4,5,6,7,8,9,10],
                   "feature_2": [0,np.NaN,20,30,40,50,60,70,80,np.NaN,100],
                  })
df

Unnamed: 0,feature_1,feature_2
0,0,0.0
1,1,
2,2,20.0
3,3,30.0
4,4,40.0
5,5,50.0
6,6,60.0
7,7,70.0
8,8,80.0
9,9,


### Mean imputation

In [3]:
from sklearn.impute import SimpleImputer

In [4]:
mean_imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')
mean_imputer

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [5]:
mean_imputer.fit(df)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [6]:
nparray_imputed_mean = mean_imputer.transform(df)
nparray_imputed_mean

array([[  0.,   0.],
       [  1.,  50.],
       [  2.,  20.],
       [  3.,  30.],
       [  4.,  40.],
       [  5.,  50.],
       [  6.,  60.],
       [  7.,  70.],
       [  8.,  80.],
       [  9.,  50.],
       [ 10., 100.]])

Notice how the missing values are replaced with `50` in both cases.

### Regression Imputation

In [7]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [8]:
reg_imputer = IterativeImputer()
reg_imputer

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=None,
                 sample_posterior=False, tol=0.001, verbose=0)

In [9]:
reg_imputer.fit(df)

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=None,
                 sample_posterior=False, tol=0.001, verbose=0)

In [10]:
nparray_imputed_reg = reg_imputer.transform(df)
nparray_imputed_reg

array([[  0.,   0.],
       [  1.,  10.],
       [  2.,  20.],
       [  3.,  30.],
       [  4.,  40.],
       [  5.,  50.],
       [  6.,  60.],
       [  7.,  70.],
       [  8.,  80.],
       [  9.,  90.],
       [ 10., 100.]])

Notice how the filled in values are replaced with `10` and `90` when using regression imputation. The imputation assumed a linear relationship between feature 1 and feature 2.

### This is the end of this practice section.

Please continue on with the lecture videos!

---