<a href="https://colab.research.google.com/github/GabeMaldonado/AIforMedicine/blob/master/AIforMed_C2_W2_Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will cover the following concepts:

*   Missing Values
*   Decision Tree Classifier
*   Applying a Mask
*   Imputation

# Missing Values


In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [3]:
# Create a df with missing values

df = pd.DataFrame({"feature 1" : [0.1, np.NaN, np.NaN, 0.4],
                   "feature 2" : [1.1, 2.2, np.NaN, np.NaN]
                   })
df

Unnamed: 0,feature 1,feature 2
0,0.1,1.1
1,,2.2
2,,
3,0.4,


Check if a value is missing in the df



In [6]:
df.isnull()

Unnamed: 0,feature 1,feature 2
0,False,False
1,True,False
2,True,True
3,False,True


Check if any values in a row are true


In [7]:
# Create a df of booleans

df_booleans = pd.DataFrame({"feature 1" : [True, True, False],
                            "feature 2" : [True, False, False]
                            })
df_booleans

Unnamed: 0,feature 1,feature 2
0,True,True
1,True,False
2,False,False


*   If we call `pd.df.any()`, it checks if at least one value in the column is `True` and if so-- it returns `True`.
*   If all the rows are `False`then it returns '`False` for that column. 

In [8]:
df_booleans.any()

feature 1    True
feature 2    True
dtype: bool

*   Setting the axis to zero also checks if any item in a column is `True`

In [9]:
df_booleans.any(axis=0)

feature 1    True
feature 2    True
dtype: bool

*   Setting the `axis=1` it checks if any item in a ***row*** is `True` and if so, it returns `True`
*   When ***all*** the values in the rows are `False` then it returns `False`

In [10]:
df_booleans.any(axis=1)

0     True
1     True
2    False
dtype: bool

## Boolean Operations -- Sum


In [11]:
series_booleans= pd.Series([True, True, False])
series_booleans

0     True
1     True
2    False
dtype: bool

*   Setting the axis to zero also checks if any item in a column is `True`

In [12]:
sum(series_booleans)

2

# Decision Tree Classifier


In [0]:
# Create some toy data

X = pd.DataFrame({"feature 1" : [0, 1, 2, 3]})
y = pd.Series([0, 0, 1, 1])

In [14]:
X

Unnamed: 0,feature 1
0,0
1,1
2,2
3,3


In [15]:
y

0    0
1    0
2    1
3    1
dtype: int64

In [0]:
# import the dt classifier from sklearn

from sklearn.tree import DecisionTreeClassifier

In [17]:
# Create a dt classifier instance

dt = DecisionTreeClassifier()
dt

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [20]:
# Fit dt classifier
dt.fit(X, y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=10, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

## Set Tree parameters

In [18]:
dt = DecisionTreeClassifier(criterion='entropy',
                            max_depth=10,
                            min_samples_split=2)
dt

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=10, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

## Set Tree parameters using a dict

We can also define the tree's parameters using a dictionary. We define the name of the parameter as the **key** and the value of the parameter as the **value** for each key-value pair in the dictionary.

In [0]:
tree_parameters = {'criterion' : 'entropy',
                   'max_depth' : 10,
                   'min_samples_split' : 2
                   }

*   We can pass in the dictionary and use `**` to 'unpack' that dictionary's key-value pairs as parameter values for the function.

In [22]:
dt = DecisionTreeClassifier(**tree_parameters)
dt

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=10, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

## Applying a Mask

We use a 'mask' to filer the data in a df.

In [24]:
# Create dataframe 

df2 = pd.DataFrame({"feature_1" : [0, 1, 2, 3, 4]})
df2

Unnamed: 0,feature_1
0,0
1,1
2,2
3,3
4,4


In [26]:
# Create a mask

mask = df2["feature_1"] >=3 
mask

0    False
1    False
2    False
3     True
4     True
Name: feature_1, dtype: bool

In [28]:
# Get items using the mask

df2[mask]

Unnamed: 0,feature_1
3,3
4,4


## Combining Comparison Operators

We'll need to be careful when combining more than one comparison operator otherwise we will get errors.
*   Using `and` in a series will result in a `ValueError`. 

In [29]:
df2["feature_1"] >= 2

0    False
1    False
2     True
3     True
4     True
Name: feature_1, dtype: bool

In [30]:
df2["feature_1"] <= 3

0     True
1     True
2     True
3     True
4    False
Name: feature_1, dtype: bool

In [31]:
# This will result in a ValueError
df2["feature_1"] >= 2 and df2["feature_1"] <=3

ValueError: ignored

## Combining two Logical Operators for a Series

To avoid the error above, we need to look at the same row of each of the two series and compare each pair of items one row at the time. To do this we use:
*   The `&` operator instead of `and`
*   The `|` operator instead of `or`
*   Also-- we need to put every comparison clause inside parenthesis `()`

In [32]:
# Comparing the series one row at the time

(df2["feature_1"] >= 2) & (df2["feature_1"] <= 3)

0    False
1    False
2     True
3     True
4    False
Name: feature_1, dtype: bool

## Imputation

Imputation refers to techniques on how to deal with missing data. In this examples we'll use sklearn's functions to impute missing values. For more infor, see the scikit-learn [documentation on imputation](https://scikit-learn.org/stable/modules/impute.html#iterative-imputer)

In [33]:
# Create dataframe

df = pd.DataFrame({"feature_1": [0,1,2,3,4,5,6,7,8,9,10],
                   "feature_2": [0,np.NaN,20,30,40,50,60,70,80,np.NaN,100],
                  })
df

Unnamed: 0,feature_1,feature_2
0,0,0.0
1,1,
2,2,20.0
3,3,30.0
4,4,40.0
5,5,50.0
6,6,60.0
7,7,70.0
8,8,80.0
9,9,


## Mean Imputation

Replaces missing values with the mean of the data.

In [0]:
from sklearn.impute import SimpleImputer

In [35]:
mean_imputer = SimpleImputer(missing_values=np.NaN, strategy='mean')
mean_imputer

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [36]:
mean_imputer.fit(df)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [37]:
df

Unnamed: 0,feature_1,feature_2
0,0,0.0
1,1,
2,2,20.0
3,3,30.0
4,4,40.0
5,5,50.0
6,6,60.0
7,7,70.0
8,8,80.0
9,9,


In [38]:
nparray_imputed_mean = mean_imputer.transform(df)
nparray_imputed_mean

array([[  0.,   0.],
       [  1.,  50.],
       [  2.,  20.],
       [  3.,  30.],
       [  4.,  40.],
       [  5.,  50.],
       [  6.,  60.],
       [  7.,  70.],
       [  8.,  80.],
       [  9.,  50.],
       [ 10., 100.]])

As we can see, the two missing values have been replaced with the mean of the data.

## Regression Imputation



In [0]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [41]:
reg_imputer = IterativeImputer()
reg_imputer

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=None,
                 sample_posterior=False, skip_complete=False, tol=0.001,
                 verbose=0)

In [42]:
reg_imputer.fit(df)

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=10, max_value=None, min_value=None,
                 missing_values=nan, n_nearest_features=None, random_state=None,
                 sample_posterior=False, skip_complete=False, tol=0.001,
                 verbose=0)

In [43]:
df

Unnamed: 0,feature_1,feature_2
0,0,0.0
1,1,
2,2,20.0
3,3,30.0
4,4,40.0
5,5,50.0
6,6,60.0
7,7,70.0
8,8,80.0
9,9,


In [44]:
nparray_reg_imputed = reg_imputer.transform(df)
nparray_reg_imputed

array([[  0.,   0.],
       [  1.,  10.],
       [  2.,  20.],
       [  3.,  30.],
       [  4.,  40.],
       [  5.,  50.],
       [  6.,  60.],
       [  7.,  70.],
       [  8.,  80.],
       [  9.,  90.],
       [ 10., 100.]])

Here we can see that the missing values have been replaced with the values of 10 and 90, respectively. The imputetion assumed a linear relationship between feature1 and feature2. 