<a href="https://colab.research.google.com/github/IndraniMandal/CSC310-S20/blob/master/08_data_manipulation_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
###### Config #####
import sys, os, platform
if os.path.isdir("ds-assets"):
  !cd ds-assets && git pull
else:
  !git clone https://github.com/IndraniMandal/ds-assets.git
colab = True if 'google.colab' in os.sys.modules else False
system = platform.system() # "Windows", "Linux", "Darwin"
home = "ds-assets/assets/"
sys.path.append(home)

Already up to date.


In [2]:
# notebook level imports
import pandas as pd
from sklearn import tree
from sklearn import metrics

# Dealing with Missing Data


In [3]:
# for the following we need the definitions
COLUMNS = 1
INDEX = 0



* Pandas flags missing values with NaN (not a number).
* In most cases, any computations applied to a dataframe with NaNs will ignore the NaNs
* However, it is still a good idea to clean up the dataframe
* In general, there exist sophisticated procedures to deal with missing data, here we limit ourselves to **dropping the row or columns that has NaNs**.


In [4]:
df_missing = pd.read_csv(home+"mammals-missing.csv")
df_missing.index = ['Dog', 'Duck', 'Frog', 'Bat', 'Bar Stool']
df_missing

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Frog,4,no,no,,False
Bat,4,,yes,no,True
Bar Stool,3,no,no,no,False


**Observation**: Notice the NaN values in the dataframe indicating missing values.

We can use the **isnull** function to detect missing values in the dataframe.

In [5]:
df_missing.isnull()

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,False,False,False,False,False
Duck,False,False,False,False,False
Frog,False,False,False,True,False
Bat,False,True,False,False,False
Bar Stool,False,False,False,False,False


**Observation**: For each missing value we find a **True** in the returned dataframe.

Rather than printing out the dataframe and then search for the True values we can use the **sum** function and the fact that Python treat True as 1 in order to quickly detect missing values.

In [6]:
df_missing.isnull().sum(axis=INDEX)

Unnamed: 0,0
Legs,0
Wings,1
Fur,0
Feathers,1
Mammal,0


In [7]:
df_missing.isnull().sum(axis=COLUMNS)

Unnamed: 0,0
Dog,0
Duck,0
Frog,1
Bat,1
Bar Stool,0


If we don't care where the missing values are and we just want to find out that there are missing values we can first sum over the dataframe (defaults to INDEX sum) and the sum over the resulting vector (series).

In [8]:
df_missing.isnull().sum().sum()

2

In [9]:
# drop rows that have NaNs
df_missing.dropna(how='any',axis=INDEX)

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
Dog,4,no,yes,no,True
Duck,2,yes,no,yes,False
Bar Stool,3,no,no,no,False


In [10]:
# dropping columns that have NaNs
df_missing.dropna(how='any',axis=COLUMNS)

Unnamed: 0,Legs,Fur,Mammal
Dog,4,yes,True
Duck,2,no,False
Frog,4,no,False
Bat,4,yes,True
Bar Stool,3,no,False


**NOTE**: In most data sets we have more rows than columns, so **in most cases you want to delete rows rather than columns** in order to eliminate missing data.

# Converting Categorical Data to Numerical Data




Recall that in sklearn all independent variables have to be numerical.  That means if we have a categorical independent variable we need to convert it.

We accomplish the conversion via **dummy variables** or, more formally, **indicator variables**.

Pandas supports the **get_dummies** function that converts categorical variables in a dataframe into dummy/indicator variables.

Each variable is converted into as many 0/1 dummy/indicator variables as there are different values and the original variable is deleted from the dataset. Columns in the resulting dataframe are each named after a value. The resulting names consist of the original variable name and the value name.  Consider the variable **Fur** in the mammals dataset which has two values: **yes** and **no**.  The resulting indicator variable names are: **Fur_yes** and **Fur_no**.

**IMPORTANT**: Just converting labels into numerical values does not work unless we are dealing with ordinal categorical values. Doing this simple conversion for nominal categorical values will **introduce unwanted/implicit biases** into the data.

For example, if we have a categorical variable with the labels 'big' and 'small' then we can easily replace those two labels with ordinal values such as big=>2 and small=>1 assuming that the original labels also have the relationship big > small.  On the other hand, given a categorical variable with labels 'green' and 'red' it makes no sense to replace those labels with green=>2 and red=>1 which would imply that green > red.  In the latter we introduced a bias in our data using the encoding.

Let's try it using our mammal dataset.

In [12]:
mammal_df = pd.read_csv(home+"mammals.csv")
mammal_df

Unnamed: 0,Legs,Wings,Fur,Feathers,Mammal
0,4,no,yes,no,True
1,2,yes,no,yes,False
2,4,no,no,no,False
3,4,yes,yes,no,True
4,3,no,no,no,False


In [13]:
df_dummies1 = pd.get_dummies(mammal_df)
df_dummies1

Unnamed: 0,Legs,Mammal,Wings_no,Wings_yes,Fur_no,Fur_yes,Feathers_no,Feathers_yes
0,4,True,True,False,False,True,True,False
1,2,False,False,True,True,False,False,True
2,4,False,True,False,True,False,True,False
3,4,True,False,True,False,True,True,False
4,3,False,True,False,True,False,True,False


**Observation**:  Notice that the **Fur** variable has been converted to **Fur_yes** and **Fur_no**.

By default, boolean values are not converted into dummy variables.
If we really had to convert these to numerical values as well we can force Pandas to do so.




In [14]:
df_dummies2 = pd.get_dummies(df_dummies1,columns=['Mammal'])
df_dummies2

Unnamed: 0,Legs,Wings_no,Wings_yes,Fur_no,Fur_yes,Feathers_no,Feathers_yes,Mammal_False,Mammal_True
0,4,True,False,False,True,True,False,False,True
1,2,False,True,True,False,False,True,True,False
2,4,True,False,True,False,True,False,True,False
3,4,False,True,False,True,True,False,False,True
4,3,True,False,True,False,True,False,True,False


# Sklearn needs numerical Data

**The machine learning algorithms in sklearn only operate on numerical data**.  That means any data that is categorical has to be converted to numerical data.  **This is only true for the independent variables**.  The target variable can be categorical or numeric.


Let's try this on our **tennis dataset** and see if we can modify the data in such a way that we can build a decision tree.



In [15]:
tennis_df = pd.read_csv(home+"tennis.csv")
tennis_df.head()

Unnamed: 0,outlook,temp,humidity,windy,play
0,sunny,hot,high,False,no
1,sunny,hot,high,True,no
2,overcast,hot,high,False,yes
3,rainy,mild,high,False,yes
4,rainy,cool,normal,False,yes


Let's try to build a decision tree on this.

In [16]:

features_df = tennis_df.drop(columns=['play'])
target_df = pd.DataFrame(tennis_df[['play']])

dtree = tree.DecisionTreeClassifier()
try:
  dtree.fit(features_df,target_df)
except Exception as e:
  print(e)


could not convert string to float: 'sunny'


Notice that the tree algorithm complains that it cannot convert the categorical label 'sunny' into a number for training purposes. This is an indicator that it needs a numerical variable.

&rarr; We need to introduce dummy variables. But we don't want to convert our target variable 'play'. We explicitly state which columns to convert.

We need to be explicit which columns we want to convert.  For example, we don't want to convert the **play** column because that is our target variable.

In [17]:
tennis_dummies_df = pd.get_dummies(tennis_df, columns=['outlook','temp','humidity','windy'])
tennis_dummies_df.head()

Unnamed: 0,play,outlook_overcast,outlook_rainy,outlook_sunny,temp_cool,temp_hot,temp_mild,humidity_high,humidity_normal,windy_False,windy_True
0,no,False,False,True,False,True,False,True,False,True,False
1,no,False,False,True,False,True,False,True,False,False,True
2,yes,True,False,False,False,True,False,True,False,True,False
3,yes,False,True,False,False,False,True,True,False,True,False
4,yes,False,True,False,True,False,False,False,True,True,False


Let's try to build a decision tree on this now that it is in numeric shape suitable for sklearn.

In [18]:

features_df = tennis_dummies_df.drop(columns=['play'])
target_df = tennis_dummies_df[['play']]

dtree = tree.DecisionTreeClassifier(criterion='entropy')
dtree.fit(features_df,target_df)
print(tree.export_text(dtree,
                       feature_names=list(features_df.columns)))

|--- outlook_overcast <= 0.50
|   |--- humidity_high <= 0.50
|   |   |--- windy_False <= 0.50
|   |   |   |--- outlook_rainy <= 0.50
|   |   |   |   |--- class: yes
|   |   |   |--- outlook_rainy >  0.50
|   |   |   |   |--- class: no
|   |   |--- windy_False >  0.50
|   |   |   |--- class: yes
|   |--- humidity_high >  0.50
|   |   |--- outlook_rainy <= 0.50
|   |   |   |--- class: no
|   |   |--- outlook_rainy >  0.50
|   |   |   |--- windy_False <= 0.50
|   |   |   |   |--- class: no
|   |   |   |--- windy_False >  0.50
|   |   |   |   |--- class: yes
|--- outlook_overcast >  0.50
|   |--- class: yes



The tree looks a bit different because we are splitting on 0/1.  But we can see that the outlook variable is still the most predictive variable.

**Note**: (something =< 0.5) means (something == 0) since the values are only 1 and 0.

Let's see if this tree behaves as well as the tree built on the original categorical data.

In [19]:
predict_df = pd.DataFrame(dtree.predict(features_df), columns=['play'])
print("The accuracy of our model is: {}%"
      .format(metrics.accuracy_score(target_df, predict_df)*100))

The accuracy of our model is: 100.0%


**Observation**: Yup, still predicts all the rows correctly, just like the original tree.

# Reading

* 3.1 [Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)
* 3.2 [Data Indexing and Selection](https://jakevdp.github.io/PythonDataScienceHandbook/03.02-data-indexing-and-selection.html)
* 3.4 [Handling Missing Data](https://jakevdp.github.io/PythonDataScienceHandbook/03.04-missing-values.html)


# Project

See BrightSpace Assignment #2