# Panda Master for Machine Learning

![title](http://oi65.tinypic.com/2nvf248.jpg)

With business data, we often have numerical and categorical data. In my experience, those are the 8 major challenges:

(1) **Missing numerical data.**

(2) **Imputation mean vs. median.**

(3) **Gaussian vs. skewed distribution.**

(4) **Missing categorical data at random (MCAR).**

(5) **Outlier Detection based on Tukey's IQR * 1.5.**

(6) **Cardinality and Rare Values.**

(7) **LabelEncoding and One-Hot Encoding.**

(8) **Normalization vs. Standardization.**

I omit some more advanced, but rarer problems such as mixed categorical data and NaN's in the timestamp.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_excel("Panda_Master.xlsx")

In [3]:
data.head(1)

Unnamed: 0,Name,Age,Gender,Pre_Test_Score,Post_Test_Score,Country,State,Label
0,Jason,42,Male,101.0,103,USA,CA,0


# Check for NaN's

In [4]:
# Check for NaN's
data[['Name', 'Age', 'Pre_Test_Score', 'Post_Test_Score', 'Country']].isnull().sum()

Name               1
Age                0
Pre_Test_Score     1
Post_Test_Score    0
Country            0
dtype: int64

In [5]:
# We have 2 NaN's
data.isnull().sum(axis=0)

Name               1
Age                0
Gender             0
Pre_Test_Score     1
Post_Test_Score    0
Country            0
State              0
Label              0
dtype: int64

In [6]:
# Se
data.loc[:,'Name'].head(3)

0    Jason
1    Molly
2     Tina
Name: Name, dtype: object

In [7]:
data.isnull().sum(axis=1)

0    0
1    1
2    0
3    0
4    1
5    0
6    0
7    0
dtype: int64

In [8]:
# let's look at the first NaN
data.iloc[1:2,:]

Unnamed: 0,Name,Age,Gender,Pre_Test_Score,Post_Test_Score,Country,State,Label
1,Molly,52,Female,,191,USA,MI,0


Comment: In this fake dataset I created, we have a NaN in Name and Pre_Test_Score.

Of course, in the real-world we don't have the luxury to eyeball all our data. So we have to inspect it with Pandas.

# Dealing with Numerical NaN's
Mean/Median imputation consists of replacing all occurences of missing value (NA) with a variable. 

Important: **mean/median imputation has the assumption that the data is missing completely at random (MCAR)**. If that's not the case, we have to change our approach and dig deeper (e.g. a NaN might have some bigger reason).

Technically we should use use **mean imputation if the underlying variable has a Gaussian distribution** and median imputation if the variable has a skewed distribution.

In practice, I have not seen a major difference why I stick with Tukey's outlier detection.

# Outlier Detection
Detecting outliers is unfortunately more of an art than science. The famous statistician John Tukey proposed as IQR * 1.5 as a “outlier”. Hence, the upper fence is 75% + (IQR * 1.5).

In [9]:
# Max value
data['Pre_Test_Score'].max()

121.0

In [10]:
# Min value
data['Pre_Test_Score'].min()

92.0

In [11]:
# Calculating outlier values according to Tunkey's fence
IQR_2 = data.Pre_Test_Score.quantile(0.75) - data.Pre_Test_Score.quantile(0.25)

Lower_fence_2 = data.Pre_Test_Score.quantile(0.25) - (IQR_2 * 1.5)
Upper_fence_2 = data.Pre_Test_Score.quantile(0.75) + (IQR_2 * 1.5)

In [12]:
Upper_fence_2, Lower_fence_2, IQR_2

(108.5, 88.5, 5.0)

In [13]:
# How many outliers do we have?
data[data['Pre_Test_Score'] > 108.50].apply(lambda x: x.count())

Name               1
Age                1
Gender             1
Pre_Test_Score     1
Post_Test_Score    1
Country            1
State              1
Label              1
dtype: int64

Comment: We have 1 outlier out of 5

# Add a New Column for Outliers
Depending on the used algorithm, we need to approach outlier handling differently.

Typically, we have three strategies we can use to handle outliers (Chris Albon): (1) First, we can drop them. (2) Second we can mark them as outliers and include it as a feature. (3) Finally, we can transform the feature to dampen the effect of the outlier.

In general, if the outiers are >5% of the data, I consider outliers as part of the data. In this case, the favored solution on Kaggle is solution (2).

In [14]:
# Create a new column for outliers
data['outliers'] = np.where(data['Pre_Test_Score'] >= 108.50, 1,0)

In [15]:
data.head(8)

Unnamed: 0,Name,Age,Gender,Pre_Test_Score,Post_Test_Score,Country,State,Label,outliers
0,Jason,42,Male,101.0,103,USA,CA,0,0
1,Molly,52,Female,,191,USA,MI,0,0
2,Tina,35,Female,121.0,115,USA,NY,1,1
3,Jake,24,Male,92.0,112,USA,OR,0,0
4,,22,Male,98.0,134,USA,IL,0,0
5,Heidi,35,Female,95.0,101,Germany,HE,0,0
6,Susanne,34,Female,97.0,98,Germany,BA,0,0
7,Luisa,38,Female,101.0,100,USA,CA,0,0


In [16]:
data.groupby('outliers').mean()

Unnamed: 0_level_0,Age,Pre_Test_Score,Post_Test_Score,Label
outliers,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,35.285714,97.333333,119.857143,0.0
1,35.0,121.0,115.0,1.0


# Cardinality

In [17]:
# let's have a look at how many labels

for col in data[['Country', 'State']]:
    print(col, ': ', len(data[col].unique()), ' labels')

Country :  2  labels
State :  7  labels


In [18]:
data.groupby('Country').apply(lambda x: x.count())

Unnamed: 0_level_0,Name,Age,Gender,Pre_Test_Score,Post_Test_Score,Country,State,Label,outliers
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Germany,2,2,2,2,2,2,2,2,2
USA,5,6,6,5,6,6,6,6,6


In [19]:
data['Country'].value_counts()

USA        6
Germany    2
Name: Country, dtype: int64

Comment: In pratice, I find treating cardinality (or rare values) not fruitful. Thus, I ignore it. But, if you need to squeeze 0.10% out of your model (in a ML competition or a super tuned model, do not ignore it).

In [20]:
# So we have a missting number in Pre_Test_Score
data.isnull().sum(axis=0)

Name               1
Age                0
Gender             0
Pre_Test_Score     1
Post_Test_Score    0
Country            0
State              0
Label              0
outliers           0
dtype: int64

In [21]:
# What's the mean of Pre_Test_Score?
data['Pre_Test_Score'].mean()

100.71428571428571

In [22]:
data['Pre_Test_Score'].fillna(value=100.71, inplace=False).head()

0    101.00
1    100.71
2    121.00
3     92.00
4     98.00
Name: Pre_Test_Score, dtype: float64

# Dealing with a Missing Name

In [23]:
# Missing name?
data[['Name']].isnull().sum()

Name    1
dtype: int64

In [24]:
# What are the names?
data['Name'].value_counts()

Heidi      1
Molly      1
Jake       1
Jason      1
Luisa      1
Tina       1
Susanne    1
Name: Name, dtype: int64

In [25]:
# Replace the missing name with 'Unknown'
data['Name'].fillna(value='Unknown', inplace=True)

# Encoding Categorical Labels
Categorical information is often represented in data as a vector or column of strings (e.g. "California"). The problem is that most machine learning algorithms require inputs to be numerical values.

# Gender Encoding with map()

In [26]:
# Let's encode 'Gender' with Python's map() function.

In [27]:
Gender_ = data['Gender'].map({'Female': 0, 'Male':1})
Gender_.head(2)

0    1
1    0
Name: Gender, dtype: int64

# LabelEncoding vs. One-Hot Encoding
This is a trick subject and many Machine Learning experts probably won't agree with me.

LabelEncoding: Let's say we have USA and Germany, LabelEncoding() gives each country a number. E.g. 1 and 2.

One-Hot Encoding: Each class becomes its own feature with 1s when the class appears and 0s otherwise.

The reason for one-hot encoding is to remove the order in LabelEncoding(). i.e. Germany with a 2 would be higher rated than USA with a 1 (this is wrong of course).

Unfortunately, one-hot encoding adds problems as well such as correlation and sparsity. For this sample, I stick with LabelEncoding() only.

# Country Encoding with LabelEncoder()

In [28]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

In [29]:
le.fit(data['Country'])

LabelEncoder()

In [30]:
Country_ = le.transform(data['Country'])

In [31]:
Country_

array([1, 1, 1, 1, 1, 0, 0, 1])

# Name Encoding with LabelEncoder()

In [32]:
le.fit(data['Name'])

LabelEncoder()

In [33]:
Name_ = le.transform(data['Name'])

In [34]:
Name_

array([2, 4, 6, 1, 7, 0, 5, 3])

# Normalization vs. Standardization
Normalization rescales the data from 0 - 1 while Standardization rescales the data with a mean of 0 and a standard deviation of 1. If we have outliers in the dataset, **both methods squeeze outliers and make them more prominent**.

In other words, **we should never normalize or standardize if we have outliers in the dataset**.

As I've created an additional column for outliers AND I plan to use **two algorithms insensitive to outliers** (Logistic Regression and Decision Tree), I do not normalize or standardize.

If you use a **clustering algorithm or a Deep Learning model, you need to normalize**. This gets a bit more complicated.

# State Encoding with LabelEncoder()

In [35]:
le.fit(data['State'])

LabelEncoder()

In [36]:
State_ = le.transform(data['State'])

In [37]:
State_

array([1, 4, 5, 6, 3, 2, 0, 1])