# Introduction

This notebook is inspired from the amazing [article](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02) written by Baijayanta Roy in Medium. In there, he has mentioned multiple ways of handling categorical features in machine learning problems. 

I have tried to capture most of those methods in this notebook for easy reference.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('./Titanic/train.csv')
df = df[['PassengerId', 'Embarked']]  ## keeping only one categorical feature for better demonstration
df['Embarked'].fillna('C', inplace = True)
df.sample(5)

Unnamed: 0,PassengerId,Embarked
573,574,Q
849,850,C
582,583,S
556,557,C
108,109,S


# Encoding Methods

## One-hot encoding

This is a simple dummy variable creation for each level of categorical feature. There are two ways to achieve this - <b>get_dummies</b> method from Pandas and <b>OneHotEncoder</b> from Scikit-Learn.

In [2]:
df_embarked = pd.get_dummies(df, prefix = ['Embarked'], columns = ['Embarked'])
df_embarked.sample(5)

Unnamed: 0,PassengerId,Embarked_C,Embarked_Q,Embarked_S
498,499,0,0,1
160,161,0,0,1
446,447,0,0,1
579,580,0,0,1
233,234,0,0,1


In [3]:
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder()
ohe = ohc.fit_transform(df.Embarked.values.reshape(-1, 1)).toarray()
df_ohe = pd.DataFrame(ohe, columns = ['Embarked_'+str(i) for i in range(1, 4)])
df_embarked = pd.concat([df, df_ohe], axis = 1)
df_embarked.sample(5)

Unnamed: 0,PassengerId,Embarked,Embarked_1,Embarked_2,Embarked_3
337,338,C,1.0,0.0,0.0
89,90,S,0.0,0.0,1.0
628,629,S,0.0,0.0,1.0
297,298,S,0.0,0.0,1.0
94,95,S,0.0,0.0,1.0


## Label/Ordinal Encoding

In this encoding, each level is assigned a value from 1 through N (N being the total number of levels). One major issue with this approach is even when there is no major relation or order among levels, this method assumes that there is some relationship. 

In [4]:
from sklearn.preprocessing import LabelEncoder
df['Embarked_LabelEncoded'] = LabelEncoder().fit_transform(df['Embarked'])
df.sample(5)

Unnamed: 0,PassengerId,Embarked,Embarked_LabelEncoded
365,366,S,2
404,405,S,2
27,28,S,2
307,308,C,0
333,334,S,2


Here, you can see that levels are given an encoding based on their alphabetical order (C: 0, Q: 1, S: 2). An easy way to avoid this ordering is to use map method from pandas. This can help in maintaining a natural order. However, this needs manual input which is prone to errors.

In [5]:
df['Embarked_LabelEncoded'] = df['Embarked'].map({'C': 2, 'Q': 1, 'S': 0})
df.sample(5)

Unnamed: 0,PassengerId,Embarked,Embarked_LabelEncoded
609,610,S,0
837,838,S,0
282,283,S,0
685,686,C,2
464,465,S,0


## Helmert Encoding

In this encoding, mean of dependent variable is compared across all levels.

In [6]:
import category_encoders as ce
encoder = ce.HelmertEncoder(cols = ['Embarked'], drop_invariant = True) ##this will drop the intercept column which has a constant value
dfh = encoder.fit_transform(df['Embarked'])
dfh.sample(5)

Unnamed: 0,Embarked_0,Embarked_1
805,-1.0,-1.0
806,-1.0,-1.0
313,-1.0,-1.0
617,-1.0,-1.0
417,-1.0,-1.0


## Binary Encoding

It converts classes into binary digits. For n classes, there will be roundup(log<sub>2</sub>n) features. For example 100 categories would require only 7 features. Steps in this approach are:
1. Categories are first converted into numeric order starting from 1 to n.
2. Then those integers are converted into binary code (e.g. 3 is 011 and 4 is 100).
3. These digits of binary code form separate columns.

In [7]:
encoder = ce.BinaryEncoder(cols = ['Embarked'])
dfbin = encoder.fit_transform(df['Embarked'])
dfbin.sample(5)

Unnamed: 0,Embarked_0,Embarked_1,Embarked_2
615,0,0,1
385,0,0,1
673,0,0,1
236,0,0,1
823,0,0,1


## Frequency/Mean Encoding

This method is a way to utilize frequency of each class within a variable. It helps the model to understand their probabilty of occurance. Frequencies can also be given some weights depending on the nature of data.

In [8]:
ef = df['Embarked'].value_counts(normalize = True)
df['Embarked_MeanEncoding'] = df['Embarked'].map(ef)
df.sample(5)

Unnamed: 0,PassengerId,Embarked,Embarked_LabelEncoded,Embarked_MeanEncoding
571,572,S,0,0.722783
849,850,C,2,0.190797
394,395,S,0,0.722783
728,729,S,0,0.722783
440,441,S,0,0.722783


There are many variations to this method like a smoothing value can be applied based on overall conversion rate of target variable and manually assigning weights to each level and then multiply the proportion values with those weights. Again it all depends on nature of data and what is needed to achieve from it. 

# Conclusion

There are many other encoding techniques which in my knowledge are some modified forms of above discussed techniques. I have not used them in any modelling exercise and so refrain to discuss them here, however more interested folks can read abourt Weight-of-Evidence encoding, Hashing, Backward Difference Encoding etc. 
