In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn import linear_model

In [4]:
titanic = pd.read_csv("Dataset/train.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


There are 3 usual categorical features here, they are  
Sex, Cabin, Embarked  
And another one  
Pcalss, with values [1, 2, 3], it's ordinal, in another word, ordered categorical feature.
If Pcalss is a numerical feature, then we can say that we difference beween Pclass 1 and 2, is the same between Pclass 2 and 3. But Pclass is ordinal, so we don't know which difference is bigger.  

## Encoding
one way to solve this problem is to map it's unique values to different numbers. This is usually refered to as label encodeing.  
This technique works fine with tree-based methods, because these models can split features and extract most of the useful values in categories on its own.
Non-tree-based models however, can make use of this as effectively.  
Example:  
Pclass    1      2      3  
target    1      0      1  
When the relationships between target can the oridinal features are not linear, the model would be confused. While trees would just make 2 splits select in each unique value.  

### Label Encoding:  

1. Alphabetical(sorted).   
sklearn.preprocessing.LabelEncoder  
[S, C, Q] -> [2, 1, 3]  

  
2. Order of appearance.  
pandas.factorize  
[S, C, Q] -> [1, 2, 3]  

In [7]:
# 3. Frequency Encoding.  
encoding = titanic.groupby('Embarked').size()
encoding = encoding / len(titanic)
titanic['enc'] = titanic.Embarked.map(encoding)
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,enc
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0.722783
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0.188552
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0.722783
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0.722783
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0.722783


If frequency of category is correlated with target value, then linear models will utilize this dependency. This would preserve the information about the distribution of the labels.  
However when there are 2 labels with the same frequency, these 2 lables won't be disinguishable in this new feature.  

### Adapt Categorical Features to None-Tree-Based Models
One hot encoding:  
Pcalss  
1  
2  
3  
Pclass==1 Pclass==2 Pclass==3   
1         0         0   
0         2         0     
0         0         3  

pandas.get_dummies  
sklearn.preprocessing.OneHotEncoder  
One-Hoted features usually contains lots of zeros, we can use sparse matrices to store these data more efficiently. Libraries like XGBoost, LightGBM can sklearn can support sparse matrices.  

## Summary
1. Ordinal is a special case of categorical feature with values sorted in some meaningful order.  
2. Label encoding, basically replaces unique values of categorical features with numbers. 
3. Frequency encoding maps the categorical features to their frequencies.  
4. Label encoding and frequency encoding usually works better with tree-based models.  
5. One-hot encodings is often used for non-tree based models.
6. Applying one-hot encoding with combination of other categorical features allowes non-tree-based models to take into considerations between features, and improve.  