# Categorical Encoding. 
##### Categorical encoding refers to coverting data into integer format for ML model to process. 

## Types - 
 ### 1. Nominal - Classified without a natural order. ( Male , Female )
 ### 2. Ordinal - Classified with an order or rank. (Good , Better , Best )


In [21]:
import pandas as pd
df = pd.read_csv('titanic.csv',usecols=['Sex', 'Embarked', 'Cabin', 'Survived'])
from sklearn.model_selection import train_test_split
from feature_engine.encoding import OneHotEncoder as ohe

### ONE HOT ENCODING 
This method refers to encoding nominal categories into integer. In this method, each integer value is represented as a binary vector. 
For example - Consider variable "colour" with values 'Red', 'Orange' and 'Pink', we can create 3 new variables called "Red", "Orange" and "Pink". These variables will take the value 1, if the observation is of the said colour or 0 otherwise.

One trick to avoid redundant information in OHE is to encode variables into k-1 categories. From our example consisting of colours, Red will represent 1 0 0 , Orange will represent 0 1 0 and Pink 0 0 1. For pink, the observation can be captured with combination of Red and Orange ( 0 0 ). 

In [22]:
df1 = df.copy()

In [30]:
X_train, X_test, y_train, y_test = train_test_split(df1[['Sex','Cabin','Embarked']],df1['Survived'],
                                                    test_size=0.3,random_state=0)

In [36]:
#X_train['Sex'].unique() 2 values
#X_train['Cabin'].dropna(inplace = True)
#X_train['Cabin'].unique()
#take the first letter 
X_train['Cabin'] = df1['Cabin'].str[0]
X_train['Cabin'].unique()

array(['E', 'D', nan, 'B', 'C', 'A', 'F', 'G', 'T'], dtype=object)

In [37]:
ohe_enc = ohe(drop_last=True) #drop_last for k-1
ohe_enc.fit(X_train.fillna('Na'))
transformed = ohe_enc.transform(X_train.fillna('Na'))

In [45]:
transformed.head()
#X_train['Cabin'].unique()

Unnamed: 0,Sex_male,Cabin_E,Cabin_D,Cabin_Na,Cabin_B,Cabin_C,Cabin_A,Cabin_F,Cabin_G,Embarked_S,Embarked_C,Embarked_Q
857,1,1,0,0,0,0,0,0,0,1,0,0
52,0,0,1,0,0,0,0,0,0,0,1,0
386,1,0,0,1,0,0,0,0,0,1,0,0
124,1,0,1,0,0,0,0,0,0,1,0,0
578,0,0,0,1,0,0,0,0,0,0,1,0


In [46]:
ohe_enc.variables_

['Sex', 'Cabin', 'Embarked']

In [52]:
transformed_tst = ohe_enc.transform(X_test.fillna('Na'))
transformed_tst.head()

Unnamed: 0,Sex_male,Cabin_E,Cabin_D,Cabin_Na,Cabin_B,Cabin_C,Cabin_A,Cabin_F,Cabin_G,Embarked_S,Embarked_C,Embarked_Q
495,1,0,0,1,0,0,0,0,0,0,1,0
648,1,0,0,1,0,0,0,0,0,1,0,0
278,1,0,0,1,0,0,0,0,0,0,0,1
31,0,0,0,0,0,0,0,0,0,0,1,0
255,0,0,0,1,0,0,0,0,0,0,1,0


## One hot encoding of frequent categories. 

In case of data with high cardinality, feature space will increase drastically. Therefore it is good idea to select only frequent categories to have dummy variables. 

In [56]:
ohe_enc = OneHotEncoder(
    top_categories=10, #add this extra key to select top 10 frequent categories. 
    variables=['Sex' , 'Cabin' , 'Embarked'],drop_last=False)

## Mean target guided encoding

This ordinal encoding method replaces the observations by the average target value of the category. 

For example - We have variable Color with category Red , Yellow , Green. Find mean of target variable for red (0.5) , Green (0) and Yellow 1. Replace the categories with these values. 


In [58]:
from feature_engine.encoding import MeanEncoder

In [62]:
df3 = df.copy()
df3.head(4)

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,,S
1,1,female,C85,C
2,1,female,,S
3,1,female,C123,S


In [65]:
df3['Cabin'] = df3['Cabin'].astype(str).str[0] #take first letter

In [68]:
df3['Embarked'].fillna('Missing', inplace=True) # fill in missing values

In [71]:
df3.head(4)

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,n,S
1,1,female,C,C
2,1,female,n,S
3,1,female,C,S


In [79]:
for x in df3.columns:
    print (len(df3[x].unique()))

2
2
9
4


In [81]:
X_train, X_test, y_train, y_test = train_test_split(df3[['Cabin', 'Sex', 'Embarked']],df3['Survived'],
                                                    test_size=0.3,  random_state=0)

In [82]:
mean_enc = MeanEncoder(
    variables=['Cabin', 'Sex', 'Embarked'])

In [83]:
mean_enc.fit(X_train, y_train)

In [84]:
mean_enc.encoder_dict_

{'Cabin': {'A': 0.42857142857142855,
  'B': 0.7741935483870968,
  'C': 0.5714285714285714,
  'D': 0.6923076923076923,
  'E': 0.7407407407407407,
  'F': 0.6666666666666666,
  'G': 0.5,
  'T': 0.0,
  'n': 0.3036093418259023},
 'Sex': {'female': 0.7534883720930232, 'male': 0.19607843137254902},
 'Embarked': {'C': 0.5648148148148148,
  'Missing': 1.0,
  'Q': 0.4107142857142857,
  'S': 0.3413566739606127}}

In [85]:
#Transform with mean encoder 
X_train = mean_enc.transform(X_train)
X_test = mean_enc.transform(X_test)

In [86]:
X_train.head(5)

Unnamed: 0,Cabin,Sex,Embarked
857,0.740741,0.196078,0.341357
52,0.692308,0.753488,0.564815
386,0.303609,0.196078,0.341357
124,0.692308,0.196078,0.341357
578,0.303609,0.753488,0.564815


## Few more encoding techniques - 

### 1. Weight of evidence encoding.
### 2. Integer encoding. 
### 3. Probability ratio encoding. 