### Objective

Feature engineering is one of the most important aspects and it is the part where one should spend the most time on. The objective of this exercise is to demonstarte different types of feature encoding methods used in contests. It is very common to see categorical features in a dataset.

So what is feature encoding? It is the process of transforming a categorical variable into a continuous variable and using them in the model. Lets start with basic and go to advanced methods.


To be covered:
* One Hot Encoding & Label Encoding
* Frequency Encoding
* Target Mean Encoding


In [11]:
## Loading packages
import numpy as np
import pandas as pd


In [16]:
## Loading dataset
## Change the path to files
train = pd.read_csv("/train.csv")
test = pd.read_csv("/test.csv")

## Glimpse throught the data
train.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [17]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### Before we jump in feature encoding, let's go ahead and remove unwanted variables like Cabin and Ticket.

In [18]:
## Removing dummy variables
train.drop(labels = ["Cabin", "Ticket"], axis = 1, inplace = True)
test.drop(labels = ["Cabin", "Ticket"], axis = 1, inplace = True)


#### 1. Converting missing values to NaN.
#### 2. Imputation with Median and Mode for "Age" and "Embarked"

In [19]:
## Fill missing values with NaN
train = train.fillna(np.nan)
test = test.fillna(np.nan)


In [20]:
## Check for Null values
train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Fare             0
Embarked         2
dtype: int64

In [21]:
## Missing Values Imputation
train["Age"].fillna(train["Age"].median(), inplace = True)
train["Embarked"].fillna("S", inplace = True)

In [22]:
## Lets create a variable called title from the name variable
for name in train["Name"]:
    train["Title"] = train["Name"].str.extract("([A-Za-z]+)\.",expand=True)

title_replacements = {"Mlle": "Other", "Major": "Other", "Col": "Other", "Sir": "Other", "Don": "Other", "Mme": "Other",
          "Jonkheer": "Other", "Lady": "Other", "Capt": "Other", "Countess": "Other", "Ms": "Other", "Dona": "Other"}

train.replace({"Title": title_replacements}, inplace=True)
train.replace({"Title": title_replacements}, inplace=True)


In [23]:
train

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.2500,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.9250,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1000,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.0500,S,Mr
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0000,S,Rev
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0000,S,Miss
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,23.4500,S,Miss
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0000,C,Mr


### One Hot Encoding & Label Encoding

Let's say we have ‘eggs’, ‘butter’ and ‘milk’ in a categorical variable

* **One Hot Encoding** will produce three columns and the presence of a class will be represented in binary format. Three classes are separated out to three different features. The alogirithm is only worried about their presence/absence without making any assumptions of their relationship.
* **Label Encoding** gives numerical aliases to the classes. So the resultant label enocded feature will have 0,1 and 2. The problem with this approach is that there is no relation between these three classes yet our alogirithm might consider them to be ordered (that is there is some relation between them) maybe 0<1<2 that is ‘eggs’<‘butter’<‘milk’.

Depending on the variable we should either use one hot encoding or label encoding.


#### One Hot Encoding

In [24]:
## subset categorical variables which you want to encode
x = train[['Embarked','Pclass','Title']]

x = pd.get_dummies(x, columns=['Embarked','Pclass','Title'], drop_first=False)
x.head()


Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,Pclass_2,Pclass_3,Title_Dr,Title_Master,Title_Miss,Title_Mr,Title_Mrs,Title_Other,Title_Rev
0,0,0,1,0,0,1,0,0,0,1,0,0,0
1,1,0,0,1,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,1,0,0,1,0,0,0,0
3,0,0,1,1,0,0,0,0,0,0,1,0,0
4,0,0,1,0,0,1,0,0,0,1,0,0,0


#### We see that all the sub categories in a categorical variable have been converted into binary flags. This type of feature encoding is one hot encoding.

#### Label encoding

In [28]:
## subset categorical variables which you want to encode
x = train[['Embarked','Pclass','Title']]

## Write your code here
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
x['Embarked'] = oe.fit_transform(x[['Embarked']])
x['Pclass'] = oe.fit_transform(x[['Pclass']])
x['Title'] = oe.fit_transform(x[['Title']])
x

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['Embarked'] = oe.fit_transform(x[['Embarked']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['Pclass'] = oe.fit_transform(x[['Pclass']])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x['Title'] = oe.fit_transform(x[['Title']])


Unnamed: 0,Embarked,Pclass,Title
0,2.0,2.0,3.0
1,0.0,0.0,4.0
2,2.0,2.0,2.0
3,2.0,0.0,4.0
4,2.0,2.0,3.0
...,...,...,...
886,2.0,1.0,6.0
887,2.0,0.0,2.0
888,2.0,2.0,2.0
889,0.0,0.0,3.0


In [35]:
train['Embarked'].unique()
#train['Pclass'].unique()
#train['Title'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [38]:
## subset categorical variables which you want to encode
y = train[['Embarked','Pclass','Title']]

## Write your code here
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['S', 'C', 'Q'],[ 1, 2, 3],['Mr', 'Mrs', 'Miss', 'Master', 'Other', 'Rev', 'Dr']])
y[['Embarked','Pclass','Title']] = oe.fit_transform(y[['Embarked','Pclass','Title']])
y

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  y[['Embarked','Pclass','Title']] = oe.fit_transform(y[['Embarked','Pclass','Title']])


Unnamed: 0,Embarked,Pclass,Title
0,0.0,2.0,0.0
1,1.0,0.0,1.0
2,0.0,2.0,2.0
3,0.0,0.0,1.0
4,0.0,2.0,0.0
...,...,...,...
886,0.0,1.0,5.0
887,0.0,0.0,2.0
888,0.0,2.0,2.0
889,1.0,0.0,0.0


#### We see that all the subcategories in the categorical variable have been given numbered aliases.  

### Frequency Encoding

* Step 1 : Select a categorical variable you would like to transform.
* Step 2 : Group by the categorical variable and obtain counts of each category.
* Step 3 : Join it back with the train dataset.


In [42]:
## sample train dataset
sample_train = train[['Embarked','Pclass','Title']]

## Frequency Encoding title variable
y = sample_train.groupby(['Title']).size().reset_index()
y.columns = ['Title', 'Freq_Encoded_Title']
y.head()


Unnamed: 0,Title,Freq_Encoded_Title
0,Dr,7
1,Master,40
2,Miss,182
3,Mr,517
4,Mrs,125


In [40]:
sample_train = pd.merge(sample_train,y,on = 'Title',how = 'left')
sample_train.head()


Unnamed: 0,Embarked,Pclass,Title,Freq_Encoded_Title
0,S,3,Mr,517
1,C,1,Mrs,125
2,S,3,Miss,182
3,S,1,Mrs,125
4,S,3,Mr,517


#### We see that all the subcategories in the categorical variable have been given the total number of occurance for that specific category.

### Mean Encoding
**Survived** is our dependent variable (DV), so let's look at how we can extract features from it. The following steps are used in **Mean encoding**,

* Step 1 : Select a categorical variable you would like to transform.
* Step 2 : Group by the categorical variable and obtain aggregated sum over "survived" variable.
(total number of 1's for each category in DV)
* Step 3 : Group by the categorical variable and obtain aggregated count over "survived" variable.
* Step 4 : Divide the step 2 / step 3 results and join it back with the train.


In [47]:
sample_train = train[['Title','Survived']]

## Mean encoding
x = sample_train.groupby(['Title'])['Survived'].sum().reset_index()
x = x.rename(columns={"Survived" : "Title_Survived_sum"})

y = sample_train.groupby(['Title'])['Survived'].count().reset_index()
y = y.rename(columns={"Survived" : "Title_Survived_count"})

z = pd.merge(x,y,on = 'Title',how = 'inner')
z['Target_Encoded_over_Title'] = z['Title_Survived_sum']/z['Title_Survived_count']
z.head()
# x = sample_train.groupby(['Title'])['Survived'].mean().reset_index()
# x

Unnamed: 0,Title,Title_Survived_sum,Title_Survived_count,Target_Encoded_over_Title
0,Dr,3,7,0.428571
1,Master,23,40,0.575
2,Miss,127,182,0.697802
3,Mr,81,517,0.156673
4,Mrs,99,125,0.792


#### We see that all the subcategories in the categorical variable are represented as the survival probabilty occuring in that specific category.

In [44]:
## Joining this back with the sample_train dataset

z = z[['Title','Target_Encoded_over_Title']]

sample_train = pd.merge(sample_train,z,on = 'Title',how = 'left')
sample_train.head()


Unnamed: 0,Title,Survived,Target_Encoded_over_Title
0,Mr,0,0.156673
1,Mrs,1,0.792
2,Miss,1,0.697802
3,Mrs,1,0.792
4,Mr,0,0.156673


What will you do if you want to mean encode a categorical variable using a **continuous variable** instead of a **dichotomous/binary variable**? How will you use mean encoding? There are two methods which can be used for mean encoding continuous variables :  

1. Direct method
2. k-fold method  

#### Direct Method
* Step 1 : Select a categorical variable you would like to transform
* Step 2 : Select a continuous variable variable.
* Step 3 : Group by the categorical variable and obtain the aggregated mean over the numeric variable.

In [48]:
## Direct Method
## TYPE 1
## Selecting title (categorical) and Fare (numeric) from the train dataset

sample_train = train[['Title','Fare']]

## Mean encoding
x = sample_train.groupby(['Title'])['Fare'].mean().reset_index()
x = x.rename(columns={"Fare" : "Title" +"_Mean_Encoded"})
x.head()


Unnamed: 0,Title,Title_Mean_Encoded
0,Dr,49.168457
1,Master,34.703125
2,Miss,43.797873
3,Mr,24.44156
4,Mrs,45.138533


In [49]:
## Joining this back with the sample_train dataset

sample_train = pd.merge(sample_train,x,on = 'Title',how = 'left')
sample_train.head()


Unnamed: 0,Title,Fare,Title_Mean_Encoded
0,Mr,7.25,24.44156
1,Mrs,71.2833,45.138533
2,Miss,7.925,43.797873
3,Mrs,53.1,45.138533
4,Mr,8.05,24.44156


#### We see that each title is encoded into the mean of Ticket Fare. This is a popularly used feature encoding technique in kaggle competitions.

#### But why are these encodings better ?

* Mean encoding can embody the target in the label whereas label encoding has no correlation with the target.
* In case of large number of features, mean encoding could prove to be a much simpler alternative.
* A histogram of predictions using label & mean encoding show that mean encoding tend to group the classes together whereas the grouping is random in case of label encoding

![](https://cdn-images-1.medium.com/max/800/1*qwooYKx8rU6h1VDnUCgsNg.png)


* Even though it looks like mean encoding is Superman, it’s kryptonite is overfitting. The fact that we use target classes to encode for our training labels may leak data about the predictions causing the encoding to become biased. Well we can avoid this by Regularizing.

#### Now let's look at how we can reduce this bias.

#### What is k-fold cross validation?

Cross-validation is primarily used in machine learning to estimate the skill of a machine learning model on unseen data. Let's say the value of k is 5. You break the train dataset into 5 parts hold out one part as test and run a model using the other 4 parts as train. This is iteratively done such that the model trains through all combinations of the dataset. (refer image below)

#### How can we use this for mean encoding ?
Since K-fold strategy holds out some data, it reduces the bias we discussed about earlier and applying the direct method over k-folds will be the best way to do feature encoding.


![](https://cdn-images-1.medium.com/max/1600/1*me-aJdjnt3ivwAurYkB7PA.png)