## ***Mean Encoding***
In this method, we will convert the categories into their mean values based on the output.

This type of approach will be applicable where we have a lot of categorical variables for a particular column.



In [2]:
import pandas as pd
import numpy as np
data = pd.read_csv('titanic.csv',usecols=['Survived','Cabin'])

In [3]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [4]:
# First let's fix the nan values present in Cabin by using the Replacing NAN with a new Category technique
data.Cabin.fillna('Missing',inplace=True)

In [6]:
data.Cabin = data.Cabin.astype(str).str[0]

Over here we can see that there is lot of category like missing, C85, C123, E46 and many more.</br>
Here C85 and C123 are different different seat numbers which belongs to the **C** category.</br>
Similarly E46 and E47 are different seat numbers which belongs to the **E** category.

In order to perform target guided encoding on Cabin column, we are going to take the first letter of each category like for C85 we take C and consider it as our one of the category.

In [7]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [8]:
data.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [9]:
data.groupby('Cabin').Survived.mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

Here we can see the surviving probability of each category. In short we can see how many values in each category are 1 and how many are 0 i.e. we are calculating mean here

In [12]:
# Sort the values in ascending order according to their mean and get a dict
mean_dict = data.groupby('Cabin').Survived.mean().to_dict()
mean_dict

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [13]:
# Now to encode according to the mean encoding technique , we just need to replace each category with their mean
data.Cabin = data.Cabin.map(labels)

In [14]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,0.299854
1,1,0.59322
2,1,0.299854
3,1,0.59322
4,0,0.299854


### **Advantages and Disadvantage of Mean Encoding**

***Advantages:***
1. It captures information within the label therefore, rendering more predictive features.
2. It creates a monotonic relationship variable and target.

***Disadvantage:***
1. It prones to overfit the data.