## ***Target Guided Ordinal Encoding***
In this, we order the labels according to the target.

Or

replace the labels by the joint probability of being 1 or 0

In this method, we calculate the mean of each categorical variable based on the output and then rank them.

We can apply this technique but cant do this with nominal as we dont know the order in case of nominal variables unlike in the case of Ordinal where we know the order of variables.
To overcome this limitation for Nominal variables we use another technique called Mean Encoding

In [1]:
import pandas as pd
import numpy as np

In [6]:
data = pd.read_csv('C:\\Users\\kc510\\Documents\\Feature_Engineering\\Handling_Categorical_Variables\\titanic.csv',usecols=['Cabin','Survived'])

In [7]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [8]:
# First let's fix the nan values present in Cabin by using the Replacing NAN with a new Category technique
data.Cabin.fillna('Missing',inplace=True)

In [9]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


Over here we can see that there is lot of category like missing, C85, C123, E46 and many more.</br>
Here C85 and C123 are different different seat numbers which belongs to the **C** category.</br>
Similarly E46 and E47 are different seat numbers which belongs to the **E** category.

In order to perform target guided encoding on Cabin column, we are going to take the first letter of each category like for C85 we take C and consider it as our one of the category.

In [17]:
data['Cabin'] = data.Cabin.astype(str).str[0]

In [18]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [19]:
data.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [21]:
data.groupby('Cabin').Survived.mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

Here we can see the surviving probability of each category. In short we can see how many values in each category are 1 and how many are 0 i.e. we are calculating mean here

In [25]:
# Sort the values in ascending order according to their mean and get there index
ordinal_labels = data.groupby('Cabin').Survived.mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [26]:
# Now to encode, we have to map some number or we can say that rank to these labels according to their order.
ordinal_labels2 = {k:i for i,k in enumerate(ordinal_labels,0)} # here we are creating a dictionary with the index as the key and the value as the label
ordinal_labels2
# as D has the highest mean value, it will be the highest rank

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [27]:
data['Cabin_Ordinal'] = data.Cabin.map(ordinal_labels2)

In [28]:
data.head()

Unnamed: 0,Survived,Cabin,Cabin_Ordinal
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


In [29]:
# Now we can drop the original Cabin column
data.drop('Cabin',axis=1,inplace=True)
data.head()

Unnamed: 0,Survived,Cabin_Ordinal
0,0,1
1,1,4
2,1,1
3,1,4
4,0,1
