# Target Guided Ordinal Encoding
1. Ordering the labels according to the target
2. Replace the labels by the joint probability of being 1 or 0 (mean)<br>


In target encoding, we calculate the mean of the target variable for each category, assign the rank accoding to the
mean and replace the category variable with the rank values.

In [2]:
import pandas as pd
df=pd.read_csv('titanic.csv', usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [3]:
df['Cabin'].fillna('Missing',inplace=True) #replacing NaN values with Missing
df

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing
...,...,...
886,0,Missing
887,1,B42
888,0,Missing
889,1,C148


In [4]:
df['Cabin']=df['Cabin'].astype(str).str[0] # taking first lettters from strings in Cabin

In [5]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [6]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [7]:
df.groupby(['Cabin'])['Survived'].mean() #Taking mean of Survived w.r.t. Cabin

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [8]:
df.groupby(['Cabin'])['Survived'].mean().sort_values().index

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [9]:
ordinal_labels=df.groupby(['Cabin'])['Survived'].mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [10]:
#Enumerate() method adds a counter to an iterable and returns it in a form of
#enumerating object.

enumerate(ordinal_labels,0)

<enumerate at 0x2db86a14f00>

In [11]:
#The rank is assigned according to the mean of values (lowest mean will get 0 and so on.. )

In [13]:
ordinal_labels2 = {k:i for i,k in enumerate(ordinal_labels,0)} # i(iterable) = 0 , k = ordinal_label
ordinal_labels2

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [14]:
df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels2) #mapping
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


# Disadvantages
1. It can lead to target leakage or overfitting.
2. we may face is the improper distribution of categories in train and test data. In such a case, the categories
may assume extreme values. Therefore the target means for the category are mixed with the marginal mean
of the target.