## TARGET GUIDED ENCODING

<b>Define -:In target encoding, we calculate the mean of the target variable for each category and assign the ranks according to their (mean) and then replace the category variable with the rank value.</b>

1. Ordering the labels according to the target<br>
2. Replace the labels by the joint probability of being 1 or 0

In [1]:
import pandas as pd
df=pd.read_csv('titanic_train.csv', usecols=['Cabin','Survived'])
df.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [2]:
# Filling the NAN by Category Named Missing
df['Cabin'].fillna('Missing',inplace=True)

In [3]:
df

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing
...,...,...
886,0,Missing
887,1,B42
888,0,Missing
889,1,C148


In [5]:
#taking the First letter from the cabin string
df['Cabin']=df['Cabin'].astype(str).str[0] 

In [6]:
df.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [7]:
# checking the unique categories
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [8]:
 #Taking the mean of Survived w.r.t to Cabin
df.groupby(['Cabin'])['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [9]:
#sort in the ascending order
df.groupby(['Cabin'])['Survived'].mean().sort_values()

Cabin
T    0.000000
M    0.299854
A    0.466667
G    0.500000
C    0.593220
F    0.615385
B    0.744681
E    0.750000
D    0.757576
Name: Survived, dtype: float64

In [10]:
#Taking the Index only
df.groupby(['Cabin'])['Survived'].mean().sort_values().index

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [11]:
ordinal_labels=df.groupby(['Cabin'])['Survived'].mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [12]:
#Enumerate() method adds a counter to an iterable and 
#returns it in a form of enumerating object.
enumerate(ordinal_labels,0)

<enumerate at 0x22315a4e040>

In [13]:
# The Below Commad is assigning the ranks w.r.t means (lowest will get 0 rank and so on)

In [14]:
ordinal_labels2={label:count for count,label  in enumerate(ordinal_labels,0)}
ordinal_labels2

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [15]:
df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels2)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


### Disadvantage
1.It can lead to target leakage or overfitting.<br>
2.The second issue, we may face is the improper distribution of categories in train and test data. In such a case, the categories may assume extreme values. Therefore the target means for the category are mixed with the marginal mean of the target