### Target-encoding Categorical Variables

Categorical variables are a challenge for Machine Learning algorithms. Since most (if not all) of them accept only numerical values as inputs, we need to transform the categories into numbers to use them in the model.

By one-hot encoding them, we create a really sparse matrix and inflate the number of dimensions the model needs to work with, and we may fall victim to the dreaded Curse of Dimensionality. This is amplified when the feature has too many categories, most of them being useless for the prediction.

[VISIT](https://towardsdatascience.com/dealing-with-categorical-variables-by-using-target-encoder-a0f1733a4c69)


## Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [14]:
import pandas as pd
df= pd.DataFrame({
    'city':['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [15]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [16]:
mean_price= df.groupby('city')['price'].mean().to_dict()

In [17]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [18]:
df['city_encoded'] = df.city.map(mean_price)

In [19]:
df[['price', 'city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


In [22]:
import seaborn as sns
titanic_df= sns.load_dataset('titanic')

In [24]:
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [25]:
titanic_df['class'].unique()

['Third', 'First', 'Second']
Categories (3, object): ['First', 'Second', 'Third']

In [29]:
mean_class = titanic_df.groupby('class',observed=True)['fare'].mean().to_dict()

In [30]:
mean_class

{'First': 84.1546875,
 'Second': 20.662183152173913,
 'Third': 13.675550101832993}

In [34]:
titanic_df['class_encoded'] = titanic_df['class'].map(mean_class)

In [35]:
titanic_df['class_encoded']

0      13.675550
1      84.154687
2      13.675550
3      84.154687
4      13.675550
         ...    
886    20.662183
887    84.154687
888    13.675550
889    84.154687
890    13.675550
Name: class_encoded, Length: 891, dtype: category
Categories (3, float64): [84.154687, 20.662183, 13.675550]

In [36]:
titanic_df[['fare','class_encoded']]

Unnamed: 0,fare,class_encoded
0,7.2500,13.675550
1,71.2833,84.154687
2,7.9250,13.675550
3,53.1000,84.154687
4,8.0500,13.675550
...,...,...
886,13.0000,20.662183
887,30.0000,84.154687
888,23.4500,13.675550
889,30.0000,84.154687
