Target Guided Ordinal Encoding

It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

Common Types of Target Guided Encoding:
Mean Encoding / Target Mean Encoding
Replace each category with the mean of the target for that category.
Example:

Category   Target
A          100
A          150
B           50
B           30

Mean of A = 125 → A becomes 125
Mean of B = 40  → B becomes 40

Target Guided Ordinal Encoding
Sort categories based on the mean of the target, then assign ordered labels.
Example:

Category Means:
B: 40
A: 125

→ B = 0, A = 1

Weight of Evidence (WoE) Encoding
Used mostly in credit scoring. Based on the distribution of good/bad outcomes per category (used in binary classification).

WoE = log(% of Good / % of Bad)

🔹 Why Use It?
✅ Captures predictive power of categories
✅ Can improve model performance
✅ Helps with high-cardinality categorical features (like "zip code")

⚠️ But Watch Out!
Overfitting: You're using target info, so you must do it carefully, often with cross-validation or on training data only.

Not ideal for tree-based models (like Random Forest or XGBoost), which handle categoricals well on their own.

In [1]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [3]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [5]:
mean_price=df.groupby('city')['price'].mean().to_dict()

In [7]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [9]:
df['city_encoded']=df['city'].map(mean_price)

In [11]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


In [27]:
import seaborn as sns
df1 = sns.load_dataset('tips')

In [29]:
df1

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [31]:
mean_price1=df1.groupby('time')['total_bill'].mean().to_dict()

  mean_price1=df1.groupby('time')['total_bill'].mean().to_dict()


In [33]:
mean_price1

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [35]:
df1['time_encoded']=df['time'].map(mean_price1)

In [37]:
df1[['total_bill','time_encoded']]

Unnamed: 0,total_bill,time_encoded
0,16.99,20.797159
1,10.34,20.797159
2,21.01,20.797159
3,23.68,20.797159
4,24.59,20.797159
...,...,...
239,29.03,20.797159
240,27.18,20.797159
241,22.67,20.797159
242,17.82,20.797159
