## Target Guided Ordinal Encoding

Target Guided Ordinal Encoding is a technique used to encode categorical variables based on their relationship with the target variable. This is especially useful when you have a categorical variable with **many unique categories** (high cardinality), and you want to use it as a feature in your machine learning model.

### How Does It Work?

- For each category in the categorical variable, calculate the **mean** (or median) of the target variable.
- Replace each category with its corresponding mean (or median) value.
- This creates a **monotonic relationship** between the encoded feature and the target, helping the model understand the impact of each category.

### Why Use Target Guided Ordinal Encoding?

- **Captures the relationship** between the category and the target variable.
- **Reduces dimensionality**: Unlike One Hot Encoding, it does not create many columns, making it efficient for features with many categories.
- **Improves model performance**: Especially useful for tree-based models and when categories have a strong effect on the target.

### Example

Suppose you have a dataset with a `city` column (categorical) and a `price` column (target):

1. **Calculate the mean price for each city:**
   ```python
   mean_price = df.groupby('city')['price'].mean().to_dict()
   ```
2. **Map each city to its mean price:**
   ```python
   df['city_encoded'] = df['city'].map(mean_price)
   ```

Now, the `city_encoded` column numerically represents the average price for each city.

### Another Example: The Tips Dataset

If you want to encode the `time` column (Lunch/Dinner) based on the average tip:
1. **Calculate the mean tip for each time:**
   ```python
   mean_tips = df_tips.groupby('time')['tip'].mean().to_dict()
   ```
2. **Map the mean tip to the time column:**
   ```python
   df_tips['time_encoded'] = df_tips['time'].map(mean_tips)
   ```

Now, `time_encoded` shows the average tip for Lunch and Dinner, helping the model understand which time is associated with higher tips.

---

### Key Points & Observations

- **Use when:** You have a categorical variable with many unique values and want to capture its effect on the target.
- **Avoid data leakage:** Always compute the mapping using only the training data, not the test data.
- **Works best for:** Tree-based models, but can be useful for others as well.
- **Not for all cases:** If the relationship between the category and target is weak, this encoding may not help much.
- **Alternative statistics:** You can use median, count, or other statistics instead of mean, depending on your data.

---

**Quick Revision Table**

| Step                | What to do?                                      |
|---------------------|--------------------------------------------------|
| 1. Group by category| Calculate mean/median of target for each category|
| 2. Map to new column| Replace category with calculated value           |
| 3. Use in model     | Use encoded column as a feature                  |

---

*Use this notebook as a quick reference for Target Guided Ordinal Encoding whenever you need to encode categorical variables based on their relationship with the target!*

In [2]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})

In [3]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [11]:
mean_price=df.groupby('city')['price'].mean().to_dict()

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [5]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [6]:
# This line maps the mean price of each city to the 'city_encoded' column in the dataframe
# It replaces the 'city' column with the average price of each city, creating a new column 'city_encoded'
df['city_encoded']=df['city'].map(mean_price)
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [7]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


In [15]:
import seaborn as sns
df_tips = sns.load_dataset('tips')
df_tips

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [17]:
df_tips['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [20]:
mean_tips = df_tips.groupby('time')['tip'].mean().to_dict()
mean_tips

  mean_tips = df_tips.groupby('time')['tip'].mean().to_dict()


{'Lunch': 2.7280882352941176, 'Dinner': 3.102670454545455}

In [25]:
df_tips['time_encoded'] = df_tips['time'].map(mean_tips)
df_tips[['tip', 'time_encoded']].head()

Unnamed: 0,tip,time_encoded
0,1.01,3.10267
1,1.66,3.10267
2,3.5,3.10267
3,3.31,3.10267
4,3.61,3.10267
