## One Hot Encoding (or) Nominal

Definition:
One-Hot Encoding transforms nominal categorical variables (categories without intrinsic order) into multiple binary (0/1) columns — one for each category value. Each row has 1 in the column of its category and 0 in others.
Real-World Example:
In a ride-hailing app, the payment_method feature could be ["Cash", "Credit Card", "UPI"]. Since payment method has no ranking, OHE creates:
payment_cash | payment_credit_card | payment_upi
      1                0                  0

When to Use:

    For nominal categorical features without any order.
    Works well for tree-based models and linear regression.

Disadvantages:

    Increases feature space (curse of dimensionality) if the category count is high.
    Can lead to sparse matrices in high-cardinality features.

In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [7]:
df = pd.DataFrame({
  'color': ['red', 'blue', 'green', 'green', 'red', 'red', 'blue']
})
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [13]:
encoder = OneHotEncoder()
encoded_one = encoder.fit_transform(df[['color']]).toarray()

In [23]:
encoder.transform([['red']]).toarray()



array([[0., 0., 1.]])

In [14]:
encoded_df = pd.DataFrame(encoded_one, columns=encoder.get_feature_names_out())
encoded_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,0.0,0.0,1.0
6,1.0,0.0,0.0


In [15]:
pd.concat([df, encoded_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,red,0.0,0.0,1.0
6,blue,1.0,0.0,0.0


## Label Encoding
Definition:
Label Encoding assigns a unique integer to each category. Categories are replaced directly with their numeric code.
Real-World Example:
In an e-commerce platform, the shipping_region feature might be:
["North", "South", "East", "West"]
Label encoding converts it to:
North → 0, South → 1, East → 2, West → 3
When to Use:

    When categorical variable has a natural order or when using models that can handle arbitrary numeric codes (e.g., tree-based models).

Disadvantages:

    Implies ordinal relationship where none may exist, which can mislead models like linear regression.

In [16]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [18]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [19]:
label_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 2, 0])

In [24]:
label_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

## Ordinal Encoding

Definition:

Ordinal Encoding maps categories to integers based on a defined order.

Real-World Example:

In a hotel booking dataset, the room_quality feature might be:

["Standard", "Deluxe", "Suite"]

Encoded as:

Standard → 1, Deluxe → 2, Suite → 3

Here, the numbers represent ranking.

When to Use:

    When there’s clear ranking or hierarchy in categories.

Disadvantages:

    If the assumed order is wrong, it introduces bias.

In [25]:
from sklearn.preprocessing import OrdinalEncoder

In [27]:
df = pd.DataFrame({
  'size': ['small', 'large', 'small', 'medium', 'medium', 'large']
})
df.head()

Unnamed: 0,size
0,small
1,large
2,small
3,medium
4,medium


In [28]:
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])

In [29]:
encoder.fit_transform(df[['size']])

array([[0.],
       [2.],
       [0.],
       [1.],
       [1.],
       [2.]])

In [31]:
encoder.transform([['large']])



array([[2.]])

## Target Guided Ordinal Encoding
**Definition:**  
Categories are ordered based on the **mean of the target variable** and then replaced by integers representing that order.

**Real-World Example:**  
In a **loan default prediction** dataset, the `occupation` variable may have categories:

```
["Clerk", "Manager", "Laborer", "Businessman"]
```

If we calculate default rates:

```
Clerk → 0.10  
Manager → 0.05  
Laborer → 0.20  
Businessman → 0.15
```

Ordering by target mean (default probability ascending):

```
Manager (0.05) → 1  
Clerk (0.10) → 2  
Businessman (0.15) → 3  
Laborer (0.20) → 4
```

**When to Use:**

- When you have historical data linking categories to target behavior.
- Often used in **credit scoring, churn prediction, fraud detection**.

**Disadvantages:**

- **Data leakage risk** if applied before splitting train-test data.
- Requires large enough data to get stable target means.


In [33]:
import pandas as pd

df = pd.DataFrame({
  'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
  'price': [200, 150, 300, 250, 180, 320]
})

In [39]:
mean_price = df.groupby('city').price.mean().to_dict()
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [41]:
df["city_encoded"] = df.city.map(mean_price)

In [43]:
df[['city', 'city_encoded']]

Unnamed: 0,city,city_encoded
0,New York,190.0
1,London,150.0
2,Paris,310.0
3,Tokyo,250.0
4,New York,190.0
5,Paris,310.0
