# One-Hot Encoding — For Unordered Categories

**Objective**: Master **One-Hot Encoding (OHE)** — the **safest** way to encode **nominal** data.

---

## 1. Introduction

### What is One-Hot Encoding?

- Converts each category into a **binary column** (0 or 1)
- No assumed order → **perfect for nominal data**

**Example**:
```
color: red → [1, 0, 0]
       blue → [0, 1, 0]
       green → [0, 0, 1]
```

### Why Use OHE?

- **No false ordinality**
- Works with **all models**
- Interpretable: `color_red = 1` → clearly means "is red"

### Real-World Example
> **Customer Segment**: `Student`, `Professional`, `Retired` → 3 binary columns

## 2. Creating Sample Dataset

In [1]:
import pandas as pd

df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue'],
    'size': ['M', 'L', 'S', 'M', 'L', 'S']
})

print("Original Data:")
df

Original Data:


Unnamed: 0,color,size
0,red,M
1,blue,L
2,green,S
3,green,M
4,red,L
5,blue,S


## 3. Implementing One-Hot Encoding

**Two Ways**:
1. `pd.get_dummies()` → Quick & easy
2. `OneHotEncoder()` → Production-ready (handles unseen data)

In [2]:
from sklearn.preprocessing import OneHotEncoder

# Initialize encoder
ohe = OneHotEncoder(sparse_output=False, drop=None)  # drop=None → keep all

# Fit and transform
encoded = ohe.fit_transform(df[['color']])

# Convert to DataFrame
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out())

print("Encoded Columns:")
encoded_df

Encoded Columns:


Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [3]:
# Combine with original
df_ohe = pd.concat([df, encoded_df], axis=1)
df_ohe

Unnamed: 0,color,size,color_blue,color_green,color_red
0,red,M,0.0,0.0,1.0
1,blue,L,1.0,0.0,0.0
2,green,S,0.0,1.0,0.0
3,green,M,0.0,1.0,0.0
4,red,L,0.0,0.0,1.0
5,blue,S,1.0,0.0,0.0


## 4. Handling New / Unseen Categories

In [4]:
# New data with unseen category
new_data = [['yellow']]

# With handle_unknown='ignore' → returns all zeros
ohe_safe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe_safe.fit(df[['color']])

print("Transforming unseen 'yellow':")
ohe_safe.transform(new_data)

Transforming unseen 'yellow':




array([[0., 0., 0.]])

## 5. Pros and Cons

| Pros | Cons |
|------|------|
| No false assumptions | **High dimensionality** |
| Works with all models | Sparse matrix (memory) |
| Interpretable | **Curse of dimensionality** |

**Tip**: Use `drop='first'` to avoid multicollinearity in linear models

---

## 6. Summary

| Category Type | Encoding Recommended | Example |
|---------------|------------------------|--------|
| **Nominal** | **One-Hot** | City, Color, Gender |
| **Ordinal** | Ordinal | Size, Grade, Rating |

**Key Takeaway**:
> **One-Hot Encoding is the safest** for **unordered categories** — **use it by default** unless cardinality is too high.

---
**End of Notebook**