# **One-Hot Encoding**

- Since most ML models only work with numbers, we need a way to represent categories as numbers while preserving the information they carry.

- **One-Hot Encoding (OHE)** is a technique *used to convert categorical data into a numerical format* that machine learning models can understand.

---

## **How does it work?**

- Imagine we have a dataset with a categorical feature "*color*", which has three possible values: 🔴 Red, 🔵 Blue, and 🟢 Green.

- If we simply assign numbers (🔴Red = 1, 🔵Blue = 2, 🟢Green = 3), the model might mistakenly misinterpret the numbers as having a natural order or ranking assuming 🟢Green (3) > 🔵Blue (2) > 🔴Red (1), which is incorrect!

---

☑️ **Solution:** One-Hot Encoding!

Instead of assigning arbitrary numbers, we create binary columns for each category, and each category gets its own binary column where:

- ✅ 1 means the category is present
- ❌ 0 means the category is absent

This removes any false numerical relationships!

---

In [44]:
import pandas as pd
import numpy as np

In [45]:
# Sample dataset with a categorical feature
data = pd.DataFrame({
    'purse': ['Gucci', 'Chanel', 'Hermes', 'Gucci', 'Hermes', 'Hermes', 'Chanel'],
    'price': [1000, 2000, 1500, 900, 1600, 1700, 2200],
    'color': ['Red', 'Blue', 'Green', 'Red', 'Green', 'Green', 'Blue']
})

## Manual Implementation of One-Hot Encoding

In [46]:
# Get unique categories
unique_cats = list(set(data['color']))
unique_cats

['Blue', 'Red', 'Green']

In [47]:
# Create a zero matrix (num_samples x num_categories)
one_hot_matrix = np.zeros((len(data), len(unique_cats)))
one_hot_matrix

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [48]:
# Fill the matrix
for i, cat in enumerate(data['color']):
    one_hot_matrix[i, unique_cats.index(cat)] = 1

In [49]:
one_hot_matrix

array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [50]:
# Combine the one-hot encoded matrix with the original data
encoded_categories = pd.concat([data, pd.DataFrame(one_hot_matrix, columns=unique_cats)], axis=1).drop('color', axis=1)

In [51]:
encoded_categories

Unnamed: 0,purse,price,Blue,Red,Green
0,Gucci,1000,0.0,1.0,0.0
1,Chanel,2000,1.0,0.0,0.0
2,Hermes,1500,0.0,0.0,1.0
3,Gucci,900,0.0,1.0,0.0
4,Hermes,1600,0.0,0.0,1.0
5,Hermes,1700,0.0,0.0,1.0
6,Chanel,2200,1.0,0.0,0.0


## One-Hot Encoding using Pandas

In [52]:
# Apply One-Hot Encoding
encoded_categories = pd.get_dummies(data, columns=['color'], dtype=int)

In [53]:
encoded_categories

Unnamed: 0,purse,price,color_Blue,color_Green,color_Red
0,Gucci,1000,0,0,1
1,Chanel,2000,1,0,0
2,Hermes,1500,0,1,0
3,Gucci,900,0,0,1
4,Hermes,1600,0,1,0
5,Hermes,1700,0,1,0
6,Chanel,2200,1,0,0


## One-Hot Encoding with scikit-learn

In [54]:
from sklearn.preprocessing import OneHotEncoder

In [55]:
# Initialize the encoder
encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
encoder

In [56]:
# Apply One-Hot Encoding
encoded_categories = encoder.fit_transform(data[['color']])
encoded_categories

Unnamed: 0,color_Blue,color_Green,color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,1.0,0.0
5,0.0,1.0,0.0
6,1.0,0.0,0.0


In [57]:
# Concatenate the encoded categories with the original data
encoded_categories = pd.concat([data, encoded_categories], axis=1).drop(columns=['color'], axis=1)

In [58]:
encoded_categories

Unnamed: 0,purse,price,color_Blue,color_Green,color_Red
0,Gucci,1000,0.0,0.0,1.0
1,Chanel,2000,1.0,0.0,0.0
2,Hermes,1500,0.0,1.0,0.0
3,Gucci,900,0.0,0.0,1.0
4,Hermes,1600,0.0,1.0,0.0
5,Hermes,1700,0.0,1.0,0.0
6,Chanel,2200,1.0,0.0,0.0
