# Encoding Techniques Notebook

This notebook demonstrates various encoding techniques to handle categorical data:
1. **One-Hot Encoding**
2. **Label Encoding**
3. **Frequency Encoding**
4. **Target Encoding**

Each method addresses the challenge of converting non-numeric categorical values into numerical representations for further processing.

In [1]:
#Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

## Loading the Dataset

We will use a small dataset containing categorical columns to illustrate the different encoding methods.

In [10]:
# Sample dataset
data = {
    "Customer_ID": [1, 2, 3, 4, 5],
    "Region": ["North", "South", "East", "West", "South"],
    "Product": ["A", "B", "A", "C", "B"],
    "Rating": [4.5, 3.0, 4.0, 3.5, 4.0]
}

df = pd.DataFrame(data)

# Display dataset
df

Unnamed: 0,Customer_ID,Region,Product,Rating
0,1,North,A,4.5
1,2,South,B,3.0
2,3,East,A,4.0
3,4,West,C,3.5
4,5,South,B,4.0


## One-Hot Encoding

One-hot encoding creates binary columns for each category. It is useful for nominal categorical variables without an intrinsic order.

### Example:
For the column **Region**:
- "North" → [1, 0, 0, 0]
- "South" → [0, 1, 0, 0]
- "East" → [0, 0, 1, 0]
- "West" → [0, 0, 0, 1]


In [None]:
# One-Hot Encoding using pandas
df_one_hot = pd.get_dummies(df, columns=["Region"], prefix="Region")

# Display results
df_one_hot

Unnamed: 0,Customer_ID,Product,Rating,Region_East,Region_North,Region_South,Region_West
0,1,A,4.5,False,True,False,False
1,2,B,3.0,False,False,True,False
2,3,A,4.0,True,False,False,False
3,4,C,3.5,False,False,False,True
4,5,B,4.0,False,False,True,False


## Label Encoding

Label encoding assigns a unique integer to each category. It is useful for ordinal categorical variables with a natural order.

### Example:
For the column **Product**:
- "A" → 0
- "B" → 1
- "C" → 2


In [11]:
# Label Encoding
df_label = df.copy()
df_label["Product_Encoded"] = LabelEncoder().fit_transform(df_label["Product"])

# Display results
df_label

Unnamed: 0,Customer_ID,Region,Product,Rating,Product_Encoded
0,1,North,A,4.5,0
1,2,South,B,3.0,1
2,3,East,A,4.0,0
3,4,West,C,3.5,2
4,5,South,B,4.0,1


## Frequency Encoding

Frequency encoding replaces each category with its frequency in the dataset. It is useful when category proportions are significant.

### Example:
For the column **Region**:
- "South" → 2 (occurs twice)
- "North" → 1 (occurs once)
- "East" → 1 (occurs once)
- "West" → 1 (occurs once)


In [12]:
# Frequency Encoding using value_counts
df_frequency = df.copy()
df_frequency["Region_Frequency"] = df_frequency["Region"].map(df_frequency["Region"].value_counts())

# Display results
df_frequency

Unnamed: 0,Customer_ID,Region,Product,Rating,Region_Frequency
0,1,North,A,4.5,1
1,2,South,B,3.0,2
2,3,East,A,4.0,1
3,4,West,C,3.5,1
4,5,South,B,4.0,2


## Target Encoding

Target encoding replaces each category with the mean of a target variable. It is useful when categories are directly correlated with a numeric outcome.

### Example:
For the column **Region** with the target **Rating**:
- Compute the mean rating for each region:
  - "North" → Mean Rating = 4.5
  - "South" → Mean Rating = 3.5
  - "East" → Mean Rating = 4.0
  - "West" → Mean Rating = 3.5
- Replace each region with its mean rating.


In [13]:
# Target Encoding using groupby and mean
df_target = df.copy()
df_target["Region_Target_Encoded"] = df_target["Region"].map(df_target.groupby("Region")["Rating"].mean())

# Display results
df_target

Unnamed: 0,Customer_ID,Region,Product,Rating,Region_Target_Encoded
0,1,North,A,4.5,4.5
1,2,South,B,3.0,3.5
2,3,East,A,4.0,4.0
3,4,West,C,3.5,3.5
4,5,South,B,4.0,3.5


## Comparison of Encoding Methods

| Encoding Method     | Use Case                          | Limitations                         |
|---------------------|-----------------------------------|-------------------------------------|
| **One-Hot Encoding**| Nominal categorical data          | High-dimensional for many categories|
| **Label Encoding**  | Ordinal categorical data          | Assumes intrinsic order             |
| **Frequency Encoding**| Proportionally significant categories| Loses individual category meaning   |
| **Target Encoding** | Categories correlated with target | Risk of data leakage                |


## Summary

1. One-hot encoding creates binary features for categories.
2. Label encoding assigns integers to categories.
3. Frequency encoding replaces categories with their frequency.
4. Target encoding substitutes categories with a target variable's mean.

Each method has its own benefits and limitations. The choice depends on the dataset and the problem at hand.
