## 5. Encoding Categorical Data
- Advanced Encoding Techniques
- One-Hot Encoding with many categories (high cardinality solutions)
- Ordinal Encoding with custom order
- Target Encoding (mean target per category)
- Frequency Encoding
- Hash Encoding (sklearn’s FeatureHasher)
when to use which encoding (classification vs regression)


## Ordinal Encoding

**What it is:**  
Ordinal Encoding converts **ordinal categorical data** (categories with a meaningful order) into **integer values** that preserve the order.

**Example:**  
| Education Level | Ordinal Encoded |
|-----------------|----------------|
| High School     | 1              |
| Bachelor        | 2              |
| Master          | 3              |
| PhD             | 4              |

**Use when:**  
- Categories have a **clear ranking or order**  
- Order matters in modeling  
- Numeric representation is required for ML models

**Avoid when:**  
- Categories are **nominal** (no order)  
- Model might interpret the numeric differences as exact distances (e.g., linear regression may assume 1 → 2 is same as 3 → 4)

**Implementation in pandas / sklearn:**  
```python
from sklearn.preprocessing import OrdinalEncoder

data = [['High School'], ['Bachelor'], ['Master'], ['PhD']]
encoder = OrdinalEncoder(categories=[['High School', 'Bachelor', 'Master', 'PhD']])
encoded = encoder.fit_transform(data)
print(encoded)


## One-Hot Encoding

**What it is:**  
One-Hot Encoding converts **nominal categorical data** (categories with no natural order) into **binary columns**, each representing a category.

**Example:**  
| Color  | Red | Green | Blue |
|--------|-----|-------|------|
| Red    | 1   | 0     | 0    |
| Green  | 0   | 1     | 0    |
| Blue   | 0   | 0     | 1    |

**Use when:**  
- Categories are **nominal** (no order)  
- You want to **avoid numeric assumptions** in models  

**Avoid when:**  
- Dataset has **too many unique categories** → creates many columns (high dimensionality)  

**Dummy Variable Trap:**  
- If all dummy columns are included, they are **linearly dependent** (sum = 1)  
- Solution: **drop one column** to avoid multicollinearity in linear models  

**Implementation in pandas / sklearn:**  
```python
import pandas as pd

df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue']})
df_encoded = pd.get_dummies(df, drop_first=True)  # drop_first avoids dummy trap
print(df_encoded)


## Label Encoding

**What it is:**  
Label Encoding converts **categorical data** into **integer labels**. Each category is assigned a unique number.  

**Example:**  
| Class   | Label |
|---------|-------|
| Cat     | 0     |
| Dog     | 1     |
| Rabbit  | 2     |

**Use when:**  
- Encoding **target variable (`y`)** for classification  
- Categories are **nominal or ordinal**, but for `y` the numeric values are just labels  

**Avoid when:**  
- Encoding **input features** with no natural order for linear models (may mislead algorithms to assume order)  

**Implementation in sklearn:**  
```python
from sklearn.preprocessing import LabelEncoder

y = ['Cat', 'Dog', 'Rabbit']
le = LabelEncoder()
y_encoded = le.fit_transform(y)
print(y_encoded)  # Output: [0 1 2]


In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [26]:
cd= pd.read_csv("data/customer.csv")

In [27]:
cd.sample(10)

Unnamed: 0,age,gender,review,education,purchased
7,60,Female,Poor,School,Yes
31,22,Female,Poor,School,Yes
12,51,Male,Poor,School,No
34,86,Male,Average,School,No
19,97,Male,Poor,PG,Yes
30,73,Male,Average,UG,No
26,53,Female,Poor,PG,No
45,61,Male,Poor,PG,Yes
43,27,Male,Poor,PG,No
18,19,Male,Good,School,No


In [28]:
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder, OneHotEncoder


oe=OrdinalEncoder(categories=[["Poor", "Average", "Good"], ["School", "UG", "PG"]])

oe.fit(cd[["review", "education"]])

oe_encoded=pd.DataFrame(oe.transform(cd[["review", "education"]]), columns=["review","education"])


In [29]:
oe_encoded.head()

Unnamed: 0,review,education
0,1.0,0.0
1,0.0,1.0
2,2.0,2.0
3,2.0,2.0
4,1.0,1.0


In [30]:
# Label encoding
le=LabelEncoder()

le.fit(cd["purchased"])

le_encoded=pd.DataFrame(le.transform(cd["purchased"]), columns=["purchased"])

In [31]:
le_encoded.head()

Unnamed: 0,purchased
0,0
1,0
2,0
3,0
4,0


In [32]:
# One hot encoding
pd.get_dummies(cd, columns=["gender"]).sample(10)

Unnamed: 0,age,review,education,purchased,gender_Female,gender_Male
23,96,Good,School,No,True,False
31,22,Poor,School,Yes,True,False
7,60,Poor,School,Yes,True,False
18,19,Good,School,No,False,True
26,53,Poor,PG,No,True,False
19,97,Poor,PG,Yes,False,True
47,38,Good,PG,Yes,True,False
28,48,Poor,School,No,False,True
12,51,Poor,School,No,False,True
5,31,Average,School,Yes,True,False


In [33]:
# k-1 encoding
# One hot encoding
pd.get_dummies(cd, columns=["gender"], drop_first=True).sample(10)

Unnamed: 0,age,review,education,purchased,gender_Male
17,22,Poor,UG,Yes,False
45,61,Poor,PG,Yes,True
10,98,Good,UG,Yes,False
21,32,Average,PG,No,True
37,94,Average,PG,Yes,True
3,72,Good,PG,No,False
47,38,Good,PG,Yes,False
18,19,Good,School,No,True
35,74,Poor,School,Yes,True
49,25,Good,UG,No,False


In [34]:
cr= pd.read_csv("data/cars.csv")
cr.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [35]:
cr["brand"].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [36]:
cr.shape

(8128, 5)

In [37]:
pd.get_dummies(cr, columns=["brand"]).shape

(8128, 36)

In [38]:
# using sklearn library
oh=OneHotEncoder()

oh_encoded=oh.fit_transform(cr[["fuel", "owner"]]).toarray()

In [39]:
oh_encoded

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]], shape=(8128, 9))

In [40]:
np.hstack((cr[["brand", "km_driven"]].values, oh_encoded))

array([['Maruti', 145500, 0.0, ..., 0.0, 0.0, 0.0],
       ['Skoda', 120000, 0.0, ..., 1.0, 0.0, 0.0],
       ['Honda', 140000, 0.0, ..., 0.0, 0.0, 1.0],
       ...,
       ['Maruti', 120000, 0.0, ..., 0.0, 0.0, 0.0],
       ['Tata', 25000, 0.0, ..., 0.0, 0.0, 0.0],
       ['Tata', 25000, 0.0, ..., 0.0, 0.0, 0.0]],
      shape=(8128, 11), dtype=object)

In [41]:
cr_encoded=np.hstack((cr[["brand", "km_driven"]].values, oh_encoded))
cr_encoded.shape

(8128, 11)

In [42]:
cr["brand"].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [43]:
threshold=100
counts= cr["brand"].value_counts()

rep= counts[counts<=threshold].index


In [44]:
rep

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Land', 'Force', 'Isuzu', 'Ambassador',
       'Kia', 'MG', 'Daewoo', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [45]:
pd.get_dummies(cr["brand"].replace(rep, "uncommon"), dtype=int)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,0,0,0,0,1,0,0,0,0,0,0,0,0
8124,0,0,0,0,1,0,0,0,0,0,0,0,0
8125,0,0,0,0,0,0,1,0,0,0,0,0,0
8126,0,0,0,0,0,0,0,0,0,1,0,0,0


## ColumnTransformer

**What it is:**  
`ColumnTransformer` allows you to apply **different preprocessing steps** to **different columns** of a dataset in a single pipeline.  
This is especially useful when you have **mixed data types** (numerical + categorical).

---

**Key Features:**  
- Apply **scalers** to numeric columns  
- Apply **encoders** to categorical columns  
- Combine transformations into **one unified dataset**  
- Integrates seamlessly with **scikit-learn pipelines**

---

**Example:**  
```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd

# Sample dataset
data = pd.DataFrame({
    'Age': [25, 30, 22],
    'Salary': [50000, 60000, 45000],
    'City': ['Kathmandu', 'Pokhara', 'Lalitpur']
})

# Define ColumnTransformer
ct = ColumnTransformer(transformers=[
    ('num', StandardScaler(), ['Age', 'Salary']),     # scale numeric columns
    ('cat', OneHotEncoder(), ['City'])               # encode categorical column
])

# Fit and transform
transformed_data = ct.fit_transform(data)
print(transformed_data)


In [46]:
cd.head()

Unnamed: 0,age,gender,review,education,purchased
0,30,Female,Average,School,No
1,68,Female,Poor,UG,No
2,70,Female,Good,PG,No
3,72,Female,Good,PG,No
4,16,Female,Average,UG,No


In [47]:
from sklearn.compose import ColumnTransformer

transformer= ColumnTransformer(transformers=[
    ("tf1", OrdinalEncoder(categories=[["Poor", "Average", "Good"], ["School", "UG", "PG"]]), ["review", "education"]),
    ("tf3", OneHotEncoder(drop="first"), ["gender"])], remainder="passthrough")

In [48]:
transformer.fit_transform(cd)[:10]

array([[1.0, 0.0, 0.0, 30, 'No'],
       [0.0, 1.0, 0.0, 68, 'No'],
       [2.0, 2.0, 0.0, 70, 'No'],
       [2.0, 2.0, 0.0, 72, 'No'],
       [1.0, 1.0, 0.0, 16, 'No'],
       [1.0, 0.0, 0.0, 31, 'Yes'],
       [2.0, 0.0, 1.0, 18, 'No'],
       [0.0, 0.0, 0.0, 60, 'Yes'],
       [1.0, 1.0, 0.0, 65, 'No'],
       [2.0, 1.0, 1.0, 74, 'Yes']], dtype=object)