## One-Hot Encoding

##### One-Hot Encoding (OHE) is a technique used to convert categorical variables into a numerical format so that machine learning models can process them.

##### One-Hot Encoding is done for Nominal data.

In [1]:
color = ["yellow","blue","red"]

In [2]:
# we create a column for each category 

| Color  | yellow | blue | red |
| ------ | ------ | ---- | --- |
| yellow | 1      | 0    | 0   |
| blue   | 0      | 1    | 0   |
| red    | 0      | 0    | 1   |


###  Dummy Variable Trap 

In [3]:
# Drop one dummy column

##### Important ML Rule
If a categorical feature has k categories:
Use k − 1 dummy variables

In [5]:
# Multicollinearity occurs when two or more input features are highly correlated

##### Problem : Most frequent variables 
##### Solution : 
1. Group Rare Categories : If frequency < threshold → "Other"
2. Frequency / Count Encodin : Replace category with its occurrence count or percentage
3. Target Encoding : Encode category using mean of target variable[⚠️ Needs careful regularization to avoid leakage]
4. Drop OHE for Trees?Tree-based models handle this better, but sparsity still exists.g

## Example 

In [6]:
import pandas as pd
import numpy as np

In [7]:
df = pd.read_csv("cars.csv")

In [8]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [10]:
df["owner"].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

In [11]:
df["brand"].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [12]:
df["brand"].nunique()

32

In [13]:
df['fuel'].value_counts()

fuel
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: count, dtype: int64

### 1. OneHotEncoding using Pandas

In [20]:
df_encoded = pd.get_dummies(
    df,
    columns=["fuel","owner"],   # only these two
    drop_first=True             
)

In [21]:
df_encoded

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


In [22]:
# drop_first = True  : is for avoiding dummy variable trap : K-1 encoding 

##### Pandas is for data handling, not for production-grade ML pipelines.

Why Pandas is NOT Preferred in ML Pipelines : 
1. Data Leakage Risk
2. No Pipeline Integration : ML needs: [Scaler → Encoder → Model],
Pandas:
Cannot be chained
Cannot be saved
Cannot be reused safely
3. Inconsistent Columns
4. Not Production Safe

### OneHotEncoding using Sklearn

In [23]:
X = df.drop("selling_price", axis=1)
y = df["selling_price"]

In [25]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [28]:
df

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000
...,...,...,...,...,...
8123,Hyundai,110000,Petrol,First Owner,320000
8124,Hyundai,119000,Diesel,Fourth & Above Owner,135000
8125,Maruti,120000,Diesel,First Owner,382000
8126,Tata,25000,Diesel,First Owner,290000


In [29]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6518,Tata,2560,Petrol,First Owner
6144,Honda,80000,Petrol,Second Owner
6381,Hyundai,150000,Diesel,Fourth & Above Owner
438,Maruti,120000,Diesel,Second Owner
5939,Maruti,25000,Petrol,First Owner


In [30]:
from sklearn.preprocessing import OneHotEncoder

In [47]:
ohe = OneHotEncoder(drop='first',sparse_output=False , dtype=np.int32)

In [48]:
X_train_new = ohe.fit_transform(X_train[["fuel","owner"]])
X_test_new = ohe.transform(X_test[["fuel","owner"]])

In [49]:
X_train_new

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 1, 0, 0]])

In [50]:
X_train[["brand","km_driven"]].values

array([['Tata', 2560],
       ['Honda', 80000],
       ['Hyundai', 150000],
       ...,
       ['Hyundai', 35000],
       ['Maruti', 27000],
       ['Maruti', 70000]], dtype=object)

In [51]:
X_train_stacked = np.hstack([
    X_train_new,                          # OHE (fuel, owner)
    X_train[["brand", "km_driven"]].values
])

X_train_stacked

array([[0, 0, 1, ..., 0, 'Tata', 2560],
       [0, 0, 1, ..., 0, 'Honda', 80000],
       [1, 0, 0, ..., 0, 'Hyundai', 150000],
       ...,
       [0, 0, 1, ..., 0, 'Hyundai', 35000],
       [1, 0, 0, ..., 0, 'Maruti', 27000],
       [0, 0, 1, ..., 0, 'Maruti', 70000]], dtype=object)

##### NOTE : 


In real ML pipelines, raw categorical columns (like brand) must be either:
1. encoded (One-Hot / Target / Frequency), or
2. excluded from model training.

In [52]:
X_train_stacked.shape

(6502, 9)

### OneHotEncoding with Top Categories 

In [64]:
counts = df['brand'].value_counts()

In [65]:
df['brand'].nunique()

32

In [70]:
threshold = 100
counts = df['brand'].value_counts()
rare = counts[counts < threshold].index

brand_top = df['brand'].replace(rare, 'Other')

In [73]:
pd.get_dummies(brand_top, drop_first=True).astype(int)

Unnamed: 0,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Other,Renault,Skoda,Tata,Toyota,Volkswagen
0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0
2,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,0,0,0,1,0,0,0,0,0,0,0,0
8124,0,0,0,1,0,0,0,0,0,0,0,0
8125,0,0,0,0,0,1,0,0,0,0,0,0
8126,0,0,0,0,0,0,0,0,0,1,0,0


In [75]:
df['brand_top'].unique()

array(['Maruti', 'Skoda', 'Honda', 'Hyundai', 'Toyota', 'Ford', 'Renault',
       'Mahindra', 'Tata', 'Chevrolet', 'Other', 'Volkswagen', 'BMW'],
      dtype=object)

##  One-Hot Encoding : Lecture Conclusion

In this lecture, we studied **One-Hot Encoding**, a core technique for handling
**nominal categorical data** in machine learning.

### What we covered
- Why categorical variables must be converted into numerical form
- Difference between **nominal** and **ordinal** data
- One-Hot Encoding using **pandas** and **sklearn**
- Dummy Variable Trap and the use of `drop="first"`
- Multicollinearity and why it matters for linear models
- Handling high-cardinality features using **top categories + threshold**
- Grouping rare categories into an **`Other`** class
- Understanding baseline categories in encoded data

### Key takeaways
- One-Hot Encoding prevents false ordering assumptions
- All ML models require **numerical input features**
- Rare categories can increase noise and sparsity
- Pandas is useful for **EDA and learning**
- sklearn encoders are preferred for **ML pipelines**

> One-Hot Encoding is simple in concept, but powerful when applied correctly.

---
