# One-Hot Encoding ‚Äî Practical Walkthrough

This notebook demonstrates **One-Hot Encoding**, a technique used to convert categorical data into numerical form so machine learning models can use it.

---

## üîπ What is One-Hot Encoding?

One-Hot Encoding converts each category in a column into a new binary column:

Example:

| Fuel | Petrol | Diesel | CNG |
|------|--------|--------|-----|
| Petrol | 1 | 0 | 0 |
| Diesel | 0 | 1 | 0 |

Why needed?

- ML models cannot understand text categories
- Prevents models from assuming order (Petrol > Diesel ‚ùå)
- Improves prediction performance


In [1]:
import pandas as pd
import numpy as np

## üì¶ Importing Libraries

We import:

- **pandas** ‚Üí data handling and manipulation  
- **numpy** ‚Üí numerical operations  

These are standard libraries in data science workflows.


In [2]:
df =pd.read_csv('dataset/cars.csv')

In [3]:
df

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000
...,...,...,...,...,...
8123,Hyundai,110000,Petrol,First Owner,320000
8124,Hyundai,119000,Diesel,Fourth & Above Owner,135000
8125,Maruti,120000,Diesel,First Owner,382000
8126,Tata,25000,Diesel,First Owner,290000


## üîç Exploring Categorical Columns

We use `.value_counts()` to:

- Count frequency of each category
- Understand distribution
- Detect rare categories

This helps decide encoding strategy.


In [4]:
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

In [5]:
df['fuel'].value_counts()

fuel
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: count, dtype: int64

In [6]:
df['brand'].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

## üü¢ One-Hot Encoding using Pandas

`pd.get_dummies()` automatically converts categorical columns into binary columns.

Parameters used:

- `columns` ‚Üí columns to encode  
- `dtype=int` ‚Üí ensures output is numeric

Advantage:
- Quick and simple

Limitation:
- Column mismatch between train/test sets


In [7]:
pd.get_dummies(df , columns =['fuel','owner'], dtype=int )

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,0,1,0,0,1,0,0,0,0
1,Skoda,120000,370000,0,1,0,0,0,0,1,0,0
2,Honda,140000,158000,0,0,0,1,0,0,0,0,1
3,Hyundai,127000,225000,0,1,0,0,1,0,0,0,0
4,Maruti,120000,130000,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,0,1,1,0,0,0,0
8124,Hyundai,119000,135000,0,1,0,0,0,1,0,0,0
8125,Maruti,120000,382000,0,1,0,0,1,0,0,0,0
8126,Tata,25000,290000,0,1,0,0,1,0,0,0,0


## ‚ö†Ô∏è Dummy Variable Trap (k-1 Encoding)

If a column has **k categories**, using all k dummies causes multicollinearity.

Solution:
Drop one column.

Example:
If Petrol & Diesel exist, knowing Petrol=0 means Diesel=1.

`drop_first=True` removes the first category to avoid redundancy.


In [8]:
pd.get_dummies(df , columns =['fuel','owner'], dtype=int, drop_first=True )

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


# üîµ One-Hot Encoding using Scikit-Learn

Pandas encoding is risky in ML pipelines because:

- Train/test columns may not match
- Order of columns may change

Scikit-learn solves this with consistent encoding.


## ‚úÇÔ∏è Train-Test Split

`train_test_split()` splits data into:

- Training set ‚Üí model learns from this  
- Testing set ‚Üí model evaluation

Common ratio:
80% train / 20% test


In [9]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=2)

In [10]:
X_train

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner
...,...,...,...,...
3606,Ford,35000,Diesel,First Owner
5704,Maruti,120000,Petrol,First Owner
6637,Tata,15000,Petrol,First Owner
2575,Maruti,32500,Diesel,Second Owner


## üß† OneHotEncoder (Sklearn)

`OneHotEncoder` is preferred in ML pipelines.

Parameters:

- `drop='first'` ‚Üí avoids dummy trap  
- `sparse_output=False` ‚Üí returns array instead of sparse matrix  
- `dtype=np.int32` ‚Üí efficient memory usage

Why better?
- Consistent encoding
- Works well in pipelines


In [11]:
from sklearn.preprocessing import OneHotEncoder


In [12]:
ohe = OneHotEncoder(drop='first',sparse_output=False,dtype=np.int32)

## üîÑ fit_transform()

Two steps happen here:

1. **fit()**  
   Learns unique categories

2. **transform()**  
   Converts categories into binary columns

`fit_transform()` combines both.


In [13]:
x_train = ohe.fit_transform(X_train[['fuel','owner']])

In [14]:
x_train

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], shape=(6502, 7), dtype=int32)

In [15]:
x_train.shape

(6502, 7)

In [16]:
x_test = ohe.fit_transform(X_test[['fuel','owner']])

In [17]:
x_test

array([[0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 1, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], shape=(1626, 7), dtype=int32)

## üîó Combining Encoded Data

`np.hstack()` merges arrays horizontally.

We combine:

- Numerical columns
- Encoded categorical columns

This creates the final ML-ready dataset.


In [18]:
a=np.hstack((X_train[['brand','km_driven']].values,x_train))

In [19]:
a = pd.DataFrame(a)
a

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,Hyundai,35000,1,0,0,0,0,0,0
1,Jeep,60000,1,0,0,0,0,0,0
2,Hyundai,25000,0,0,1,0,0,0,0
3,Mahindra,130000,1,0,0,0,1,0,0
4,Hyundai,155000,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...
6497,Ford,35000,1,0,0,0,0,0,0
6498,Maruti,120000,0,0,1,0,0,0,0
6499,Tata,15000,0,0,1,0,0,0,0
6500,Maruti,32500,1,0,0,0,1,0,0


# ‚≠ê Handling High Cardinality (Top-K Encoding)

Some columns have many categories (e.g., car brands).

Too many categories ‚Üí too many dummy columns ‚Üí overfitting + slow models.

Solution:
Keep only top frequent categories and group rest as "uncommon".


In [20]:
counts = df['brand'].value_counts()

In [21]:
counts

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [22]:
df['brand'].nunique()

32

## üéØ Frequency Thresholding

Steps:

1. Count category frequency  
2. Set threshold  
3. Replace rare categories with "uncommon"

This reduces dimensionality and noise.


In [23]:
threshold = 100
repl = counts[counts <= threshold].index

In [24]:
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Land', 'Force', 'Isuzu', 'Ambassador',
       'Kia', 'MG', 'Daewoo', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [25]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon'), dtype = int).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
2273,0,0,0,0,0,0,0,0,0,0,1,0,0
3724,0,0,0,0,1,0,0,0,0,0,0,0,0
6902,0,0,0,0,0,1,0,0,0,0,0,0,0
2568,0,0,0,0,0,0,1,0,0,0,0,0,0
4323,0,0,0,0,0,0,0,0,0,1,0,0,0


# ‚úÖ Key Takeaways

- One-Hot Encoding converts categories into numbers
- Avoid dummy variable trap with k-1 encoding
- Sklearn encoder is safer for ML pipelines
- Top-K encoding helps with high cardinality

---

üöÄ You now know practical One-Hot Encoding used in real ML projects!
