# **One-Hot Encoding**

This notebook demonstrates various methods for performing One-Hot Encoding on categorical variables using Pandas, Scikit-Learn, and best practices for handling high-cardinality features.


In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# **Load the dataset**


In [2]:
df = pd.read_csv('cars.csv')
df.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
2649,Hyundai,50000,Petrol,Third Owner,170000
7442,Honda,30646,Petrol,First Owner,840000
2697,Maruti,25000,Petrol,First Owner,764000
8117,Maruti,50000,Diesel,First Owner,625000
3503,Hyundai,80000,Petrol,First Owner,378000


In [3]:
# Check unique values and their counts in 'owner' column
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

## **One-Hot Encoding using Pandas**

One-Hot Encoding converts categorical variables into binary columns (0 and 1), making them suitable for machine learning models.

In [4]:
# One-Hot encode the 'fuel' and 'owner' columns
pd.get_dummies(df, columns=['fuel', 'owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


## **K-1 One-Hot Encoding**

The K-1 encoding drops the first dummy variable to avoid multicollinearity, reducing redundancy while preserving all information.

In [5]:
# One-Hot encode with drop_first=True to avoid multicollinearity
pd.get_dummies(df, columns=['fuel', 'owner'], drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


## **One-Hot Encoding using Scikit-Learn**

Scikit-Learn's OneHotEncoder provides more control and is commonly used in machine learning pipelines.

In [6]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:, 0:4],  
    df.iloc[:, -1],   
    test_size=0.2,
    random_state=2
)

In [7]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


### **Step 1: Train-Test Split**

In [8]:
from sklearn.preprocessing import OneHotEncoder

In [9]:
ohe = OneHotEncoder(drop='first', sparse_output=False, dtype=np.int32)

### **Step 2: Initialize & Fit Encoder**

In [10]:
X_train_new = ohe.fit_transform(X_train[['fuel', 'owner']])

In [11]:
X_test_new = ohe.transform(X_test[['fuel', 'owner']])

In [12]:
print(f"Encoded training data shape: {X_train_new.shape}")

Encoded training data shape: (6502, 7)


### **Step 3: Combine Encoded Features with Numerical Features**

In [13]:
combined_data = np.hstack((X_train[['brand', 'km_driven']].values, X_train_new))
combined_data

array([['Hyundai', 35000, 1, ..., 0, 0, 0],
       ['Jeep', 60000, 1, ..., 0, 0, 0],
       ['Hyundai', 25000, 0, ..., 0, 0, 0],
       ...,
       ['Tata', 15000, 0, ..., 0, 0, 0],
       ['Maruti', 32500, 1, ..., 1, 0, 0],
       ['Isuzu', 121000, 1, ..., 0, 0, 0]], shape=(6502, 9), dtype=object)

## **4. One-Hot Encoding with Top Categories**

When dealing with many unique categorical values, we can group rare categories together as 'uncommon' to reduce dimensionality and improve model efficiency.

In [17]:
counts = df['brand'].value_counts()
df['brand'].nunique()
threshold = 100
repl = counts[counts <= threshold].index

In [25]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
1931,False,False,False,False,True,False,False,False,False,False,False,False,False
3809,False,False,False,False,True,False,False,False,False,False,False,False,False
3118,False,False,False,False,False,False,False,False,False,True,False,False,False
5395,False,False,False,False,False,False,False,True,False,False,False,False,False
6479,False,False,False,False,False,False,False,False,False,False,False,False,True
