<a href="https://colab.research.google.com/github/Saifullah785/machine-learning-engineer-roadmap/blob/main/Lecture_27_one_hot_encoding_in_machine_learning/Lecture_27_one_hot_encoding_in_machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **🎓 Lecture: One Hot Encoding in Machine Learning (with Seaborn Dataset)**

# **✅ What is One Hot Encoding?**


---


One Hot Encoding is a method used to convert categorical data into a numerical format so that ML algorithms can process it.

Each category becomes a new binary column (0 or 1). Only one column is 1 per row, the rest are 0.

# **🎯 Why use it?**

ML models can’t process text labels directly.

Avoids false ordering (unlike Label Encoding).

Best for non-ordinal categorical features (e.g., color, city, gender).

In [1]:
import numpy as np
import pandas as pd


In [3]:
df = pd.read_csv('/content/cars.csv')

In [4]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [6]:
# df['brand'].value_counts()
df['brand'].nunique()

32

In [8]:
# df['fuel'].value_counts()
df['owner'].value_counts()

Unnamed: 0_level_0,count
owner,Unnamed: 1_level_1
First Owner,5289
Second Owner,2105
Third Owner,555
Fourth & Above Owner,174
Test Drive Car,5


# **1. One Hot Encoding using Pandas**


---



# Problem Note:
we genrally get_dumies pandas use during analysis of data but we dont use get_dumies during machine learning projects reason is because pandas doesnt remember position of columns if we repeat the process of one hot encoding in pandas maybe it will change the position of columns



In [9]:
pd.get_dummies(df,columns=['fuel','owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


# **2.K-1 One Hot Encoding**

In [10]:
pd.get_dummies(df,columns=['fuel','owner'],drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


# **3.One Hot Encoding using Sklearn**

In [12]:
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [13]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=2)


In [15]:
x_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


In [16]:
from sklearn.preprocessing import OneHotEncoder

In [42]:
ohe = OneHotEncoder(drop='first',dtype=np.int32)

In [43]:
x_train_new = ohe.fit_transform(x_train[['fuel','owner']]).toarray()

In [44]:
x_test_new = ohe.fit_transform(x_test[['fuel','owner']]).toarray()

In [45]:
x_train_new.shape

(6502, 7)

In [47]:
np.hstack((x_train[['brand','km_driven']].values,x_train_new))

array([['Hyundai', 35000, 1, ..., 0, 0, 0],
       ['Jeep', 60000, 1, ..., 0, 0, 0],
       ['Hyundai', 25000, 0, ..., 0, 0, 0],
       ...,
       ['Tata', 15000, 0, ..., 0, 0, 0],
       ['Maruti', 32500, 1, ..., 1, 0, 0],
       ['Isuzu', 121000, 1, ..., 0, 0, 0]], dtype=object)

# **4. One Hot Encoding with top Categories**

In [53]:
counts = df['brand'].value_counts()

In [55]:
df['brand'].nunique()
threshold = 100

In [57]:
rep1 = counts[counts <= threshold].index

In [59]:
pd.get_dummies(df['brand'].replace(rep1, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
520,False,False,False,False,False,False,True,False,False,False,False,False,False
6280,False,False,False,False,False,False,True,False,False,False,False,False,False
3603,False,False,False,False,True,False,False,False,False,False,False,False,False
1186,False,False,False,False,False,True,False,False,False,False,False,False,False
4376,False,False,False,False,False,False,False,False,False,True,False,False,False


# **Practical Part Using Seaborn tips Dataset**


---


Let’s use the popular tips dataset that contains restaurant bill data.

In [60]:
import seaborn as sns
import pandas as pd

# Load the tips dataset
df = sns.load_dataset('tips')
print(df.head())


   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4


In [61]:
# Perform One Hot Encoding on all categorical columns
df_encoded = pd.get_dummies(df, columns=['sex', 'smoker', 'day', 'time'])




In [62]:
# View the new dataframe
print(df_encoded.head())

   total_bill   tip  size  sex_Male  sex_Female  smoker_Yes  smoker_No  \
0       16.99  1.01     2     False        True       False       True   
1       10.34  1.66     3      True       False       False       True   
2       21.01  3.50     3      True       False       False       True   
3       23.68  3.31     2      True       False       False       True   
4       24.59  3.61     4     False        True       False       True   

   day_Thur  day_Fri  day_Sat  day_Sun  time_Lunch  time_Dinner  
0     False    False    False     True       False         True  
1     False    False    False     True       False         True  
2     False    False    False     True       False         True  
3     False    False    False     True       False         True  
4     False    False    False     True       False         True  
