<a href="https://colab.research.google.com/github/SOUMYA2402/Pandas-and-ML/blob/main/One_Hot_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **One Hot Encoding for Nominal Categorical Data**
- Nominal Categorical data is a data that does not follow a partical order, like gender, branch, colour etc

- we have dummy variable trap problem in one hot encoding which leads to multicollinearity ie. dependency of one var in a column to another variable in th same column and because of it
we need to drop 1st column of each category

- if we do not remove multicollinearity then we face enormous challenges while working with linear and logisgic regression

In [None]:
import pandas as pd
import numpy as np


In [None]:
df=pd.read_csv("cars.csv")
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(df.iloc[:,0:4], df.iloc[:,-1], test_size=0.2)

In [None]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
3167,Tata,35000,Petrol,First Owner
1429,Hyundai,120000,Diesel,Second Owner
5393,Maruti,102000,Diesel,First Owner
3725,Mahindra,50000,Diesel,First Owner
5416,Maruti,150000,Petrol,Third Owner


In [13]:
df.nunique()

Unnamed: 0,0
brand,32
km_driven,921
fuel,4
owner,5
selling_price,677


In [10]:
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(drop='first', sparse_output=False)
X_train_new=ohe.fit_transform(X_train[['fuel','owner']])
X_test_new=ohe.fit_transform(X_test[['fuel','owner']])

In [12]:
X_train_new.shape

(6502, 7)

In [15]:
X_train[['brand','km_driven']].values

array([['Tata', 35000],
       ['Hyundai', 120000],
       ['Maruti', 102000],
       ...,
       ['Maruti', 82050],
       ['Maruti', 69000],
       ['Ford', 108000]], dtype=object)

now we will combine X_train_new with X_train[[brand, kn_driven]].values as both are numpy to get final result

In [19]:
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Tata', 35000, 0.0, ..., 0.0, 0.0, 0.0],
       ['Hyundai', 120000, 1.0, ..., 1.0, 0.0, 0.0],
       ['Maruti', 102000, 1.0, ..., 0.0, 0.0, 0.0],
       ...,
       ['Maruti', 82050, 1.0, ..., 0.0, 0.0, 0.0],
       ['Maruti', 69000, 0.0, ..., 0.0, 0.0, 0.0],
       ['Ford', 108000, 1.0, ..., 1.0, 0.0, 0.0]], dtype=object)

In [20]:
np.hstack((X_train[['brand','km_driven']].values,X_train_new)).shape

(6502, 9)

## **OneHotEncoding with Top Categories**

In [23]:
counts=df['brand'].value_counts()
type(counts)

In [25]:
counts.head()

Unnamed: 0_level_0,count
brand,Unnamed: 1_level_1
Maruti,2448
Hyundai,1415
Mahindra,772
Tata,734
Toyota,488


In [27]:
# df['brand'].nunique()

32

In [28]:
threshold=100

In [44]:

repl=counts[counts<=threshold].index
type(repl)
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Land', 'Force', 'Isuzu', 'Ambassador',
       'Kia', 'MG', 'Daewoo', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [46]:
df['brand']

Unnamed: 0,brand
0,Maruti
1,Skoda
2,Honda
3,Hyundai
4,Maruti
...,...
8123,Hyundai
8124,Hyundai
8125,Maruti
8126,Tata


In [51]:
df1=pd.get_dummies(df['brand'].replace(repl, 'uncommon')).astype(int)


In [52]:
df1.head()

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0


In [54]:
df1[df1['uncommon']==1]

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
31,0,0,0,0,0,0,0,0,0,0,0,0,1
38,0,0,0,0,0,0,0,0,0,0,0,0,1
41,0,0,0,0,0,0,0,0,0,0,0,0,1
49,0,0,0,0,0,0,0,0,0,0,0,0,1
51,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8072,0,0,0,0,0,0,0,0,0,0,0,0,1
8090,0,0,0,0,0,0,0,0,0,0,0,0,1
8091,0,0,0,0,0,0,0,0,0,0,0,0,1
8101,0,0,0,0,0,0,0,0,0,0,0,0,1
