- Handle nominal categorical data using one hot encoding (OHE).    
One-Hot Encoding transforms categorical variables (which contain label values) into a binary matrix (0s and 1s). Each unique category becomes a separate column, and each row gets a "1" in the column corresponding to the category it belongs to, while all other columns have a "0."

- Dummy Variable Trap.    
When create a new columnn of the goiven input call dummy variable.        
The problem of multicollinearity arises due to the creation of dummy variables, and that is the reason it is called the dummy variable trap.      
Multicollinearity refers to a situation in regression analysis where two or more independent variables (predictors) in a model are highly correlated with each other.      
After One-Hot Encoding, one column is generally dropped to avoid multicollinearity. If the input columns are related, they become dependent on each other, which is not acceptable because all input columns should be independent. This is especially important when using a Linear Regression (LR) model.     

- OHE using most frequent variable.    
If there are too many categories, a separate column is created for each feature, which increases the dimensionality of the data and slows down processing. To handle this, we can keep the most frequent categories and group the remaining ones into an 'Other' category, transforming them into a new column. This reduces the dimensionality of the dataset.

In [65]:
import numpy as np 
import pandas as pd 

In [66]:
df = pd.read_csv('./DataSets/27_cars.csv')
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [67]:
df['brand'].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [68]:
df['brand'].nunique()

32

In [69]:
# df['fuel'].value_counts()
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

1. OneHotEncodinh using pandas   
We are applying one-hot encoding on just two columns.   

In [70]:
pd.get_dummies(df, columns=['fuel', 'owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


2. k-1 OneHOtEncoding   
First column is removed 

In [71]:
pd.get_dummies(df, columns=['fuel', 'owner'], drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


3. OneHotEncoding using sklean

In [72]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:4], df.iloc[:,-1], test_size=0.2, random_state=0)
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
3042,Hyundai,60000,LPG,First Owner
1520,Tata,150000,Diesel,Third Owner
2611,Hyundai,110000,Diesel,Second Owner
3544,Mahindra,28000,Diesel,Second Owner
4138,Maruti,15000,Petrol,First Owner


In [73]:
from sklearn.preprocessing import OneHotEncoder

In [74]:
ohe = OneHotEncoder(drop='first')

In [75]:
# X_train_new = ohe.fit_transform(X_train[['fuel', 'owner']])
X_train_new = ohe.fit_transform(X_train[['fuel', 'owner']]).toarray()
X_train_new

array([[0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       [1., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 1., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.]])

In [76]:
x_test_new = ohe.transform(X_test[['fuel', 'owner']]).toarray()
x_test_new

array([[1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.]])

In [78]:
X_train_new.shape

(6502, 7)

In [77]:
np.hstack((X_train[['brand', 'km_driven']].values, X_train_new)).shape

(6502, 9)

,   
,   
,   
,   
,  

4. OneHotEncoding with Top categories   
Brand column has many categories, we will encode that column.  
thresold : >100 all cars combine and create other categories.   

In [81]:
counts = df['brand'].value_counts()
counts

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Land                6
Force               6
Isuzu               5
Ambassador          4
Kia                 4
MG                  3
Daewoo              3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [83]:
df['brand'].unique()
thresold = 100

In [85]:
repl = counts[counts<=thresold].index
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Land', 'Force', 'Isuzu', 'Ambassador',
       'Kia', 'MG', 'Daewoo', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [87]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
4415,False,False,False,False,False,False,False,False,False,False,False,False,True
2532,False,False,False,False,False,False,False,False,False,False,False,False,True
5714,False,False,False,False,False,False,True,False,False,False,False,False,False
7129,False,False,False,False,False,False,False,False,False,False,False,False,True
600,False,False,False,False,False,False,False,False,False,False,True,False,False
