## `One Hot Encoding`

- Here it creates a new column for each new category. These columns are known as **Dummy Variables**.
- Genrally in **OHE** after creating all the columns for different categories we remove the first column. So if there are `n` number of categories then we will finalyy get `n-1` columns.
- It is done to reduce the `multicollinearity` between the input features. As it creates problem in case of `linear` models.  This is known as **Dummy Variable Trap**.
- When the number of categories are huge then it may cause to **Curse of Dimensionality**.
- To overcome this we need to use only the **most frequent categories** and transforming all the rest categories to a new column.

In [1]:
# importing the libraries

import pandas as pd
import numpy as np
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [2]:
# importing the dataset

df = pd.read_csv('datasets/cars.csv')
df.head()

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [3]:
# Checking number of different brands we have

df['brand'].nunique()

32

In [4]:
# Checking fuel types and their number of appearances

df['fuel'].value_counts()

Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: fuel, dtype: int64

In [5]:
# Checking owners categories

df['owner'].value_counts()

First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: owner, dtype: int64

In [6]:
df.shape

(8128, 5)

### 1. OneHotEncoding using Pandas

- We can also do **OHE** using pandas function `get_dummies()`.
- Here we need to pass the dataframe name and the list of the columns on which we want to perform **One Hot Encoding**. 
- Here we are doing it with only two columns `fuel` and `owner` as `brand` has 32 categories in it.

In [7]:
pd.get_dummies(df, columns=['fuel','owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,0,1,0,0,1,0,0,0,0
1,Skoda,120000,370000,0,1,0,0,0,0,1,0,0
2,Honda,140000,158000,0,0,0,1,0,0,0,0,1
3,Hyundai,127000,225000,0,1,0,0,1,0,0,0,0
4,Maruti,120000,130000,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,0,1,1,0,0,0,0
8124,Hyundai,119000,135000,0,1,0,0,0,1,0,0,0
8125,Maruti,120000,382000,0,1,0,0,1,0,0,0,0
8126,Tata,25000,290000,0,1,0,0,1,0,0,0,0


**Notes**

- Now we have 12 columns inplace of 5 as now `fuel` is replace by 4 columns and `owner` is replaced with 5 columns.
- Here the column names are given in `columnName_categoryName` pattern.

### 2. n-1 OneHotEncoding

- Here we will use the `drop_first` parameter to create `n-1` columns for `n` categories.

In [8]:
pd.get_dummies(df,columns=['fuel','owner'], drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


**Notes**

- Now it has 10 columns as for each column one `coulmn_category` get removed.
- It is not a good practice in ML projects as pandas don't remember the class position it creates.

### 3. OneHotEncoding using Sklearn

#### Doing train test split

In [9]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],
                                                 df.iloc[:,-1],
                                                 test_size=0.2,
                                                 random_state=42)
X_train.shape, X_test.shape

((6502, 4), (1626, 4))

In [10]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
6518,Tata,2560,Petrol,First Owner
6144,Honda,80000,Petrol,Second Owner
6381,Hyundai,150000,Diesel,Fourth & Above Owner
438,Maruti,120000,Diesel,Second Owner
5939,Maruti,25000,Petrol,First Owner


#### Now doing the `One Hot Encoding`

-  As we are not going to apply **OHE** on all the columns.
- So here 1st we need to separate the columns on which **OHE** to be performed and then again they needed to be attached with the main data.
- This is a difficult task, to overcome it we need to use **Column Transformer**.
- But here we will do it manually.

In [11]:
from sklearn.preprocessing import OneHotEncoder

In [12]:
# Creating object of OHE
# Here we need to pass drop='first' so first column will be dropped
# sparse=False is there so the sparse matrix don't happen
# dtype=np.int32 to control the datatype, so all the categories be in integer format

ohe = OneHotEncoder(drop='first', sparse=False, dtype=np.int32)

In [13]:
# Doing fit transform with the training data and applying OHE on only columns 'fuel' and 'owner'

X_train_new = ohe.fit_transform(X_train[['fuel','owner']])

In [14]:
# Now transforming the test data

X_test_new = ohe.transform(X_test[['fuel','owner']])

In [15]:
X_train_new.shape

(6502, 7)

### 4. OneHotEncoding with Top Categories

In [16]:
# Checking how many cars in each category in 'brand' column

counts = df['brand'].value_counts()
counts

Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: brand, dtype: int64

In [17]:
# Now creating a threshold of 100 

df['brand'].nunique()
threshold = 100

In [18]:
# now we want those categories where we have less cars than the threshold

repl = counts[counts <= threshold].index
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object')

In [19]:
# Now passing that index to create columns and making all other categories as 'uncommon' category

pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
2885,0,0,0,0,0,0,1,0,0,0,0,0,0
5188,1,0,0,0,0,0,0,0,0,0,0,0,0
5448,0,0,0,0,0,0,1,0,0,0,0,0,0
4913,0,0,0,0,0,0,0,0,0,1,0,0,0
1955,0,0,0,0,0,0,1,0,0,0,0,0,0


**Notes:**

- So here instead of 32 columns we have 13 columns of top brands, and one `uncommon` for all other categories.

### Binary Encoding

- Now here as there are too many categories in the `brand` column we can also use **BinaryEncoding** to encode them.
- For this 1st we need to install the **category_encoders** library by using `!pip install category_encoders`
- Then we need the module **BinaryEncoder** from there

In [20]:
!pip install category_encoders



In [21]:
# importing the library

from category_encoders.binary import BinaryEncoder

In [22]:
total_categories = len(df['brand'].unique())
print(f"Total number of categories in the column brand is: {total_categories}")

Total number of categories in the column brand is: 32


In [23]:
# Doing with One hot encoding

pd.get_dummies(df, columns=['brand'], drop_first=True)

Unnamed: 0,km_driven,fuel,owner,selling_price,brand_Ashok,brand_Audi,brand_BMW,brand_Chevrolet,brand_Daewoo,brand_Datsun,...,brand_Mitsubishi,brand_Nissan,brand_Opel,brand_Peugeot,brand_Renault,brand_Skoda,brand_Tata,brand_Toyota,brand_Volkswagen,brand_Volvo
0,145500,Diesel,First Owner,450000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,120000,Diesel,Second Owner,370000,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,140000,Petrol,Third Owner,158000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,127000,Diesel,First Owner,225000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,120000,Petrol,First Owner,130000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,110000,Petrol,First Owner,320000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8124,119000,Diesel,Fourth & Above Owner,135000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8125,120000,Diesel,First Owner,382000,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8126,25000,Diesel,First Owner,290000,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [24]:
# Now doing with Binary encoder

be = BinaryEncoder()

In [25]:
# Doing fit transform with the test dataset

transformed_df = be.fit_transform(df['brand'])
transformed_df

Unnamed: 0,brand_0,brand_1,brand_2,brand_3,brand_4,brand_5
0,0,0,0,0,0,1
1,0,0,0,0,1,0
2,0,0,0,0,1,1
3,0,0,0,1,0,0
4,0,0,0,0,0,1
...,...,...,...,...,...,...
8123,0,0,0,1,0,0
8124,0,0,0,1,0,0
8125,0,0,0,0,0,1
8126,0,0,1,0,0,1


**Notes:**
- Now here we can see only 6 columns instead of 35 columns created by the **OHE**.