# Dummy Variables and One Hot Encoding

In [1]:
#Importing Library
import pandas as pd
import numpy as np

In [2]:
#Loading CarPrices Dataset
data=pd.read_csv('carprices.csv')

In [3]:
data.head()

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4


In [4]:
#Check count of each Category of Categorical Variable Column which is Car Model
data['Car Model'].value_counts()

BMW X5                   5
Audi A5                  4
Mercedez Benz C class    4
Name: Car Model, dtype: int64

### Using Pandas' get_dummies() to handle Categorical Features(Nominal)

<b>Nominal Categorical Feature</b> : A categorical variable that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.

<b>Ordinal Categorical Feature</b> : A categorical variable that has two or more categories where variables have natural, ordered categories. For example, socio economic status (“low income”,”middle income”,”high income”), education level (“high school”,”BS”,”MS”,”PhD”), income level (“less than 50K”, “50K-100K”, “over 100K”).

In [5]:
#Appling get_dummies to our Categorical Feature to create Dummy Variables
pd.get_dummies(data['Car Model'])

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


<b>Dummy Variables</b>: If there are n number of categories in categorical attribute, n new attributes will be created. These attributes created are called Dummy Variables. In our case 3 Dummy variables are created.

<b>Dummy Variable Trap</b>: The Dummy variable trap is a scenario where there are attributes which are highly correlated (Multicollinear) and one variable predicts the value of others. When we handle Categorical Features via get_dummies or One Hot Encoding, then one dummy variable (attribute) can be predicted with the help of other dummy variables.Hence, one dummy variable is highly correlated with other dummy variables. Using all dummy variables for regression models lead to dummy variable trap. So, the regression models should be designed excluding one dummy variable.

In [6]:
#To exclude One dummy variable, use drop_first paramter 
model=pd.get_dummies(data['Car Model'],drop_first=True)
model

Unnamed: 0,BMW X5,Mercedez Benz C class
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
5,0,0
6,0,0
7,0,0
8,0,0
9,0,1


In [7]:
#Concatenating both tables  
new_data=pd.concat([data,model],axis=1)
new_data

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs),BMW X5,Mercedez Benz C class
0,BMW X5,69000,18000,6,1,0
1,BMW X5,35000,34000,3,1,0
2,BMW X5,57000,26100,5,1,0
3,BMW X5,22500,40000,2,1,0
4,BMW X5,46000,31500,4,1,0
5,Audi A5,59000,29400,5,0,0
6,Audi A5,52000,32000,5,0,0
7,Audi A5,72000,19300,6,0,0
8,Audi A5,91000,12000,8,0,0
9,Mercedez Benz C class,67000,22000,6,0,1


In [8]:
#Removing original Categorical Feature
new_data.drop('Car Model',axis=1,inplace=True)

In [9]:
#Data after handling Categorical Feature with get_dummies()
new_data

Unnamed: 0,Mileage,Sell Price($),Age(yrs),BMW X5,Mercedez Benz C class
0,69000,18000,6,1,0
1,35000,34000,3,1,0
2,57000,26100,5,1,0
3,22500,40000,2,1,0
4,46000,31500,4,1,0
5,59000,29400,5,0,0
6,52000,32000,5,0,0
7,72000,19300,6,0,0
8,91000,12000,8,0,0
9,67000,22000,6,0,1


### Using One-Hot Encoding to handle Categorical Features(Nominal)¶

In [10]:
#One hot encoder only takes numerical categorical values, hence any value of string type should be label encoded before one-hot encoded.
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

In [11]:
data['model']=le.fit_transform(data['Car Model'])
data['model']

0     1
1     1
2     1
3     1
4     1
5     0
6     0
7     0
8     0
9     2
10    2
11    2
12    2
Name: model, dtype: int32

In [12]:
#Data after Label Encoding Car Model column
data.drop('Car Model',axis=1,inplace=True)
data

Unnamed: 0,Mileage,Sell Price($),Age(yrs),model
0,69000,18000,6,1
1,35000,34000,3,1
2,57000,26100,5,1
3,22500,40000,2,1
4,46000,31500,4,1
5,59000,29400,5,0
6,52000,32000,5,0
7,72000,19300,6,0
8,91000,12000,8,0
9,67000,22000,6,2


In [13]:
#creating one hot encoder object
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(handle_unknown='ignore')

In [14]:
#applying one hot encoding on model column
ohe_data=pd.DataFrame(ohe.fit_transform(data[['model']]).toarray())
ohe_data

Unnamed: 0,0,1,2
0,0.0,1.0,0.0
1,0.0,1.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,1.0,0.0
5,1.0,0.0,0.0
6,1.0,0.0,0.0
7,1.0,0.0,0.0
8,1.0,0.0,0.0
9,0.0,0.0,1.0


In [15]:
#Concatenating both tables and showing resultant data after One-Hot Encoding
new_data=pd.concat([data,ohe_data],axis=1)
new_data

Unnamed: 0,Mileage,Sell Price($),Age(yrs),model,0,1,2
0,69000,18000,6,1,0.0,1.0,0.0
1,35000,34000,3,1,0.0,1.0,0.0
2,57000,26100,5,1,0.0,1.0,0.0
3,22500,40000,2,1,0.0,1.0,0.0
4,46000,31500,4,1,0.0,1.0,0.0
5,59000,29400,5,0,1.0,0.0,0.0
6,52000,32000,5,0,1.0,0.0,0.0
7,72000,19300,6,0,1.0,0.0,0.0
8,91000,12000,8,0,1.0,0.0,0.0
9,67000,22000,6,2,0.0,0.0,1.0


<b>Conclusion</b>: Get_dummies() and One-Hot encoding method is generally used for non-tree based model such as Linear models, KNN or Neural networks. It is important to understand various option for encoding categorical variables because each approach has its own pros and cons. In data science, it is an important step, so I really encourage you to keep these ideas in mind when dealing with categorical variables.


<b>References</b> : 
1. https://www.geeksforgeeks.org/ml-one-hot-encoding-of-datasets-in-python/
2. https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd