# Lecture 8: Categorical Variable Encoding

Instructor: Md Shahidullah Kawsar
<br>Data Scientist, IDARE, Houston, TX, USA

#### Objectives:
- Dealing with categorical variables
- Label encoding
- One-hot encoding
- Categorical variable creation from the numeric variable

#### References:
<br>[1] One-Hot Encoding vs. Label Encoding using Scikit-Learn: https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/
<br>[2] Label Encoding: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
<br>[3] One-hot encoding: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html
<br>[4] https://pandas.pydata.org/docs/reference/api/pandas.cut.html

In [51]:
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder

sns.set_context("talk")

#### Load data

In [52]:
df = pd.read_csv("bmw.csv")

display(df.head(10))

Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize
0,5 Series,2014,11200,Automatic,67068,Diesel,125,57.6,2.0
1,6 Series,2018,27000,Automatic,14827,Petrol,145,42.8,2.0
2,5 Series,2016,16000,Automatic,62794,Diesel,160,51.4,3.0
3,1 Series,2017,12750,Automatic,26676,Diesel,145,72.4,1.5
4,7 Series,2014,14500,Automatic,39554,Diesel,160,50.4,3.0
5,5 Series,2016,14900,Automatic,35309,Diesel,125,60.1,2.0
6,5 Series,2017,16000,Automatic,38538,Diesel,125,60.1,2.0
7,2 Series,2018,16250,Manual,10401,Petrol,145,52.3,1.5
8,4 Series,2017,14250,Manual,42668,Diesel,30,62.8,2.0
9,5 Series,2016,14250,Automatic,36099,Diesel,20,68.9,2.0


#### Dealing with categorical variables

In [53]:
print(df["model"].unique())
print(len(df["model"].unique()))

print(df["transmission"].unique())
print(len(df["transmission"].unique()))

print(df["fuelType"].unique())
print(len(df["fuelType"].unique()))

[' 5 Series' ' 6 Series' ' 1 Series' ' 7 Series' ' 2 Series' ' 4 Series'
 ' X3' ' 3 Series' ' X5' ' X4' ' i3' ' X1' ' M4' ' X2' ' X6' ' 8 Series'
 ' Z4' ' X7' ' M5' ' i8' ' M2' ' M3' ' M6' ' Z3']
24
['Automatic' 'Manual' 'Semi-Auto']
3
['Diesel' 'Petrol' 'Other' 'Hybrid' 'Electric']
5


#### Label Encoding

In [54]:
LE = LabelEncoder()
df["transmission_"] = LE.fit_transform(df["transmission"])
print(LE.classes_)
# print(df["transmission_"].unique())

df["model_"] = LE.fit_transform(df["model"])
print(LE.classes_)
# print(df["model_"].unique())

df["fuelType_"] = LE.fit_transform(df["fuelType"])
print(LE.classes_)
# print(df["fuelType_"].unique())

display(df.sample(10))

['Automatic' 'Manual' 'Semi-Auto']
[' 1 Series' ' 2 Series' ' 3 Series' ' 4 Series' ' 5 Series' ' 6 Series'
 ' 7 Series' ' 8 Series' ' M2' ' M3' ' M4' ' M5' ' M6' ' X1' ' X2' ' X3'
 ' X4' ' X5' ' X6' ' X7' ' Z3' ' Z4' ' i3' ' i8']
['Diesel' 'Electric' 'Hybrid' 'Other' 'Petrol']


Unnamed: 0,model,year,price,transmission,mileage,fuelType,tax,mpg,engineSize,transmission_,model_,fuelType_
5852,5 Series,2017,26000,Semi-Auto,43533,Diesel,150,53.3,3.0,2,4,0
8204,1 Series,2017,12415,Manual,17724,Diesel,145,83.1,1.5,1,0,0
5318,X5,2019,52950,Semi-Auto,3309,Petrol,145,27.2,3.0,2,17,4
8042,4 Series,2017,19600,Automatic,19293,Diesel,145,65.7,2.0,0,3,0
3286,1 Series,2016,18990,Semi-Auto,11538,Petrol,200,39.8,3.0,2,0,4
1288,2 Series,2019,20276,Automatic,4013,Petrol,145,53.3,1.5,0,1,4
2329,4 Series,2018,20991,Manual,11455,Petrol,150,46.3,2.0,1,3,4
8830,X1,2016,17400,Manual,28999,Diesel,125,58.9,2.0,1,13,0
4257,2 Series,2019,21680,Semi-Auto,8994,Petrol,145,47.9,2.0,2,1,4
9153,Z4,2016,15991,Manual,25921,Petrol,205,41.5,2.0,1,21,4


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10781 entries, 0 to 10780
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   model          10781 non-null  object 
 1   year           10781 non-null  int64  
 2   price          10781 non-null  int64  
 3   transmission   10781 non-null  object 
 4   mileage        10781 non-null  int64  
 5   fuelType       10781 non-null  object 
 6   tax            10781 non-null  int64  
 7   mpg            10781 non-null  float64
 8   engineSize     10781 non-null  float64
 9   transmission_  10781 non-null  int64  
 10  model_         10781 non-null  int64  
 11  fuelType_      10781 non-null  int64  
dtypes: float64(2), int64(7), object(3)
memory usage: 1010.8+ KB


In [56]:
df["transmission"].value_counts()

Semi-Auto    4666
Automatic    3588
Manual       2527
Name: transmission, dtype: int64

In [57]:
df["fuelType"].value_counts()

Diesel      7027
Petrol      3417
Hybrid       298
Other         36
Electric       3
Name: fuelType, dtype: int64

#### One-hot Encoding

In [58]:
df_transmission = pd.get_dummies(df["transmission"])
df_transmission = df_transmission.drop("Manual", axis=1)
display(df_transmission.sample(10))

Unnamed: 0,Automatic,Semi-Auto
5989,0,1
2891,0,1
9332,1,0
7102,1,0
3650,0,0
4266,0,0
1409,1,0
505,0,1
9112,1,0
6345,0,1


In [59]:
df_fuelType = pd.get_dummies(df["fuelType"])
display(df_fuelType.sample(10))

Unnamed: 0,Diesel,Electric,Hybrid,Other,Petrol
917,1,0,0,0,0
4115,0,0,0,0,1
3506,1,0,0,0,0
4660,1,0,0,0,0
6027,0,0,0,0,1
9218,0,0,0,0,1
7249,0,0,0,0,1
7939,1,0,0,0,0
4755,1,0,0,0,0
2139,0,0,0,0,1


In [60]:
df_model = pd.get_dummies(df["model"])
display(df_model.sample(10))

Unnamed: 0,1 Series,2 Series,3 Series,4 Series,5 Series,6 Series,7 Series,8 Series,M2,M3,...,X2,X3,X4,X5,X6,X7,Z3,Z4,i3,i8
10633,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5592,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10773,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5449,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8481,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8746,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2061,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5764,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9317,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1761,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
print(df.shape)
df = pd.get_dummies(df)
# df = pd.get_dummies(df, drop_first=True)

display(df.sample(10))
print(df.shape)

(10781, 12)


Unnamed: 0,year,price,mileage,tax,mpg,engineSize,transmission_,model_,fuelType_,model_ 1 Series,...,model_ i3,model_ i8,transmission_Automatic,transmission_Manual,transmission_Semi-Auto,fuelType_Diesel,fuelType_Electric,fuelType_Hybrid,fuelType_Other,fuelType_Petrol
646,2019,29998,10171,145,41.5,2.0,2,2,4,0,...,0,0,0,0,1,0,0,0,0,1
9818,2016,17999,27193,160,50.4,3.0,0,4,0,0,...,0,0,1,0,0,1,0,0,0,0
8723,2017,13000,24104,125,53.3,1.5,1,0,4,1,...,0,0,0,1,0,0,0,0,0,1
3323,2019,50990,4729,145,37.7,3.0,2,17,0,0,...,0,0,0,0,1,1,0,0,0,0
2641,2019,30326,3392,145,41.5,2.0,2,2,4,0,...,0,0,0,0,1,0,0,0,0,1
2205,2015,10999,90018,20,70.6,2.0,0,2,0,0,...,0,0,1,0,0,1,0,0,0,0
7404,2007,3495,130000,200,44.8,2.0,1,2,4,0,...,0,0,0,1,0,0,0,0,0,1
9804,2012,5295,95360,30,62.8,2.0,1,0,0,1,...,0,0,0,1,0,1,0,0,0,0
6878,2017,19460,11076,145,56.5,3.0,0,2,0,0,...,0,0,1,0,0,1,0,0,0,0
10395,2017,15600,29019,145,65.7,2.0,0,0,0,1,...,0,0,1,0,0,1,0,0,0,0


(10781, 41)


In [62]:
# Linear Regression
# Decision Tree
# Random Forest
# XGB Xtreme Gradient Boosting
# y = m1*x1 + m2*x2 + c

# 1000
# 80% training data = 800
# 20% test data = 200 # seprate actual price

# price = c1*model_1series + c2*model_2series + c3*year + c # ML training
# predicted price = c1*model_1series + c2*model_2series + c3*year + c # ML testing

# error = compare(actual price, predicted price)

# 10%

# 14%

# Summary of Data Preprocessing for ML with Python
What you have learned from this module:

- Importing Data (csv, xlsx, txt etc.) with Pandas
- creating a new DataFrame
- column splitting
- creating a new column in a dataframe
- replace/removing a value from a pandas column
- removing a column from the dataframe
- renaming column names
- extracting new information from a column
- creating a column based on a condition or function
- Removing a string from a column
- Checking the unique values for each column
- performing calculation in dataframe columns
- dataframe sorting
- dataframe slicing
- data cleaning
- data visualization of missing values
- string to datetime conversion
- removing missing values
- replacing missing values by: 1. mean, 2. median, 3. constant, 4. interpolation, 5. forward imputation, 6. backward imputation
- inner join, outer join, left join, right join
- Data filtering
- Data Aggregation/grouping 
- Pivot table
- Data Visualization: Barplot
- Dealing with categorical variables: 1. Label encoding, 2. One-hot encoding