### Encoding Comparison Analysis
Label encoder v/s One hot encoder

Typically, any structured dataset includes multiple columns – a combination of numerical as well as categorical variables. A machine can only understand the numbers and not the text. That’s essentially the case with Machine Learning algorithms too.

That’s primarily the reason we need to convert categorical columns to numerical columns so that a machine learning algorithm understands it. This process is called categorical encoding. Categorical encoding is a process of converting categories to numbers.

There are multiple ways of handling Categorical variables. In this notebook, we will see widely used techniques:

Label Encoding

One-Hot Encoding



#### Click link below to download data

__[Automobile Data Set](https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv)__ 

In [1]:
import pandas as pd
import numpy as np

In [2]:
filename = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
df = pd.read_csv(filename, names = headers)

In [3]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [4]:
df.replace("?", np.nan, inplace = True)
df.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [5]:
# simply drop whole row with NaN in "price" column
df.dropna(subset=["price"], axis=0, inplace=True)

# reset index, because we droped two rows
df.reset_index(drop=True, inplace=True)

In [6]:
df[['price']] = df[['price']].astype(int)
df.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                  int64
dtype: object

### Label Encoding
Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.

In [7]:
# Import label encoder from scikit-learn 
from sklearn import preprocessing
# create label_encoder object 
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'body-style'. 
df['lbl_body-style'] = label_encoder.fit_transform(df['body-style']) 
print(df[['lbl_body-style', 'body-style']].head())
print(df['lbl_body-style'].unique())
print(df['body-style'].unique())

   lbl_body-style   body-style
0               0  convertible
1               0  convertible
2               2    hatchback
3               3        sedan
4               3        sedan
[0 2 3 4 1]
['convertible' 'hatchback' 'sedan' 'wagon' 'hardtop']


##### Label encoding uses alphabetical ordering. Hence, body-style convertible has been encoded with 0, the hardtop as 1, the hatchback with 2, sedan with 3, and wagon with 4.
The body-style names do not have an order or rank. But, when label encoding is performed, the body-style names are ranked based on the alphabets. The problem here is, since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order, 0 < 1 < 2. But this isn’t the case at all. To overcome this problem, we use One Hot Encoder.

### One hot encoding 
This encoding takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns. The numbers are replaced by 1s and 0s, depending on which column has what value. 

In [8]:
dummy_variable_1 = pd.get_dummies(df["body-style"])
dummy_variable_1.head()

Unnamed: 0,convertible,hardtop,hatchback,sedan,wagon
0,1,0,0,0,0
1,1,0,0,0,0
2,0,0,1,0,0
3,0,0,0,1,0
4,0,0,0,1,0


In [9]:
# merge data frame "df" and "dummy_variable_1" 
df = pd.concat([df, dummy_variable_1], axis=1)
print(df[['body-style', 'convertible', 'hardtop', 'hatchback', 'sedan', 'wagon']].head(10))

df.head(10)

    body-style  convertible  hardtop  hatchback  sedan  wagon
0  convertible            1        0          0      0      0
1  convertible            1        0          0      0      0
2    hatchback            0        0          1      0      0
3        sedan            0        0          0      1      0
4        sedan            0        0          0      1      0
5        sedan            0        0          0      1      0
6        sedan            0        0          0      1      0
7        wagon            0        0          0      0      1
8        sedan            0        0          0      1      0
9        sedan            0        0          0      1      0


Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,peak-rpm,city-mpg,highway-mpg,price,lbl_body-style,convertible,hardtop,hatchback,sedan,wagon
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,5000,21,27,13495,0,1,0,0,0,0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,5000,21,27,16500,0,1,0,0,0,0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,5000,19,26,16500,2,0,0,1,0,0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,5500,24,30,13950,3,0,0,0,1,0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,5500,18,22,17450,3,0,0,0,1,0
5,2,,audi,gas,std,two,sedan,fwd,front,99.8,...,5500,19,25,15250,3,0,0,0,1,0
6,1,158.0,audi,gas,std,four,sedan,fwd,front,105.8,...,5500,19,25,17710,3,0,0,0,1,0
7,1,,audi,gas,std,four,wagon,fwd,front,105.8,...,5500,19,25,18920,4,0,0,0,0,1
8,1,158.0,audi,gas,turbo,four,sedan,fwd,front,105.8,...,5500,17,20,23875,3,0,0,0,1,0
9,2,192.0,bmw,gas,std,two,sedan,rwd,front,101.2,...,5800,23,29,16430,3,0,0,0,1,0


As you can see here, 5 new features are added as the body-style contains 5 unique values – convertible, hardtop, hatchback, sedan, wagon. In this technique, we solved the problem of ranking as each category is represented by a binary vector.

Challenges of One-Hot Encoding: One-Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of the remaining variables

In [10]:
df.dtypes

symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                  object
stroke                object
compression-ratio    float64
horsepower            object
peak-rpm              object
city-mpg               int64
highway-mpg            int64
price                  int64
lbl_body-style         int64
convertible            uint8
hardtop                uint8
hatchback              uint8
sedan                  uint8
wagon                  uint8
dtype: object

In [12]:
# Let us split df columns for OneHot and LabelEnc
X_OneHot = df[['convertible', 'hardtop', 'hatchback', 'sedan', 'wagon']]
X_OneHot[['convertible', 'hardtop', 'hatchback', 'sedan', 'wagon']] = X_OneHot[['convertible', 'hardtop', 'hatchback', 'sedan', 'wagon']].astype(int)
X_LabelEnc = df[['lbl_body-style']].astype(int)
Y = df[['price']]

In [13]:
from sklearn.linear_model import LinearRegression

lm_OneHot = LinearRegression()
lm_LabelEnc = LinearRegression()

In [18]:
lm_OneHot.fit(X_OneHot,Y)
lm_LabelEnc.fit(X_LabelEnc,Y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [19]:
print(lm_OneHot.intercept_, lm_OneHot.coef_)

[8.57204958e+17] [[-8.57204958e+17 -8.57204958e+17 -8.57204958e+17 -8.57204958e+17
  -8.57204958e+17]]


In [20]:
print(lm_LabelEnc.intercept_, lm_LabelEnc.coef_)

[14961.44945092] [[-670.37707155]]


In [24]:
Ypred_OneHot = lm_OneHot.predict(X_OneHot)
Ypred_LabelEnc = lm_LabelEnc.predict(X_LabelEnc)

In [31]:
from sklearn.metrics import mean_squared_error
from math import sqrt
mse_OneHot = sqrt(mean_squared_error(df['price'], Ypred_OneHot))
print('The mean square error of price and predicted value with one hot encoding is: ', mse_OneHot)


The mean square error of price and predicted value with one hot encoding is:  7287.929318827049


In [33]:
mse_LabelEnc = sqrt(mean_squared_error(df['price'], Ypred_LabelEnc))
print('The mean square error of price and predicted value with label encoding is: ', mse_LabelEnc)

The mean square error of price and predicted value with label encoding is:  7906.161543493036


#### Closer the RMSE to 0 better the accuracy. RMSE of One Hot Encoder is less than Label Encoder which means using One Hot encoder has given better accuracy as we know 