# AccelerateAI - Data Science Bootcamp

## Encoding Techniques

We will look at the following Encoding methods

- One Hot Encoding
- Label Encoding
- Ordinal Encoding
- Helmert Encoding
- Binary Encoding

## Import Libraries and Load sample dataset

In [1]:
import pandas as pd
import numpy as np

# Below are usage from sklearn family that we can use
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

# Below are some examples of how we can leverage category_encoders and invoke required libraries as per our usage
# This is not exhaustive, just a representative list
from category_encoders import OneHotEncoder
from category_encoders import OrdinalEncoder
from category_encoders import BinaryEncoder
from category_encoders import HelmertEncoder
from category_encoders import TargetEncoder
from category_encoders import HashingEncoder
from category_encoders import WOEEncoder


import warnings
warnings.filterwarnings('ignore')

In [2]:
data = {'Server_Id':[7493,7494,7495,7496,7497,7498,7499,7500,7501,7502,7503,7504,7505,7506,7507,7508,7509,7510,7512,7513],
       'Chiller_Temp':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold','Very Hot'],
       'Model':['RX','YX','BX','BX','RX','YX','RX','BX','YX','YX','YX','BX','BX','RX','YX','RX','RX','YX','RX','BX'],
       'Target':[1,1,1,0,1,0,1,0,1,1,0,0,1,0,1,1,0,1,1,0]}
df = pd.DataFrame(data)

df.sample(5)

Unnamed: 0,Server_Id,Chiller_Temp,Model,Target
0,7493,Hot,RX,1
8,7501,Hot,YX,1
3,7496,Warm,BX,0
17,7510,Hot,YX,1
10,7503,Cold,YX,0


In [3]:
df.shape

(20, 4)

In [4]:
# We will drop the ID column and consider rest columns
df = pd.DataFrame(data, columns=['Chiller_Temp','Model','Target'])

df.head(4)

Unnamed: 0,Chiller_Temp,Model,Target
0,Hot,RX,1
1,Cold,YX,1
2,Very Hot,BX,1
3,Warm,BX,0


### Handle Missing values

In [5]:
# Imputing with previous value (in case there are missing values - as an example and placeholder)
# fillna(method = 'pad')

df1 = df.copy()
df1 = df1.fillna(method = 'pad')

In [6]:
# Let us consider 10 records for our analysis for ease of use
df1 = df[:10]

df1

Unnamed: 0,Chiller_Temp,Model,Target
0,Hot,RX,1
1,Cold,YX,1
2,Very Hot,BX,1
3,Warm,BX,0
4,Hot,RX,1
5,Warm,YX,0
6,Warm,RX,1
7,Hot,BX,0
8,Hot,YX,1
9,Hot,YX,1


## 1. One-Hot Encoding

### With sample data created above - using pd.get_dummies

In [7]:
# Let us consider 10 records for our analysis for ease of use
df_OneHot = df1.copy()

df_OneHot

Unnamed: 0,Chiller_Temp,Model,Target
0,Hot,RX,1
1,Cold,YX,1
2,Very Hot,BX,1
3,Warm,BX,0
4,Hot,RX,1
5,Warm,YX,0
6,Warm,RX,1
7,Hot,BX,0
8,Hot,YX,1
9,Hot,YX,1


One Hot Encoding for one of the columns.

In [8]:
one_hot_encoded_data = pd.get_dummies(df_OneHot, columns = ['Chiller_Temp'])

one_hot_encoded_data

Unnamed: 0,Model,Target,Chiller_Temp_Cold,Chiller_Temp_Hot,Chiller_Temp_Very Hot,Chiller_Temp_Warm
0,RX,1,0,1,0,0
1,YX,1,1,0,0,0
2,BX,1,0,0,1,0
3,BX,0,0,0,0,1
4,RX,1,0,1,0,0
5,YX,0,0,0,0,1
6,RX,1,0,0,0,1
7,BX,0,0,1,0,0
8,YX,1,0,1,0,0
9,YX,1,0,1,0,0


One Hot encoding with other columns along with Target also works similarly

In [9]:
one_hot_encoded_data = pd.get_dummies(df_OneHot, columns = ['Model', 'Target'])

one_hot_encoded_data

Unnamed: 0,Chiller_Temp,Model_BX,Model_RX,Model_YX,Target_0,Target_1
0,Hot,0,1,0,0,1
1,Cold,0,0,1,0,1
2,Very Hot,1,0,0,0,1
3,Warm,1,0,0,1,0
4,Hot,0,1,0,0,1
5,Warm,0,0,1,1,0
6,Warm,0,1,0,0,1
7,Hot,1,0,0,1,0
8,Hot,0,0,1,0,1
9,Hot,0,0,1,0,1


### With another sample dataset - country dataset - using sklearn

In [10]:
from sklearn.preprocessing import OneHotEncoder

# Creating instance of one-hot-encoder
ohc = OneHotEncoder(handle_unknown='ignore')
ohe = ohc.fit_transform(df_OneHot.Chiller_Temp.values.reshape(-1,1)).toarray()

dfOneHot1 = pd.DataFrame(ohe,columns=["ChillerTemp_"+str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))])

dfh = pd.concat([df_OneHot,dfOneHot1], axis=1)
dfh

Unnamed: 0,Chiller_Temp,Model,Target,ChillerTemp_Cold,ChillerTemp_Hot,ChillerTemp_Very Hot,ChillerTemp_Warm
0,Hot,RX,1,0.0,1.0,0.0,0.0
1,Cold,YX,1,1.0,0.0,0.0,0.0
2,Very Hot,BX,1,0.0,0.0,1.0,0.0
3,Warm,BX,0,0.0,0.0,0.0,1.0
4,Hot,RX,1,0.0,1.0,0.0,0.0
5,Warm,YX,0,0.0,0.0,0.0,1.0
6,Warm,RX,1,0.0,0.0,0.0,1.0
7,Hot,BX,0,0.0,1.0,0.0,0.0
8,Hot,YX,1,0.0,1.0,0.0,0.0
9,Hot,YX,1,0.0,1.0,0.0,0.0


## 2. Label Encoding

### Using Category Codes Approach

This approach requires the category column to be of ‘category’ datatype. By default, a non-numerical column is of ‘object’ type. So you might have to change type to ‘category’ before using this approach.

In [11]:
# Creating initial dataframe
df_category_codes = df1.copy()

df_category_codes.head(4)

Unnamed: 0,Chiller_Temp,Model,Target
0,Hot,RX,1
1,Cold,YX,1
2,Very Hot,BX,1
3,Warm,BX,0


In [12]:
# Converting type of columns to 'category'
df_category_codes['Chiller_Temp'] = df_category_codes['Chiller_Temp'].astype('category')

# Assigning numerical values and storing in another column
df_category_codes['Chiller_Temp_Cat'] = df_category_codes['Chiller_Temp'].cat.codes

df_category_codes

Unnamed: 0,Chiller_Temp,Model,Target,Chiller_Temp_Cat
0,Hot,RX,1,1
1,Cold,YX,1,0
2,Very Hot,BX,1,2
3,Warm,BX,0,3
4,Hot,RX,1,1
5,Warm,YX,0,3
6,Warm,RX,1,3
7,Hot,BX,0,1
8,Hot,YX,1,1
9,Hot,YX,1,1


### Using Pandas Factorize

In [13]:
df_factorize = df1.copy()

df_factorize.head(4)

Unnamed: 0,Chiller_Temp,Model,Target
0,Hot,RX,1
1,Cold,YX,1
2,Very Hot,BX,1
3,Warm,BX,0


In [14]:
df_factorize.loc[:,'ChillerTemp_factorize_encode'] = pd.factorize(df_factorize['Chiller_Temp'])[0].reshape(-1,1)

df_factorize

Unnamed: 0,Chiller_Temp,Model,Target,ChillerTemp_factorize_encode
0,Hot,RX,1,0
1,Cold,YX,1,1
2,Very Hot,BX,1,2
3,Warm,BX,0,3
4,Hot,RX,1,0
5,Warm,YX,0,3
6,Warm,RX,1,3
7,Hot,BX,0,0
8,Hot,YX,1,0
9,Hot,YX,1,0


### Using sklearn library

In [15]:
df_LE = df1.copy()

In [16]:
# Creating instance of labelencoder
labelencoder = LabelEncoder()

In [17]:
# Assigning numerical values and storing in another column
df_LE['ChillerTemp_Cat'] = labelencoder.fit_transform(df_LE['Chiller_Temp'])

df_LE

Unnamed: 0,Chiller_Temp,Model,Target,ChillerTemp_Cat
0,Hot,RX,1,1
1,Cold,YX,1,0
2,Very Hot,BX,1,2
3,Warm,BX,0,3
4,Hot,RX,1,1
5,Warm,YX,0,3
6,Warm,RX,1,3
7,Hot,BX,0,1
8,Hot,YX,1,1
9,Hot,YX,1,1


## 3. Ordinal Encoding

In [18]:
df_OrdinalEnc = df1.copy()

df_OrdinalEnc.head(4)

Unnamed: 0,Chiller_Temp,Model,Target
0,Hot,RX,1
1,Cold,YX,1
2,Very Hot,BX,1
3,Warm,BX,0


In [19]:
ChillerTemp_Dict = {'Cold': 1, 'Warm': 2, 'Hot': 3, 'Very Hot': 4}

df_OrdinalEnc['ChillerTemp_Ordinal'] = df_OrdinalEnc.Chiller_Temp.map(ChillerTemp_Dict)

df_OrdinalEnc

Unnamed: 0,Chiller_Temp,Model,Target,ChillerTemp_Ordinal
0,Hot,RX,1,3
1,Cold,YX,1,1
2,Very Hot,BX,1,4
3,Warm,BX,0,2
4,Hot,RX,1,3
5,Warm,YX,0,2
6,Warm,RX,1,2
7,Hot,BX,0,3
8,Hot,YX,1,3
9,Hot,YX,1,3


## 4. Helmert Encoding

In this encoding, the mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.

In [20]:
df_HelmertEnc = df1.copy()

df_HelmertEnc.head(4)

Unnamed: 0,Chiller_Temp,Model,Target
0,Hot,RX,1
1,Cold,YX,1
2,Very Hot,BX,1
3,Warm,BX,0


In [21]:
import category_encoders as ce

encoder = ce.HelmertEncoder(cols=['Chiller_Temp'],drop_invariant=True) #whether or not to drop columns with 0 variance.

dfH = encoder.fit_transform(df_HelmertEnc['Chiller_Temp'])

dfHE = pd.concat([df_HelmertEnc,dfH], axis=1)

dfHE

Unnamed: 0,Chiller_Temp,Model,Target,Chiller_Temp_0,Chiller_Temp_1,Chiller_Temp_2
0,Hot,RX,1,-1.0,-1.0,-1.0
1,Cold,YX,1,1.0,-1.0,-1.0
2,Very Hot,BX,1,0.0,2.0,-1.0
3,Warm,BX,0,0.0,0.0,3.0
4,Hot,RX,1,-1.0,-1.0,-1.0
5,Warm,YX,0,0.0,0.0,3.0
6,Warm,RX,1,0.0,0.0,3.0
7,Hot,BX,0,-1.0,-1.0,-1.0
8,Hot,YX,1,-1.0,-1.0,-1.0
9,Hot,YX,1,-1.0,-1.0,-1.0


## 5. Binary Encoding

In [22]:
df_Binary = df1.copy()

df_Binary.head(4)

Unnamed: 0,Chiller_Temp,Model,Target
0,Hot,RX,1
1,Cold,YX,1
2,Very Hot,BX,1
3,Warm,BX,0


In [23]:
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['Chiller_Temp'])
df_BinEnc = encoder.fit_transform(df_Binary['Chiller_Temp'])

df_Binary = pd.concat([df_Binary,df_BinEnc],axis=1)
df_Binary

Unnamed: 0,Chiller_Temp,Model,Target,Chiller_Temp_0,Chiller_Temp_1,Chiller_Temp_2
0,Hot,RX,1,0,0,1
1,Cold,YX,1,0,1,0
2,Very Hot,BX,1,0,1,1
3,Warm,BX,0,1,0,0
4,Hot,RX,1,0,0,1
5,Warm,YX,0,1,0,0
6,Warm,RX,1,1,0,0
7,Hot,BX,0,0,0,1
8,Hot,YX,1,0,0,1
9,Hot,YX,1,0,0,1


## Conclusion

That's all.

Other Encoding methods can also be performed similarly.