# Encoding Categorical Features

There are many ways we can encode these categorical variables as numbers and use them in an algorithm.<br>
1) One Hot Encoding<br>
2) Label Encoding<br>
3) Ordinal Encoding<br>
4) Helmert Encoding<br>
5) Binary Encoding<br>
6) Frequency Encoding<br>
7) Mean Encoding<br>
8) Weight of Evidence Encoding<br>
9) Probability Ratio Encoding<br>
10) Hashing Encoding<br>
11) Backward Difference Encoding<br>
12) Leave One Out Encoding<br>
13) James-Stein Encoding<br>
14) M-estimator Encoding<br>
15) Thermometer Encoder (To be updated)<br>

In [1]:
# importing the libraries
import pandas as pd
import numpy as np

In [2]:
data = { 'Temperature' : ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
         'Color' : ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
         'Target' : [1,1,1,0,1,0,1,0,1,1] }
df = pd.DataFrame(data, columns = ['Temperature','Color','Target'])

In [3]:
df.head()

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1


## One Hot Encoding

<br>
In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features. This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature.<br>


<br>
one hot encoding can be done in 2 ways:<br>
get_dummies<br>
onehotencoder<br>
<br>

<br>Note : you have to drop the first or the last column after encoding for regression but for tree-based learning algorithm, it is good practice to encode it into N binary variables and don’t drop.<br>

#### Using get_dummies

In [4]:
df = pd.get_dummies(df, prefix=['Temp'], columns=['Temperature'])
df

Unnamed: 0,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0,1,0,0
1,Yellow,1,1,0,0,0
2,Blue,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Red,1,0,1,0,0
5,Yellow,0,0,0,0,1
6,Red,1,0,0,0,1
7,Yellow,0,0,1,0,0
8,Yellow,1,0,1,0,0
9,Yellow,1,1,0,0,0


#### Using the OneHotEncoder

In [5]:
data = { 'Temperature' : ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
         'Color' : ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
         'Target' : [1,1,1,0,1,0,1,0,1,1] }
df1 = pd.DataFrame(data, columns = ['Temperature','Color','Target'])

In [6]:
df1.head()

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1


In [7]:
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder()
ohe = ohc.fit_transform(df1.Temperature.values.reshape(-1,1)).toarray()
dfOneHot = pd.DataFrame(ohe,columns = ["Temp_"+str(ohc.categories_[0][i]) 
                                         for i in range (len(ohc.categories_[0]))])
dfh = pd.concat([df1,dfOneHot], axis=1)
dfh

Unnamed: 0,Temperature,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Hot,Red,1,0.0,1.0,0.0,0.0
1,Cold,Yellow,1,1.0,0.0,0.0,0.0
2,Very Hot,Blue,1,0.0,0.0,1.0,0.0
3,Warm,Blue,0,0.0,0.0,0.0,1.0
4,Hot,Red,1,0.0,1.0,0.0,0.0
5,Warm,Yellow,0,0.0,0.0,0.0,1.0
6,Warm,Red,1,0.0,0.0,0.0,1.0
7,Hot,Yellow,0,0.0,1.0,0.0,0.0
8,Hot,Yellow,1,0.0,1.0,0.0,0.0
9,Cold,Yellow,1,1.0,0.0,0.0,0.0
