# Feature Encoding

Handling Categorical/Qualitative variables is an important step in data preprocessing.Many Machine learning algorithms can not handle categorical variables by themself unless we convert them to numerical values.<br>
And performance of ML algorithms is based on how Categorical variables are encoded.
The results produced by the model varies from different encoding techniques used.

Categorical variables can be divided into two categories:<br>
1. Nominal (No particular order) 
2. Ordinal (some ordered).

<img src="images/Categorical_variables.png" width = 800 />

<div class="alert alert-block alert-warning">  
<b>You can use the below cheat-sheet as a guiding tool. </b> 
</div>

<img src="images/Categorical_Encoding.png" width = 1000 />

There are many ways we can encode these categorical variables.


1. One Hot Encoding
2. Label Encoding
3. Ordinal Encoding
4. Frequency or Count Encoding
5. Binary Encoding

and many others

In [1]:
import pandas as pd 
import numpy as np

In [2]:
data = {'Temperature':['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
        'Color':['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
        'Target':[1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(data)
df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Yellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


### 1. One Hot Encoding

In this technique, it creates a new column/feature for each category in the Categorical Variable and replaces with either 1 (presence of the feature) or 0 (absence of the feature). The number of column/feature depends on the number of categories in the Categorical Variable.This method slows down the learning process significantly if the number of the categories are very high.

We prefer One-Hot Encoding with `Nominal` categorical columns.

In [3]:
# Using get_dummies method in pandas
df_ohe = df.copy()
one_hot_1 = pd.get_dummies(df_ohe,prefix = 'Temp' ,columns=['Temperature'],drop_first=False)
one_hot_1.insert(loc=2, column='Temperature', value=df.Temperature.values)
one_hot_1

Unnamed: 0,Color,Target,Temperature,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,Hot,0,1,0,0
1,Yellow,1,Cold,1,0,0,0
2,Blue,1,Very Hot,0,0,1,0
3,Blue,0,Warm,0,0,0,1
4,Red,1,Hot,0,1,0,0
5,Yellow,0,Warm,0,0,0,1
6,Red,1,Warm,0,0,0,1
7,Yellow,0,Hot,0,1,0,0
8,Yellow,1,Hot,0,1,0,0
9,Yellow,1,Cold,1,0,0,0


1. For Regression, we can use N-1 (drop first or last column of One Hot Coded new feature ), 
2. For classification, the recommendation is to use all N columns as most of the tree-based algorithm builds a tree based on all available variables. 

**Disadvantages:** 
1. Tree algorithms cannot be applied to one-hot encoded data since it creates a sparse matrix.
2. When the feature contains too many unique values, that many features are created which may result in overfitting.

### 2. Label Encoding

1. In this encoding, a unique value is assigned for different labels/categories.<br>
2. One major issue with sklearn.LabelEncoder is it assigns the values to the labels based on the Alphabetical order of the lables.<br>
Ex : Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 
3. We shuold prefer Label encoding with `ordinal` categorical variables

In [4]:
# Using sklearn LabelEncoder()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_ohe['Temperature_encoded'] = le.fit_transform(df.Temperature)
df_ohe

Unnamed: 0,Temperature,Color,Target,Temperature_encoded
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


In [5]:
# Using Pandas factorize()
fact = df.copy()
fact['Temperature_factor'] = pd.factorize(df.Temperature)[0]
fact

Unnamed: 0,Temperature,Color,Target,Temperature_factor
0,Hot,Red,1,0
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,0
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,0
8,Hot,Yellow,1,0
9,Cold,Yellow,1,1


**Disadvantages:** 
1. It mis-leads the information by assigning values based on Alphabetical order instead of actual label order.

# 3. Ordinal Encoding

In [6]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe_val = oe.fit_transform(df['Temperature'].values.reshape(-1, 1))
pd.concat([df,pd.DataFrame(oe_val,columns=['Temperature_Oe'])],axis=1)

Unnamed: 0,Temperature,Color,Target,Temperature_Oe
0,Hot,Red,1,1.0
1,Cold,Yellow,1,0.0
2,Very Hot,Blue,1,2.0
3,Warm,Blue,0,3.0
4,Hot,Red,1,1.0
5,Warm,Yellow,0,3.0
6,Warm,Red,1,3.0
7,Hot,Yellow,0,1.0
8,Hot,Yellow,1,1.0
9,Cold,Yellow,1,0.0


In [7]:
# Best way is mapping based on their actual label order
# Ex : Cold < Warm <Hot < Very Hot = 1 < 2 < 3 < 4
Temp_order = {'Cold' : 1 , 'Warm' : 2 , 'Hot' : 3 , 'Very Hot' : 4}
df['Temperature_Order'] = df.Temperature.map(Temp_order)
df

Unnamed: 0,Temperature,Color,Target,Temperature_Order
0,Hot,Red,1,3
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,4
3,Warm,Blue,0,2
4,Hot,Red,1,3
5,Warm,Yellow,0,2
6,Warm,Red,1,2
7,Hot,Yellow,0,3
8,Hot,Yellow,1,3
9,Cold,Yellow,1,1


### 4. Frequency or Count Encoder
In frequency encoding, each of the categories in the feature is replaced with the frequencies of categories.<br>
Here frequency of the categories is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.


<img src="images/frequency_encoding.png" width = 500  >

Category refers to each of the unique values in a feature.
1. **Frequency(category)** = Number of values in that category
2. **Size(data)** = Size of the entire dataset.

In [8]:
# Using Pandas groupby()
cat_freq = df.groupby('Temperature').size() / len(df)
df['Temp_Freq_Enc'] = df.Temperature.map(cat_freq)
df

Unnamed: 0,Temperature,Color,Target,Temperature_Order,Temp_Freq_Enc
0,Hot,Red,1,3,0.4
1,Cold,Yellow,1,1,0.2
2,Very Hot,Blue,1,4,0.1
3,Warm,Blue,0,2,0.3
4,Hot,Red,1,3,0.4
5,Warm,Yellow,0,2,0.3
6,Warm,Red,1,2,0.3
7,Hot,Yellow,0,3,0.4
8,Hot,Yellow,1,3,0.4
9,Cold,Yellow,1,1,0.2


**Disadvantage**:
1. If two categories have the same frequency then it is hard to distinguish between them.

# 5. Binary Encoding

1. Base-N encoder encodes the categories into arrays of their base-N representation. 
2. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.

**Feature -> ordinal encoding -> binary code -> digits of the binary code to separate columns.**

<img  src="images/Binary_encoding.png" width = 800/>

In [9]:
import category_encoders as ce
be = ce.BinaryEncoder(cols=['Temperature'])
be_df = be.fit_transform(df['Temperature'])
pd.concat([df,be_df],axis=1)

Unnamed: 0,Temperature,Color,Target,Temperature_Order,Temp_Freq_Enc,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,3,0.4,0,0,1
1,Cold,Yellow,1,1,0.2,0,1,0
2,Very Hot,Blue,1,4,0.1,0,1,1
3,Warm,Blue,0,2,0.3,1,0,0
4,Hot,Red,1,3,0.4,0,0,1
5,Warm,Yellow,0,2,0.3,1,0,0
6,Warm,Red,1,2,0.3,1,0,0
7,Hot,Yellow,0,3,0.4,0,0,1
8,Hot,Yellow,1,3,0.4,0,0,1
9,Cold,Yellow,1,1,0.2,0,1,0


`NOTE` : 
<div class="alert alert-block alert-danger"> 
<b>It is essential to understand, for all machine learning models, all these encodings do not work well in all situations or for every dataset. <br>
Data Scientists still need to experiment and find out which works best for their specific case.  </b>
</div>


# Key Takeaways
- The two most popular encoding technique are `Label Encoding` and `One-Hot Encoding`. 
- There are various other encoding technique's but not all work on every dataset. Data Scientists need to experiment to find out which works best for specific case. 
- `Label Encoding` is used for Ordinal Categorical data.
- `One-Hot Encoding` is used for Nominal Categorical data. 
