# Encodings 101

This notebook showcases different strategies to encode categorical variables present in your dataset into numeric.

We will go through the following _Encoding Strategies_:

__Unsupervised -__ _Don’t make use of the target variable to encode categorical variables_
1. Label Encoder
2. One-Hot Encoder
3. Frequency Encoder
4. Binary Encoder
5. Hashing Encoder

__Supervised -__ _Employ the target variable to encode categorical variables._
1. Target Encoder
2. James-Stein Encoder
3. M-Estimate Encoder
4. Catboost Encoder

__Libraries Used:__

1. [Category Encoders](https://contrib.scikit-learn.org/category_encoders/)
2. [Numpy](https://numpy.org/)
2. [Pandas](https://pandas.pydata.org/)
3. [Plotly](https://plotly.com/python/)

__Let's get started!__

First let's install the libraries and import it

In [1]:
%pip install -q category_encoders numpy pandas plotly

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import pandas as pd
import plotly.express as px

# Import Encoders
from category_encoders import OrdinalEncoder, OneHotEncoder, CountEncoder, BinaryEncoder, HashingEncoder, \
                              JamesSteinEncoder, MEstimateEncoder, CatBoostEncoder

In [3]:
# Set seed to replicate the results shown in this notebook
np.random.seed(42)

### Create the Dataset

In [4]:
categories = ['Ronaldo', 'Messi', 'Neymar']

---
***NOTE***

[__Numpy Choice__](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) (_np.random.choice_) - Generates a random sample from a given 1-D array

[__Numpy Randint__](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html) (_np.random.randint_) - Return random integers from low (inclusive) to high (exclusive)

---

In [5]:
df = pd.DataFrame({
    'player':  np.random.choice(categories, 10),
    'goals': np.random.randint(low=1, high=50, size=10)
})

In [6]:
df

Unnamed: 0,player,goals
0,Neymar,11
1,Ronaldo,11
2,Neymar,24
3,Neymar,36
4,Ronaldo,40
5,Ronaldo,24
6,Neymar,3
7,Messi,22
8,Neymar,2
9,Neymar,24


### Label Encoder

In [7]:
# Encode Player Column
df_encoded = OrdinalEncoder(cols=['player']).fit_transform(df)

In [8]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [9]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,1,11
1,Ronaldo,2,11
2,Neymar,1,24
3,Neymar,1,36
4,Ronaldo,2,40
5,Ronaldo,2,24
6,Neymar,1,3
7,Messi,3,22
8,Neymar,1,2
9,Neymar,1,24


__Summary__

Neymar gets mapped to 1

Ronaldo gets mapped to 2

Messi gets mapped to 3

### One Hot Encoder

In [10]:
# Encode Player Column
# Use Cat Names Parameter adds the category name to the column
df_encoded = OneHotEncoder(cols=['player'], use_cat_names=True).fit_transform(df)

In [11]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [12]:
df_encoded

Unnamed: 0,player_original,player_Neymar,player_Ronaldo,player_Messi,goals
0,Neymar,1,0,0,11
1,Ronaldo,0,1,0,11
2,Neymar,1,0,0,24
3,Neymar,1,0,0,36
4,Ronaldo,0,1,0,40
5,Ronaldo,0,1,0,24
6,Neymar,1,0,0,3
7,Messi,0,0,1,22
8,Neymar,1,0,0,2
9,Neymar,1,0,0,24


### Frequency Encoder

In [13]:
# Encode Player Column
df_encoded = CountEncoder(cols=['player']).fit_transform(df)

In [14]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [15]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,6,11
1,Ronaldo,3,11
2,Neymar,6,24
3,Neymar,6,36
4,Ronaldo,3,40
5,Ronaldo,3,24
6,Neymar,6,3
7,Messi,1,22
8,Neymar,6,2
9,Neymar,6,24


In [16]:
# Encode Player Column with Normalization
df_encoded = CountEncoder(cols=['player'], normalize=True).fit_transform(df)

In [17]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

Normalization simply divides the count with total number of observations.

For Example, 

__Neymar__ 

$$\text{Count} = 6$$
$$\text{Total Observations} = 10$$
$$\therefore \text{Normalized Count} = \frac{6}{10} = 0.6$$


In [18]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,0.6,11
1,Ronaldo,0.3,11
2,Neymar,0.6,24
3,Neymar,0.6,36
4,Ronaldo,0.3,40
5,Ronaldo,0.3,24
6,Neymar,0.6,3
7,Messi,0.1,22
8,Neymar,0.6,2
9,Neymar,0.6,24


### Binary Encoder

In [19]:
df_encoded = BinaryEncoder(cols=['player']).fit_transform(df)

In [20]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [21]:
df_encoded

Unnamed: 0,player_original,player_0,player_1,goals
0,Neymar,0,1,11
1,Ronaldo,1,0,11
2,Neymar,0,1,24
3,Neymar,0,1,36
4,Ronaldo,1,0,40
5,Ronaldo,1,0,24
6,Neymar,0,1,3
7,Messi,1,1,22
8,Neymar,0,1,2
9,Neymar,0,1,24


### Hashing Encoder

In [22]:
df_encoded = HashingEncoder(cols=['player']).fit_transform(df)

In [23]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [24]:
df_encoded

Unnamed: 0,player_original,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,goals
0,Neymar,0,1,0,0,0,0,0,0,11
1,Ronaldo,0,0,0,0,0,0,0,1,11
2,Neymar,0,1,0,0,0,0,0,0,24
3,Neymar,0,1,0,0,0,0,0,0,36
4,Ronaldo,0,0,0,0,0,0,0,1,40
5,Ronaldo,0,0,0,0,0,0,0,1,24
6,Neymar,0,1,0,0,0,0,0,0,3
7,Messi,0,0,1,0,0,0,0,0,22
8,Neymar,0,1,0,0,0,0,0,0,2
9,Neymar,0,1,0,0,0,0,0,0,24


In [25]:
# Sets number of components present in the higher dimensional space
df_encoded = HashingEncoder(cols=['player'], n_components=3).fit_transform(df)

In [26]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [27]:
df_encoded

Unnamed: 0,player_original,col_0,col_1,col_2,goals
0,Neymar,0,1,0,11
1,Ronaldo,1,0,0,11
2,Neymar,0,1,0,24
3,Neymar,0,1,0,36
4,Ronaldo,1,0,0,40
5,Ronaldo,1,0,0,24
6,Neymar,0,1,0,3
7,Messi,1,0,0,22
8,Neymar,0,1,0,2
9,Neymar,0,1,0,24


Reminds you of One Hot Encoding doesn't it? Feature hashing is very similar to one-hot encoding but with a control over the output dimensions.

_Moving onto Supervised Encoding Strategies..._ :)

### Target Encoding

In [28]:
# Reset DataFrame
df_encoded = pd.DataFrame(None)
df_encoded.loc[:, "player"] = pd.DataFrame(df.groupby("player")["goals"].transform("mean"))

In [29]:
df_encoded['goals'] = df['goals']
df_encoded['player_original'] = df['player']

In [30]:
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [31]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,16.666667,11
1,Ronaldo,25.0,11
2,Neymar,16.666667,24
3,Neymar,16.666667,36
4,Ronaldo,25.0,40
5,Ronaldo,25.0,24
6,Neymar,16.666667,3
7,Messi,22.0,22
8,Neymar,16.666667,2
9,Neymar,16.666667,24


### James Stein Encoding

$$JS_i = (1-B)*mean(y_i) + B*mean(y)$$

$$\text{Where, }  B = var(y_i) / (var(y_i)+var(y))$$  

For feature value i, James-Stein estimator returns a weighted average of:

$$ \text{The mean target value for the observed feature value i  }  (y_i) $$

$$ \text{The mean target value regardless of the feature value  }  (y) $$

In [35]:
df_encoded = JamesSteinEncoder().fit_transform(X=df, y=df['goals'])

In [36]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [100]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,16.666667,11
1,Ronaldo,25.0,11
2,Neymar,16.666667,24
3,Neymar,16.666667,36
4,Ronaldo,25.0,40
5,Ronaldo,25.0,24
6,Neymar,16.666667,3
7,Messi,22.0,22
8,Neymar,16.666667,2
9,Neymar,16.666667,24


### M-Estimate Encoding

<img src="./Assets/Images/mest.png" width= 30% height= auto />

In [101]:
df_encoded = MEstimateEncoder().fit_transform(X=df, y=df['goals'])

In [102]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [103]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,17.1,11
1,Ronaldo,23.675,11
2,Neymar,17.1,24
3,Neymar,17.1,36
4,Ronaldo,23.675,40
5,Ronaldo,23.675,24
6,Neymar,17.1,3
7,Messi,20.85,22
8,Neymar,17.1,2
9,Neymar,17.1,24


### Catboost Encoding

<img src="./Assets/Images/catboost.png" width= 20% height= auto />

In [45]:
df_encoded = CatBoostEncoder().fit_transform(X=df, y=df['goals'])

In [46]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [47]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,19.7,11
1,Ronaldo,19.7,11
2,Neymar,15.35,24
3,Neymar,18.233333,36
4,Ronaldo,15.35,40
5,Ronaldo,23.566667,24
6,Neymar,22.675,3
7,Messi,19.7,22
8,Neymar,18.74,2
9,Neymar,15.95,24
