# Encodings 101

This notebook showcases different strategies to encode categorical variables present in your dataset into numeric.

We will go through the following _Encoding Strategies_:

__Unsupervised -__ _Don’t make use of the target variable to encode categorical variables_
1. Label Encoder
2. One-Hot Encoder
3. Frequency Encoder
4. Binary Encoder
5. Hashing Encoder

__Supervised -__ _Employ the target variable to encode categorical variables._
1. Target Encoder
2. James-Stein Encoder
3. M-Estimate Encoder
4. Weight of Evidence Encoder
5. Catboost Encoder

__Libraries Used:__

1. [Category Encoders](https://contrib.scikit-learn.org/category_encoders/)
2. [Numpy](https://numpy.org/)
2. [Pandas](https://pandas.pydata.org/)
3. [Plotly](https://plotly.com/python/)

__Let's get started!__

First let's install the libraries and import it

In [96]:
%pip install -q category_encoders numpy pandas plotly

Note: you may need to restart the kernel to use updated packages.


In [97]:
import numpy as np
import pandas as pd
import plotly.express as px

# Import Encoders
from category_encoders import OrdinalEncoder, OneHotEncoder, CountEncoder, BinaryEncoder, HashingEncoder, \
                              TargetEncoder, JamesSteinEncoder, MEstimateEncoder, WOEEncoder, CatBoostEncoder

In [98]:
# Set seed to replicate the results shown in this notebook
np.random.seed(42)

### Create the Dataset

In [99]:
categories = ['Ronaldo', 'Messi', 'Neymar']

---
***NOTE***

[__Numpy Choice__](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) (_np.random.choice_) - Generates a random sample from a given 1-D array

[__Numpy Randint__](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html) (_np.random.randint_) - Return random integers from low (inclusive) to high (exclusive)

---

In [100]:
df = pd.DataFrame({
    'player':  np.random.choice(categories, 10),
    'goals': np.random.randint(low=1, high=50, size=10)
})

In [101]:
df

Unnamed: 0,player,goals
0,Neymar,11
1,Ronaldo,11
2,Neymar,24
3,Neymar,36
4,Ronaldo,40
5,Ronaldo,24
6,Neymar,3
7,Messi,22
8,Neymar,2
9,Neymar,24


### Label Encoder

In [102]:
# Encode Player Column
df_encoded = OrdinalEncoder(cols=['player']).fit_transform(df)

In [103]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [104]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,1,11
1,Ronaldo,2,11
2,Neymar,1,24
3,Neymar,1,36
4,Ronaldo,2,40
5,Ronaldo,2,24
6,Neymar,1,3
7,Messi,3,22
8,Neymar,1,2
9,Neymar,1,24


__Summary__

Neymar gets mapped to 1

Ronaldo gets mapped to 2

Messi gets mapped to 3

### One Hot Encoder

In [105]:
# Encode Player Column
# Use Cat Names Parameter adds the category name to the column
df_encoded = OneHotEncoder(cols=['player'], use_cat_names=True).fit_transform(df)

In [106]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [107]:
df_encoded

Unnamed: 0,player_original,player_Neymar,player_Ronaldo,player_Messi,goals
0,Neymar,1,0,0,11
1,Ronaldo,0,1,0,11
2,Neymar,1,0,0,24
3,Neymar,1,0,0,36
4,Ronaldo,0,1,0,40
5,Ronaldo,0,1,0,24
6,Neymar,1,0,0,3
7,Messi,0,0,1,22
8,Neymar,1,0,0,2
9,Neymar,1,0,0,24


### Frequency Encoder

In [113]:
# Encode Player Column
df_encoded = CountEncoder(cols=['player']).fit_transform(df)

In [114]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

In [115]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,6,11
1,Ronaldo,3,11
2,Neymar,6,24
3,Neymar,6,36
4,Ronaldo,3,40
5,Ronaldo,3,24
6,Neymar,6,3
7,Messi,1,22
8,Neymar,6,2
9,Neymar,6,24


In [116]:
# Encode Player Column with Normalization
df_encoded = CountEncoder(cols=['player'], normalize=True).fit_transform(df)

In [117]:
# Get original player names
df_encoded['player_original'] = df['player']
# Reorder Elements
df_encoded = df_encoded[['player_original', *df_encoded.columns[:-1]]]

Normalization simply divides the count with total number of observations.

For Example, 

__Neymar__ 

$$\text{Count} = 6$$
$$\text{Total Observations} = 10$$
$$\therefore \text{Normalized Count} = \frac{6}{10} = 0.6$$


In [118]:
df_encoded

Unnamed: 0,player_original,player,goals
0,Neymar,0.6,11
1,Ronaldo,0.3,11
2,Neymar,0.6,24
3,Neymar,0.6,36
4,Ronaldo,0.3,40
5,Ronaldo,0.3,24
6,Neymar,0.6,3
7,Messi,0.1,22
8,Neymar,0.6,2
9,Neymar,0.6,24
