# 0. Introduction 
In this notebook, I will introduce some basic encoding schemes in a very understandable way. The target audiences are those who just get to know machine learning and want quick access to these techniques. For convenience purposes, I will also provide the sklearn version and the library corresponding to each method.

By the end of this post, I hope that you would have a better idea of how to apply different encoding schemes.

For best result, you should look at the data frame -> description -> code

There 3 main routes to encode the categorical.

**Classic Encoders:** Ordinal, OneHot, Binary, Frequency, Hashing

**Contrast Encoders:** Helmert, Backward Difference

**Bayesian Encoders:** Target, Leave One Out, Weight Of Evidence, James-Stein, M-estimator

And there are many more! However, once you know how these most common encoding schemes work, you will find it fairly easy to google the other one.

Let's create a random pokemon data set!

In [2]:
%pip install category_encoders

import pandas as pd
import numpy as np
from category_encoders import *


data={'Type':['Fire','Water','Bug', 'Fire', 'Fire','Bug','Water','Bug','Ice'],
      'Height':['Short','Normal','Very short','Tall','Normal','Short','Tall','Very short','Tall'],
      'Stats_total':[495,525,195,580, 525,500,670,405,580],
      'Legendary':[0,0,0,1,0,0,1,0,1]}
df_main=pd.DataFrame(data)
df_main

Collecting category_encoders
  Downloading category_encoders-2.8.0-py3-none-any.whl.metadata (7.9 kB)
Collecting patsy>=0.5.1 (from category_encoders)
  Downloading patsy-1.0.1-py2.py3-none-any.whl.metadata (3.3 kB)
Collecting statsmodels>=0.9.0 (from category_encoders)
  Downloading statsmodels-0.14.4-cp313-cp313-win_amd64.whl.metadata (9.5 kB)
Downloading category_encoders-2.8.0-py3-none-any.whl (85 kB)
Downloading patsy-1.0.1-py2.py3-none-any.whl (232 kB)
Downloading statsmodels-0.14.4-cp313-cp313-win_amd64.whl (9.8 MB)
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.8 MB ? 

Unnamed: 0,Type,Height,Stats_total,Legendary
0,Fire,Short,495,0
1,Water,Normal,525,0
2,Bug,Very short,195,0
3,Fire,Tall,580,1
4,Fire,Normal,525,0
5,Bug,Short,500,0
6,Water,Tall,670,1
7,Bug,Very short,405,0
8,Ice,Tall,580,1


We are going to examine the columns feature of this data by applying different encoders.

# I. Classic Encoders

As the name suggests, classical encoders are well known and widely used. Their concept are also pretty straight-forward. 

# 1) Ordinal Encoding

"Ordinal" means ordered, so this only works on the ordinal feature. 
Most of the time, unique values in the ordinal column are of type string and written in a human language. Thus, we need to manually assign a numerical ranking according to their order.

In [2]:
df=df_main.copy()
height_dict ={'Very short':1, 'Short':2, 'Normal':3, 'Tall':4}
df['Ordinal_Height']=df.Height.map(height_dict)
df[['Height','Ordinal_Height']]

Unnamed: 0,Height,Ordinal_Height
0,Short,2
1,Normal,3
2,Very short,1
3,Tall,4
4,Normal,3
5,Short,2
6,Tall,4
7,Very short,1
8,Tall,4


# 2) One-hot encoding
One-hot encoding can be explained by a 2 steps process:

* Split all the categories in one column to different columns

* Put the check mark '1' for the appropriate location.

The `get_dummies` function in pandas can do the job

In [3]:
df=df_main.copy()
df_Height=pd.get_dummies(df[['Height']],prefix='T')

pd.concat([df[['Height']],df_Height],axis=1).head()

Unnamed: 0,Height,T_Normal,T_Short,T_Tall,T_Very short
0,Short,0,1,0,0
1,Normal,1,0,0,0
2,Very short,0,0,0,1
3,Tall,0,0,1,0
4,Normal,1,0,0,0


Sklearn can do the similar things:  
(*I still prefer using get_dummies since it gives us a nicer label.*)

In [4]:
from sklearn.preprocessing import OneHotEncoder
df=df_main.copy()

ohe=OneHotEncoder()
ohe=ohe.fit_transform(df[['Height']]).toarray()
newdata=pd.DataFrame(ohe)

dfh=pd.concat([df[['Height']],newdata],axis=1)
dfh.head()

Unnamed: 0,Height,0,1,2,3
0,Short,0.0,1.0,0.0,0.0
1,Normal,1.0,0.0,0.0,0.0
2,Very short,0.0,0.0,0.0,1.0
3,Tall,0.0,0.0,1.0,0.0
4,Normal,1.0,0.0,0.0,0.0


# 3) Binary Encoding

This encoding is different from what you think it is

There are 3 steps:
* Going down the column, every time it sees a new category, it gives a number, starting from 1 (and the next one is 2)
* Convert these number into binary
* Place each digit in this binary in a separate column.

Imagine that you have 200 different categories. One hot encoding will create 200 different columns. It the meantime, binary encoding only need 8 columns. (Since 11001000 is 200 in base 2).

In the code below, I will add the encounter step column so you can see how it works.

**Note**: People sometimes refer hot-encoding as binary encoding

In [5]:
from category_encoders import BinaryEncoder
df=df_main.copy()

be=BinaryEncoder(cols=['Type'])
newdata=be.fit_transform(df['Type'])

EncounterStep= pd.DataFrame([1,2,3,1,1,3,2,3,4],columns=["EncounterStep"]) #Test it your self if this correct
dfh=pd.concat([df[['Type']],EncounterStep,newdata],axis=1)
dfh

Unnamed: 0,Type,EncounterStep,Type_0,Type_1,Type_2
0,Fire,1,0,0,1
1,Water,2,0,1,0
2,Bug,3,0,1,1
3,Fire,1,0,0,1
4,Fire,1,0,0,1
5,Bug,3,0,1,1
6,Water,2,0,1,0
7,Bug,3,0,1,1
8,Ice,4,1,0,0


# 4) Frequency Encoding

Give each category the probability (occurence/total event).

In [6]:
df=df_main.copy()

dfTemp=df.groupby("Type").size()/len(df) #Group it by type, find the size of each type, and divide by total event
df['Type_freq']=df['Type'].map(dfTemp) #dfTemp is a dataframe type

pd.concat([df[['Type']],df['Type_freq']],axis=1)

Unnamed: 0,Type,Type_freq
0,Fire,0.333333
1,Water,0.222222
2,Bug,0.333333
3,Fire,0.333333
4,Fire,0.333333
5,Bug,0.333333
6,Water,0.222222
7,Bug,0.333333
8,Ice,0.111111


# 5) Hashing Encoding

Hashing converts categorical variables to a higher dimensional space of integers. I won't comment on the methodology much here since [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html "reference") explain it very well.

The `n_feature` is the number of columns you want to add. These new columns distinguish the corresponding category. However, you can adjust `n_feature` to any number. This is like binary encoding on steroids! 

**Advantage**
* Deal with large scale categorical features
* High speed and reduced memory usage

**Disadvantage**
* No inverse-transformation method

In [7]:
from sklearn.feature_extraction import FeatureHasher
df=df_main.copy()

fg = FeatureHasher(n_features=2, input_type='string')
hashed_features = fg.fit_transform(df['Type'])
hashed_features = hashed_features.toarray()

df=pd.concat([df[['Type']], pd.DataFrame(hashed_features)], axis=1)
df

Unnamed: 0,Type,0,1
0,Fire,-1.0,1.0
1,Water,0.0,1.0
2,Bug,-1.0,0.0
3,Fire,-1.0,1.0
4,Fire,-1.0,1.0
5,Bug,-1.0,0.0
6,Water,0.0,1.0
7,Bug,-1.0,0.0
8,Ice,1.0,0.0


# II. Contrast encoders
Contrast coding allows for recentering of categorical variables such that the intercept of a model is not the mean of one level of a category, but instead, the mean of all data points in the data set.

Many people argue that these encodings are not very effective so I won't talk alot about it.

# 1) Helmert (reverse) Encoding
Helmert coding compares each level of a categorical variable to the mean of the subsequent levels
More about this [here](https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/#HELMERT)


In [8]:
from category_encoders import HelmertEncoder
df=df_main.sample(5)

he=HelmertEncoder(cols=['Height'])
newcolumn=he.fit_transform(df['Height'])

df=pd.concat([df[['Height']],newcolumn],axis=1)
df



Unnamed: 0,Height,intercept,Height_0,Height_1,Height_2
6,Tall,1,-1.0,-1.0,-1.0
2,Very short,1,1.0,-1.0,-1.0
7,Very short,1,1.0,-1.0,-1.0
4,Normal,1,0.0,2.0,-1.0
0,Short,1,0.0,0.0,3.0


# 2) Backward Difference Encoding

In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. [Read more](http://www.statsmodels.org/dev/contrasts.html)


In [9]:
from category_encoders import BackwardDifferenceEncoder
df=df_main.sample(5)

bwde = BackwardDifferenceEncoder()
newcolumns=bwde.fit_transform(df['Type'])

pd.concat([df[['Type']],newcolumns],axis=1)



Unnamed: 0,Type,intercept,Type_0,Type_1,Type_2
8,Ice,1,-0.75,-0.5,-0.25
7,Bug,1,0.25,-0.5,-0.25
2,Bug,1,0.25,-0.5,-0.25
1,Water,1,0.25,0.5,-0.25
3,Fire,1,0.25,0.5,0.75


# III. Bayesian Target Encoders

The general idea of this method is to take the target into account. 

**Advantage:** 

* Require minimal effort, only create one column for any number of categories in that feature

* Most favorite encoding scheme in Kaggle competition

**Disadvantage:**

* Only work for supervised learning (thus, inherently leaky). This means that when dealing with unsupervised data, it gets worse!

* Need regularization for the previous reason

# 1) Target Encoding
The basic idea is 
$$TE_i=\frac{\text{total true}(y_i)}{\text{total}(y_i)}\cdot \lambda$$

where $y_i$ is a category and $\lambda$ is a smoothing function (For more, search additive or Laplace smoothing)

Let's compare the table without the smoothing function...

In [10]:
df=df_main.copy()

mean_encode=df.groupby("Type")['Legendary'].mean()
df['Type_legendary']=df['Type'].map(mean_encode)

df[['Type','Legendary','Type_legendary']]

Unnamed: 0,Type,Legendary,Type_legendary
0,Fire,0,0.333333
1,Water,0,0.5
2,Bug,0,0.0
3,Fire,1,0.333333
4,Fire,0,0.333333
5,Bug,0,0.0
6,Water,1,0.5
7,Bug,0,0.0
8,Ice,1,1.0


...and sk-learn TargetEncoder with smoothing.

**Note:** By default, the "smoothing" coefficient is 1. The bigger the value, the stronger our regularization. 

In [11]:
from category_encoders import TargetEncoder
df=df_main.copy()

TE = TargetEncoder(cols=['Type'])
df['Type_legendary']=TE.fit_transform(df['Type'],df['Legendary'])

df[['Type','Legendary','Type_legendary']]

Unnamed: 0,Type,Legendary,Type_legendary
0,Fire,0,0.333333
1,Water,0,0.356975
2,Bug,0,0.281845
3,Fire,1,0.333333
4,Fire,0,0.333333
5,Bug,0,0.281845
6,Water,1,0.356975
7,Bug,0,0.281845
8,Ice,1,0.420072


# 2) Leave One Out Encoding
This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.

Additionally, you can add some (Gaussian) noise to the data to prevent overfitting. Change the sigma function from 0 to any value between 0 and 1 do the trick.


In [12]:
from category_encoders import LeaveOneOutEncoder
df=df_main.copy()

LOOE = LeaveOneOutEncoder(cols=['Type'], sigma=0.2)
df['Type_legendary']=LOOE.fit_transform(df['Type'], df['Legendary'])

df[['Type','Legendary','Type_legendary']]

Unnamed: 0,Type,Legendary,Type_legendary
0,Fire,0,0.555986
1,Water,0,0.85832
2,Bug,0,0.0
3,Fire,1,0.0
4,Fire,0,0.418259
5,Bug,0,0.0
6,Water,1,0.0
7,Bug,0,0.0
8,Ice,1,0.264134


# 3) Weight of Evidence Encoding

is a measure of how much the evidence supports or undermines a hypothesis. The Weight of Evidence in sklearn is the adjacent version of it which is just adding some value on the top and the bottom:

$$WoE=\bigg[ \ln\bigg( \frac{\text{Distribution of goods}+adj}{\text{Distribution of Bads}+adj}\bigg) \bigg]$$

where $adj$ is the adjacent factor is a function that avoids division by 0.

Advantage:

* Work well with logistic regression since WoE transformation has the same logistic scale.
* Can use WoE to compare across feature since their values are standardized.
           
Disadvantages: 

* May lose information due to some category may have the same WoE
* Does not take into account features correlation
* Overfit 

Note: We can adjust the adj factor by changing regularization. (By default it is 1). When setting it equal to 0. You come back to the original WOE and may encounter division by 0

In [13]:
from category_encoders import WOEEncoder

WOEE = WOEEncoder(cols=['Type'],regularization=0.5)
df['Type_legendary']=WOEE.fit_transform(df['Type'], df['Legendary'])

df[['Type','Legendary','Type_legendary']]

Unnamed: 0,Type,Legendary,Type_legendary
0,Fire,0,0.04879
1,Water,0,0.559616
2,Bug,0,-1.386294
3,Fire,1,0.04879
4,Fire,0,0.04879
5,Bug,0,-1.386294
6,Water,1,0.559616
7,Bug,0,-1.386294
8,Ice,1,0.0


# 4)James-Stein Encoding
This is target encoding but is more roburst. It is defined by the formula:
   $$JS_i = (1-B)\cdot \text{mean}(y_i) + B\cdot\text{mean}(y)$$
where $\text{mean}(y)$ is the global mean of the target, $\text{mean}(y_i)$ is the mean of the category, and $B$ is the weight. 

The weight B depends on the $\sigma (y)$ and $\sigma (y_i)$, which is the variance of the target. However, we do not know what the variance is so we have to estimate it. More about this method [here](https://kiwidamien.github.io/james-stein-encoder.html). 

**Note:** The limitation of James-Stein is it work only best for the feature that has a normal distribution. 

In the sklearn version, the default sigma is $0.05$.

In [14]:
from category_encoders import JamesSteinEncoder

JSE= JamesSteinEncoder(sigma=0.1)
newcolumns=JSE.fit_transform(df['Type'], df['Legendary'])

df['JSE_col']=newcolumns
df[['Type','Legendary','JSE_col']]

Unnamed: 0,Type,Legendary,JSE_col
0,Fire,0,0.333333
1,Water,0,0.462963
2,Bug,0,0.0
3,Fire,1,0.333333
4,Fire,0,0.333333
5,Bug,0,0.0
6,Water,1,0.462963
7,Bug,0,0.0
8,Ice,1,1.0


# 5) M-estimator Encoding

M-Estimate Encoder is a simplified version of Target Encoder. The M stands for maximum likelihood-type. It has only one hyper-parameter — $m$, which represents the power of regularization. The higher the value of m results into stronger shrinking. Recommended values for $m$ is in the range of $1$ to $100$. Read more [here](https://en.wikipedia.org/wiki/M-estimator)

**Note:** By default, $m=1$.

In [15]:
from category_encoders import MEstimateEncoder
df=df_main.copy()

MEE=MEstimateEncoder(m=2)
newcolumns = MEE.fit_transform(df['Type'], df['Legendary'])

df['MEE_col']=newcolumns
df[['Type','Legendary','MEE_col']]

Unnamed: 0,Type,Legendary,MEE_col
0,Fire,0,0.333333
1,Water,0,0.416667
2,Bug,0,0.133333
3,Fire,1,0.333333
4,Fire,0,0.333333
5,Bug,0,0.133333
6,Water,1,0.416667
7,Bug,0,0.133333
8,Ice,1,0.555556


# V. Conclusion
There are no single formula for encoding a feature. However, if you understand the 12 encoding techniques I introduced above, it would be able to move fast. Moreover, it always worth try all the techniques that are applicable to the feature and decide which one works best. Try to input different regularization coefficient values and see if they increase your score. The cheat-sheet below will help you make some initial decisions. 

Have fun playing with encoders!