# Categorical Encodings
> a tutorial on categorical encodings, how to use them, and when

- toc: true
- badges: true
- comments: true
- categories: [jupyter]

## About

Most Machine Learning algorithms can't make use of categorical features untill they are converted into numerical values. that's where "Categorical Encoding" comes into play. There are a lot of different ways to convert categorical features, some are better than other in different situations. I'll be doing my best to clarify how each categorical encoding work and when to use them.

Categorical Encoding can be divided into two broad categories, Nominal (there's no order to the categories) and Oridnal (there's some order into them).

Examples for Nominal:
* Red, Blue, Black
* Car, Ship, Plane

Examples for Ordinal:
* Excellent, Very Good, Good, Failed
* Tall, medium, short

* the most stable and accurate encoders are target-based encoders with Double Validation: Catboost Encoder, James-Stein Encoder, M-estimator Encoder and Target Encoder


* Using Single Validation will result in much better outcome than not using validation at all. Double validation will achieve more stable score but it would costs more resources and time.


* Regularization is a must for target-based encoders.


* **Reference**
    * y, y+ = # of target values, # of true target variable
    * n, n+ = # of observations for a given value in a categoricla column
    * xi, yi = ith value of category and target
    * a = regularization hyperparameter, default is prior(mean value of the target)

In [2]:
#hide
import pandas as pd
import category_encoders as ce
import warnings
warnings.filterwarnings('ignore')

In [2]:
# set to True, if your data is a binary classification problem (for WOE and Probability Ration Encodings)
is_binary_classification = False

# set to True, if you data is normally distributed (for James-Stein Encoding)
is_normally_distributed = False

## Data

The [data](https://github.com/mrdbourke/zero-to-mastery-ml/edit/master/data/car-sales.csv) we'll be using is from ZeroToMastery: Machine Learning and Data Science Udemy Course.

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales.csv')
data

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Toyota,White,150043,4,"$4,000.00"
1,Honda,Red,87899,4,"$5,000.00"
2,Toyota,Blue,32549,3,"$7,000.00"
3,BMW,Black,11179,5,"$22,000.00"
4,Nissan,White,213095,4,"$3,500.00"
5,Toyota,Green,99213,4,"$4,500.00"
6,Honda,Blue,45698,4,"$7,500.00"
7,Honda,Blue,54738,4,"$7,000.00"
8,Toyota,White,60000,4,"$6,250.00"
9,Nissan,White,31600,4,"$9,700.00"


In [4]:
# Converting Price into numeric
data['Price'] = data['Price'].str.replace('$', '').str.replace(',', '').str.replace('.', '').astype(int)/100
data.drop('Colour', axis=1, inplace=True)
data

Unnamed: 0,Make,Odometer (KM),Doors,Price
0,Toyota,150043,4,4000.0
1,Honda,87899,4,5000.0
2,Toyota,32549,3,7000.0
3,BMW,11179,5,22000.0
4,Nissan,213095,4,3500.0
5,Toyota,99213,4,4500.0
6,Honda,45698,4,7500.0
7,Honda,54738,4,7000.0
8,Toyota,60000,4,6250.0
9,Nissan,31600,4,9700.0


In [5]:
# Total of 10 data points
data['Make'].value_counts()

Toyota    4
Honda     3
Nissan    2
BMW       1
Name: Make, dtype: int64

## One Hot Encoding

* It uses a vector to denote the absence or existence for a category. the length of the vector depends on the number of categories in the feature.


* It creates N columns (N is the number of categories).


* When solving regression problems it's better to keep the columns at N-1 to ensure the correct number of degrees of freedom (N-1)


* It's recommended to Use N columns for classifiction, espically when using a tree-based algorithm. 


* It's recommended to use N-1 columns for algorithms that look at all the features simultaneoulsy during training (e.g. SVM, NN, clustring algorithms).


* If the number of categories is big, it will slow the training process. OHE expands the size of your dataset, which makes it memory-inefficient encoder.


* There are several strategies to overcome the memory problem with OHE, one of which is working with sparse not dense data representation.**??**

In [6]:
ohe = ce.OneHotEncoder(cols='Make',
                       use_cat_names=True)
one_hot_encoded_data = ohe.fit_transform(data)
one_hot_encoded_data

Unnamed: 0,Make_Toyota,Make_Honda,Make_BMW,Make_Nissan,Odometer (KM),Doors,Price
0,1,0,0,0,150043,4,4000.0
1,0,1,0,0,87899,4,5000.0
2,1,0,0,0,32549,3,7000.0
3,0,0,1,0,11179,5,22000.0
4,0,0,0,1,213095,4,3500.0
5,1,0,0,0,99213,4,4500.0
6,0,1,0,0,45698,4,7500.0
7,0,1,0,0,54738,4,7000.0
8,1,0,0,0,60000,4,6250.0
9,0,0,0,1,31600,4,9700.0


## Sum Encoding 

* also known as Deviation encoding or Effect encoding

* Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression (LR) types of models.

* However, the difference between them is the interpretation of LR coefficients.**??**
    * OHE model the intercept represents the mean for the baseline condition and coefficients represents simple effects (the difference between one particular condition and the baseline),
    * in Sum Encoder model the intercept represents the grand mean (across all conditions) and the coefficients can be interpreted directly as the main effects.

In [38]:
# Code for sum encoding
sum_encoding = ce.SumEncoder(cols=['Make'])
sum_encoded_data = sum_encoding.fit_transform(data).drop('intercept', axis=1)
sum_encoded_data

Unnamed: 0,Make_0,Make_1,Make_2,Odometer (KM),Doors,Price
0,1.0,0.0,0.0,150043,4,4000.0
1,0.0,1.0,0.0,87899,4,5000.0
2,1.0,0.0,0.0,32549,3,7000.0
3,0.0,0.0,1.0,11179,5,22000.0
4,-1.0,-1.0,-1.0,213095,4,3500.0
5,1.0,0.0,0.0,99213,4,4500.0
6,0.0,1.0,0.0,45698,4,7500.0
7,0.0,1.0,0.0,54738,4,7000.0
8,1.0,0.0,0.0,60000,4,6250.0
9,-1.0,-1.0,-1.0,31600,4,9700.0


## Helmert Encoding

* Helmert coding is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding.


* This type of encoding can be useful in certain situations where levels of the categorical variable are ordered, say, from lowest to highest, or from smallest to largest.


> The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. [source](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)

In [5]:
helmert_encoding = ce.HelmertEncoder(cols='Make',
                                     drop_invariant=True)
helmert_encoded_data = helmert_encoding.fit_transform(data)
helmert_encoded_data

Unnamed: 0,Make_0,Make_1,Make_2,Odometer (KM),Doors,Price
0,-1.0,-1.0,-1.0,150043,4,4000.0
1,1.0,-1.0,-1.0,87899,4,5000.0
2,-1.0,-1.0,-1.0,32549,3,7000.0
3,0.0,2.0,-1.0,11179,5,22000.0
4,0.0,0.0,3.0,213095,4,3500.0
5,-1.0,-1.0,-1.0,99213,4,4500.0
6,1.0,-1.0,-1.0,45698,4,7500.0
7,1.0,-1.0,-1.0,54738,4,7000.0
8,-1.0,-1.0,-1.0,60000,4,6250.0
9,0.0,0.0,3.0,31600,4,9700.0


In [12]:
helmert_encoding.mapping

[{'col': 'Make',
  'mapping':     Make_0  Make_1  Make_2
   1    -1.0    -1.0    -1.0
   2     1.0    -1.0    -1.0
   3     0.0     2.0    -1.0
   4     0.0     0.0     3.0
  -1     0.0     0.0     0.0
  -2     0.0     0.0     0.0}]

## Label Encoding

In this encoding, it assigns a number for each category, ranging from 1 to N.

The major issue with it is the numbers don't necessarly represent an order (Toyota > Honda > BMW > Nissan).

In [7]:
from sklearn.preprocessing import LabelEncoder

label_encoding = LabelEncoder()
label_encoded_data = data.copy()

label_encoded_data['Make'] = label_encoding.fit_transform(data['Make'])
label_encoded_data

Unnamed: 0,Make,Odometer (KM),Doors,Price
0,3,150043,4,4000.0
1,1,87899,4,5000.0
2,3,32549,3,7000.0
3,0,11179,5,22000.0
4,2,213095,4,3500.0
5,3,99213,4,4500.0
6,1,45698,4,7500.0
7,1,54738,4,7000.0
8,3,60000,4,6250.0
9,2,31600,4,9700.0


In [8]:
# BMW: 0, Honda: 1, ...
label_encoding.classes_

array(['BMW', 'Honda', 'Nissan', 'Toyota'], dtype=object)

## Ordinal Encoding

It mostly works the same way as Label Encoding. Label encoding wouldn't consider whether the feature is ordinal or not. With ordinal encoding we provide what is the order of the categories in a column.

In [9]:
# there"s no order in the Make column, but just to demonstrate the code

mapping = [{"col": "Make", 
            "mapping": {"BMW": 1, "Honda": 2, "Nissan": 3, "Toyota": 4}}]

ordinal_encoding = ce.OrdinalEncoder(cols=["Make"], mapping=mapping)
ordinal_encoded_data = ordinal_encoding.fit_transform(data)
ordinal_encoded_data

Unnamed: 0,Make,Odometer (KM),Doors,Price
0,4,150043,4,4000.0
1,2,87899,4,5000.0
2,4,32549,3,7000.0
3,1,11179,5,22000.0
4,3,213095,4,3500.0
5,4,99213,4,4500.0
6,2,45698,4,7500.0
7,2,54738,4,7000.0
8,4,60000,4,6250.0
9,3,31600,4,9700.0


In [10]:
ordinal_encoding.category_mapping

[{'col': 'Make', 'mapping': {'BMW': 1, 'Honda': 2, 'Nissan': 3, 'Toyota': 4}}]

## Label Encoding + Ordinal Encoding

* such transformation should not be used “as is” for several types of models (Linear Models, KNN, Neural Nets, etc.).

* While applying gradient boosting it could be used only if the type of a column is specified as “category”

> df[“category_representation”] = df[“category_representation”].astype(“category”)

* If you are working with tabular data and your model is gradient boosting (especially LightGBM library), LE is the simplest and efficient way for you to work with categories in terms of memory (the category type in python consumes much less memory than the object type).


## Binary Encoding

Binary Encoding converts the number of categories N into a binary number, and uses every bit as a column. Here we have 4 categories which can be stored in 3 bits, meaning 3 columns.

In [13]:
# you can also use custom mapping
binary_encoding = ce.BinaryEncoder(cols='Make')
binary_encoded_data = binary_encoding.fit_transform(data)
binary_encoded_data

Unnamed: 0,Make_0,Make_1,Make_2,Odometer (KM),Doors,Price
0,0,0,1,150043,4,4000.0
1,0,1,0,87899,4,5000.0
2,0,0,1,32549,3,7000.0
3,0,1,1,11179,5,22000.0
4,1,0,0,213095,4,3500.0
5,0,0,1,99213,4,4500.0
6,0,1,0,45698,4,7500.0
7,0,1,0,54738,4,7000.0
8,0,0,1,60000,4,6250.0
9,1,0,0,31600,4,9700.0


Mapping:

* Toyota: 001
* Honda: 010
* BMW: 011
* Nissan: 100

## Frequency Encdoing

* encoding for different sizes of test batch might be different. You should think about it beforehand and make preprocessing of the train as close to the test as possible.


* Nevertheless, Frequency Encoding and RFE are especially efficient when your categorical column has “long tails”, i.e. several frequent values and the remaining ones have only a few examples in the dataset. In such a case, Frequency Encoding would catch the similarity between rare columns.

In [34]:
frequency_encoded_data = data.copy()

freq = data.groupby('Make').size()
frequency_encoded_data['Make'] = data['Make'].map(freq)
frequency_encoded_data

Unnamed: 0,Make,Odometer (KM),Doors,Price
0,4,150043,4,4000.0
1,3,87899,4,5000.0
2,4,32549,3,7000.0
3,1,11179,5,22000.0
4,2,213095,4,3500.0
5,4,99213,4,4500.0
6,3,45698,4,7500.0
7,3,54738,4,7000.0
8,4,60000,4,6250.0
9,2,31600,4,9700.0


## Mean Encoding

* Also Known as Target Encoding, is the go-to categorical encoding used in Kaggle competitions.


* It can figure out the categories that can have simillar effect on predict the target value.


* It doesn't affect the volume of the data, thus a faster learning process.

* The encoded category values are calculated according to the following formulas: ![](https://miro.medium.com/max/386/1*CQ0CSJY8yBq0P0L4i2ztwQ.png)


* mdl — min data (samples) in leaf, a — smoothing parameter, representing the power of regularization. Recommended values for mdl and a are in the range of 1 to 100.


* It has a huge disadvantage — target leakage: it uses information about the target. Because of the target leakage, model overfits the training data which results in unreliable validation and lower test scores.


* To reduce the effect of target leakage, we may
    1. increase regularization (it’s hard to tune those hyperparameters without unreliable validation),
    2. add random noise to the representation of the category in train dataset (some sort of augmentation), or
    3. use Double Validation.


In [16]:
# Using ce
mean_encoding = ce.TargetEncoder(cols='Make', smoothing=0)
mean_encoded_data = mean_encoding.fit_transform(data.drop('Price', axis=1), data['Price'])
mean_encoded_data['Price'] = data['Price']
mean_encoded_data

Unnamed: 0,Make,Odometer (KM),Doors,Price
0,5437.5,150043,4,4000.0
1,6500.0,87899,4,5000.0
2,5437.5,32549,3,7000.0
3,7645.0,11179,5,22000.0
4,6600.0,213095,4,3500.0
5,5437.5,99213,4,4500.0
6,6500.0,45698,4,7500.0
7,6500.0,54738,4,7000.0
8,5437.5,60000,4,6250.0
9,6600.0,31600,4,9700.0


## M-estimator Encoding

* M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyperparameter — m, which represents the power of regularization 


![](https://miro.medium.com/max/252/0*zZNdDd_6wpQq_-k5.png)


* In different sources, you may find another formula of M-Estimator. Instead of y+ there is n in the denominator. I found that such representation has similar scores.

In [22]:
m_estimator_encoding = ce.MEstimateEncoder(cols='Make',
                                           m=1)
m_estimator_encoded_data = m_estimator_encoding.fit_transform(data.drop('Price', axis=1), data['Price'])
m_estimator_encoded_data['Price'] = data['Price']
m_estimator_encoded_data

Unnamed: 0,Make,Odometer (KM),Doors,Price
0,5879.0,150043,4,4000.0
1,6786.25,87899,4,5000.0
2,5879.0,32549,3,7000.0
3,14822.5,11179,5,22000.0
4,6948.333333,213095,4,3500.0
5,5879.0,99213,4,4500.0
6,6786.25,45698,4,7500.0
7,6786.25,54738,4,7000.0
8,5879.0,60000,4,6250.0
9,6948.333333,31600,4,9700.0


## Weight of Evidence Encoding
## Probability Ratio Encoding

both are very similar, and both are explanied in this [article](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02). 

WOE and Probability Ratio both works with binary classification problems.

i won't go over them because i can't grasp it completly. even though i understand how the code works and the instructions of the algorithm. i can't completly understand how it can affect the learning process.

## Weight of Evidence Encdoing

* is a commonly used target-based encoder in credit scoring.


* It is a measure of the “strength” of a grouping for separating good and bad risk.


* it might lead to target leakage and overfit. To avoid that, regularization parameter a is induced and WoE is calculated in the following way

![](https://miro.medium.com/max/451/1*B2A6dKhKrMW7kqfZmm7HjQ.png)

In [17]:
# Weight of Evidence Encoding
if is_binary_classification:
    woe_encoding = ce.WOEEncoder(cols='Make')
    woe_encoded_data = woe_encoding.fit_transform(data.drop('Price', axis=1), data['Price'])
    woe_encoded_data

## Hashing

> With Hashing, the number of dimensions will be far less than the number of dimensions with encoding like One Hot Encoding. This method is advantageous when the cardinality of categorical is very high.

In [18]:
hashing_encoding = ce.HashingEncoder(cols='Make')
hashing_encoded_data = hashing_encoding.fit_transform(data)
hashing_encoded_data

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,Odometer (KM),Doors,Price
0,0,1,0,0,0,0,0,0,150043,4,4000.0
1,0,0,1,0,0,0,0,0,87899,4,5000.0
2,0,1,0,0,0,0,0,0,32549,3,7000.0
3,1,0,0,0,0,0,0,0,11179,5,22000.0
4,0,0,0,1,0,0,0,0,213095,4,3500.0
5,0,1,0,0,0,0,0,0,99213,4,4500.0
6,0,0,1,0,0,0,0,0,45698,4,7500.0
7,0,0,1,0,0,0,0,0,54738,4,7000.0
8,0,1,0,0,0,0,0,0,60000,4,6250.0
9,0,0,0,1,0,0,0,0,31600,4,9700.0


## Backward Difference Encdoing

* This Tequnique falls under the contrast coding system for categorical features.
* It creates N-1 columns (N is no. of columns)
* It compares the mean of the dependent variable with the mean of the dependent variable for the prior level

In [6]:
bd_encoding = ce.BackwardDifferenceEncoder(cols='Make', 
                                           drop_invariant=True)
bd_encoded_data = bd_encoding.fit_transform(data)
bd_encoded_data

Unnamed: 0,Make_0,Make_1,Make_2,Odometer (KM),Doors,Price
0,-0.75,-0.5,-0.25,150043,4,4000.0
1,0.25,-0.5,-0.25,87899,4,5000.0
2,-0.75,-0.5,-0.25,32549,3,7000.0
3,0.25,0.5,-0.25,11179,5,22000.0
4,0.25,0.5,0.75,213095,4,3500.0
5,-0.75,-0.5,-0.25,99213,4,4500.0
6,0.25,-0.5,-0.25,45698,4,7500.0
7,0.25,-0.5,-0.25,54738,4,7000.0
8,-0.75,-0.5,-0.25,60000,4,6250.0
9,0.25,0.5,0.75,31600,4,9700.0


Looks very similar to Helmert Encoding

In [20]:
bd_encoding.mapping

[{'col': 'Make',
  'mapping':     Make_0  Make_1  Make_2
   1   -0.75    -0.5   -0.25
   2    0.25    -0.5   -0.25
   3    0.25     0.5   -0.25
   4    0.25     0.5    0.75
  -1    0.00     0.0    0.00
  -2    0.00     0.0    0.00}]

## Leave One Out Encoding

> This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce outliers.

In [21]:
loo_encoding = ce.LeaveOneOutEncoder(cols='Make')
loo_encoded_data = loo_encoding.fit_transform(data.drop('Price', axis=1), data['Price'])
loo_encoded_data['Price'] = data['Price']
loo_encoded_data

Unnamed: 0,Make,Odometer (KM),Doors,Price
0,5916.666667,150043,4,4000.0
1,7250.0,87899,4,5000.0
2,4916.666667,32549,3,7000.0
3,7645.0,11179,5,22000.0
4,9700.0,213095,4,3500.0
5,5750.0,99213,4,4500.0
6,6000.0,45698,4,7500.0
7,6250.0,54738,4,7000.0
8,5166.666667,60000,4,6250.0
9,3500.0,31600,4,9700.0


## James-Stein Encoding

* Target-based Encoder

* Returns a weighted Average of:
    * The mean target value for the observed feature value.
    * The mean for the whole target value

* It tends to shrinks the average toard the overall average.

* One Major issue, it was defined only for normal distributions

In [23]:
if is_normally_distributed:
    james_stein_encoding = ce.JamesSteinEncoder(cols=['Make'])
    james_stein_encoded_data = james_stein_encoding.fit_transform(data.drop('Price', axis=1), data['Price'])
    james_stein_encoded_data['Price'] = data['Price']
    james_stein_encoded_data

## Which encoding method to use for your dataset?

It's recommended to follow the following chart to figure out which one best suited for your dataset.

You can test on multiple methods on a sample of your dataset, it should give you a good view of which one is better for your situation.

![](https://miro.medium.com/max/1000/0*NBVi7M3sGyiUSyd5.png)

In [30]:
# Expermenting which one is best for the dataset used
from sklearn.linear_model import LinearRegression

results = {}

encoded_datasets = {
    "One Hot Encoding": one_hot_encoded_data,
    "Label Encoding": label_encoded_data,
    "Ordinal Encoding": ordinal_encoded_data,
    "Helmert Encoding": helmert_encoded_data,
    "Binary Encoding": binary_encoded_data,
    "Frequency Encoding": frequency_encoded_data,
    "Mean Encoding": mean_encoded_data,
    "Hashing": hashing_encoded_data,
    "Backward Difference Encoding": bd_encoded_data,
    "Leave-One-Out": loo_encoded_data,
    "M-Estimator Encoding": m_estimator_encoded_data,
}

if is_normally_distributed:
    encoded_datasets["James-Stein Encoding"] = james_stein_encoded_data
if is_binary_classification:
    encoded_datasets["Weight of Evidence Encoding"] = woe_encoded_data


for encoding_name, encoded_dataset in encoded_datasets.items():
    
    # Split data
    X = encoded_dataset.drop('Price', axis=1)
    y = encoded_dataset['Price']
    
    model = LinearRegression()
    model.fit(X, y)
    
    score = model.score(X, y)
    
    results[encoding_name] = score;

LinearRegression()

LinearRegression()

LinearRegression()

LinearRegression()

LinearRegression()

LinearRegression()

LinearRegression()

LinearRegression()

LinearRegression()

LinearRegression()

LinearRegression()

In [31]:
sorted(results.items(), key=lambda x:x[1], reverse=True)

[('One Hot Encoding', 0.9945536929215048),
 ('Helmert Encoding', 0.9945536929215048),
 ('Binary Encoding', 0.9945536929215048),
 ('Hashing', 0.9945536929215048),
 ('Backward Difference Encoding', 0.9945536929215048),
 ('M-Estimator Encoding', 0.9801482890347663),
 ('Frequency Encoding', 0.8508034307880376),
 ('Leave-One-Out', 0.8166612326802211),
 ('Mean Encoding', 0.813764093905115),
 ('Label Encoding', 0.7735336830068426),
 ('Ordinal Encoding', 0.7735336830068424)]

## Notes

### Encoders-wise
* LE and OE are bad with Linear Models, KNN, and Neural Nets

* OHE is good for tree-based algorithms(and for regression if it produces N-1 columns instead of N)

* Helmert Encoding and Sum Encoding is good for Linear models


### Algorithm-wise
* With Regression, SVM, NN, and clusting algorithms, it's better to use an encoder that generates N-1 columns


## Resources

[Benchmarking Categorical Encoders](https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8)

[All About Categorical Variable Encoding](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)

To-Read

[Coding Categorical Variables (Research Paper)](http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf)