# All about Categorical Variable Encoding

### Convert a categorical variable to number for Machine Learning Model Building

#### Types of Categorical Variables :-
1. Nominal variable :- Ordering not present every value has equal importance like places, animals etc.
2. Ordinal variable :- Ordering present like reviews (Good, Bad, Average), Education (10th, 12th, Graduation, Masters) etc.

In [1]:
# !pip install category_encoders

In [5]:
import pandas as pd
import numpy as np

### Category encoders library can be used to encode categorical features
#### https://www.kaggle.com/discdiver/category-encoders-examples

In [3]:
import category_encoders as ce

In [35]:
df = pd.DataFrame({
    'color':["a", "c", "a", "a", "b", "b"], 
    'country': ['India','UAE','USA','India','PERU','UAE'],
    'outcome':[1, 2, 0, 0, 0, 1]})

# set up X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)

## 1. One Hot Encoding

If we use the categorical variables in a tree-based learning algorithm, it is good practice to encode it into N binary variables and don’t drop the first.

In [36]:
# if features has only 2 unique values then apply one hot encoding direactly it will not effect the dimentionality 
ohenc_df = pd.get_dummies(X,drop_first=True)
ohenc_df

Unnamed: 0,color_b,color_c,country_PERU,country_UAE,country_USA
0,0,0,0,0,0
1,0,1,0,1,0
2,0,0,0,0,1
3,0,0,0,0,0
4,1,0,1,0,0
5,1,0,0,1,0


#### One Hot encoding applied in [Street, Utilities, CentralAir] features because they have only 2 values

## 2. Label Encoding
#### Mostly used in categorical target column because it assigns value automatically

In [43]:
labels = pd.factorize(X['country'])[0].reshape(-1,1)
labels

array([[0],
       [1],
       [2],
       [0],
       [3],
       [1]], dtype=int64)

### OR with sklearn

In [44]:
from sklearn.preprocessing import LabelEncoder
labels = LabelEncoder().fit_transform(X['country'])
labels

array([0, 2, 3, 0, 1, 2])

## 3. Ordinal Encoding

##### 1. This can be used if feature didn't have more unique values.
##### 2. This can be very handy to handle data which are not present in trainig dataset but we know can come in testing because we are assigning orders manually

In [45]:
mapedValues = {'c':0, 'd':1,'b':2, 'a':3}
oe_df = df['color'].map(mapedValues)
oe_df.head()

0    3
1    0
2    3
3    3
4    2
Name: color, dtype: int64

## 4. Helmert Encoding or Reverse Helmert Coding
In this encoding, the mean of the dependent variable for a level is compared to the mean of the dependent variable over all
previous levels.

#### Disadvantages :-
Curse of Dimensionality

In [49]:
import category_encoders as ce
encoder = ce.HelmertEncoder(cols=['color','country'], drop_invariant=True)
dfh = encoder.fit_transform(df[['color','country']])
dfh

Unnamed: 0,color_0,color_1,country_0,country_1,country_2
0,-1.0,-1.0,-1.0,-1.0,-1.0
1,1.0,-1.0,1.0,-1.0,-1.0
2,-1.0,-1.0,0.0,2.0,-1.0
3,-1.0,-1.0,-1.0,-1.0,-1.0
4,0.0,2.0,0.0,0.0,3.0
5,0.0,2.0,1.0,-1.0,-1.0


## 5. Binary Encoding
Binary encoding converts a category into binary digits. Each binary digit creates one feature column. If there are n unique categories, then binary encoding results in the only log(base 2)ⁿ features. In this example, we have four features; thus, the total number of the binary encoded features will be three features. Compared to One Hot Encoding, this will require fewer feature columns (for 100 categories One Hot Encoding will have 100 features while for Binary encoding, we will need just seven features).

**For Binary encoding, one has to follow the following steps:**
1. The categories are first converted to numeric order starting from 1 (order is created as categories appear in a dataset and do not mean any ordinal nature)
2. Then those integers are converted into binary code, so for example 3 becomes 011, 4 becomes 100
3. Then the digits of the binary number form separate columns.

In [51]:
encoder = ce.BinaryEncoder(cols=['country'])
dfbin = encoder.fit_transform(df)
dfbin

Unnamed: 0,color,country_0,country_1,country_2,outcome
0,a,0,0,1,1
1,c,0,1,0,2
2,a,0,1,1,0
3,a,0,0,1,0
4,b,1,0,0,0
5,b,0,1,0,1


## 6. Frequency Encoding
It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Three-step for this :
1. Select a categorical variable you would like to transform
2. Group by the categorical variable and obtain counts of each category
3. Join it back with the training dataset

In [60]:
freq = df.groupby('color').size()/len(df)
df['color_freq_encode'] = df['color'].map(freq)
df

Unnamed: 0,color,country,outcome,color_freq_encode
0,a,India,1,0.5
1,c,UAE,2,0.166667
2,a,USA,0,0.5
3,a,India,0,0.5
4,b,PERU,0,0.333333
5,b,UAE,1,0.333333


## 7. Mean Encoding
Mean Encoding or Target Encoding is one viral encoding approach followed by Kagglers. There are many variations of this. Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data. This encoding method brings out the relation between similar categories, but the connections are bounded within the categories and target itself.
### Advantages :-
1. it does not affect the volume of the data and helps in faster learning.

### Disadvantages :-
1. Usually, Mean encoding is notorious for over-fitting; thus, a regularization with cross-validation or some other approach is a must on most occasions.

**Mean encoding approach is as below:**
1. Select a categorical variable you would like to transform
2. Group by the categorical variable and obtain aggregated sum over the “Target” variable. (total number of 1’s for each category in ‘Temperature’)
3. Group by the categorical variable and obtain aggregated count over “Target” variable
4. Divide the step 2 / step 3 results and join it back with the train.

In [64]:
mean_encode = df.groupby('color')['outcome'].mean()
df.loc[:, 'color_mean_encode'] = df['color'].map(mean_encode)
df

Unnamed: 0,color,country,outcome,color_freq_encode,color_mean_encode
0,a,India,1,0.5,0.333333
1,c,UAE,2,0.166667,2.0
2,a,USA,0,0.5,0.333333
3,a,India,0,0.5,0.333333
4,b,PERU,0,0.333333,0.5
5,b,UAE,1,0.333333,0.5


Mean encoding can embody the target in the label, whereas label encoding does not correlate with the target. In the case of a large number of features, mean encoding could prove to be a much simpler alternative. Mean encoding tends to group the classes, whereas the grouping is random in case of label encoding.

There are many variations of this target encoding in practice, like smoothing. Smoothing can implement as below:

In [70]:
# compute the global mean
mean_val = df['outcome'].mean()

# compute the number of values and mean of each values
aggr = df.groupby('color')['outcome'].agg(['count','mean'])
counts = aggr['count']
means = aggr['mean']
weight = 100

# Compute the 'smoothed' mean
smooth = (counts*means + weight*means) / (counts+weight)

# Replace each value by the according smoothed mean
df.loc[:,'color_smean_enc'] = df['color'].map(smooth)
df

Unnamed: 0,color,country,outcome,color_freq_encode,color_mean_encode,color_smean_enc
0,a,India,1,0.5,0.333333,0.333333
1,c,UAE,2,0.166667,2.0,2.0
2,a,USA,0,0.5,0.333333,0.333333
3,a,India,0,0.5,0.333333,0.333333
4,b,PERU,0,0.333333,0.5,0.5
5,b,UAE,1,0.333333,0.5,0.5


## 7. Weight of Evidence Encoding
Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique to separate good and bad. This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry. Weight of evidence (WOE) is a measure of how much the evidence supports or undermines a hypothesis.

WoE is well suited for Logistic Regression because the Logit transformation is simply the log of the odds, i.e., ln(P(Goods)/P(Bads)). Therefore, by using WoE-coded predictors in Logistic Regression, the predictors are all prepared and coded to the same scale. The parameters in the linear logistic regression equation can be directly compared.

**The WoE transformation has (at least) three advantage:**
1. It can transform an independent variable so that it establishes a monotonic relationship to the dependent variable. It does more than this — to secure monotonic relationship it would be enough to “recode” it to any ordered measure (for example 1,2,3,4…), but the WoE transformation orders the categories on a “logistic” scale which is natural for Logistic Regression.
2. For variables with too many (sparsely populated) discrete values, these can be grouped into categories (densely populated), and the WoE can be used to express information for the whole category.
3. The (univariate) effect of each category on the dependent variable can be compared across categories and variables because WoE is a standardized value (for example you can compare WoE of married people to WoE of manual workers).

**It also has (at least) three drawbacks:**
1. Loss of information (variation) due to binning to a few categories
2. It is a “univariate” measure, so it does not take into account the correlation between independent variables
3. It is easy to manipulate (over-fit) the effect of variables according to how categories are created

In [74]:
# We calculate probability of target = 1 i.e. Good = 1 for each category
woe_df = df.groupby('color')['outcome'].mean()
woe_df = pd.DataFrame(woe_df)

# Rename the column name to 'Good' to keep it consistant with formula for better understanding
woe_df = woe_df.rename(columns = {'outcome': 'Good'})

# Calculate 'Bad' probability which is 1-Good probability
woe_df['Bad'] = 1-woe_df.Good

# We need to add small value to avoid divide by zero in denominator
woe_df['Bad'] = np.where(woe_df['Bad'] == 0, 0.000001, woe_df['Bad'])
woe_df['WOE'] = np.log(woe_df.Good / woe_df.Bad)
woe_df

  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0_level_0,Good,Bad,WOE
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,0.333333,0.666667,-0.693147
b,0.5,0.5,0.0
c,2.0,-1.0,


In [78]:
# Concat WOE with origin dataframe
df['color_woe_encode'] = df['color'].map(woe_df['WOE'])
df

Unnamed: 0,color,country,outcome,color_freq_encode,color_mean_encode,color_smean_enc,color_woe_encode
0,a,India,1,0.5,0.333333,0.333333,-0.693147
1,c,UAE,2,0.166667,2.0,2.0,
2,a,USA,0,0.5,0.333333,0.333333,-0.693147
3,a,India,0,0.5,0.333333,0.333333,-0.693147
4,b,PERU,0,0.333333,0.5,0.5,0.0
5,b,UAE,1,0.333333,0.5,0.5,0.0


## 8. Probability Ratio Encoding
Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only difference is the only ratio of good and bad probability is used. For each label, we calculate the mean of target=1, that is the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ). And then, we calculate the ratio P(1)/P(0) and replace the labels by that ratio. We need to add a minimal value with P(0) to avoid any divide by zero scenarios where for any particular category, there is no target=0.

In [80]:

# We calculate probability of target = 1 i.e. Good = 1 for each category
pr_df = df.groupby('color')['outcome'].mean()
pr_df = pd.DataFrame(pr_df)

# Rename the column name to 'Good' to keep it consistant with formula for better understanding
pr_df = pr_df.rename(columns = {'outcome': 'Good'})

# Calculate 'Bad' probability which is 1-Good probability
pr_df['Bad'] = 1-pr_df.Good

# We need to add small value to avoid divide by zero in denominator
pr_df['Bad'] = np.where(pr_df['Bad'] == 0, 0.000001, pr_df['Bad'])
pr_df['PR'] = pr_df.Good / pr_df.Bad
pr_df


Unnamed: 0_level_0,Good,Bad,PR
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,0.333333,0.666667,0.5
b,0.5,0.5,1.0
c,2.0,-1.0,-2.0


In [83]:
# Concat WOE with origin dataframe
df['color_pr_encode'] = df['color'].map(pr_df['PR'])
df

Unnamed: 0,color,country,outcome,color_freq_encode,color_mean_encode,color_smean_enc,color_woe_encode,color_pr_encode
0,a,India,1,0.5,0.333333,0.333333,-0.693147,0.5
1,c,UAE,2,0.166667,2.0,2.0,,-2.0
2,a,USA,0,0.5,0.333333,0.333333,-0.693147,0.5
3,a,India,0,0.5,0.333333,0.333333,-0.693147,0.5
4,b,PERU,0,0.333333,0.5,0.5,0.0,1.0
5,b,UAE,1,0.333333,0.5,0.5,0.0,1.0


## 9. Hashing
Hashing converts categorical variables to a higher dimensional space of integers, where the distance between two vectors of categorical variables in approximately maintained the transformed numerical dimensional space. With Hashing, the number of dimensions will be far less than the number of dimensions with encoding like One Hot Encoding. This method is advantageous when the cardinality of categorical is very high.

It’s important to read about how max_process & max_sample work before setting them manually, inappropriate setting slows down encoding.

Refer for more detail :-
https://contrib.scikit-learn.org/category_encoders/hashing.html

In [94]:
df = pd.DataFrame({
    'color':["a", "c", "a", "a", "b", "b"], 
    'country': ['India','UAE','USA','India','PERU','UAE'],
    'outcome':[1, 2, 0, 0, 0, 1]})
X = df.drop('outcome', axis = 1)
y = df['outcome']

In [95]:
ce_hash = ce.HashingEncoder(cols = ['color'])
ce_hash.fit_transform(X['color'], y)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,1


### Advantages
1. The advantage of this encoder is that it does not maintain a dictionary of observed categories. 
2. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.

### Disadvantages
1. Time consuming process

## 10. Backward Difference Encoding

In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.

In [97]:
encoder = ce.BackwardDifferenceEncoder(cols=['color'])
bde = encoder.fit_transform(X,y)
bde

Unnamed: 0,intercept,color_0,color_1,country
0,1,-0.666667,-0.333333,India
1,1,0.333333,-0.333333,UAE
2,1,-0.666667,-0.333333,USA
3,1,-0.666667,-0.333333,India
4,1,0.333333,0.666667,PERU
5,1,0.333333,0.666667,UAE


## 11. Leave One Out Encoding
This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.

In [6]:
df = pd.DataFrame({
    'color':["a", "c", "a", "a", "b", "b"], 
    'country': ['India','UAE','USA','India','PERU','UAE'],
    'outcome':[1, 2, 0, 0, 0, 1]})
X = df.drop('outcome', axis = 1)
y = df['outcome']

In [8]:
ce_leave = ce.LeaveOneOutEncoder(cols = ['color'])
ce_leave.fit(X, y)        
ce_leave.transform(X, y)       

Unnamed: 0,color,country
0,0.0,India
1,0.666667,UAE
2,0.5,USA
3,0.5,India
4,1.0,PERU
5,0.0,UAE


## 12. James-Stein Encoding
For feature value, James-Stein estimator returns a weighted average of:
1. The mean target value for the observed feature value.
2. The mean target value (regardless of the feature value).


The James-Stein encoder shrinks the average toward the overall average. It is a target based encoder. James-Stein estimator has, however, one practical limitation — it was defined only for normal distributions.

In [104]:
# Build the encoder
encoder = ce.JamesSteinEncoder(cols=['color','country'])

# Encode the frame and view it
color_transform = encoder.fit_transform(X, y)
color_transform

Unnamed: 0,color,country
0,0.333333,0.52381
1,2.0,1.380952
2,0.333333,0.0
3,0.333333,0.52381
4,0.5,0.0
5,0.5,1.380952


## 13. M-estimator Encoding
M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyper-parameter — m, which represents the power of regularization. The higher the value of m results, into stronger shrinking. Recommended values for m is in the range of 1 to 100.

In [106]:
encoder = ce.MEstimateEncoder(cols=['color'])
meEnc = encoder.fit_transform(X,y)
meEnc

Unnamed: 0,color,country
0,0.416667,India
1,1.333333,UAE
2,0.416667,USA
3,0.416667,India
4,0.555556,PERU
5,0.555556,UAE


### ----- There are lots of other encoding techniques are available in category_encoders library -----

# Cheatsheet For Categorical Feature Encoding

<img src='Categorical_feature_encoding.png' />