#### What is Mean Encoding  (TS encoding, Target Encoding)  ?

When you’re doing supervised learning you often have to deal with categorical variables. That is, variables which don’t have a natural numerical representation. The problem is that most machine learning algorithms require the input data to be numerical. At some point or another a data science pipeline will require converting categorical variables to numerical variables.


There are many ways to do so, but the most popularonce are:

1. **Label encoding**: where you choose an arbitrary number for each category


2. **One-hot encoding**: where you create one binary column per category


3. **Vector representation**: a.k.a. **Entity encoding** (**word2vec**) where you find a low dimensional subspace that fits your data


4. **Optimal binning**: where you rely on tree-learners such as LightGBM or CatBoost


5. **Target encoding**: where you average the target value by category


Each and every one of these method has it’s pros and cons, and it usually depends on your data and your requirements. If a variable has a lot of categories then a one-hot encoding scheme will produce many columns which can cause memory issues. In my experience relying on LightGBM/CatBoost is the best out-of-the-box method. Label encoding is useless and you should never use it. However if your categorical variable happens to be ordinal then you can and should represent it with increasing numbers (for example “cold” becomes 0, “mild” becomes 1, and “hot” becomes 2). word2vec and others such methods are cool and good but they require some fine-tuning and don’t always work out.

Target encoding (TS encoding, mean encoding) is a fast way to get the most out of your categorical variables with little effort. The idea is quite simple:

Say you have a categorical variable x and a target y.


y can be binary,multinominal or continuous, it doesn’t matter.


For each distinct (unique) element in x you’re going to compute the average of the corresponding values in y.


Then you’re going to replace each x_i with the according mean. This is rather easy to do in Python and the pandas library.

First let’s create some dummy data.

**In a nutshell, it uses the target variable as the basis to generate the new encoded feature.**

Let’s take a look at an example:

![Table01](./images/table01.png)

In this sample dataset we can see a feature named `Jobs`, another feature named `Age` and a target variable which points to a binary classification problem with target variables `1` and `0`. Now, feature `Age` is all set since it’s already numerical but now we need to encode feature `Jobs`.

The most obvious approach would be label encoding, where we would convert the values according to a mapping logic — an example is 1 for Doctor, 2 for Teacher, 3 for Engineer, 4 for Waiter and 5 for Driver. Thus the result would be:

![Table02](./images/table02.png)

There’s nothing wrong with this approach really. However, if we look at the distribution of the feature, we see that it is completely random, no correlation whatsoever with the target variable.

![chart01](./images/chart01.png)

Which makes sense, right? There wasn’t any specific logic regarding the mapping we applied, we just gave each job a number and that was it. Now, is there another way we can encode this feature so that it’s not so random and maybe gives us some extra information about the target variable itself?

Let’s try this: for each unique value of the categorical feature, let us encode it based on the ratio of occurrence of the positive class in the target variable (**mean of the targets for that categorical value**). The result would be:

![table03](./images/table03.png)

Why? For example, let’s look at unique value “Doctor”. It has 4 occurrences of the target variable and 2 of those are the positive label — therefore, mean encoding would be 0.5 for value “Doctor”. Repeat the process or all unique values of the feature and you get the result.

Now let’s look at the distribution of the feature once again and see the difference.

![chart02](./images/chart02.png)

The target classes seem way more separate — class 1 to the right, class 0 to the left — because there is a correlation between the feature value and the target class. **From a mathematical point of view, mean encoding represents a probability of your target variable, conditional on each value of the feature.** In a way, it embodies the target variable in its encoded value.

To conclude, what **mean encoding** does is, it solves both the encoding task and also creates a feature that is more representative of the target variable — **essentially getting two fruits with one stone.**

#### Prons and Cons

However, it’s not all roses. While mean encoding has shown to increase the quality of a classification model, it doesn’t go without its problems; the main one being the usual suspect, **overfitting.**

The fact that we are encoding the feature based on target classes may lead to **data leakage**, rendering the feature biased. To solve this, mean encoding is usually used with some type of Regularization. Check [this solution](https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36136#201638) on Kaggle as an example, where the author used an averaged cross-validation scheme.

we might as well mention gradient boosting trees and how mean encoding is particularly useful with those. One of GBT’s downsides is its inability to handle high-cardinality categorical features, because trees have limited depth.


![graph01](./images/graph01.png)

Now, since mean encoding considerably decreases cardinality, as we’ve seen before, it becomes a great tool to use in order to reach a better loss with a shorter tree, and thus improving the classification model.

**HINT:** measure the cardinality of the categorical features before and after applying mean encoding

#### Target Encoding  (TS encoding, Mean Encoding)  Done The Right Way


Target encoding is rather easy to do in Python and the pandas library, but doing it the right way gives lot mileage in accuracy improvement. Let me illustrate this with an example

In [20]:
import pandas as pd

df = pd.DataFrame({
    'x_0': ['a'] * 5 + ['b'] * 5,
    'x_1': ['a'] * 9 + ['b'] * 1,
    'y': [1, 1, 1, 1, 0, 1, 0, 0, 0, 0]
})

In [21]:
df


Unnamed: 0,x_0,x_1,y
0,a,a,1
1,a,a,1
2,a,a,1
3,a,a,1
4,a,a,0
5,b,a,1
6,b,a,0
7,b,a,0
8,b,a,0
9,b,b,0


We can start by computing the means of the `x_0` column.

In [22]:
means = df.groupby('x_0')['y'].mean()

In [23]:
means

x_0
a    0.8
b    0.2
Name: y, dtype: float64

We can then replace each value in x_0 with the matching mean.

In [24]:
df['x_0'] = df['x_0'].map(means)
df

Unnamed: 0,x_0,x_1,y
0,0.8,a,1
1,0.8,a,1
2,0.8,a,1
3,0.8,a,1
4,0.8,a,0
5,0.2,a,1
6,0.2,a,0
7,0.2,a,0
8,0.2,a,0
9,0.2,b,0


We can do the same for `x_1`.

In [26]:
df['x_1'] = df['x_1'].map(df.groupby('x_1')['y'].mean())
df

Unnamed: 0,x_0,x_1,y
0,0.8,0.555556,1
1,0.8,0.555556,1
2,0.8,0.555556,1
3,0.8,0.555556,1
4,0.8,0.555556,0
5,0.2,0.555556,1
6,0.2,0.555556,0
7,0.2,0.555556,0
8,0.2,0.555556,0
9,0.2,0.0,0


Target encoding is good because it picks up values that can explain the target. In this silly example value `a` of variable `x_0`has an average target value of 0.8.
This can greatly help the machine learning classifications algorithms used downstream.

The problem of target encoding has a name: **over-fitting**. Indeed relying on an average value isn’t always a good idea when the number of values used in the average is low. You’ve got to keep in mind that the dataset you’re training on is a sample of a larger set. This means that whatever artifacts you may find in the training set might not hold true when applied to another dataset (i.e. the test set).

In the example, the value `d` of variable `x_1`is replaced with a 0 because it only appears once and the corresponding value of y is a 0. In this case we’re over-fitting because we don’t have enough values to be sure that 0 is in fact the mean value of `y` when `x_1`is equal to `d`.
In other words only relying on each group mean is too reckless.

There are various ways to handle this. A popular way is to use cross-validation and compute the means in each out-of-fold dataset. This is what what many Kagglers do.


Another approach which I much prefer is to use is **additive smoothing.** This is supposedly what **IMDB uses to rate it’s movies.**

The intuition is as follows. Imagine a new movie is posted on IMDB and it receives three ratings. Taking into account the three ratings gives the movie an average of 9.5. This is surprising because most movies tend to hover around 7, and the very good ones rarely go above 8. The point is that these first three ratings are extreme values that can’t been trusted. The trick is to “smooth” the average by including the average rating over all movies. In other words, if there aren’t many ratings we should rely on the global average rating, whereas if there enough ratings then we can safely rely on the local average.

Mathematically this is equivalent to:

<tex>\mu = \frac{n \times \bar{x} + m \times w}{n + m}<tex>

where

- \mu is the mean we’re trying to compute (the one that’s going to replace our categorical values)


- n is the number of values you have


- \bar{x} is your estimated mean


- m is the “weight” you want to assign to the overall mean


- w is the overall mean

In this notation m is the only parameter you have to set. The idea is that the higher m is, the more you’re going to rely on the overall mean w. If m is equal to 0 then you’re simply going to compute the empirical mean, which is:

\mu = \frac{n \times \bar{x} + 0 \times w}{n + 0} = \frac{n \times \bar{x}}{n} = \bar{x}

In other words you’re not doing any smoothing whatsoever.

Again this is quite easy to do in Python. First we’re going to write a method that computes a smooth mean. It’s going to take as input a pandas.DataFrame, a categorical column name, the name of the target column, and a weight m.

In [27]:
def calc_smooth_mean(df, by, on, m):
    # Compute the global mean
    mean = df[on].mean()

    # Compute the number of values and the mean of each group
    agg = df.groupby(by)[on].agg(['count', 'mean'])
    counts = agg['count']
    means = agg['mean']

    # Compute the "smoothed" means
    smooth = (counts * means + m * mean) / (counts + m)

    # Replace each value by the according smoothed mean
    return df[by].map(smooth)

Let’s see what this does in the previous example with a weight of, say, 10.

In [28]:
df['x_0'] = calc_smooth_mean(df, by='x_0', on='y', m=10)
df['x_1'] = calc_smooth_mean(df, by='x_1', on='y', m=10)

In [29]:
df

Unnamed: 0,x_0,x_1,y
0,0.6,0.526316,1
1,0.6,0.526316,1
2,0.6,0.526316,1
3,0.6,0.526316,1
4,0.6,0.526316,0
5,0.4,0.526316,1
6,0.4,0.526316,0
7,0.4,0.526316,0
8,0.4,0.526316,0
9,0.4,0.454545,0


It’s should be quite noticeable that each computed value is much closer to the overall mean of 0.5. This is because a weight of 10 is rather large for a dataset of only 10 values. The value d of variable x_1 has been replaced with 0.454545 instead of the 0 we got earlier. The equation for obtaining it was:

d = \frac{1 \times 0 + 10 \times 0.5}{1 + 10} = \frac{0 + 5}{11} \simeq 0.454545


Meanwhile the new value for replacing the value a of variable x_0 was:

a = \frac{5 \times 0.8 + 10 \times 0.5}{5 + 10} = \frac{4 + 5}{15} = 0.6


Computing smooth means can be done extremely quickly. What’s more you only have to choose a single parameter, which is m. I find that setting to something like 300 works well in most cases. It’s quite intuitive really: you’re saying that you require that there must be at least 300 values for the sample mean to overtake the global mean. There are other ways to do target encoding, you can google for one which is rather popular on Kaggle. However it produces encoded variables which are very correlated with the output of additive smoothing, at the cost of requiring two parameters.