# Introduction #

Most of the techniques we've seen in this course have been for numerical features. The technique we'll look at in this lesson, *target encoding*, is instead meant to be used on categorical features. It's a method of encoding categories as numbers, like one-hot or label encoding, with the difference that it makes use of the *target* as well. This makes it what we call a **supervised** feature engineering technique. Like all supervised techniques, it can be very powerful. But it does come with an increased risk of overfitting, so we need to take some care in how we apply it.

# Target Encoding #

Most broadly, a **target encoding** is any kind of encoding that replaces a feature's categories with some number derived from the target.

A simple version is just to apply one of the group aggregations we saw in Lesson 3. The *Automobiles* dataset has a numeric target `'price'` and a categorical feature `'make'`. This following transform simply encodes the make of each vehicle as the average price of that make -- the category `'nissan'` for instance would be replaced with the average price of all Nissans in the data (which happens to be $10415.67).

```
df["make_encoded"] = df.groupby("make")["price"].transform("mean")
```

An encoding like this is best for features with a known set of categories, with none of them very rare.

# Smoothing #

*Smoothing* is a solution to the problems above.

The *in-category* estimate is blended with the *overall* estimate according to how frequent the category is: common categories have encodings weighted towards the in-category estimate, while rare categories have encodings weighted towards the overall estimate. In pseudocode:

```
encoding = weight * in_category + (1 - weight) * overall
```
where `weight` is a value between 0 and 1 calculated from the category frequency.

There are a variety of ways to determine the value of `weight`. To get an "m-estimate" we compute it like:
```
weight = n / (n + m)
```
where `n` is the total number of times that category occurs in the data. The parameter `m` determines the "shrinkage factor". Larger values of `m` put more weight on the overall estimate.

<figure style="padding: 1em;">
<img src="https://i.imgur.com/1uVtQEz.png" width=500, alt="">
<figcaption style="textalign: center; font-style: italic"><center>
</center></figcaption>
</figure>

The figure shows <mark>TODO</mark>.

Here's an example. In the *Automobiles* dataset, there are three cars with the make `'chevrolet`. If you chose `m=2.0`, then the `'chevrolet` category would be encoded with 60% of the Chevrolet average price plus 20% of the overall average price.

When choosing a value for `m`, consider how noisy you expect the categories to be. Does the price of a vehicle vary a great deal within each make? Would you need a lot of data to get good estimates? If so, it could be better to choose a larger value for `m`; if the average price for each make were relatively stable, a smaller value could be okay.

# Example - MovieLens1M #

The *MovieLens1M* dataset contains one-million movie ratings by users of the MovieLens website, with features describing each user and movie.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

sns.set_style('whitegrid')

df = pd.read_csv("../input/fe-course-data/movielens1m.csv")
df = df.astype(np.uint8, errors='ignore') # reduce memory footprint
print("Number of Unique Zipcodes: {}".format(df["Zipcode"].nunique()))

With over 3000 categories, the `Zipcode` feature makes a good candidate for target encoding, and the size of this dataset (over one-million rows) means we can spare some data to create the encoding.

This next cell will create a 25% split for the encoding data.

In [None]:
from sklearn.model_selection import train_test_split

X = df.copy()
y = X.pop('Rating')

# `test_size` gives the size of X_pretrain and y_train
X_encode, X_pretrain, y_encode, y_train = train_test_split(X, y, test_size=0.75)

In [None]:
from category_encoders import MEstimateEncoder

# Create the encoder instance. Choose m to control noise.
encoder = MEstimateEncoder(cols=["Zipcode"], m=5.0)

# Fit the encoder on the encoding split.
encoder.fit(X_encode, y_encode)

# Encode the Zipcode column to create the final training data
X_train = encoder.transform(X_pretrain)

Let's compare the encoded values to the target to see how informative our encoding might be.

In [None]:
plt.figure(dpi=90)
ax = sns.distplot(y, kde=False)
ax = sns.kdeplot(X_train.Zipcode, bw_adjust=2.0, color='r', shade=True, ax=ax)
ax.set_xlabel("Rating")
ax.legend(labels=['Zipcode', 'Rating']);

The distribution of the encoded `Zipcode` feature roughly follows the distribution of the actual ratings, meaning that movie-watchers differed enough in their ratings from zipcode to zipcode that our target encoding was able to capture useful information.

# Keep Going #