<center>
    <h1 id='target-encoding' style='color:#7159c1'>⚙️ Target Encoding ⚙️</h1>
    <i>Encoding Target Variables</i>
</center>

---

`Target Encoding`, like One-Hot and Ordinal Encoding, is applied in the Categorical Features in other to transform them into numbers. However, differently than these two techniques, Target Encoding uses the Target as a parameter to encode the Categorical Features.

Use Cases for Target Encoding:

> **High-Cardinality Features** - `a feature with a large number of categories can be troublesome to encode - a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature's most important property: its relationship with the target`;

> **Domain-Motivated Features** - `from prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature's true informativeness`.

<br />

Techniques:

```
- Mean
- Smoothing
```

<h1 id='0-mean' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Mean</h1>

AKA: Bin Counting, Likelihood Encoding, Impact Encoding and Leave-One-Out Encoding.

Groups the Features but another one, selects a Feature and Calculate the their Mean.

In [1]:
# ---- Mean Encoding ----
import pandas as pd # pip install pandas

autos_df = pd.read_csv('./datasets/autos.csv')
autos_df['make_price_encoded'] = (
    autos_df.groupby('make') # for each 'make'
    ['price']                # select the 'price'
    .transform('mean')       # and calculate price's mean
)

autos_df[['make', 'make_price_encoded']].head()

Unnamed: 0,make,make_price_encoded
0,alfa-romero,15498.333333
1,alfa-romero,15498.333333
2,alfa-romero,15498.333333
3,audi,17859.166667
4,audi,17859.166667


<h1 id='1-smoothing' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Smoothing</h1>

An encoding like the Mean One presents a couple of problems, however.

> **Unknown Categories** - `target encodings create a special risk of overfitting, which means they need to be trained on an independent "encoding" split. When you join the encoding to future splits, Pandas will fill in missing values for any categories not present in the encoding split. These missing values you would have to impute somehow`;

> **Rare Categories** - `when a category only occurs a few times in the dataset, any statistics calculated on its group are unlikely to be very accurate. In the Automobiles dataset, the mercurcy make only occurs once. The "mean" price we calculated is just the price of that one vehicle, which might not be very representative of any Mercuries we might see in the future. Target encoding rare categories can make overfitting more likely`.

<br />

A solution to these problems is to add `SMOOTHING`. The idea is to blend the in-category average with the overall average. Rare categories get less weight on their category average, while missing categories just get the overall average:

$$
\text{encoding} = \text{weights} \cdot \text{in_category} \cdot (1 - \text{weight}) \cdot \text{overall}
$$

Where weight is a value between 0 and 1 calculated from the category frequency. An easy way to determine the value for weight is to compute an m-estimate:

$$
\text{weight} = \frac{n}{n + m}
$$

Where n is the total number of times that category occurs in the data. The parameter m determines the "smoothing factor". Larger values of m put more weight on the overall estimate.

In [3]:
# ---- Smoothing Encoding ----
import pandas as pd
import warnings # pip install warnings
warnings.filterwarnings('ignore')

In [6]:
# ---- Smoothing Encoding ----
movies_df = pd.read_csv('./datasets/movielens1m.csv')
movies_df = movies_df.astype(np.uint8, errors='ignore') # replace memory footprint
print(f'- Number of UniqueZipcodes: {movies_df.Zipcode.nunique()}')

- Number of UniqueZipcodes: 3439


---

With over 3000 categories, the Zipcode feature makes a good candidate for target encoding, and the size of this dataset (over one-million rows) means we can spare some data to create the encoding.

We'll start by creating a 25% split to train the target encoder.

In [10]:
# ---- Smoothing Encoding ----
#
# - setting up Features and Target, splitting dataset into Target Encoding's
# Train and Pretrain
#
X = movies_df.copy()
y = X.pop('Rating')

X_encode = X.sample(frac=0.25)
y_encode = y[X_encode.index]
X_pretrain = X.drop(X_encode.index)
y_pretrain = y[X_pretrain.index]

In [16]:
# ---- Smoothing Encoding ----
#
# - importing library, creating and training the encoder
# and encoding the dataset
#
from category_encoders import MEstimateEncoder # pip install category_encoders

encoder = MEstimateEncoder(cols=['Zipcode'], m=5.0) # 'm' is used to control the noise
encoder.fit(X_encode, y_encode)

X_train = encoder.transform(X_pretrain)
X_train.Zipcode.head()

0    3.728199
1    2.982014
2    3.475534
3    3.626640
4    3.871088
Name: Zipcode, dtype: float64

---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).