References: <br>
- [Target-encoding Categorical Variables](https://towardsdatascience.com/dealing-with-categorical-variables-by-using-target-encoder-a0f1733a4c69)
- [GitHub](https://github.com/vinyluis/Articles/tree/main/Target%20Encoder)

One-hot encoding creates a sparse matrix and inflates the number of dimensions (Curse of Dimensionality). <br>

# Target Encoder
A target encoder encodes categories by replacing them with a measurement of the effect they might have on the target.


On a binary classifier, calculate probability `p(t = 1 | x = ci)` <br>
- `t` = target
- `x` = input
- `ci` = i-th category

In Bayesian statistics, this is considered the posterior probability of `t=1` given the input was the category `ci`. <br>
Replace category `ci` for the value of the posterior probability of the target being `1` on the presence of that category.

In [1]:
import pandas as pd
import numpy as np

Create synthetic dataset:

In [2]:
seed = 2022

# target
np.random.seed(seed)
target = list(np.random.randint(0, 2, 50))

# inputs
genre = ["Sci Fi", "Drama", "Romance", "Fantasy", "Nonfiction"]
np.random.seed(seed)
genres = [genre[i] for i in np.random.randint(0, len(genre), 50)]

df = pd.DataFrame({"genre" : genres, "target" : target})
df

Unnamed: 0,genre,target
0,Nonfiction,1
1,Sci Fi,0
2,Drama,1
3,Drama,0
4,Sci Fi,1
5,Sci Fi,1
6,Romance,0
7,Sci Fi,1
8,Sci Fi,0
9,Drama,0


For every single category, count occurrences of the targets `0` and `1`. <br>
Then calculate <br>
`encoding = (count of target = 1) / (total occurences)`

In [3]:
categories = df['genre'].unique()
targets = df['target'].unique()

cat_list = []
for cat in categories:
    # create dictionary
    aux_dict = {}
    # add category to dictionary
    aux_dict['category'] = cat
    
    # filtered df with current category
    aux_df = df[df['genre'] == cat]
    
    counts = aux_df['target'].value_counts()
    # add category to dictionary
    aux_dict['count'] = sum(counts)
    
    for t in targets:
        # add count of each value of the target to dictionary
        aux_dict['target_' + str(t)] = counts[t]
        
    cat_list.append(aux_dict)

In [4]:
# create df
cat_list = pd.DataFrame(cat_list)
# get posteriors
cat_list['genre_encoded_prob'] = cat_list['target_1'] / cat_list['count']
cat_list

Unnamed: 0,category,count,target_1,target_0,genre_encoded_prob
0,Nonfiction,7,5,2,0.714286
1,Sci Fi,15,8,7,0.533333
2,Drama,9,3,6,0.333333
3,Romance,10,3,7,0.3
4,Fantasy,9,6,3,0.666667


In [5]:
# add prob to df
df = df.join(cat_list.drop(columns = ['count', 'target_1', 'target_0']).set_index('category'), on = 'genre', how = 'left')
df

Unnamed: 0,genre,target,genre_encoded_prob
0,Nonfiction,1,0.714286
1,Sci Fi,0,0.533333
2,Drama,1,0.333333
3,Drama,0,0.333333
4,Sci Fi,1,0.533333
5,Sci Fi,1,0.533333
6,Romance,0,0.3
7,Sci Fi,1,0.533333
8,Sci Fi,0,0.533333
9,Drama,0,0.333333


Since the target of interest is the value `1`, this probability is the mean of the target, given a category. <br>

Calulate this mean with a simple aggregation:

In [6]:
stats = df['target'].groupby(df['genre']).agg(['count', 'mean'])
stats

Unnamed: 0_level_0,count,mean
genre,Unnamed: 1_level_1,Unnamed: 2_level_1
Drama,9,0.333333
Fantasy,9,0.666667
Nonfiction,7,0.714286
Romance,10,0.3
Sci Fi,15,0.533333


Replace categories with their encoded values:

In [7]:
df = df.join(stats.drop(columns = 'count'), on = 'genre', how = 'left').rename(columns = {'mean'  : 'genre_encoded_mean'})
df

Unnamed: 0,genre,target,genre_encoded_prob,genre_encoded_mean
0,Nonfiction,1,0.714286,0.714286
1,Sci Fi,0,0.533333,0.533333
2,Drama,1,0.333333,0.333333
3,Drama,0,0.333333,0.333333
4,Sci Fi,1,0.533333,0.533333
5,Sci Fi,1,0.533333,0.533333
6,Romance,0,0.3,0.3
7,Sci Fi,1,0.533333,0.533333
8,Sci Fi,0,0.533333,0.533333
9,Drama,0,0.333333,0.333333


## Problems
- Target Leakage <br>
Using probability of target to encode features that would be fed with information of the variable to model. Model will learn from a variable that contains the target itself.

- Use of mean as a predictor for whole distribution is not perfect <br>
Even if mean is a good summary, model is trained with a fraction of the data. The mean of this fraction may not be the mean of the full population, so the encoding might not be correct. If the sample is different enough from the population, the model may even overfit the training data.

## Target encoder with prior smoothing
Use prior smoothing to reduce unwanted effects. <br>
Assume a model that predicts quality of a book in an online store. <br>
A book with 5 evaluations has a score of 9.8 out of 10, but other books have a mean score of 7 (due to using the mean of a small sample).<br>
"Smooth" the score of this book with fewer evaluations by considering the mean of the whole population of books.


5 categories to be encoded: Nonfiction, Romance, Drama, Sci-Fi, and Fantasy. <br>
Now use the mean of target across all categories to smooth encoding of each category. <br>
Mean of the target = prior probability `p(t = 1)` <br>
Encoding uses a parameter `α` (0 to 1), to balance smoothing. <br>
`encoding = α * p(t = 1| x = xi) + (1 - α)*p(t = 1)` <br>
`α = 1/ (1 + exp(-(n-k)/f)`

### Manual

In [8]:
# f
smoothing_factor = 1.0
# k
min_samples_leaf = 1

prior = df['target'].mean()

# α
smoove = 1 / (1 + np.exp(-(stats['count'] - min_samples_leaf) / smoothing_factor))

# encoding
smoothing = smoove*stats['mean'] + (1 - smoove)*prior

encoded = pd.Series(smoothing, name = 'genre_encoded_smoothing')
encoded

genre
Drama         0.333389
Fantasy       0.666611
Nonfiction    0.713756
Romance       0.300025
Sci Fi        0.533333
Name: genre_encoded_smoothing, dtype: float64

In [9]:
df = df.join(encoded, on = 'genre', how = 'left')
df

Unnamed: 0,genre,target,genre_encoded_prob,genre_encoded_mean,genre_encoded_smoothing
0,Nonfiction,1,0.714286,0.714286,0.713756
1,Sci Fi,0,0.533333,0.533333,0.533333
2,Drama,1,0.333333,0.333333,0.333389
3,Drama,0,0.333333,0.333333,0.333389
4,Sci Fi,1,0.533333,0.533333,0.533333
5,Sci Fi,1,0.533333,0.533333,0.533333
6,Romance,0,0.3,0.3,0.300025
7,Sci Fi,1,0.533333,0.533333,0.533333
8,Sci Fi,0,0.533333,0.533333,0.533333
9,Drama,0,0.333333,0.333333,0.333389


### Sklearn Category Encoders
https://contrib.scikit-learn.org/category_encoders/targetencoder.html

In [10]:
# pip install --upgrade category_encoders

In [11]:
from category_encoders import TargetEncoder

In [12]:
encoder = TargetEncoder()
df['genre_encoded_sklearn'] = encoder.fit_transform(df['genre'], df['target'])
df

Unnamed: 0,genre,target,genre_encoded_prob,genre_encoded_mean,genre_encoded_smoothing,genre_encoded_sklearn
0,Nonfiction,1,0.714286,0.714286,0.713756,0.713756
1,Sci Fi,0,0.533333,0.533333,0.533333,0.533333
2,Drama,1,0.333333,0.333333,0.333389,0.333389
3,Drama,0,0.333333,0.333333,0.333389,0.333389
4,Sci Fi,1,0.533333,0.533333,0.533333,0.533333
5,Sci Fi,1,0.533333,0.533333,0.533333,0.533333
6,Romance,0,0.3,0.3,0.300025,0.300025
7,Sci Fi,1,0.533333,0.533333,0.533333,0.533333
8,Sci Fi,0,0.533333,0.533333,0.533333,0.533333
9,Drama,0,0.333333,0.333333,0.333389,0.333389


# Multiclass Target Encoder

In [13]:
np.random.seed(2022)
target = list(np.random.randint(0, 3, 50))

genre = ["Romance", "Fantasy", "Nonfiction"]
np.random.seed(123)
genres = [genre[i] for i in np.random.randint(0, len(genre), 50)]

df = pd.DataFrame({"genre" : genres, "target" : target})
df

Unnamed: 0,genre,target
0,Nonfiction,1
1,Fantasy,0
2,Nonfiction,1
3,Nonfiction,0
4,Romance,1
5,Nonfiction,1
6,Nonfiction,0
7,Fantasy,0
8,Nonfiction,2
9,Fantasy,0


Need to encode features for each target independently. <br>
Calculate posterior probabilities of each target given each category by using conditional probabilities:

In [14]:
categories = df['genre'].unique()
targets = df['target'].unique()

cat_list = []
for cat in categories:
    aux_dict = {}
    aux_dict['category'] = cat
    
    aux_df = df[df['genre'] == cat]
    
    counts = aux_df['target'].value_counts()
    aux_dict['count'] = sum(counts)
    
    for t in targets:
        aux_dict['target_' + str(t)] = counts[t] if t in counts.keys() else 0 # new
    cat_list.append(aux_dict)

In [15]:
cat_list = pd.DataFrame(cat_list)

for t in targets:
    # count of target_t / count of total count
    cat_list['genre_prob_target_' + str(t)] = cat_list['target_' + str(t)] / cat_list['count'] 
    
cat_list

Unnamed: 0,category,count,target_1,target_0,target_2,genre_prob_target_1,genre_prob_target_0,genre_prob_target_2
0,Nonfiction,17,6,5,6,0.352941,0.294118,0.352941
1,Fantasy,15,6,7,2,0.4,0.466667,0.133333
2,Romance,18,6,5,7,0.333333,0.277778,0.388889


In [16]:
df = df.join(cat_list.drop(columns = (['count'] + ['target_' + str(t) for t in targets])).set_index('category'), on = 'genre', how = 'left')
df

Unnamed: 0,genre,target,genre_prob_target_1,genre_prob_target_0,genre_prob_target_2
0,Nonfiction,1,0.352941,0.294118,0.352941
1,Fantasy,0,0.4,0.466667,0.133333
2,Nonfiction,1,0.352941,0.294118,0.352941
3,Nonfiction,0,0.352941,0.294118,0.352941
4,Romance,1,0.333333,0.277778,0.388889
5,Nonfiction,1,0.352941,0.294118,0.352941
6,Nonfiction,0,0.352941,0.294118,0.352941
7,Fantasy,0,0.4,0.466667,0.133333
8,Nonfiction,2,0.352941,0.294118,0.352941
9,Fantasy,0,0.4,0.466667,0.133333


Code also works with a categorical target:

In [17]:
genre = ["Romance", "Fantasy", "Nonfiction"]
np.random.seed(123)
genres = [genre[i] for i in np.random.randint(0, len(genre), 50)]
target = list(np.random.randint(0, 3, 50))

copy = pd.DataFrame({"genre" : genres, "target" : target})

copy['target'] = copy['target'].replace({0: 'apple', 1: "banana", 2: 'orange'})

categories = copy['genre'].unique()
targets = copy['target'].unique()

cat_list_copy = []
for cat in categories:
    aux_dict = {}
    aux_dict['category'] = cat
    
    aux_df = copy[copy['genre'] == cat]
    
    counts = aux_df['target'].value_counts()
    aux_dict['count'] = sum(counts)
    
    for t in targets:
        aux_dict['target_' + str(t)] = counts[t] if t in counts.keys() else 0 # new
    cat_list_copy.append(aux_dict)
    
cat_list_copy = pd.DataFrame(cat_list_copy)

for t in targets:
    # count of target_t / count of total count
    cat_list_copy['genre_prob_target_' + str(t)] = cat_list_copy['target_' + str(t)] / cat_list_copy['count'] 
    
cat_list_copy  

Unnamed: 0,category,count,target_orange,target_banana,target_apple,genre_prob_target_orange,genre_prob_target_banana,genre_prob_target_apple
0,Nonfiction,17,8,2,7,0.470588,0.117647,0.411765
1,Fantasy,15,6,8,1,0.4,0.533333,0.066667
2,Romance,18,7,7,4,0.388889,0.388889,0.222222


## Sklearn Category Encoders

In [18]:
from category_encoders import TargetEncoder

In [19]:
targets = df['target'].unique()

for t in targets:
    target_aux = df['target'].apply(lambda x: 1 if x == t else 0)
    encoder = TargetEncoder()
    df['genre_sklearn_target_' + str(t)] = encoder.fit_transform(df['genre'], target_aux)
    
df

Unnamed: 0,genre,target,genre_prob_target_1,genre_prob_target_0,genre_prob_target_2,genre_sklearn_target_1,genre_sklearn_target_0,genre_sklearn_target_2
0,Nonfiction,1,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
1,Fantasy,0,0.4,0.466667,0.133333,0.4,0.466667,0.133333
2,Nonfiction,1,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
3,Nonfiction,0,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
4,Romance,1,0.333333,0.277778,0.388889,0.333333,0.277778,0.388889
5,Nonfiction,1,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
6,Nonfiction,0,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
7,Fantasy,0,0.4,0.466667,0.133333,0.4,0.466667,0.133333
8,Nonfiction,2,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
9,Fantasy,0,0.4,0.466667,0.133333,0.4,0.466667,0.133333


## Manual (Mean)

In [20]:
targets = df['target'].unique()

for t in targets:
    df['target_' + str(t)] = df['target'].apply(lambda x: 1 if x == t else 0)
    stats = df['target_' + str(t)].groupby(df['genre']).agg(['mean'])
    df = df.join(stats, on = 'genre', how = 'left').rename(columns = {'mean'  : 'genre_encoded_mean_target_' + str(t)})
    df = df.drop(columns = ['target_' + str(t)])
    
df

Unnamed: 0,genre,target,genre_prob_target_1,genre_prob_target_0,genre_prob_target_2,genre_sklearn_target_1,genre_sklearn_target_0,genre_sklearn_target_2,genre_encoded_mean_target_1,genre_encoded_mean_target_0,genre_encoded_mean_target_2
0,Nonfiction,1,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
1,Fantasy,0,0.4,0.466667,0.133333,0.4,0.466667,0.133333,0.4,0.466667,0.133333
2,Nonfiction,1,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
3,Nonfiction,0,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
4,Romance,1,0.333333,0.277778,0.388889,0.333333,0.277778,0.388889,0.333333,0.277778,0.388889
5,Nonfiction,1,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
6,Nonfiction,0,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
7,Fantasy,0,0.4,0.466667,0.133333,0.4,0.466667,0.133333,0.4,0.466667,0.133333
8,Nonfiction,2,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941,0.352941,0.294118,0.352941
9,Fantasy,0,0.4,0.466667,0.133333,0.4,0.466667,0.133333,0.4,0.466667,0.133333
