# Exercise 2: Categorical Encodings

In this exercise you'll be applying more advanced encodings to the categorical variables. The goal is to encode the categorical variables in a way that provides more information for the classifier model. The encodings you will implement are:

- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD 

After each encoding, you'll refit the classifier and check its performance on hold-out data. First, run the next cell to repeat the work you did in the last exercise.

In [30]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex2 import *

# Create features from timestamps
click_data = pd.read_csv('../input/train_sample.csv', parse_dates=['click_time'])
click_times = click_data['click_time']
clicks = click_data.assign(day=click_times.dt.day.astype('uint8'),
                           hour=click_times.dt.hour.astype('uint8'),
                           minute=click_times.dt.minute.astype('uint8'),
                           second=click_times.dt.second.astype('uint8'))

# Label encoding for categorical features
cat_features = ['ip', 'app', 'device', 'os', 'channel']
for feature in cat_features:
    label_encoder = preprocessing.LabelEncoder()
    clicks[feature] = label_encoder.fit_transform(clicks[feature])



Here I'll define a couple functions to help test the new encodings.

In [31]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    
    test_pred = bst.predict(test[feature_cols])
    test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
    print(f"Validation AUC score: {valid_score}")
    
    return bst, valid_score, test_score

Run this cell to get a baseline score. If your encodings do better than this, you can keep them.

In [32]:
print("Baseline model")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Baseline model
Training model!
Validation AUC score: 0.9622743228943659


### 1) Count encodings

Here, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. Then, retrain the model to measure the score with the new encodings.

In [33]:
class CountEncoder:
    def __init__(self):
        self.mapping = {}
        
    def fit(self, df):
        for col in df.columns:
            self.mapping[col] = df.groupby(col).count().iloc[:, 0]
        
    def transform(self, df):
        out_df = df.copy()
        for col, encoding in self.mapping.items():
            out_df[col] = df[col].map(self.mapping[col]).fillna(0)
        
        return out_df

In [34]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = ____

# Learn encoding from the training set
____

# Apply encoding to each set and train a new model
# Add encoded features as new 
for each in (train, valid, test):
    ____

In [35]:
#%%RM_IF(PROD)%%
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = CountEncoder()

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to each set and train a new model
for each in (train, valid, test):
    encoded = count_enc.transform(each[cat_features])
    for col in encoded:
        each.insert(0, col + '_count', encoded[col])

In [36]:
# Train the model on the encoded datasets
_ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9653049481211093


Nice, count encoding improved our model's score. Now we can add it to the whole dataset and try more encodings.

In [37]:
encoded = count_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_count', encoded[col])

### 2) Target encoding

Now you'll try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. Create the target encoder from the `category_encoders` library. Then, learn the encodings from the training dataset, apply the encodings to all the datasets and retrain the model.

In [38]:
import category_encoders as ce

In [39]:
train, valid, test = get_data_splits(clicks)

# Create the target encoder
tenc = ____

# Learn encoding from the training set
____

# Apply encoding to each set and train a new model
for each in (train, valid, test):
    ____

In [40]:
#%%RM_IF(PROD)%%
cat_features = ['app', 'device', 'os', 'channel']
tenc = ce.TargetEncoder(cols=cat_features)

train, valid, test = get_data_splits(clicks)
tenc.fit(train[cat_features], train['is_attributed'])

for each in (train, valid, test):
    encoded = tenc.transform(each[cat_features])
    for col in encoded:
        each.insert(0, col + '_target', encoded[col])

In [41]:
_ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9650694152351313


### 3) Remove IP encoding

Try leaving `ip` out of the encoded features and retrain the model with target encoding again. You should find that the score increases and is above the baseline score! Why do you think the score is below baseline when we encode the IP address but above baseline when we don't?

**Answer:** (This is my guess) Target encoding attempts to measure the population mean of the target for each level in a categorical feature. This means when there is less data per level, the estimated mean will be further, there will be more variance. There is little data per IP address so it's likely that the estimates are much noisier than for the other features. Going forward, we'll leave out the IP feature when trying different encodings.

### 4) Leave-One-Out Encoding

Try leave-one-out encoding which might work better since it leaves out some data that can reduce overfitting. Again, create the leave-one-out encoder, fit it on the training dataset, and apply the encodings to all the datasets. Then, retrain the model to see if the new encodings improve the score.

In [42]:
train, valid, test = get_data_splits(clicks)

# Create the leave-one-out encoder. Use random_state=7.
loo_enc = ____

# Learn encoding from the training set
____

# Apply encoding to each set and train a new model
for each in (train, valid, test):
    ____

In [43]:
#%%RM_IF(PROD)%%
cat_features = ['app', 'device', 'os', 'channel']
loo_enc = ce.LeaveOneOutEncoder(cols=cat_features, sigma=1, random_state=7)

train, valid, test = get_data_splits(clicks)
loo_enc.fit(train[cat_features], train['is_attributed'])

for each in (train, valid, test):
    encoded = loo_enc.transform(each[cat_features])
    for col in encoded:
        each.insert(0, col + '_loo', encoded[col])

In [44]:
_ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9650772281075612


### 5) CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [45]:
train, valid, test = get_data_splits(clicks)

# Create the CatBoost encoder
cb_enc = ____

# Learn encoding from the training set
____

# Apply encoding to each set and train a new model
for each in (train, valid, test):
    ____

In [46]:
#%%RM_IF(PROD)%%
cat_features = ['app', 'device', 'os', 'channel']
cb_enc = ce.CatBoostEncoder(cols=cat_features, random_state=7)

train, valid, test = get_data_splits(clicks)
# Learn encodings on the train set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encodings to each set
for each in [train, valid, test]:
    encoded = cb_enc.transform(each[cat_features])
    for col in encoded:
        each.insert(0, col + '_cb', encoded[col])

In [47]:
_ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9651637372380955


In [48]:
encoded = cb_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_cb', encoded[col])

## Categorical feature embeddings with SVD

Now you'll create embeddings from pairs of columns using SVD to learn from a count matrix.

In [49]:
import itertools
from sklearn.decomposition import TruncatedSVD

### 6) Learn SVD components as embeddings.

Here you'll use SVD to learn embeddings for the categorical features from a matrix of counts. First, create the SVF transformer. Then for each pair of features, create a count matrix and learn the SVD components. Remember you should be learning the embeddings from the train datatset to avoid leakage.

In [50]:
train, valid, test = get_data_splits(clicks)
cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create the SVD transformer
svd = ____

# Learn SVD feature vectors and store in svd_components as DataFrames
svd_components = {}
for col1, col2 in itertools.permutations(cat_features, 2):
    ____
    svd_components['_'.join([col2, col1])] = ____

In [51]:
#%%RM_IF(PROD)%%
train, valid, test = get_data_splits(clicks)

# Learn SVD feature vectors
cat_features = ['ip', 'app', 'device', 'os', 'channel']
svd_components = {}
svd = TruncatedSVD(n_components=5)
# Loop through each pair of categorical features
for col1, col2 in itertools.permutations(cat_features, 2):
    # For a pair, create a sparse matrix with cooccurence counts
    pair_counts = train.groupby([col1, col2])['is_attributed'].count()
    pair_matrix = pair_counts.unstack(fill_value=0)
    
    # Fit the SVD and store the components
    # Note: these components represent column 2
    svd.fit(pair_matrix)
    svd_components['_'.join([col2, col1])] = pd.DataFrame(svd.components_.transpose())

### 7) Encode categorical features with SVD components

With the components learned from the train dataset, encode the categorical features and add the new features to the datasets.

In [52]:
# Apply encodings to each set
for each in [train, valid, test]:
    ____

In [53]:
#%%RM_IF(PROD)%%
svd_encodings = pd.DataFrame(index=clicks.index)
for feature in svd_components:
    # Get the feature column the SVD components are encoding
    col = feature.split('_')[0]

    # Use SVD components to encode the categorical features
    comp_cols = svd_components[feature].reindex(clicks[col]).set_index(clicks.index)
    
    # Add encoded features to the DataFrame
    for component, values in comp_cols.T.iterrows():
        svd_encodings.insert(len(svd_encodings.columns), feature + "_svd_" + str(component), values)

In [54]:
# Fill null values and add to the dataframe
svd_encodings = svd_encodings.fillna(svd_encodings.mean())
clicks = clicks.join(svd_encodings)

Test the encoded data.

In [55]:
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9651369718983184


Next up, you'll start generating completely new features from the data itself.