# Exercise 2: Categorical Encodings

In this exercise you'll be applying more advanced encodings to the categorical variables. The goal is to encode the categorical variables in a way that provides more information for the classifier model. The encodings you will implement are:

- Count Encoding
- Target Encoding
- Leave-on-out Encoding
- CatBoost Encoding
- Feature embedding with SVD 

After each encoding, you'll refit the classifier and check it's performance on hold-out data. First, run the next cell to repeat the work you did in the last exercise.

In [99]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex2 import *

# Create features from timestamps
click_data = pd.read_csv('../input/train_sample.csv', parse_dates=['click_time'])
click_times = click_data['click_time']
clicks = click_data.assign(day=click_times.dt.day.astype('uint8'),
                           hour=click_times.dt.hour.astype('uint8'),
                           minute=click_times.dt.minute.astype('uint8'),
                           second=click_times.dt.second.astype('uint8'))

# Label encoding for categorical features
cat_features = ['ip', 'app', 'device', 'os', 'channel']
for feature in cat_features:
    label_encoder = preprocessing.LabelEncoder()
    clicks[feature] = label_encoder.fit_transform(clicks[feature])



Here I'll define a couple functions to help test the new encodings.

In [2]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test):
    """ Trains a logistic classifier with LightGBM on the train and valid datasets.
        Uses the test set to measure the AUC score on hold-out data.
    """
    feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                       'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 'metric': 'auc'}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    ypred = bst.predict(test[feature_cols])
    score = metrics.roc_auc_score(test['is_attributed'], ypred)
    print(f"Model AUC score: {score}")
    
    return bst

Run this cell to get a baseline score. If your encodings do better than this, you can keep them.

In [3]:
print("Baseline model")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Baseline model
Training model!
Model AUC score: 0.9727603360058794


### 1) Count encodings

Here, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. Then, retrain the model to measure the score with the new encodings.

In [110]:
class CountEncoder:
    def __init__(self):
        self.mapping = {}
        
    def fit(self, df):
        for col in df.columns:
            self.mapping[col] = df.groupby(col).count().iloc[:, 0]
        
    def transform(self, df):
        out_df = df.copy()
        for col, encoding in self.mapping.items():
            out_df[col] = df[col].map(self.mapping[col]).fillna(0)
        
        return out_df

In [111]:
count_enc = CountEncoder()

train, valid, test = get_data_splits(clicks)
# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to each set and train a new model
for each in (train, valid, test):
    encoded = count_enc.transform(each[cat_features])
    for col in encoded:
        each.insert(0, col + '_count', encoded[col])

bst = train_model(train, valid, test)

Training model!
Model AUC score: 0.9742349038475913


Nice, count encoding improved our model's score. Add it to the whole dataset and try more encodings.

In [113]:
count_enc = CountEncoder()
train, valid, test = get_data_splits(clicks)
# Learn encoding from the training set
count_enc.fit(train[cat_features])
encoded = count_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_count', encoded[col])

### 2) Target encoding

Now, try target encoding.

In [12]:
import category_encoders as ce

In [17]:
tenc = ce.TargetEncoder(cols=cat_features)

train, valid, test = get_data_splits(clicks)
tenc.fit(train[cat_features], train['is_attributed'])

for each in (train, valid, test):
    encoded = tenc.transform(each[cat_features])
    for col in encoded:
        each.insert(0, col + '_target', encoded[col])

bst = train_model(train, valid, test)

Training model!
Model AUC score: 0.9654157052540729


### X) Leave-One-Out Encoding

In [18]:
loo_enc = ce.LeaveOneOutEncoder(cols=cat_features, sigma=1)

train, valid, test = get_data_splits(clicks)
loo_enc.fit(train[cat_features], train['is_attributed'])

for each in (train, valid, test):
    encoded = loo_enc.transform(each[cat_features])
    for col in encoded:
        each.insert(0, col + '_loo', encoded[col])

bst = train_model(train, valid, test)

Training model!
Model AUC score: 0.9626649445257338


### X) CatBoost Encoding

In [15]:
import category_encoders as ce

cb_enc = ce.CatBoostEncoder(cols=cat_features, sigma=0.6)

train, valid, test = get_data_splits(clicks)
# Learn encodings on the train set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encodings to each set
for each in [train, valid, test]:
    encoded = cb_enc.transform(each[cat_features])
    for col in encoded:
        each.insert(0, col + '_cb', encoded[col])

bst = train_model(train, valid, test)

Training model!
Model AUC score: 0.9652455582007964


### X) Supervised encoding performance

Why do you think these encodings result in lower scores?

**Answer:** Most likely it's because these encodings are using information about the target to create the encodings. This means the encodings are too specific to the training set and don't generalize to the test set.

## Categorical feature embeddings with SVD

Now you'll create embeddings from pairs of columns using SVD to learn from a count matrix.

In [8]:
import itertools
from sklearn.decomposition import TruncatedSVD

## Note to self/Dan
Here I'm creating the count matrix as a dense array in a DataFrame. It's possible to use CountVectorizer from sklearn to create a sparse array. But, CountVectorizer expects the input data to be strings while our data here is all integers. It would be fairly confusing to use CountVectorizer for this.

### X) Learn SVD components

In [103]:
# Again, be sure to learn encodings only from the training set!
train, valid, test = get_data_splits(clicks)

# Learn SVD feature vectors
cat_features = ['ip', 'app', 'device', 'os', 'channel']
svd_components = {}
svd = TruncatedSVD(n_components=5)
# Loop through each pair of categorical features
for col1, col2 in itertools.permutations(cat_features, 2):
    # For a pair, create a sparse matrix with cooccurence counts
    pair_counts = train.groupby([col1, col2])['is_attributed'].count()
    pair_matrix = pair_counts.unstack(fill_value=0)
    
    # Fit the SVD and store the components
    # Note: these components represent column 2
    svd.fit(pair_matrix)
    svd_components['_'.join([col2, col1])] = pd.DataFrame(svd.components_.transpose())

### X) Encode categorical features with SVD components

In [115]:
train, valid, test = get_data_splits(clicks)

# Apply encodings to each set
for each in [train, valid, test]:
    for feature in svd_components:
        # Get the feature column the SVD components are encoding
        col = feature.split('_')[0]
        
        # Use SVD components to encode the categorical features
        comp_cols = svd_components[feature].reindex(each[col]).set_index(each.index)

        # Add encoded features to the DataFrame
        for component, values in comp_cols.T.iterrows():
            each.insert(len(each.columns), feature + "_svd_" + str(component), values)

Test the encoded data.

In [116]:
bst = train_model(train, valid, test)

Training model!
Model AUC score: 0.9746475189597654


You should see a slight boost in the AUC score.

### X) Which encodings work?

Which encodings should we use?