# Exercise 2: Categorical Encodings

In this exercise you'll be applying more advanced encodings to the categorical variables. The goal is to encode the categorical variables in a way that provides more information for the classifier model. The encodings you will implement are:

- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD 

After each encoding, you'll refit the classifier and check its performance on hold-out data. First, run the next cell to repeat the work you did in the last exercise.

In [None]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering.ex2 import *

# Create features from timestamps
click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv', 
                         parse_dates=['click_time'])
click_times = click_data['click_time']
clicks = click_data.assign(day=click_times.dt.day.astype('uint8'),
                           hour=click_times.dt.hour.astype('uint8'),
                           minute=click_times.dt.minute.astype('uint8'),
                           second=click_times.dt.second.astype('uint8'))

# Label encoding for categorical features
cat_features = ['ip', 'app', 'device', 'os', 'channel']
for feature in cat_features:
    label_encoder = preprocessing.LabelEncoder()
    clicks[feature] = label_encoder.fit_transform(clicks[feature])

Here I'll define a couple functions to help test the new encodings.

In [None]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

Run this cell to get a baseline score. If your encodings do better than this, you can keep them.

In [None]:
print("Baseline model")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid)

## 1) Categorical encodings and leakage

These encodings are all based on statistics calculated from the dataset like counts and means. Considering this, what data should you be using to calculate the encodings?

Uncomment the following line after you've decided your answer.

In [None]:
#q_1.solution

## 2) Count encodings

Here, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. First I'll define a `CountEncoder` class you can use. Something similar to this is being added to the `category_encoders` package soon, but it hasn't been released yet.

In [None]:
class CountEncoder:
    def __init__(self):
        self.mapping = {}
        
    def fit(self, df):
        """ Calculates count encodings for each column in a dataframe. """
        for col in df.columns:
            self.mapping[col] = df.groupby(col).count().iloc[:, 0]
        
    def transform(self, df):
        """ Applies learned encodings to a dataframe. Returned datafrom has the same
            indices and columns as original dataframe. 
        """
        out_df = df.copy()
        for col, encoding in self.mapping.items():
            out_df[col] = df[col].map(self.mapping[col]).fillna(0)
        
        return out_df

Using `CountEncoder`, fit the encoding using the categorical feature columns defined in `cat_features`. Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed `"_count"`.

In [None]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = ____

# Learn encoding from the training set
____

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train = ____
valid = ____

In [None]:
# Uncomment if you need some guidance
#q_2.hint()
#q_2.solution()

In [None]:
# Run this cell to check your work
q_2.check()

In [None]:
#%%RM_IF(PROD)%%
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = CountEncoder()

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets
train = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

q_2.check()

In [None]:
# Train the model on the encoded datasets
_ = train_model(train, valid, test)

Nice, count encoding improved our model's score. Now we can add it to the whole dataset and try more encodings.

In [None]:
encoded = count_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_count', encoded[col])

### 3) Target encoding

Here you'll try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. Create the target encoder from the `category_encoders` library. Then, learn the encodings from the training dataset, apply the encodings to all the datasets and retrain the model.

In [None]:
import category_encoders as ce

In [None]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the target encoder
target_enc = ____

# Learn encoding from the training set
____

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_target` as a suffix to the new columns
train = ____
valid = ____

In [None]:
# Uncomment these if you need some guidance
#q_3.hint()
#q_3.solution()

In [None]:
# Run this cell to check your work
q_3.check()

In [None]:
#%%RM_IF(PROD)%%
cat_features = ['ip', 'app', 'device', 'os', 'channel']
target_enc = ce.TargetEncoder(cols=cat_features)

train, valid, test = get_data_splits(clicks)
target_enc.fit(train[cat_features], train['is_attributed'])

train = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

q_3.check()

In [None]:
_ = train_model(train, valid)

### 4) Remove IP encoding

Try leaving `ip` out of the encoded features and retrain the model with target encoding again. You should find that the score increases and is above the baseline score! Why do you think the score is below baseline when we encode the IP address but above baseline when we don't?

Uncomment the following line after you've decided your answer.

In [None]:
#q_4.solution()

### 5) Leave-One-Out Encoding

Try leave-one-out encoding which might work better since it leaves out some data that can reduce overfitting. Again, create the leave-one-out encoder, fit it on the training dataset, and apply the encodings to all the datasets. Then, retrain the model to see if the new encodings improve the score.

In [None]:
cat_features = ['app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the leave-one-out encoder. Use random_state=7.
loo_enc = ____

# Learn encoding from the training set
____

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_loo` as a suffix to the new columns
train = ____
valid = ____

In [None]:
# Uncomment these if you need some guidance
#q_5.hint()
#q_5.solution()

In [None]:
# Run this cell to check your work
q_5.check()

In [None]:
#%%RM_IF(PROD)%%
cat_features = ['app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

loo_enc = ce.LeaveOneOutEncoder(cols=cat_features, random_state=7)
loo_enc.fit(train[cat_features], train['is_attributed'])

train = train.join(loo_enc.transform(train[cat_features]).add_suffix('_loo'))
valid = valid.join(loo_enc.transform(valid[cat_features]).add_suffix('_loo'))

q_5.check()

In [None]:
_ = train_model(train, valid)

### 6) CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [None]:
train, valid, test = get_data_splits(clicks)

# Create the CatBoost encoder
cb_enc = ____

# Learn encoding from the training set
____

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns
train = ____
valid = ____

In [None]:
# Uncomment these if you need some guidance
q_6.hint()
q_6.solution()

In [None]:
# Run this cell to check your work
q_6.check()

In [None]:
#%%RM_IF(PROD)%%
cat_features = ['app', 'device', 'os', 'channel']
train, valid, _ = get_data_splits(clicks)

cb_enc = ce.CatBoostEncoder(cols=cat_features, random_state=7)

# Learn encodings on the train set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encodings to each set
train = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))
q_6.check()

In [None]:
_ = train_model(train, valid)

The CatBoost encodings work the best, so we'll keep those.

In [None]:
encoded = cb_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_cb', encoded[col])

## Categorical feature embeddings with SVD

Now you'll create embeddings from pairs of columns using SVD to learn from a count matrix.

In [None]:
import itertools
from sklearn.decomposition import TruncatedSVD

### 7) Learn SVD components as embeddings.

Here you'll use SVD to learn embeddings for the categorical features from a matrix of counts. First, create the SVD transformer with `TruncatedSVD`. Then for each pair of features, create a count matrix and learn the SVD components. Remember you should be learning the embeddings from the train datatset to avoid leakage.

In [None]:
train, valid, test = get_data_splits(clicks)
cat_features = ['app', 'device', 'os', 'channel']

# Create the SVD transformer with 5 components, set random_state to 7
svd = ____

# Learn SVD feature vectors and store in svd_components as DataFrames
# Make sure you're only using the train set!
svd_components = {}
for col1, col2 in itertools.permutations(cat_features, 2):
    # Create the count matrix
    ____
    
    # Fit the SVD with the count matrix
    ____
    
    # Store the components in the dictionary. 
    svd_components['_'.join([col2, col1])] = ____

In [None]:
# Uncomment these if you need some guidance
#q_7.hint()
#q_7.solution()

In [None]:
# Run this cell to check your work
q_7.check()

In [None]:
#%%RM_IF(PROD)%%
train, valid, test = get_data_splits(clicks)

# Learn SVD feature vectors
cat_features = ['app', 'device', 'os', 'channel']
svd_components = {}
svd = TruncatedSVD(n_components=5, random_state=7)
# Loop through each pair of categorical features
for col1, col2 in itertools.permutations(cat_features, 2):
    # For a pair, create a sparse matrix with cooccurence counts
    pair_counts = train.groupby([col1, col2])['is_attributed'].count()
    pair_matrix = pair_counts.unstack(fill_value=0)
    
    # Fit the SVD and store the components
    # Note: these components encode column 2
    svd.fit(pair_matrix)
    svd_components['_'.join([col2, col1])] = pd.DataFrame(svd.components_)

q_7.check()

### 8) Encode categorical features with SVD components

With the components learned from the train dataset, encode the categorical features and create a dataframe `svd_encodings`. The columns need to be named with the feature pair, `svd`, and the component index, such as `"os_device_svd_0"`.

In [None]:
svd_encodings = pd.DataFrame(index=clicks.index)

for feature in svd_components:
    # Get the feature column the SVD components are encoding
    col = ____
    
    ## Use svd_components to encode the categorical features and join with svd_encodings
    ____
    

In [None]:
# Uncomment these if you need some guidance
#q_8.hint()
#q_8.solution()

In [None]:
# Run this cell to check your work
q_8.check()

In [None]:
#%%RM_IF(PROD)%%
svd_encodings = pd.DataFrame(index=clicks.index)
for feature in svd_components:
    # Get the feature column the SVD components are encoding
    col = feature.split('_')[0]

    ## Use SVD components to encode the categorical features
    # Need to transpose so .reindex works appropriately
    feature_components = svd_components[feature].transpose()
    comp_cols = feature_components.reindex(clicks[col]).set_index(clicks.index)
    
    # Doing this so we know what these features are
    comp_cols = comp_cols.add_prefix(feature + '_svd_')
    
    svd_encodings = svd_encodings.join(comp_cols)

# Fill null values with the mean
svd_encodings = svd_encodings.fillna(svd_encodings.mean())

q_8.check()

Test the encoded data.

In [None]:
train, valid, test = get_data_splits(clicks.join(svd_encodings))
_ = train_model(train, valid)

Next up, you'll start generating completely new features from the data itself.