# Introduction

In this exercise you'll apply more advanced encodings to encode the categorical variables ito improve your classifier model. The encodings you will implement are:

- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD 

You'll refit the classifier after each encoding to check its performance on hold-out data. First, run the next cell to repeat the work you did in the last exercise.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
# This can take a few seconds, thanks for your patience
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering.ex2 import *

clicks = pd.read_parquet('../input/feature-engineering-data/baseline_data.pqt')

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.
  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


Here I'll define a couple functions to help test the new encodings.

In [2]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

Run this cell to get a baseline score. If your encodings do better than this, you can keep them.

In [3]:
print("Baseline model")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid)

Baseline model
Training model!
Validation AUC score: 0.9622743228943659


## 1) Categorical encodings and leakage

These encodings are all based on statistics calculated from the dataset like counts and means. Considering this, what data should you be using to calculate the encodings?

Uncomment the following line after you've decided your answer.

In [4]:
# q_1.solution()

## 2) Count encodings

Here, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. Using `CountEncoder` from the `category_encoders` library, fit the encoding using the categorical feature columns defined in `cat_features`. Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed `"_count"`.

In [5]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = ____

# Learn encoding from the training set
____

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train_encoded = ____
valid_encoded = ____

q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. You need to update the code that creates variables `train_encoded`, `valid_encoded`

In [6]:
# Uncomment if you need some guidance
#q_2.hint()
#q_2.solution()

In [7]:
#%%RM_IF(PROD)%%
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = ce.CountEncoder(cols=cat_features)

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

q_2.assert_check_passed()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [8]:
# Train the model on the encoded datasets
# This can take around 30 seconds to complete
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9653051135205329


Count encoding improved our model's score!

## 3) Why is count encoding effective?
At first glance, it could be surprising that Count Encoding helps make accurate models. 
Why do you think is count encoding is a good idea, or how does it improve the model score?

Uncomment the following line after you've decided your answer.

In [9]:
# q_3.solution()

## 4) Target encoding

Here you'll try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. Create the target encoder from the `category_encoders` library. Then, learn the encodings from the training dataset, apply the encodings to all the datasets and retrain the model.

In [10]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the target encoder. You can find this easily by using tab completion.
# Start typing ce. the press Tab to bring up a list of classes and functions.
target_enc = ____

# Learn encoding from the training set. Use the 'is_attributed' column as the target.
____

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_target` as a suffix to the new columns
train_encoded = ____
valid_encoded = ____

q_4.check()

<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. You need to update the code that creates variables `train_encoded`, `valid_encoded`

In [11]:
# Uncomment these if you need some guidance
#q_4.hint()
#q_4.solution()

In [12]:
#%%RM_IF(PROD)%%
cat_features = ['ip', 'app', 'device', 'os', 'channel']
target_enc = ce.TargetEncoder(cols=cat_features)

train, valid, test = get_data_splits(clicks)
target_enc.fit(train[cat_features], train['is_attributed'])

train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

q_4.assert_check_passed()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [13]:
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9540530347873288


## 5) Remove IP encoding

Try leaving `ip` out of the encoded features and retrain the model with target encoding again. You should find that the score increases and is above the baseline score! Why do you think the score is below baseline when we encode the IP address but above baseline when we don't?

Uncomment the following line after you've decided your answer.

In [14]:
# q_5.solution()

## 6) CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [15]:
train, valid, test = get_data_splits(clicks)

# Create the CatBoost encoder
cb_enc = ____

# Learn encoding from the training set
____

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns
train_encoded = ____
valid_encoded = ____
q_6.check()

<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. You need to update the code that creates variables `train_encoded`, `valid_encoded`

In [16]:
# Uncomment these if you need some guidance
#q_6.hint()
#q_6.solution()

In [17]:
#%%RM_IF(PROD)%%
cat_features = ['app', 'device', 'os', 'channel']
train, valid, _ = get_data_splits(clicks)

cb_enc = ce.CatBoostEncoder(cols=cat_features, random_state=7)

# Learn encodings on the train set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encodings to each set
train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))

q_6.assert_check_passed()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [18]:
_ = train_model(train, valid)

Training model!
Validation AUC score: 0.9622743228943659


The CatBoost encodings work the best, so we'll keep those.

In [19]:
encoded = cb_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_cb', encoded[col])

## Categorical feature embeddings with SVD

Now you'll create embeddings from pairs of columns using SVD to learn from a count matrix.

In [20]:
import itertools
from sklearn.decomposition import TruncatedSVD

## 7) Learn SVD components as embeddings.

Here you'll use SVD to learn embeddings for the categorical features from a matrix of counts. First, create the SVD transformer with `TruncatedSVD`. Then for each pair of features, create a count matrix and learn the SVD components. Remember you should be learning the embeddings from the train dataset to avoid leakage.

In [21]:
train, valid, test = get_data_splits(clicks)
cat_features = ['app', 'device', 'os', 'channel']

# Create the SVD transformer
svd = TruncatedSVD(n_components=5, random_state=7)

# Learn SVD feature vectors and store in svd_components as DataFrames
# Make sure you're only using the train set!
svd_components = {}
for col1, col2 in itertools.permutations(cat_features, 2):
    # Create the count matrix
    ____
    
    # Fit the SVD and transform to get the components
    pair_components = ____
    
    # Store the components in the dictionary. 
    svd_components['_'.join([col1, col2])] = pair_components

q_7.check()

<IPython.core.display.Javascript object>

<span style="color:#cc3333">Incorrect:</span> Something wrong with app_device

In [22]:
# Uncomment these if you need some guidance
#q_7.hint()
#q_7.solution()

In [23]:
#%%RM_IF(PROD)%%
train, valid, test = get_data_splits(clicks)

# Learn SVD feature vectors
cat_features = ['app', 'device', 'os', 'channel']
svd_components = {}
svd = TruncatedSVD(n_components=5, random_state=7)
# Loop through each pair of categorical features
for col1, col2 in itertools.permutations(cat_features, 2):
    # For a pair, create a sparse matrix with cooccurence counts
    pair_counts = train.groupby([col1, col2])['is_attributed'].count()
    pair_matrix = pair_counts.unstack(fill_value=0)
    
    pair_components = pd.DataFrame(svd.fit_transform(pair_matrix))
    
    # Fit the SVD and store the components
    svd_components['_'.join([col1, col2])] = pair_components

q_7.assert_check_passed()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

## 8) Encode categorical features with SVD components

With the components learned from the train dataset, encode the categorical features and create a dataframe `svd_encodings`. The columns need to be named with the feature pair, `svd`, and the component index, such as `"os_device_svd_0"`.

In [27]:
svd_encodings = pd.DataFrame(index=clicks.index)

for feature in svd_components:
    # Get the feature column the SVD components are encoding
    col = feature.split('_')[0]

    ## Use SVD components to encode the categorical features
    feature_components = svd_components[feature]
    comp_cols = ____
    
    # Doing this so we know what these features are
    svd_cols = ____
    svd_encodings = ____

# Fill null values with the mean
# svd_encodings = svd_encodings.fillna(svd_encodings.mean())
q_8.check()

<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. You need to update the code that creates variables `svd_components`, `svd_encodings`

In [None]:
# Uncomment these if you need some guidance
#q_8.hint()
#q_8.solution()

In [None]:
#%%RM_IF(PROD)%%
svd_encodings = pd.DataFrame(index=clicks.index)
for feature in svd_components:
    # Get the feature column the SVD components are encoding
    col = feature.split('_')[0]

    ## Use SVD components to encode the categorical features
    feature_components = svd_components[feature]
    comp_cols = feature_components.reindex(clicks[col]).set_index(clicks.index)
    
    # Doing this so we know what these features are
    comp_cols = comp_cols.add_prefix(feature + '_svd_')
    
    svd_encodings = svd_encodings.join(comp_cols)

# Fill null values with the mean
svd_encodings = svd_encodings.fillna(svd_encodings.mean())

q_8.assert_check_passed()

Test the encoded data.

In [None]:
train, valid, test = get_data_splits(clicks.join(svd_encodings))
_ = train_model(train, valid)

# Keep Going

Now you are ready to **[generating completely new features](#$NEXT_NOTEBOOK_URL$)** from the data itself.