### Introduction

The more features you have...

1. the more likely you are to overfit to the training and validation sets.
2. the longer it will take to train your model and optimize hyperparameters.
3. slower inference (reasoning predictions relative to features afterwards)

In [9]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

import os

clicks = pd.read_parquet('/Users/fred/Google Drive/University/Machine Learning/Kaggle/Datasets/228068_567626_bundle_archive/baseline_data.pqt')

data_files = ['count_encodings.pqt',
              'catboost_encodings.pqt',
              'interactions.pqt',
              'past_6hr_events.pqt',
              'downloads.pqt',
              'time_deltas.pqt',
              'svd_encodings.pqt']
data_root = '/Users/fred/Google Drive/University/Machine Learning/Kaggle/Datasets/228068_567626_bundle_archive'
for file in data_files:
    features = pd.read_parquet(os.path.join(data_root, file))
    clicks = clicks.join(features)

def get_data_splits(dataframe, valid_fraction=0.1):

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

In [11]:
# Let's look at the baseline score for all the features we've made so far.


train, valid, test = get_data_splits(clicks)
_, baseline_score = train_model(train, valid)

"""
Now we have 91 features we're using for predictions. With all these features, there is a good chance the model is overfitting the data. We might be able to reduce the overfitting by removing some features. Of course, the model's performance might decrease. But at least we'd be making the model smaller and faster without losing much performance.
"""

Training model!
Validation AUC score: 0.9658334271834417


"\nNow we have 91 features we're using for predictions. With all these features, there is a good chance the model is overfitting the data. We might be able to reduce the overfitting by removing some features. Of course, the model's performance might decrease. But at least we'd be making the model smaller and faster without losing much performance.\n"

In [12]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,...,device_channel_svd_0,device_channel_svd_1,device_channel_svd_2,device_channel_svd_3,device_channel_svd_4,os_channel_svd_0,os_channel_svd_1,os_channel_svd_2,os_channel_svd_3,os_channel_svd_4
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,...,0.998937,-0.026614,0.033651,-0.016794,0.001659,0.632548,-0.05079,-0.045754,0.086897,-0.3227
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,...,0.998937,-0.026614,0.033651,-0.016794,0.001659,0.632548,-0.05079,-0.045754,0.086897,-0.3227
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,...,0.998937,-0.026614,0.033651,-0.016794,0.001659,0.632548,-0.05079,-0.045754,0.086897,-0.3227
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,...,0.998937,-0.026614,0.033651,-0.016794,0.001659,0.632548,-0.05079,-0.045754,0.086897,-0.3227
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,...,0.998937,-0.026614,0.033651,-0.016794,0.001659,0.632548,-0.05079,-0.045754,0.086897,-0.3227


### Univariate Feature Selection

Using SelectKBest features. The F-value measures the linear dependency between the feature variable and the target. This means the score might underestimate the relation between a feature and the target if the relationship is nonlinear. 


In [15]:
from sklearn.feature_selection import SelectKBest, f_classif
feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

# Create the selector, keeping 40 features
selector = SelectKBest(f_classif, k=40)

# Use the selector to retrieve the best features
X_new = selector.fit_transform(train[feature_cols], train['is_attributed']) 

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(selector.inverse_transform(X_new), 
                                 index=train.index, 
                                 columns=feature_cols)

# Find the columns that were dropped
dropped_columns = selected_features.columns[selected_features.var() == 0]


In [16]:
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1))

Training model!
Validation AUC score: 0.9625481759576047


### L1 regulatrization

Univariate methods consider only one feature at a time when making a selection decision. Instead, we can make our selection using all of the features by including them in a linear model with L1 regularization. This type of regularization (sometimes called Lasso) penalizes the absolute magnitude of the coefficients, as compared to L2 (Ridge) regression which penalizes the square of the coefficients.

Now try a more powerful approach using L1 regularization. Implement a function select_features_l1 that returns a list of features to keep.

Use a LogisticRegression classifier model with an L1 penalty to select the features. For the model, set:

the random state to 7,
the regularization parameter to 0.1,
and the solver to 'liblinear'.
Fit the model then use SelectFromModel to return a model with the selected features.

The checking code will run your function on a sample from the dataset to provide more immediate feedback.

In [21]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

def select_features_l1(X, y):
    """Return selected features using logistic regression with an L1 penalty."""
    
    logistic = LogisticRegression(C=0.1, penalty="l2", random_state=7).fit(X, y)
    model = SelectFromModel(logistic, prefit=True)

    X_new = model.transform(X)

    # Get back the kept features as a DataFrame with dropped columns as all 0s
    selected_features = pd.DataFrame(model.inverse_transform(X_new), 
                                    index=X.index,
                                    columns=X.columns)

    # Dropped columns have values of all 0s, keep other columns 
    cols_to_keep = selected_features.columns[selected_features.var() != 0]

    return cols_to_keep

In [22]:
n_samples = 10000
X, y = train[feature_cols][:n_samples], train['is_attributed'][:n_samples]
selected = select_features_l1(X, y)

dropped_columns = feature_cols.drop(selected)
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1))

Training model!
Validation AUC score: 0.9642664604988731
