# Feature Selection

In this exercise you'll use some feature selection algorithms to improve your model.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex2 import *

import os

clicks = pd.read_parquet('../input/baseline_data.pqt')
data_files = ['count_encodings.pqt',
              'catboost_encodings.pqt',
              'interactions.pqt',
              'past_6hr_events.pqt',
              'downloads.pqt',
              'time_deltas.pqt',
              'svd_encodings.pqt']
for file in data_files:
    features = pd.read_parquet(os.path.join('../input', file))
    clicks = clicks.join(features)

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.
  labels = getattr(columns, 'labels', None) or [
  return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
  labels, = index.labels


In [2]:
def get_data_splits(dataframe, valid_fraction=0.1):

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    
    test_pred = bst.predict(test[feature_cols])
    test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
    print(f"Validation AUC score: {valid_score}")
    
    return bst, valid_score, test_score

## Baseline Score

Let's look at the baseline score for all the features we've made so far.

In [3]:
train, valid, test = get_data_splits(clicks)
_, baseline_score, _ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9659200109124912


## 1) Which data to use for feature selection?

Since many feature selection methods require calculating statistics from the dataset, should you use all the data for feature selection?

**Answer:** Including validation and test data within the feature selection is a source of leakage. You'll want to perform feature selection on the train set only, then use the results there to remove features from the validation and test sets.

Now we have 131 features we're using for predictions. With all these features, there is a good chance the model is overfitting the data. We might be able to reduce the overfitting by removing some features. Of course, the model's performance might decrease. But at least we'd be making the model smaller and faster without losing much performance.

## 2) Univariate Feature Selection

Below, use `SelectKBest` to choose 40 features from the 131 features in the data.

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

In [None]:
feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

# Create the selector, keeping 40 features
selector = ____

# Use the selector to retrieve the best features
X_new = ____ 

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = ____

# Find the columns that were dropped
dropped_columns = ____

In [None]:
#%%RM_IF(PROD)%%
feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

# Do feature extraction on the training data only!
selector = SelectKBest(f_classif, k=40)
X_new = selector.fit_transform(train[feature_cols], train['is_attributed'])

# Get back the features we've kept, zero out all other features
selected_features = pd.DataFrame(selector.inverse_transform(X_new), 
                                 index=train.index, 
                                 columns=feature_cols)

# Dropped columns have values of all 0s, so var is 0, drop them
dropped_columns = selected_features.columns[selected_features.var() == 0]

In [None]:
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1),
                test.drop(dropped_columns, axis=1))

Training model!
Validation AUC score: 0.962405543877231


## 3) The best value of K

With this method we can choose the best K features, but we still have to choose K ourselves. How would you find the "best" value of K? That is, you want it to be small so you're keeping the best features, but not so small that it's degrading the model's performance.

**Answer:** To find the best value of K, you can fit multiple models with increasing values of K, then choose the smallest K with validation score above some threshold or some other criteria. A good way to do this is loop over values of K and record the validation scores for each iteration.



## 4) Use L1 regularization for feature selection

Now try a more powerful approach using L1 regularization. Use a `LinearSVC` classifier model with an L1 penalty to select the features. First fit the model, then use `SelectFromModel` to return a model with the selected features.

In [None]:
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

In [None]:
feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

# Create the LinearSVC model with l1 penalty and penalty parameter = 0.01
lsvc = ____

# Train the LinearSVC model
____

# Use SelectFromModel to retrieve the best features from the trained SVC model
model = ____ 

# With the new model, get the selected features
X_new = ____

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = ____

In [None]:
#%%RM_IF(PROD)%%

feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

X, y = train[feature_cols], train['is_attributed']
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)

X_new = model.transform(X)

# Get back the kept features as a DataFrame with dropped columns as all 0s
selected_features = pd.DataFrame(model.inverse_transform(X_new), 
                                 index=train.index, 
                                 columns=feature_cols)

In [None]:
# Dropped columns have values of all 0s, so var is 0, drop them
dropped_columns = selected_features.columns[selected_features.var() == 0]
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1),
                test.drop(dropped_columns, axis=1))

## 5) Feature Selection with Trees

Since we're using a tree-based model, using another tree-based model for feature selection might produce better results. What would you do different to select the features using a trees classifier?

**Answer:** You could use something like `RandomForestClassifier` or `ExtraTreesClassifier` to find feature importances. `SelectFromModel` can use the feature importances to find the best features.

## 6) Principle Component Analysis

There are 103 numerical features, the SVD encodings plus the three numerical features we created. Use PCA to reduce these to down to 20 features and refit the model. In this first step here, create the PCA transformer and fit it to your data.

In [None]:
from sklearn.decomposition import PCA

train, valid, test = get_data_splits(clicks)

# Select the feature columns you'll use to train the PCA transformer
feature_cols = ____

# Create the PCA transformer with 20 components
pca = ____

# Fit PCA to the feature columns
____

In [None]:
#%%RM_IF(PROD)%%
from sklearn.decomposition import PCA

train, valid, test = get_data_splits(clicks)
feature_cols = train.columns[-103:]

pca = PCA(n_components=20)
pca.fit(train[feature_cols], train['is_attributed'])

## 7) Applying PCA encodings

Implement a function `encode_pcs` that encodes the feature columns of a dataframe using a trained PCA transformer. As input, this function will take a dataframe, a trained PCA transformer, and a list of feature columns to encode. Then it should return a dataframe with the same index, but encoded features. Note that the feature columns here should be the same as what was used to train the PCA transformer.

In [None]:
def encode_pcs(df, pca, feature_cols):
    """ Returns a new dataframe with the feature columns of a dataframe (defined with the
        feature_cols argument) encoded using a trained PCA transformer
        
        Arguments
        ---------
        df: DataFrame
        pca: Trained PCA transformer
        feature_cols: the feature columns of df that will be encoded
        
        
        Returns
        -------
        DataFrame with PCA encoded features
    """
    ____
    return ____

In [None]:
#%%RM_IF(PROD)%%
def encode_pcs(df, pca, feature_cols):
    encodings = pd.DataFrame(pca.transform(df[feature_cols]),
                             index=df.index).add_prefix('pca_')
    encoded_df = df.drop(feature_cols, axis=1).join(encodings)
    return encoded_df

In [None]:
_ = train_model(encode_pcs(train, pca, feature_cols),
                encode_pcs(valid, pca, feature_cols), 
                encode_pcs(test, pca, feature_cols))

## 8) Feature Selection with Boruta

Finally, you'll use Boruta. Define a function `fit_boruta` that fits the Boruta feature selection to a dataset and returns the rejected features. This function will take a dataframe with the data, a list of feature columns, and the target column (as a string).

Boruta takes a while to run so to provide more immediate feedback I'll only be testing it on a small sample from the data. Typically, you'd let it run on the entire dataset and just wait. I'll provide a file with the results on the full dataset so you can see the final performance.

In [None]:
from boruta import BorutaPy

In [None]:
def fit_boruta(df, feature_cols, target):
    ____
    return ____

In [None]:
#%%RM_IF(PROD)%%
def fit_boruta(df, feature_cols, target):
    
    X = df[feature_columns].values
    y = df[target].values
    
    # define random forest classifier, with utilising all cores and
    # sampling in proportion to y labels
    rf = RandomForestClassifier(class_weight='balanced', max_depth=5, n_jobs=-1)

    # define Boruta feature selection method
    feat_selector = BorutaPy(rf, n_estimators='auto', random_state=1)

    # Fit the Boruta selector
    feat_selector.fit(X, y)

    # Get the rejected columns
    rejected_columns = feature_columns[~feat_selector.support_]
    return rejected_columns

feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

rejected_columns = fit_boruta(train[:5000], feature_cols, 'is_attributed')

In [None]:
## Still need to fit boruta on the entire dataset and save the results.
## I'll load in the smaller dataset here

In [None]:
_ = train_model(train.drop(rejected_columns, axis=1),
                valid.drop(rejected_columns, axis=1),
                test.drop(rejected_columns, axis=1))