## Univariate roc-auc or mse

This procedure works as follows:

- First, it builds one decision tree per feature, to predict the target
- Second, it makes predictions using the decision tree and the mentioned feature
- Third, it ranks the features according to the machine learning metric (roc-auc or mse)
- It selects the highest ranked features

I will demonstrate how to select features based on univariate roc-auc or univariate mse information on a regression and classification problem. For classification I will use the Paribas claims dataset from Kaggle. For regression, the House Price dataset from Kaggle.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from datetime import datetime
from ipywidgets import IntProgress
from multiprocessing import Pool, cpu_count

from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import roc_auc_score, mean_squared_error

### GLOBAL VARIABLES

In [None]:
INPUT_PATH = '../../data/train_test'
OUTPUT_PATH = '../../data/features'
INPUT_FILE_NAME = 'filter_features_correlation_v008'
OUTPUT_FILE_NAME = 'filter_univariate_rocauc_mse_v008'
SEED = 47
CUTOFF = 0.9
TOP_FEATURES = 35

### FUNCTIONS

In [None]:
def reduce_mem_usage(df, verbose=False):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    int_columns = df.select_dtypes(include=["int"]).columns
    float_columns = df.select_dtypes(include=["float"]).columns

    for col in int_columns:
        df[col] = pd.to_numeric(df[col], downcast="integer")

    for col in float_columns:
        df[col] = pd.to_numeric(df[col], downcast="float")

    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

In [None]:
def run_tree_regressor(feature):
    clf = DecisionTreeRegressor()
    clf.fit(X_train[feature].to_frame(), y_train)
    y_scored = clf.predict(X_test[feature].to_frame())
    
    return mean_squared_error(y_test, y_scored)
    

In [None]:
def run_tree_classification(feature):
    clf = DecisionTreeClassifier()
    clf.fit(X_train[feature].to_frame(), y_train)
    y_scored = clf.predict_proba(X_test[feature].to_frame())
    
    return roc_auc_score(y_test, y_scored[:, 1])
    

### LOAD DATASET

In [None]:
# load dataset 
X_train = pd.read_pickle(f'{INPUT_PATH}/X_train.pkl').pipe(reduce_mem_usage)
y_train = pd.read_pickle(f'{INPUT_PATH}/Y_train.pkl')
X_test = pd.read_pickle(f'{INPUT_PATH}/X_val.pkl').pipe(reduce_mem_usage)
y_test = pd.read_pickle(f'{INPUT_PATH}/Y_val.pkl')

In [None]:
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test: ", X_test.shape)
print("y_test: ", y_test.shape)

In [None]:
features = np.load(f'{OUTPUT_PATH}/{INPUT_FILE_NAME}.npy').tolist()

In [None]:
X_train = X_train[features]
X_test = X_test[features]

In [None]:
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test: ", X_test.shape)
print("y_test: ", y_test.shape)

In [None]:
X_train['target'] = y_train.values
X_test['target'] = y_test.values

In [None]:
features_init = X_train.columns.tolist()

### SAMPLE DATASET

In [None]:
X_train = X_train.groupby(['item_id', 'store_id']).apply(lambda x: pd.DataFrame.sample(x, frac=.3, random_state=SEED))

In [None]:
X_test = X_test.groupby(['item_id', 'store_id']).apply(lambda x: pd.DataFrame.sample(x, frac=.3, random_state=SEED))

In [None]:
y_train = X_train.target

In [None]:
y_test = X_test.target

In [None]:
X_train.drop('target', axis=1, inplace=True)

In [None]:
X_test.drop('target', axis=1, inplace=True)

In [None]:
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test: ", X_test.shape)
print("y_test: ", y_test.shape)

### SELECT NUMERIC FEATURES

In [None]:
catfeatures = set(['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'month', 'week', 'year'])

In [None]:
# In practice, feature selection should be done after data pre-processing,
# so ideally, all the categorical variables are encoded into numbers,
# and then you can assess whether they are correlated with other features

# here for simplicity I will use only numerical variables
# select numerical columns:

numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
numerical_vars = list(X_train.select_dtypes(include=numerics).columns)
X_train = X_train[numerical_vars]
X_train.shape

In [None]:
X_train.replace([np.inf, -np.inf], np.nan, inplace=True)
X_train.fillna(0, inplace=True)
X_test.replace([np.inf, -np.inf], np.nan, inplace=True)
X_test.fillna(0, inplace=True)

In [None]:
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test: ", X_test.shape)
print("y_test: ", y_test.shape)

## CLASSIFICATION

In [None]:
# loop to build a tree, make predictions and get the roc-auc
# for each feature of the train set
features = list(X_train.columns)
roc_values = []

tic = datetime.now()
p = Pool(cpu_count())
roc_values = list(tqdm_notebook(p.imap(run_tree_classification, features), total=len(features)))
p.close()
p.join()
toc = datetime.now()
print("Total time " ,(toc - tic).seconds/60, " min")

In [None]:
# let's add the variable names and order it for clearer visualisation
roc_values = pd.Series(roc_values)
roc_values.index = X_train.columns
roc_values.sort_values(ascending=False)

In [None]:
# and now let's plot
roc_values.sort_values(ascending=False).plot.bar(figsize=(20, 8))

In [None]:
# a roc auc value of 0.5 indicates random decision
# let's check how many features show a roc-auc value
# higher than random

len(roc_values[roc_values > 0.5])

In [None]:
features_final = list(roc_values[roc_values > 0.5].iloc[:TOP_FEATURES].reset_index()['index'])

In [None]:
print(len(features) - len(features_final), " were removed. The number of final features is ", len(features_final))

In [None]:
# saving final features
np.save(f'{OUTPUT_PATH}/{OUTPUT_FILE_NAME}.npy',features_final)

You can of course tune the parameters of the Decision Tree and get better predictions. I leave this to you. But remember that the key here is not to make ultra predictive Decision Trees, rather to use them to screen quickly for important features. So I would recommend you don't spend too much time tuning. Doing cross validation with sklearn should be very straight forward  to get a more accurate measure of the roc-auc per feature.

Once again, where we put the cut-off to select features is a bit arbitrary, other than > 0.5. It will be up to you.

## Regression

In [None]:
# loop to build a tree, make predictions and get the mse
# for each feature of the train set
features = list(X_train.columns)
mse_values = []

tic = datetime.now()
p = Pool(cpu_count())
mse_values = list(tqdm_notebook(p.imap(run_tree_regressor, features), total=len(features)))
p.close()
p.join()
toc = datetime.now()
print("Total time " ,(toc - tic).seconds/60, " min")

In [None]:
# let's add the variable names and order it for clearer visualisation
mse_values = pd.Series(mse_values)
mse_values.index = X_train.columns
mse_values.sort_values(ascending=True).head()

In [None]:
mse_values.sort_values(ascending=True).iloc[:TOP_FEATURES]

In [None]:
mse_values.sort_values(ascending=False).plot.bar(figsize=(20,8))

In [None]:
features_final = list(mse_values.sort_values(ascending=True).iloc[:TOP_FEATURES].reset_index()['index'])

In [None]:
print(len(features) - len(features_final), " were removed. The number of final features is ", len(features_final))

In [None]:
# saving final features
np.save(f'{OUTPUT_PATH}/{OUTPUT_FILE_NAME}.npy',features_final)

Remember that for regression, the smaller the mse, the better the model performance is. So in this case, we need to select from the right to the left.

For the mse, where to put the cut-off is arbitrary as well. It depends on how many features you would like to end up with.

I do use this method in my projects, particularly when I have an enormous amount of features and I need to start reducing the feature space quickly.

You can see an example use case in [my talk at pydata London](https://www.youtube.com/watch?v=UHtAjLYgDQ4)

That is all for this lecture, I hope you enjoyed it and see you in the next one!