In this notebook, we calculate the feature importance for a few tree-based ensemble methods, for the problem of predicting photometric redshifts from six photometric bands (u, g, r, i, z, y).

It accompanies Chapter 6 of the book.

Author: Viviana Acquaviva

License: TBD

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 100)


font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
#matplotlib.rcParams.update({'figure.autolayout': True})
matplotlib.rcParams['figure.dpi'] = 300

In [None]:
import xgboost as xgb

In [None]:
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

In [None]:
sel_features = pd.read_csv('../data/sel_features.csv', sep = '\t')

In [None]:
sel_target = pd.read_csv('../data/sel_target.csv')

In [None]:
sel_features.shape

In [None]:
sel_target.values.ravel() #changes shape to 1d row-like array

### Let's start with Random Forests

In [None]:
model = RandomForestRegressor(max_features=4, n_estimators=200) #I need to re-seed the random state

After the model has been fit, it will have the attribute "feature\_importances\_". We can look at the feature importance using the following code:

In [None]:
model.fit(sel_features, sel_target.values.ravel()) 

#note: this is not doing any train/test split, but fitting the entire data set 

In [None]:
model.feature_importances_

The code below plots the feature importances.

In [None]:
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(sel_features.shape[1]):
    print("%d. feature: %s, %d (%f)" % (f + 1, sel_features.columns[indices[f]], indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(figsize=(16,6))
plt.title("Feature importances")
plt.bar(range(sel_features.shape[1]), importances[indices],
       color="r", align="center")
plt.xticks(range(sel_features.shape[1]), sel_features.columns[indices])
plt.xlim([-1, sel_features.shape[1]])
plt.show()

### In this problem, actually all the features are quite important, but it's hard to diagnose issues. 

### Something we can do is to compare with the results of other algorithms.

In [None]:
# Plot the feature importances of three models

plt.figure(figsize=(16,6))

plt.title("Feature importances for various models")

models = [RandomForestRegressor(max_features=4, n_estimators=200), \
          AdaBoostRegressor(base_estimator=DecisionTreeRegressor(max_depth=None), n_estimators=100), 
          xgb.XGBRegressor(objective ='reg:squarederror', max_depth=6, n_estimators = 500, learning_rate=0.1)]

model_names = ['Random Forests', 'AdaBoost', 'XGBoost']

for i, model in enumerate(models):

    model.fit(sel_features, sel_target.values.ravel()) 
    
    importances = model.feature_importances_
    
    indices = np.argsort(importances)[::-1]
    
    plt.bar(np.arange(sel_features.shape[1])+0.1*i, importances, 
            align="center", width=0.7, alpha = 0.5, label = model_names[i])
    
    plt.xticks(range(sel_features.shape[1]), sel_features.columns)
    
    plt.xlim([-1, sel_features.shape[1]])
    
    plt.legend(fontsize = 12)   
    
#    print('For model', model_names[i], 'features importances are', sel_features.columns[indices].values, importances[indices])

This is a reminder that feature importance is an indication only, and it is often algorithm-dependent.

### The issue of leakage of test labels - recap

- When we optimize parameters with a grid search, we choose the parameters that give the best test scores. This is different from what would happen with new data - to do this fairly, at no point of the training procedure we are allowed to look at the test labels. Therefore, one needs to do <b> nested cross validation </b> to evaluate the generalization error in order to avoid leakage between the parameter optimization and the cross validation procedure.
<br>

- Technically, standardizing/normalizing data using the entire learning set introduces leakage between train and test set (the test set "knows" about the mean and standard deviation of the entire data set). Usually not dramatic, but the correct procedure is to do it within each CV fold (i.e. after separating in train and test), only on the train set, and applying the same transformation to the test set. The model then becomes a pipeline.
<br>

- Technically, doing feature selection using the entire learning set introduces leakage between train and test set (the model "picks" features that give the best results on the test set). A possible solution is to pick the "average" best features within a cross-validated model.  
<br>

- In alternative, one can use unsupervised methods, for example by picking features with the largest variance; this is ok to do on the entire learning set, because it doesn't involve labels, but does not select features that are relevant for a specific supervised problem.