#### Boruta SHAP: A Tool for Feature Selection Every Data Scientist Should Know

link : https://towardsdatascience.com/boruta-shap-an-amazing-tool-for-feature-selection-every-data-scientist-should-know-33a5f01285c0

##### Shadow Features 

In [1]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np

In [3]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

In [6]:
#Fetch the data 
dataset=load_diabetes(as_frame=True)
#Gets the independent variables 
X=dataset['data']
# Gets the dependent variable (the target)
y=dataset['target']
# Splits the dataset
X_train, X_test, y_train, y_test=train_test_split(X, y, 
                                                  test_size=0.2)

In order to use Boruta we need to define an estimator, which will be used to estimate the feature importances. In this case I chose the RandomForestRegressor:

In [7]:
from sklearn.ensemble import RandomForestRegressor
# Defines the estimator used by the Boruta algorithm
estimator=RandomForestRegressor()

Now we can create the BorutaPy object and fit it to the data using the estimator:

In [8]:
from boruta import BorutaPy
# create the BorutaPy object 
boruta=BorutaPy(estimator=estimator, n_estimators='auto',
                max_iter=100)
# Fits Boruta
boruta.fit(np.array(X_train), np.array(y_train))

BorutaPy(estimator=RandomForestRegressor(n_estimators=28,
                                         random_state=RandomState(MT19937) at 0x7FA37E1EDB40),
         n_estimators='auto',
         random_state=RandomState(MT19937) at 0x7FA37E1EDB40)

Finally we can discover which features are important, which are uninportant and which are uncertain:

In [9]:
# Important features
important =list(X.columns[boruta.support_])
print(f"Features confirmed as important :{important}")

# Tentative features 
tentative=list(X.columns[boruta.support_weak_])
print(f"Unconfirmed features (tentative): {tentative}")

#Unimportant features 
unimportant = list(X.columns[~(boruta.support_ | boruta.support_weak_)])
print(f"Features confirmed as unimportant: {unimportant}")

Features confirmed as important :['bmi', 'bp', 's5']
Unconfirmed features (tentative): ['s6']
Features confirmed as unimportant: ['age', 'sex', 's1', 's2', 's3', 's4']


##### Boruta SHAP Feature Selection

Boruta is a robust method for feature selection, but it strongly relies on the calculation of the feature importances, which might be biased or not good enough for the data.

This is where SHAP [3] joins the team. By using SHAP Values as the feature selection method in Boruta, we get the Boruta SHAP Feature Selection Algorithm. With this approach we can get the strong addictive feature explanations existent in SHAP method while having the robustness of Boruta algorithm to ensure only significant variables remain on the set.

First we need to create a BorutaShap object. The default value for importance_measure is “shap” since we want to use SHAP as the feature importance discriminator. We can change the classification parameter to True when the problem is a classification one.

In [15]:
from BorutaShap import BorutaShap

In [16]:
# Creates a BorutaShap Selector for regression 
selector=BorutaShap(importance_measure='shap', classification=False)

Then we fit the BorutaShap selector in the data or a sample of the data. The n_trials parameter defines the number of iterations of the Boruta algorithm, while the sample boolean determines if the method will internally sample the data to speed up the process.

In [17]:
# Fits the selector 
selector.fit(X=X_train, y=y_train, n_trials=100, 
            sample=False, verbose=True)
# n_trials : number of iteratins for Boruta Algorithm
# sample : samples the data so it goes faster 

  0%|          | 0/100 [00:00<?, ?it/s]

4 attributes confirmed important: ['bp', 's2', 's5', 'bmi']
6 attributes confirmed unimportant: ['age', 's6', 's4', 's1', 's3', 'sex']
0 tentative attributes remains: []


Finally we can see which features will be removed and drop them from our data:

In [18]:
# Display features to be removed 
features_to_remove=selector.features_to_remove
print(features_to_remove)

['age' 'sex' 's1' 's3' 's4' 's6']


In [19]:
# Remove them 
X_train_boruta_shap=X_train.drop(columns=features_to_remove)
X_test_boruta_shap=X_test.drop(columns=features_to_remove)

In [20]:
X_test_boruta_shap

Unnamed: 0,bmi,bp,s2,s5
309,0.001339,-0.002228,0.070084,0.026714
403,0.097264,-0.005671,-0.023861,0.061686
387,0.015350,-0.074528,-0.017284,-0.104365
257,-0.055785,0.025315,-0.023547,-0.005145
75,-0.030996,-0.026328,-0.001001,0.006209
...,...,...,...,...
307,-0.030996,0.004658,0.035638,0.023375
34,-0.063330,-0.057314,-0.048912,-0.059473
222,-0.025607,0.042530,-0.047660,0.001144
411,0.058463,-0.043542,-0.072399,-0.051401


##### Conclusion 

As important as feature selection is to our ML pipelines, we need to use the best algorithms to ensure the best results.

A downside of this method is the evaluation time, which might be too long for many Boruta iterations, or when the SHAP is fitted to many observations. Beware of the time!

With that in mind, Boruta SHAP is one of the best methods we can employ to select the most important features on our machine learning pipelines.