# Powerful Feature Selection with Recursive Feature Elimination (RFE) of Sklearn
## Feature selection based on single model performance
<img src='images/unsplash.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@victoriano?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Victoriano Izquierdo</a>
        on 
        <a href='https://unsplash.com/s/photos/selection?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash</a>
    </strong>
</figcaption>

### Setup

In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

import datetime
import time

The basic methods of feature selection are mostly about individual properties of features and how they interact with each other. [*Variance thresholding*](https://towardsdatascience.com/how-to-use-variance-thresholding-for-robust-feature-selection-a4503f2b5c3f?source=your_stories_page-------------------------------------) and [*pairwise feature selection*](https://towardsdatascience.com/how-to-use-pairwise-correlation-for-robust-feature-selection-20a60ef7d10?source=your_stories_page-------------------------------------) are a few examples that remove unnecessary features based on the amount of variance and the correlation between features. However, a more pragmatic approach would select features based on how they affect a particular model's performance. One such technique offered by Sklearn is Recursive Feature Elimination (RFE). It reduces model complexity by removing features one by one until the desired number of features are left.

### The idea behind Recursive Feature Elimination

Consider this subset of [Ansur Male dataset](https://www.kaggle.com/seshadrikolluri/ansur-ii):

In [6]:
ansur = pd.read_csv("data/ansur_male.csv", encoding="latin").select_dtypes(
    include="number"
)
ansur.iloc[:, -7:].head()

Unnamed: 0,wristcircumference,wristheight,SubjectNumericRace,DODRace,Age,Heightin,Weightlbs
0,175,853,1,1,41,71,180
1,167,815,1,1,35,68,160
2,180,831,2,2,42,68,205
3,176,793,1,1,31,66,175
4,188,954,2,2,21,77,213


It records more than 100 different types of body measurements of more than 6000 US Army Personnel. Our goal is to predict the weight in pounds using only the numeric features (there are 93) for simplicity. 

Let's establish a base performance with Random Forest Regressor. We will first build the feature and target arrays and divide them into train and test sets. Then, we will fit the estimator and score its performance using R-squared:

In [52]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Feature, target arrays
X, y = ansur.iloc[:, :-1], ansur.iloc[:, -1]

# Train/test set generation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1121218
)

# Scale train and test sets with StandardScaler
X_train_std = StandardScaler().fit_transform(X_train)
X_test_std = StandardScaler().fit_transform(X_test)

# Fix the dimensions of the target array
y_train = y_train.values.reshape(-1, 1)
y_test = y_test.values.reshape(-1, 1)

# Init, fit, test Lasso Regressor
forest = RandomForestRegressor()
_ = forest.fit(X_train_std, y_train)
forest.score(X_test_std, y_test)

0.9485677708139042

We achieved a really good R-squared of 0.948. We were able to do this using all 98 features which is much more than we might need. All Sklearn estimators have special attributes that show feature weights (or coefficients), either given as `coef_` or `.feature_importances_`. Let's see the computed coefficients for our Random Forest Regressor model:

In [58]:
pd.DataFrame(
    zip(X_train.columns, abs(forest.feature_importances_)),
    columns=["feature", "weight"],
).sort_values("weight").reset_index(drop=True)

Unnamed: 0,feature,weight
0,Heightin,0.000097
1,suprasternaleheight,0.000170
2,crotchheight,0.000180
3,DODRace,0.000182
4,cervicaleheight,0.000212
...,...,...
93,forearmforearmbreadth,0.001464
94,elbowrestheight,0.001519
95,forearmcircumferenceflexed,0.003284
96,bideltoidbreadth,0.005946


To reduce model complexity, always start by removing features with close to 0 weights. Since all weights are multiplied by the values of features, such small weights contribute very little to the overall predictions. Looking at the above weights, we can see that many weights are close to 0.

We could set some low threshold and filter out features based on it. But we have to remember that even removing a single feature forces other coefficients to change. So, we have to eliminate them step-by-step, leaving out lowest weighted feature by sorting the fitted models coefficients. Doing this manually for 98 features would be cumbersome, but thankfully Sklearn provides us with Recursive Feature Elimination - [RFE class](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) to do the task.

### Sklearn Recursive Feature Elimination Class

RFE is a transformer estimator which means it follows the familiar fit/transform pattern of Sklearn. It is a popular algorithm due to its easy configurable nature and robust performance. As the name suggests, it removes features one at a time based on the weights given by a model of our choice in each iteration. 

Below, you will see an example of RFE using the above Random Forest Regressor model:

In [59]:
from sklearn.feature_selection import RFE

# Init the transformer
rfe = RFE(estimator=RandomForestRegressor(), n_features_to_select=10)

# Fit to the training data
_ = rfe.fit(X_train_std, y_train)

After fitting the estimator, it has a `.support_` attribute that gives a boolean mask with False values for discarded features. We can use it to subset our data:

In [65]:
X_train.loc[:, rfe.support_]

Unnamed: 0,bideltoidbreadth,bizygomaticbreadth,chestcircumference,forearmcircumferenceflexed,forearmforearmbreadth,handcircumference,neckcircumferencebase,span,weightkg,wristheight
2132,477,146,974,307,527,196,421,1756,712,768
3107,518,132,1172,295,616,201,450,1790,917,852
3111,506,138,1146,288,611,208,410,1760,858,845
3975,496,136,1053,341,552,212,448,1816,952,885
2921,445,131,946,269,512,208,387,1737,673,894
...,...,...,...,...,...,...,...,...,...,...
619,512,150,1067,286,602,196,403,1706,750,759
588,523,143,1049,311,608,213,453,1819,878,858
3467,538,137,1076,312,584,205,447,1880,892,930
1442,521,144,1066,316,572,207,423,1807,916,874


Or you can directly call `.transform()` to get a new `numpy` array with the relevant features. Let's use this smaller subset to test Random Forest Regressor once again:

In [68]:
# Init, fit, score
forest = RandomForestRegressor()
_ = forest.fit(rfe.transform(X_train_std), y_train)
forest.score(rfe.transform(X_test_std), y_test)

0.94878355877858

Even after dropping almost 90 features, we got the same score which is very impressive!

### RFE Performance Considerations

Since RFE trains the given model on the full dataset every time it drops a feature, the computation time will be heavy for large datasets with many features like ours. To control this behavior, RFE provides `step` parameter which lets use drop arbitrary number of features in each iteration instead of one:

In [69]:
# Init the transformer
rfe = RFE(estimator=RandomForestRegressor(), n_features_to_select=10, step=10)
_ = rfe.fit(X_train_std, y_train)

In [70]:
X_train.columns[rfe.support_]

Index(['bideltoidbreadth', 'elbowrestheight', 'forearmcircumferenceflexed',
       'forearmforearmbreadth', 'handcircumference', 'handlength',
       'neckcircumference', 'span', 'weightkg', 'wristheight'],
      dtype='object')

### Choosing the number of features to keep automatically

The most important hyperparameters of RFE are *estimator* and *n_features_to_select*. In the last example, we arbitrarily chose 10 features and hoped for the best. However, as RFE can be wrapped around any model, we have to choose the number of relevant features based on their performance.

To achieve this, Sklearn provides a similar `RFECV` class which implements Recursive Feature Elimination with cross-validation and automatically finds the optimal number of features to keep. Below is an example that uses RFECV around a simple Linear Regression. We will be choosing Linear regression because we can guess that body measurements will be linear correlated. Besides, combined with cross-validation, Random Forest Regressor will become more computationally expensive:

In [83]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression

# Init, fit
rfecv = RFECV(
    estimator=LinearRegression(),
    min_features_to_select=5,
    step=5,
    n_jobs=-1,
    scoring="r2",
    cv=5,
)

_ = rfecv.fit(X_train_std, y_train)

I provided the default values to `cv` and `scoring` parameters. A new hyperparameter is `min_features_to_select` - you can probably guess what it does from the name. Let's see how many features the estimator computed to keep:

In [84]:
X_train.columns[rfecv.support_]

Index(['bideltoidbreadth', 'shouldercircumference', 'tibialheight',
       'waistheightomphalion', 'weightkg'],
      dtype='object')

`RFECV` tells us to keep only 5 out of 98. Let's train the model only on those 5 and look at its performance:

In [88]:
lr = LinearRegression()
_ = lr.fit(X_train_std, y_train)
print("Trainign R-sqaured:", lr.score(X_train_std, y_train))
print("Testing R-squared:",lr.score(X_test_std, y_test))

Trainign R-sqaured: 0.9380873069456624
Testing R-squared: 0.9565258599872342


Even after dropping 93 features, we still got an impressive score of 0.956. 

### Summary

By reading this tutorial, you learned:
- the idea behind Recursive Feature Elimination is
- how to use the implementation of the algorithm using Sklearn RFE class
- how to decide the number of features to keep automatically using RFECV class

If you want a deeper look at the algorithm, you can read this [post](https://machinelearningmastery.com/rfe-feature-selection-in-python/).

### Further reading on feature selection:
- [How to Use Variance Thresholding For Robust Feature Selection]()
- [How to Use Pairwise Correlation For Robust Feature Selection]()
- [Recursive Feature Elimination (RFE) Sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html)
- [RFECV Sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html)

### You might also be interested:
- Article 1
- Article 2