# 4-part Practical Study Guide To Sklearn Feature Selection
## Give me less than an hour to teach you a robust Feature Selection workflow
![](./images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@aaronburden?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Aaron Burden</a>
        on 
        <a href='https://unsplash.com/s/photos/study?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash</a>
    </strong>
</figcaption>

In [168]:
import warnings

warnings.filterwarnings("ignore")

### Introduction

Today, it is common for datasets to have hundreds if not thousands of features. On the surface, this might seem like a good thing — more features give more information about each sample. But more often than not, these additional features don’t provide much value and introduce complexity.

The biggest challenge of Machine Learning is to create models that have robust predictive power by using as few features as possible. But given the massive sizes of today’s datasets, it is easy to lose the oversight of which features are important and which ones aren’t.

That’s why there is an entire skill to be learned in the ML field — feature selection. Feature selection is the process of choosing a subset of the most important features while trying to retain as much information as possible (An excerpt from the [first article](https://towardsdatascience.com/how-to-use-variance-thresholding-for-robust-feature-selection-a4503f2b5c3f) in this series).

As feature selection is such a pressing issue there is a myriad of solutions you can *select* from🤦‍♂️🤦‍♂️. To spare you some pain, I will teach you 4 feature selection techniques that when used together, can supercharge any model's performance. 

In this article, I will give you an overview of these techniques and how to readily use them without wondering too much about the internals. For a deeper understanding, I have written separate posts for each with the nitty-gritty explained. Let's get started!

### Intro to the dataset and the problem statement

We will be working with the [Ansur Male](https://www.kaggle.com/seshadrikolluri/ansur-ii) dataset wich contains more than 100 different body measurements of US Army Personnel. I have been using this dataset excessively throughout this feature selection series mainly because it contains 98 numeric features - a perfect dataset to teach feature selection.

In [169]:
import pandas as pd

ansur_numeric = pd.read_csv("data/ansur_male.csv", encoding="latin").select_dtypes(
    include="number"
)

ansur_numeric.iloc[:5, -8:]  ## A few of the columns

Unnamed: 0,weightkg,wristcircumference,wristheight,SubjectNumericRace,DODRace,Age,Heightin,Weightlbs
0,815,175,853,1,1,41,71,180
1,726,167,815,1,1,35,68,160
2,929,180,831,2,2,42,68,205
3,794,176,793,1,1,31,66,175
4,946,188,954,2,2,21,77,213


We will be trying to predict the weight in pounds, so it is a regression problem. Let's establish a base performance with simple Linear Regression. LR is a good candidate for this problem, because we can expect body measurements to be linearly correlated:

In [170]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Build feature/target arrays
X, y = ansur_numeric.iloc[:, :-1], ansur_numeric.iloc[:, -1]

# Train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1121218
)

# Init LR, fit/score
lr = LinearRegression()
_ = lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9566669936078463

For the base performance, we got an impressive R-squared of 0.956. However, this might be due to the fact that there is also a weight in kilograms column among features, giving the algorithm all it needs (we are trying to predict weight in pounds). So, let's try without that feature:

In [171]:
# Drop weightkg
X.drop("weightkg", axis=1, inplace=True)

# Train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1121218
)

# Init LR, fit/score
lr = LinearRegression()
_ = lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9468586903272265

Now, we have 0.945 but we managed to reduce model complexity.

### Step I: Variance Thresholding

The first technique will be targeted at the individual properties of each feature. The idea behind Variance Thresholding is that features with low variance do not contribute much to overall predictions. These type of features have distributions with too few unique values or low-enough variances to make no matter. VT helps us to remove them from a dataset using Sklearn.

One concern before applying VT is the scale of features. As the values in a feature get bigger the variance grows exponentially. This means that features with different distributions have different scales so we cannot safely compare their variances. So, we are required to apply some form of normalization to bring all features to the same scale and then apply VT. Here is the code:

In [172]:
from sklearn.feature_selection import VarianceThreshold

# Normalize data
normalized_df = X / X.mean()

# Init, fit VT
vt = VarianceThreshold(threshold=0.003)
_ = vt.fit(normalized_df)

# Get a boolean mask
mask = vt.get_support()

# Subset the data
X_reduced = X.loc[:, mask]
X_reduced.shape

(4082, 47)

After normalization (here, we are dividing each sample by the feature's mean), you should choose a threshold between 0 and 1. Instead of using the `.transform()` method of the VT estimator, we are using `get_support()` which gives a boolean mask (True values for features that should be kept). Then, it can be used to subset the data while preserving the column names. 

This may be a simple technique but it can go a long in eliminating useless features. For a deeper insight and more explanation of the code, you can head over to this article:

https://towardsdatascience.com/how-to-use-variance-thresholding-for-robust-feature-selection-a4503f2b5c3f

### Step II: Pairwise Correlation

We will further trim our dataset by focusing on the relationships between features. One of the best metrics that show a *linear* connection is Pearson's correlation coefficient (denoted *r*). The logic behind using *r* for feature selection is simple. If the correlation between features A and B is 0.9 it means you can predict the values of B using the values of A 90% of the time. In other words, in a dataset where A is present you can discard B or vice versa. 

![](./images/1.png)
<figcaption style="text-align: center;">
    <strong>
        Photo by author
    </strong>
</figcaption>

There isn't an Sklearn estimator that implements feature selection based on correlation. So, we will do it on our own:

In [173]:
import numpy as np


def identify_correlated(df, threshold):
    """
    A function to identify highly correlated features.
    """
    # Compute correlation matrix with absolute values
    matrix = df.corr().abs()

    # Create a boolean mask
    mask = np.triu(np.ones_like(matrix, dtype=bool))

    # Subset the matrix
    reduced_matrix = matrix.mask(mask)

    # Find cols that meet the threshold
    to_drop = [c for c in reduced_matrix.columns if any(reduced_matrix[c] > threshold)]

    return to_drop

This function is a shorthand that returns the names of columns that should be dropped based on a custom correlation threshold. Usually, the threshold will be over 0.80 to be safe. 

In the function, we first create a correlation matrix using `.corr()`. Next, we create a boolean mask to only include correlations that are below the diagonal of the correlation matrix. We use this mask to subset the matrix. Finally, in a list comprehension, we find the names of features that should be dropped and return them.

There is a lot I didn't explain about the code. Even though this function works perfectly well, I suggest reading my separate article on feature selection based on correlation coefficient. I fully explained the concept of correlation and how it is different from causation. There is a separate section on plotting the perfect correlation matrix as a heatmap and of course, the explanation of the above function.

For our dataset, we will choose a threshold of 0.9:

In [174]:
to_drop = identify_correlated(X_reduced, threshold=0.9)
len(to_drop)

12

The function tells us to drop 13 features:

In [175]:
X_reduced.drop(to_drop, axis=1, inplace=True)

In [176]:
X_reduced.shape

(4082, 35)

Now, only 35 features are remaining.

### Step III: Recursive Feature Elimination with Cross Validation (RFECV)

Finally, we will choose the final set of features based on how they affect model performance. Most of the Sklearn models have either `.coef_` (linear models) or `.feature_importances_` (tree-based and ensemble models) attributes that show the importance of each feature. For example, let's fit the Linear Regression model to the current set of features and see the computed coefficients:

In [177]:
# New train/test sets
X_train, X_test, y_train, y_test = train_test_split(X_reduced, y, test_size=0.3)

# Init/fit LR
lr = LinearRegression()
_ = lr.fit(X_train, y_train)

# See the computed coefficients
pd.DataFrame(
    zip(X_train.columns, abs(lr.coef_)),
    columns=["feature", "coefficient"],
).sort_values("coefficient").reset_index(drop=True).head(10)

Unnamed: 0,feature,coefficient
0,subjectid,0.000107
1,SubjectNumericRace,0.000336
2,forearmforearmbreadth,0.007399
3,buttockdepth,0.008216
4,interscyeii,0.009153
5,crotchlengthposterioromphalion,0.011324
6,lateralmalleolusheight,0.014132
7,elbowrestheight,0.015555
8,DODRace,0.020661
9,lowerthighcircumference,0.029212


The above DataFrame shows the features with the smallest coefficients. The smaller the weight or the coefficient of a feature is, the less it contributes to the model's predictive power. With this idea in mind, Recursive Feature Elimination removes features one-by-one using cross-validation until the best smallest set of features is remaining.

Sklearn implements this technique under the RFECV class which takes an arbitrary estimator and a number of other arguments:

In [178]:
from sklearn.feature_selection import RFECV
from sklearn.linear_model import Lasso

# Init rfecv
rfecv = RFECV(
    estimator=LinearRegression(),
    scoring="r2",
    cv=3,
    n_jobs=-1,
    min_features_to_select=5,
    step=5,
)

# Fit
_ = rfecv.fit(X_reduced, y)
# Get a boolean mask for the features to keep
mask = rfecv.support_

After fitting the estimator to the data, we can get a boolean mask with True values encoding the features that should be kept. We can finally use it to subset the original data one last time:

In [179]:
X_final = X_reduced.loc[:, mask].copy()
X_final.shape

(4082, 30)

After applying RFECV, we managed to discard 5 more features. Let's evaluate a final GradientBoostingRegressor model on this feature selected dataset and see its performance:

In [180]:
from sklearn.ensemble import GradientBoostingRegressor

# New train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_final, y, test_size=0.3, random_state=1121218
)

# Init, fit, score GB
gb = GradientBoostingRegressor()
_ = gb.fit(X_train, y_train)
print("Training R2 score: {}".format(gb.score(X_train, y_train)))
print("Testing R2 score: {}".format(gb.score(X_test, y_test)))

Training R2 score: 0.9565125439275437
Testing R2 score: 0.9245605254616158


Even though we got a slight drop in performance, we managed to remove almost 70 features reducing model complexity significantly.

In a separate article, I further discussed the `.coef_` and `.feature_importances_` attributes as well as extra details of what happens in each elimination round of RFE:

https://towardsdatascience.com/powerful-feature-selection-with-recursive-feature-elimination-rfe-of-sklearn-23efb2cdb54e

### Summary
Feature selection should not be taken lightly. While reducing model complexity, some algorithms can even see an increase in the performance due to the lack of distracting features in the dataset. It is also not wise to rely on a single method. Instead, approach the problem from different angles and using various techniques. 

Today, we saw how to apply feature selection to a dataset in three stages:
1. Based on the properties of each individual feature using Variance Thresholding.
2. Based on the relationships between features using Pairwise Correlation.
3. Based on how features affect a model's performance.

Using these techniques in procession should give you reliable results for any type of supervised problem you face.

### Further Reading on Feature Selection
- [Official guide to feature selection by Sklearn](https://scikit-learn.org/stable/modules/feature_selection.html)
- [Variance Threshold Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html)
- [RFECV Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html)
- [Deep guide to Correlation Coefficients](https://towardsdev.com/how-to-not-misunderstand-correlation-75ce9b0289e)

### You might be interested...
- [11 Times Faster Hyperparameter Tuning with HalvingGridSearch](https://towardsdatascience.com/11-times-faster-hyperparameter-tuning-with-halvinggridsearch-232ed0160155?source=your_stories_page-------------------------------------)
- [Beginner’s Guide to XGBoost for Classification Problems](https://towardsdatascience.com/beginners-guide-to-xgboost-for-classification-problems-50f75aac5390?source=your_stories_page-------------------------------------)
- [Weekly Awesome Tricks And Best Practices From Kaggle](https://towardsdev.com/tricks-and-best-practices-from-kaggle-794a5914480f?source=your_stories_page-------------------------------------)
- [Intro to Object-Oriented Programming For Data Scientists](https://towardsdev.com/intro-to-object-oriented-programming-for-data-scientists-9308e6b726a2?source=your_stories_page-------------------------------------)