# 4-part Practical Study Guide To Sklearn Feature Selection
## Give me less than an hour to teach you a robust Feature Selection workflow
![](./images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@aaronburden?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Aaron Burden</a>
        on 
        <a href='https://unsplash.com/s/photos/study?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash</a>
    </strong>
</figcaption>

### Introduction

Today, it is common for datasets to have hundreds if not thousands of features. On the surface, this might seem like a good thing — more features give more information about each sample. But more often than not, these additional features don’t provide much value and introduce complexity.

The biggest challenge of Machine Learning is to create models that have robust predictive power by using as few features as possible. But given the massive sizes of today’s datasets, it is easy to lose the oversight of which features are important and which ones aren’t.

That’s why there is an entire skill to be learned in the ML field — feature selection. Feature selection is the process of choosing a subset of the most important features while trying to retain as much information as possible (An excerpt from the [first article](https://towardsdatascience.com/how-to-use-variance-thresholding-for-robust-feature-selection-a4503f2b5c3f) in this series).

As feature selection is such a pressing issue there is also a myriad of solutions you can *select* from🤦‍♂️🤦‍♂️. To spare you some pain, I will teach you 4 feature selection techniques that when used together, can supercharge any model's performance. 

In this article, I will give you an overview of these techniques and how to readily use them without wondering too much about the internals. For a deeper understanding, I have written separate posts for each with the nitty-gritty explained. Let's get started!

### Intro to the dataset and the problem statement

We will be working with the [Ansur Male](https://www.kaggle.com/seshadrikolluri/ansur-ii) dataset wich contains more than 100 different body measurements of US Army Personnel. I have been using this dataset excessively throughout this feature selection series mainly because it contains 98 numeric features - a perfect dataset to teach feature selection.

In [13]:
import pandas as pd

ansur_numeric = pd.read_csv("data/ansur_male.csv", encoding="latin").select_dtypes(
    include="number"
)

ansur_numeric.iloc[:5, -8:]  ## A few of the columns

Unnamed: 0,weightkg,wristcircumference,wristheight,SubjectNumericRace,DODRace,Age,Heightin,Weightlbs
0,815,175,853,1,1,41,71,180
1,726,167,815,1,1,35,68,160
2,929,180,831,2,2,42,68,205
3,794,176,793,1,1,31,66,175
4,946,188,954,2,2,21,77,213


We will be trying to predict the weight in pounds, so it is a regression problem. Let's establish a base performance with simple Linear Regression. LR is a good candidate for this problem, because we can expect body measurements to be linearly correlated:

In [15]:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Build feature/target arrays
X, y = ansur_numeric.iloc[:, :-1], ansur_numeric.iloc[:, -1]

# Train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Init LR, fit/score
lr = LinearRegression()
_ = lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9566669936078463

For the base performance, we got an impressive R-squared of 0.956. However, this might be due to the fact that there is also a weight in kilograms column among features, giving the algorithm all it needs (we are trying to predict weight in pounds). So, let's try without that feature:

In [16]:
# Train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Init LR, fit/score
lr = LinearRegression()
_ = lr.fit(X_train.drop("weightkg", axis=1), y_train)
lr.score(X_test.drop("weightkg", axis=1), y_test)

0.9457535711339792

Now, we have 0.945 but we managed to reduce model complexity. The reason we are not dropping the column right now is that we want our feature selectors to figure that out on their own. Let's see if they will!

### Step I: Variance Thresholding

The first technique will be targeted at the individual properties of each feature. The idea behind Variance Thresholding is that features with low variance do not contribute much to overall predictions. These type of features have distributions with too few unique values or low-enough variances to make no matter. VT helps us to remove them from a dataset using Sklearn.

One concern before applying VT is the scale of features. As the values in a feature get bigger the variance grows exponentially. This means that features with different distributions have different scales so we cannot safely compare their variances. So, we are required to apply some form of normalization to bring all features to the same scale and then apply VT. Here is the code:

In [68]:
from sklearn.feature_selection import VarianceThreshold

# Normalize data
normalized_df = X / X.mean()

# Init, fit VT
vt = VarianceThreshold(threshold=0.003)
_ = vt.fit(normalized_df)

# Get a boolean mask
mask = vt.get_support()

# Subset the data
X_reduced = X.loc[:, mask]
X_reduced.shape

(4082, 48)

After normalization (here, we are dividing each sample by the feature's mean), you should choose a threshold between 0 and 1. Instead of using the `.transform()` method of the VT estimator, we are using `get_support()` which gives a boolean mask (True values for features that should be kept). Then, it can be used to subset the data while preserving the column names. 

This may be a simple technique but it can go a long in eliminating useless features. For a deeper insight and more explanation of the code, you can head over to this article:

https://towardsdatascience.com/how-to-use-variance-thresholding-for-robust-feature-selection-a4503f2b5c3f