# Bad-ass Feature Selection by Combining Multiple Models
## Your favorite models choosing features themselves
<img src='images/pexels.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@olly?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Andrea Piacquadio</a>
        on 
        <a href='https://www.pexels.com/photo/crop-multiracial-people-joining-hands-together-during-break-in-modern-workplace-3931562/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Setup

In [5]:
import time
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

In [15]:
tips = sns.load_dataset("tips")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [49]:
from sklearn.preprocessing import LabelEncoder

X, y = tips.drop("total_bill", axis=1), tips.total_bill
for col in X.columns:
    X[col] = LabelEncoder().fit_transform(X[col])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=1121218, test_size=0.25
)

### Introduction

There are many feature selection methods in Machine Learning. Each one may give different results depending on how you use them, so it is hard to entirely trust a single method. Wouldn't be cool to have multiple methods cast their own vote on whether we should keep a feature or not? It would be just like Random Forests algorithm, where it combines the predictions of multiple weak learners to form a strong one. Turns out, Sklearn has already given us the tools to make such feature selector on our own. 

Together, using those tools, we will build a feature selector that accepts an arbitrary number of Sklearn models. All these models will give votes on which features we should keep and we make decision by gather all the votes across models (democracy?)

### Prerequisite Knowledge to Build the Selector: Weights and Coefficients

Before we move on to building the selector, let's brush up on some of the topics required. Firstly, almost all Sklearn estimators that yield predictions have either `.coef_` and `.feature_importances_` attributes after being fitted to the training data. 

`.coef_` attribute mostly occurs in models given under `sklearn.linear_model` submodule:

In [23]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Init, fit
lr = LinearRegression()
_ = lr.fit(X_train, y_train)

lr.coef_

array([ 0.104,  1.193,  2.085,  0.19 , -2.142,  3.593])

As the name suggests, the above are *coefficients* calculated by fitting the line of best fit for Linear Regression. Other models follow a similar pattern and yield the coefficients of their internal equation:

In [25]:
from sklearn.linear_model import Lasso

lasso = Lasso()
_ = lasso.fit(X_train, y_train)

lasso.coef_

array([ 0.132,  0.   ,  0.   , -0.   , -0.   ,  2.137])

Models in `sklearn.tree` and `sklearn.ensemble` work differently and they compute the *importance* or *weight* of each feature under `.feature_importances_` attribute:

In [26]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor()
gb = GradientBoostingRegressor()

for model in [dt, gb]:
    _ = model.fit(X_train, y_train)
    print(model.feature_importances_)

[0.401 0.056 0.064 0.075 0.034 0.371]
[0.515 0.018 0.071 0.017 0.012 0.368]


Unlike the coefficients of linear models, the weights add up to 1:

In [27]:
np.sum(dt.feature_importances_)

1.0

Regardless of the model, the feature contributes less and less to the overall prediction as its weight or coefficient decreases. This means that we can drop features with close to 0 coefficients or weights.

### Brief Overview of RFECVhttps://towardsdatascience.com/powerful-feature-selection-with-recursive-feature-elimination-rfe-of-sklearn-23efb2cdb54e?source=your_stories_page-------------------------------------

Recursive Feature Elimination (RFE) is a popular feature selection algorithm. It automatically finds the best number of features to keep to achieve the best performance for a given model. Below is a simple example:

In [38]:
from sklearn.feature_selection import RFECV

# Init the estimator
rfecv = RFECV(
    estimator=Lasso(),
    cv=3,
    scoring="r2",
    n_jobs=-1,
    min_features_to_select=2,
)

# Fit
_ = rfecv.fit(X_train, y_train)

After fitting to the training data, it has `.support_` attribute which gives a boolean mask, with True values for the features that should be kept:

In [39]:
rfecv.support_

array([ True, False, False, False, False,  True])

We can then use this mask to subset the original data:

```python
X.loc[:, rfecv.support_]
```

The core of our custom feature selector will be this `RFECV` class. I didn't go into detail of how it works 
but my previous article solely focused on it. I recommend reading it before continuing:

https://towardsdatascience.com/powerful-feature-selection-with-recursive-feature-elimination-rfe-of-sklearn-23efb2cdb54e?source=your_stories_page-------------------------------------

### Part I: Choosing the models

We will be using the [Ansur Male](https://www.kaggle.com/seshadrikolluri/ansur-ii) dataset mainly because it contains many features (98 numerical) about body measurements of 6000 US Army Personnel:

In [122]:
import pandas as pd

ansur = pd.read_csv("data/ansur_male.csv", encoding="latin").select_dtypes(
    include="number"
)
ansur.iloc[:5, -7:].head()

Unnamed: 0,wristcircumference,wristheight,SubjectNumericRace,DODRace,Age,Heightin,Weightlbs
0,175,853,1,1,41,71,180
1,167,815,1,1,35,68,160
2,180,831,2,2,42,68,205
3,176,793,1,1,31,66,175
4,188,954,2,2,21,77,213


We will be trying to predict weight in pounds and to do that we need to reduce model complexity - i. e. create a model with as much predictive power as possible using as few features as possible. Currently, there are 98 and we will be trying to decrease that number. Also, we will be dropping the column which records weight in kilograms.

In [123]:
ansur.drop("weightkg", axis=1, inplace=True)

Our first model will be Lasso Regressor and we will plug it into `RFECV`:

In [124]:
%%time

from sklearn.linear_model import Lasso

# Feature/target arrays
X, y = ansur.iloc[:, :-1], ansur.iloc[:, -1]

# Generate train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=1121218, test_size=0.25
)

# Init/fit
rfecv = RFECV(
    estimator=Lasso(), cv=5, scoring="r2", min_features_to_select=1, n_jobs=-1
)

_ = rfecv.fit(X_train, y_train)

lasso_mask = rfecv.support_
lasso_mask

Wall time: 36.8 s


array([False,  True,  True,  True,  True, False,  True, False, False,
        True, False,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True, False,
       False,  True,  True, False, False, False, False, False, False,
        True, False,  True,  True, False,  True, False,  True, False,
        True,  True, False,  True,  True, False, False,  True, False,
        True, False,  True, False, False, False,  True,  True,  True,
        True,  True, False, False,  True,  True, False, False,  True,
        True, False, False,  True,  True, False,  True,  True,  True,
       False, False, False,  True, False, False,  True,  True, False,
       False,  True, False, False, False,  True, False])

We are storing the boolean mask generated from the Lasso in `lasso_mask`, and you are going to see why in a bit.

Next, we will do the same for two more models: Linear Regression and GradientBoostingRegressor:

In [125]:
%%time

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

# Gradient Boosting Regressor
rfecv = RFECV(
    estimator=GradientBoostingRegressor(),
    cv=3,
    scoring="r2",
    n_jobs=-1,
    min_features_to_select=1,
)
_ = rfecv.fit(X_train, y_train)

gb_mask = rfecv.support_

# Simple Linear Regression
rfecv = RFECV(
    estimator=LinearRegression(),
    cv=3,
    scoring="r2",
    n_jobs=-1,
    min_features_to_select=5,
)
_ = rfecv.fit(X_train, y_train)

lr_mask = rfecv.support_

Wall time: 5min 49s


### Part II: Combining the votes

Now, we have the votes as boolean masks in three arrays: `lasso_mask`, `gb_mask` and `lr_mask`. Since True/False values represent 1 and 0s, we can add the three arrays:

In [126]:
votes = np.sum([lasso_mask, gb_mask, lr_mask], axis=0)
votes

array([1, 3, 3, 3, 3, 2, 2, 2, 1, 3, 1, 3, 1, 3, 2, 3, 3, 3, 3, 3, 3, 3,
       1, 1, 3, 3, 1, 2, 3, 2, 2, 2, 2, 1, 0, 1, 3, 1, 3, 2, 2, 2, 2, 2,
       2, 3, 3, 1, 2, 3, 2, 1, 2, 0, 3, 1, 2, 2, 1, 2, 3, 3, 3, 3, 1, 2,
       1, 3, 3, 1, 1, 3, 2, 2, 1, 3, 3, 2, 3, 3, 2, 0, 2, 2, 3, 1, 1, 3,
       3, 0, 1, 3, 1, 0, 1, 3, 2])

The result will be an array with counts of how many times each feature were chosen by all models. Now, we can set a threshold of votes to finally decide whether we will keep the feature or not. This threshold depends on conservative we want to be. We can set a strict threshold where we want the feature to have been chosen by all 3 or we can choose 1 as a threshold to be safe:

In [130]:
final_mask = votes == 3
final_mask

array([False,  True,  True,  True,  True, False, False, False, False,
        True, False,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True, False, False,  True,  True, False,
       False,  True, False, False, False, False, False, False, False,
        True, False,  True, False, False, False, False, False, False,
        True,  True, False, False,  True, False, False, False, False,
        True, False, False, False, False, False,  True,  True,  True,
        True, False, False, False,  True,  True, False, False,  True,
       False, False, False,  True,  True, False,  True,  True, False,
       False, False, False,  True, False, False,  True,  True, False,
       False,  True, False, False, False,  True, False])

Now, the `final_mask` is a boolean array with True values if a feature was chosen at least 1 by the 3 estimators. We can use it to subset the original data:

In [141]:
X.loc[:, final_mask].shape

(4082, 39)

As you can see, the final votes chose 39 columns to keep out of 98. You can use this subset of the dataset to create a less complex model. For example, we will choose a Linear Regression model because we can expect body measurements to be linearly correlated:

In [138]:
# Create new train/test sets from the feature selected data
X_reduced = X.loc[:, final_mask]

X_train, X_test, y_train, y_test = train_test_split(
    X_reduced, y, random_state=1121218, test_size=0.25
)

# Fit/score
lr = LinearRegression()
_ = lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.9449234990234734

### Summary of the Steps

Even though it takes some work to get to the final results it will be worth it. In the examples, we only chose 3 models but you can include as many models as you wish to make the results more robust and trustworthy.

A logical step at this point is to wrap all this code in a function or even a custom Sklearn transformer but custom transformers are a topic for another article. To reinforce the ideas above and give you an outline, let's review the steps we have taken:

1. Choose arbitrary number of Sklearn estimators that have either `.coef_` or `.feature_importnances_` attributes. The more estimators the more robust the results will be. However, multiple models come at a cost - as `RFECV` uses [cross-validation](https://towardsdatascience.com/how-to-master-the-subtle-art-of-train-test-set-generation-7a8408bcd578) under the hood, the training times will be computationally expensive for ensemble models and large datasets. Also, make sure to choose estimators depending on the type of the problem - remember to pass either classification or regression-only estimators for `RFECV` to work.
2. Plug all chosen models into `RFECV` class and make sure to save each round's boolean mask (accessed via `.support_`). To speed things up, you can tweak the `step` parameter so that an arbitrary number of features are dropped in each elimination round.
3. Sum up the masks from all estimators. 
4. Set a threshold for vote count. This threshold depends on how conservative you want to be. Convert the votes array into a boolean mask using this threshold.
5. Subset the original data using the final mask for final model evaluation.

### Further Reading Related to Feature Selection

- [How to Use Variance Thresholding For Robust Feature Selection](https://towardsdatascience.com/how-to-use-variance-thresholding-for-robust-feature-selection-a4503f2b5c3f?source=your_stories_page-------------------------------------)
- [How to Use Pairwise Correlation For Robust Feature Selection](https://towardsdatascience.com/how-to-use-pairwise-correlation-for-robust-feature-selection-20a60ef7d10?source=your_stories_page-------------------------------------)
- [Powerful Feature Selection With Recursive Feature Elimination](https://towardsdatascience.com/powerful-feature-selection-with-recursive-feature-elimination-rfe-of-sklearn-23efb2cdb54e)
- [RFECV Sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html)
- [Sklearn Official Feature Selection User Guide](https://scikit-learn.org/stable/modules/feature_selection.html#feature-selection)

### You might also be interested...
- [Intro to Object-Oriented-Programming For Data Scientists](https://towardsdev.com/intro-to-object-oriented-programming-for-data-scientists-9308e6b726a2?source=your_stories_page-------------------------------------)
- [My 6-Part Powerful EDA Template That Speaks of Ultimate Skill](https://towardsdatascience.com/my-6-part-powerful-eda-template-that-speaks-of-ultimate-skill-6bdde3c91431?source=your_stories_page-------------------------------------)
- [How to Use Sklearn Pipelines For Ridiculously Neat Code](https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d?source=your_stories_page-------------------------------------)