# Bad-ass Feature Selection by Combining Multiple Models
## Your favorite models choosing features themselves
<img src='images/pexels.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@olly?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Andrea Piacquadio</a>
        on 
        <a href='https://www.pexels.com/photo/crop-multiracial-people-joining-hands-together-during-break-in-modern-workplace-3931562/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels</a>
    </strong>
</figcaption>

### Setup

In [5]:
import time
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

In [15]:
tips = sns.load_dataset("tips")
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [49]:
from sklearn.preprocessing import LabelEncoder

X, y = tips.drop("total_bill", axis=1), tips.total_bill
for col in X.columns:
    X[col] = LabelEncoder().fit_transform(X[col])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=1121218, test_size=0.25
)

### Introduction

There are many feature selection methods in Machine Learning. Each one may give different results depending on how you use them, so it is hard to entirely trust a single method. Wouldn't be cool to have multiple methods cast their own vote on whether we should keep a feature or not? It would be just like Random Forests algorithm, where it combines the predictions of multiple weak learners to form a strong one. Turns out, Sklearn has already given us the tools to make such feature selector on our own. 

Together, using those tools, we will build a feature selector that accepts an arbitrary number of Sklearn models. All these models will give votes on which features we should keep and we make decision by gather all the votes across models (democracy?)

### Prerequisite Knowledge to Build the Selector: Weights and Coefficients

Before we move on to building the selector, let's brush up on some of the topics required. Firstly, almost all Sklearn estimators that yield predictions have either `.coef_` and `.feature_importances_` attributes after being fitted to the training data. 

`.coef_` attribute mostly occurs in models given under `sklearn.linear_model` submodule:

In [23]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Init, fit
lr = LinearRegression()
_ = lr.fit(X_train, y_train)

lr.coef_

array([ 0.104,  1.193,  2.085,  0.19 , -2.142,  3.593])

As the name suggests, the above are *coefficients* calculated by fitting the line of best fit for Linear Regression. Other models follow a similar pattern and yield the coefficients of their internal equation:

In [25]:
from sklearn.linear_model import Lasso

lasso = Lasso()
_ = lasso.fit(X_train, y_train)

lasso.coef_

array([ 0.132,  0.   ,  0.   , -0.   , -0.   ,  2.137])

Models in `sklearn.tree` and `sklearn.ensemble` work differently and they compute the *importance* or *weight* of each feature under `.feature_importances_` attribute:

In [26]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor

dt = DecisionTreeRegressor()
gb = GradientBoostingRegressor()

for model in [dt, gb]:
    _ = model.fit(X_train, y_train)
    print(model.feature_importances_)

[0.401 0.056 0.064 0.075 0.034 0.371]
[0.515 0.018 0.071 0.017 0.012 0.368]


Unlike the coefficients of linear models, the weights add up to 1:

In [27]:
np.sum(dt.feature_importances_)

1.0

Regardless of the model, the feature contributes less and less to the overall prediction as its weight or coefficient decreases. This means that we can drop features with close to 0 coefficients or weights.

### Brief Overview of RFECV

Recursive Feature Elimination (RFE) is a popular feature selection algorithm. It automatically finds the best number of features to keep to achieve the best performance for a given model. Below is a simple example:

In [38]:
from sklearn.feature_selection import RFECV

# Init the estimator
rfecv = RFECV(
    estimator=Lasso(),
    cv=3,
    scoring="r2",
    n_jobs=-1,
    min_features_to_select=2,
)

# Fit
_ = rfecv.fit(X_train, y_train)

After fitting to the training data, it has `.support_` attribute which gives a boolean mask, with True values for the features that should be kept:

In [39]:
rfecv.support_

array([ True, False, False, False, False,  True])

We can then use this mask to subset the original data:

```python
X.loc[:, rfecv.support_]
```

The core of our custom feature selector will be this `RFECV` class. I didn't go into detail of how it works 
but my previous article solely focused on it. I recommend reading it before continuing:

https://towardsdatascience.com/powerful-feature-selection-with-recursive-feature-elimination-rfe-of-sklearn-23efb2cdb54e?source=your_stories_page-------------------------------------