In [None]:
import sys
sys.path.append('..')

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.feature_selection import VarianceThreshold

from common import load_forest_fires

# Feature Selection

## Resources

sklearn docs - [Feature selection](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection)

Why giving your algorithm ALL THE FEATURES does not always work - Thomas Huijskens - [youtube](https://youtu.be/JsArBz46_3s)

[Automated Feature Engineering and Selection in Python](https://www.youtube.com/watch?v=4-4pKPv9lJ4)

## Why select features

- colinearity
- reduces noise (+ overfitting)
- more interpretable
- train models quicker
- train models better

Adding features is an exponential cost!
- curse of dimensionality
- model needs to understand the new feature in the context of every other feature

What makes a good feature selection algorithm
- remove low information features
- reduce overlap between features

We don't want univariate methods
- two features that are useless alone useful together
- lack of correlation != no complimentarity

## Three categories of feature selection

1. Wrappers
- assess performance by performance of a model
- new model for each set of features -> expensive

2. Filter
- only consider statistics of the data (correlation, mutual infomation, variance thresholding)
- ignore interaction with learning algorithm

3. Embedded
- combinations of wrapping & filters
- feature selection as part of the model construction process


In [None]:
ds = load_forest_fires()

x = ds.loc[:, ['FFMC', 'DMC', 'DC', 'ISI', 'temp', 'RH', 'wind', 'rain']]
y = ds.loc[:, 'area']

ds.describe()

In [None]:
ds.head()

In [None]:
ds.columns

## Variance selection

Based only on the feature
- no information about the target

In [None]:
sel = VarianceThreshold(threshold=(0.8))
sel.fit_transform(x)

x.columns[sel.get_support()]

## Univariate feature selection

Selecting features in isolation, based on statistical relationship to the target

[sklearn docs](https://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection)

In [None]:
k = 6
selector = SelectKBest(mutual_info_regression, k=k)
features = selector.fit_transform(x, y)

features

In [None]:
x.columns[selector.get_support()]

In [None]:
for score, f in zip(selector.scores_, x.columns):
    print(score, f)

## SelectFromModel

Requires an interpretable model
- coefficients in linear regression
- feature importances

Select features based on a threshold

In [None]:
mdl = ExtraTreesRegressor(n_estimators=50)
mdl.fit(x, y)
model = SelectFromModel(mdl, prefit=True, threshold='mean')
x_new = model.transform(x)
x.columns[model.get_support()]

## Stability selection

Wraps around base learner
- base learner must have a regularization hyperparameter
- runs learner on many bootstrapped samples for a range of regularization

Algorithm

```python
for regularization_parameter 
    for bootstraps
        bootstrap a dataset
        fit a model on the dataset
        select features 
        
    save average stability score across all bootstraps
    
select features based on average across all regularization params
```

## Practical

Implement stability selection

In [None]:
params = {'n_estimators': 100}
depths = np.array(np.arange(1, 20, 2).tolist() + [None,])

#from answers import stability_selection_rf_regressor
features = stability_selection_rf_regressor(x, y, params, depths)