In [1]:
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

In [2]:
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression()
# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(dataset.data, dataset.target)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

[False  True  True  True]
[2 1 1 1]


Above results showing that the latter 3 features are selected in the model (`True` means this feature is selected). The second row shows the ranking of the features if you want to use the __filter__ method.

Note that you do not need to check the performance (f1-score) or goodness-of-fit (AUC) in your model here since `RFE` automatically select the best subset of features.

However, we abitrarily select __3 features__ from the original 4. This is not scientific. What you should do is that test different number of features in the subset. You can use the code below.

In [3]:
# Checking how many features this dataset have
print(len(dataset.data[0]))

4


In [4]:
for i in range(1, len(dataset.data[0])+1):
    print(i)
    # create a base classifier used to evaluate a subset of attributes
    model = LogisticRegression()
    # create the RFE model and select 3 attributes
    rfe = RFE(model, i)
    rfe = rfe.fit(dataset.data, dataset.target)
    # summarize the selection of the attributes
    print('Model with the best', i, 'features')
    print(rfe.support_)
    print(rfe.ranking_)

1
Model with the best 1 features
[False False False  True]
[4 2 3 1]
2
Model with the best 2 features
[False  True False  True]
[3 1 2 1]
3
Model with the best 3 features
[False  True  True  True]
[2 1 1 1]
4
Model with the best 4 features
[ True  True  True  True]
[1 1 1 1]


Then you should have different subsets of selected into the __Evaluation code__ provided to check model performance.

`RFE` is like a black box - you can use a different method which you can look into the feature selection process, and have more control.

Decision tree based classifiers, such as random forest classifier or extra trees classifier, can report feature importance - which you can use as a threshold to filter features. 

See code below - in code below, we use extra trees classifier to report feature importance, and use an abitrary `0.4` value to filter the features. The idea is whatever features passing the threshold would be selected.

In [5]:
# Feature Importance
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
# load the iris datasets
dataset = datasets.load_iris()
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(dataset.data, dataset.target)
# display the relative importance of each attribute
print(model.feature_importances_ > .4)

[False False False  True]


Above results show that the 3rd and 4th features passed our threshold, hence should be selected.

Of course, abitrarily select the threshold at `.4` is not scientific. You should also try different values in a for loop.

# References
- [Kaggle - Step Forward Feature Selection: A Practical Example in Python](https://www.kdnuggets.com/2018/06/step-forward-feature-selection-python.html)
- [MLM - Feature Selection in Python with Scikit-Learn](https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/)