In [1]:
# Recursive Feature Elimination
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import pandas as pd # to see the dataset as a dataframe we will need pandas

In [2]:
# load the iris datasets
dataset = datasets.load_iris()
# create a base classifier used to evaluate a subset of attributes
model = LogisticRegression(max_iter=1000)
# create the RFE model and select 3 attributes
rfe = RFE(model, n_features_to_select= 3) # fix the 3 to the proper parameter 
rfe = rfe.fit(dataset.data, dataset.target)
# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

[False  True  True  True]
[2 1 1 1]


** After creating the pandas dataframe from the arrays the top features using RFE with n features set to 3 are sepal width (cm),	petal length (cm), & 	petal width (cm)**

In [3]:
print(dataset) # they are numpy arrays

{'data': array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
     

In [4]:
# https://stackoverflow.com/questions/38105539/how-to-convert-a-scikit-learn-dataset-to-a-pandas-dataset
iris_data = pd.DataFrame(data = dataset.data, columns = dataset.feature_names) # use the attributes, like in MINIST Mini Project 2 
#iris_data_target = pd.DataFrame(data = dataset.data, columns = dataset.target)
iris_data['target'] = pd.Series(dataset.target)
iris_data.head()


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Above results showing that the latter 3 features are selected in the model (`True` means this feature is selected). The second row shows the ranking of the features if you want to use the __filter__ method.

Note that you do not need to check the performance (f1-score) or goodness-of-fit (AUC) in your model here since `RFE` automatically select the best subset of features.

However, we abitrarily select __3 features__ from the original 4. This is not scientific. What you should do is that test different number of features in the subset. You can use the code below.

In [5]:
# Checking how many features this dataset have
print(len(dataset.data[0]))

4


In [6]:
for i in range(1, len(dataset.data[0])+1):
    print(i)
    # create a base classifier used to evaluate a subset of attributes
    model = LogisticRegression()
    # create the RFE model and select 3 attributes
    rfe = RFE(model, n_features_to_select= i)
    rfe = rfe.fit(dataset.data, dataset.target)
    # summarize the selection of the attributes
    print('Model with the best', i, 'features')
    print(rfe.support_)
    print(rfe.ranking_)

1
Model with the best 1 features
[False False  True False]
[4 3 1 2]
2
Model with the best 2 features
[False False  True  True]
[3 2 1 1]
3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Model with the best 3 features
[False  True  True  True]
[2 1 1 1]
4
Model with the best 4 features
[ True  True  True  True]
[1 1 1 1]


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


**The top feature of RFE set to 1 feature to select is petal length (cm)**

**The top two features of RFE set to 2 features to select are petal length (cm)	& petal width (cm)**

**The top three features of RFE set to 3 features to select are sepal width (cm),	petal length (cm), & petal width (cm)** 

**The top features of RFE set to 4 features to select are all the features sepal length (cm),	sepal width (cm),	petal length (cm), &	petal width (cm)**


Then you should have different subsets of selected into the __Evaluation code__ provided to check model performance.

`RFE` is like a black box - you can use a different method which you can look into the feature selection process, and have more control.

Decision tree based classifiers, such as random forest classifier or extra trees classifier, can report feature importance - which you can use as a threshold to filter features. 

See code below - in code below, we use extra trees classifier to report feature importance, and use an abitrary `0.4` value to filter the features. The idea is whatever features passing the threshold would be selected.

In [7]:
# Feature Importance
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
# load the iris datasets
dataset = datasets.load_iris()
# fit an Extra Trees model to the data
model = ExtraTreesClassifier()
model.fit(dataset.data, dataset.target)
# display the relative importance of each attribute
print(model.feature_importances_ > .4)

[False False False  True]


Above results show that the 3rd and 4th features passed our threshold, hence should be selected.

Of course, abitrarily select the threshold at `.4` is not scientific. You should also try different values in a for loop.

# References
- [Kaggle - Step Forward Feature Selection: A Practical Example in Python](https://www.kdnuggets.com/2018/06/step-forward-feature-selection-python.html)
- [MLM - Feature Selection in Python with Scikit-Learn](https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/)

**The top features using decision trees set to a importance score >.4 are petal length (cm)	& petal width (cm)**