## ___Recursive Feature Elimination (RFE)___

_Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. Features are ranked by the model’s coefficients or feature importances attributes, and by recursively eliminating a small number of features per loop, RFE attempts to eliminate dependencies and collinearity that may exist in the model._

_RFE requires a specified number of features to keep, however it is often not known in advance how many features are valid. To find the optimal number of features cross-validation is used with RFE to score different feature subsets and select the best scoring collection of features._

_Scikit Learn does most of the heavy lifting just import RFE from sklearn.feature_selection and pass any classifier model to the RFE() method with the number of features to select. Using familiar Scikit Learn syntax, the .fit() method must then be called._

In [1]:
# importing libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.datasets import load_breast_cancer

In [6]:
data = load_breast_cancer()

In [7]:
X = pd.DataFrame(data.data, columns = data.feature_names)
y = data.target

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 0)

### ___Feature Importance using Random Forest___

In [9]:
selection = SelectFromModel(RandomForestClassifier(random_state=0, n_estimators=100, n_jobs=-1))
selection.fit(X_train, y_train)
selection.get_support()

array([ True, False,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
        True, False, False])

In [10]:
X_train.columns

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension'],
      dtype='object')

In [11]:
# see the selected features.
selected_features = X_train.columns[(selection.get_support())]
selected_features

Index(['mean radius', 'mean perimeter', 'mean area', 'mean concavity',
       'mean concave points', 'area error', 'worst radius', 'worst perimeter',
       'worst area', 'worst concave points'],
      dtype='object')

In [12]:
selection.estimator_.feature_importances_

array([0.03699612, 0.01561296, 0.06016409, 0.0371452 , 0.0063401 ,
       0.00965994, 0.0798662 , 0.08669071, 0.00474992, 0.00417092,
       0.02407355, 0.00548033, 0.01254423, 0.03880038, 0.00379521,
       0.00435162, 0.00452503, 0.00556905, 0.00610635, 0.00528878,
       0.09556258, 0.01859305, 0.17205401, 0.05065305, 0.00943096,
       0.01565491, 0.02443166, 0.14202709, 0.00964898, 0.01001304])

In [13]:
np.mean(selection.estimator_.feature_importances_)

0.03333333333333334

In [14]:
def run_randomForest(X_train, X_test, y_train, y_test):
    clf = RandomForestClassifier(random_state=0, n_estimators=100, n_jobs=-1)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print('Accuracy: ', accuracy_score(y_test, y_pred))

In [15]:
X_train_transform = selection.transform(X_train)
X_test_transform = selection.transform(X_test)

In [16]:
%time run_randomForest(X_train_transform,X_test_transform, y_train, y_test) # with selected Features

Accuracy:  0.9473684210526315
Wall time: 382 ms


### ___Recursive Feature Elimination___

In [17]:
from sklearn.feature_selection import RFE

In [30]:
selection = RFE(RandomForestClassifier(random_state=0, n_estimators=100, n_jobs=-1), n_features_to_select= 15)
selection.fit(X_train, y_train)

RFE(estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                     class_weight=None, criterion='gini',
                                     max_depth=None, max_features='auto',
                                     max_leaf_nodes=None, max_samples=None,
                                     min_impurity_decrease=0.0,
                                     min_impurity_split=None,
                                     min_samples_leaf=1, min_samples_split=2,
                                     min_weight_fraction_leaf=0.0,
                                     n_estimators=100, n_jobs=-1,
                                     oob_score=False, random_state=0, verbose=0,
                                     warm_start=False),
    n_features_to_select=15, step=1, verbose=0)

In [31]:
selection.get_support()

array([ True,  True,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True,  True,  True,  True,  True, False,  True,
        True,  True, False])

In [32]:
selection.estimator_.feature_importances_

array([0.03361926, 0.0216965 , 0.0614782 , 0.02625804, 0.07751976,
       0.13635686, 0.04859556, 0.10333582, 0.02457575, 0.15148042,
       0.13843996, 0.02001461, 0.02672587, 0.11050174, 0.01940163])

In [33]:
# see the selected features.
selected_features = X_train.columns[(selection.get_support())]
selected_features

Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean concavity', 'mean concave points', 'area error', 'worst radius',
       'worst texture', 'worst perimeter', 'worst area', 'worst smoothness',
       'worst concavity', 'worst concave points', 'worst symmetry'],
      dtype='object')

In [34]:
X_train_transform = selection.transform(X_train)
X_test_transform = selection.transform(X_test)

In [35]:
%time run_randomForest(X_train_transform,X_test_transform, y_train, y_test) # with selected Features

Accuracy:  0.9736842105263158
Wall time: 369 ms


### ___Recursive Feature Elimination with Gradient Boosted Trees___

In [23]:
from sklearn.ensemble import GradientBoostingClassifier

In [24]:
selection = RFE(GradientBoostingClassifier(random_state=0, n_estimators=100), n_features_to_select= 15)
selection.fit(X_train, y_train)

RFE(estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                         criterion='friedman_mse', init=None,
                                         learning_rate=0.1, loss='deviance',
                                         max_depth=3, max_features=None,
                                         max_leaf_nodes=None,
                                         min_impurity_decrease=0.0,
                                         min_impurity_split=None,
                                         min_samples_leaf=1,
                                         min_samples_split=2,
                                         min_weight_fraction_leaf=0.0,
                                         n_estimators=100,
                                         n_iter_no_change=None,
                                         presort='deprecated', random_state=0,
                                         subsample=1.0, tol=0.0001,
                                         validation_frac

In [25]:
selection.get_support()

array([False,  True, False, False,  True, False, False,  True,  True,
       False,  True, False,  True,  True, False, False,  True, False,
       False, False,  True,  True,  True,  True, False,  True,  True,
        True, False, False])

In [26]:
# see the selected features.
selected_features = X_train.columns[(selection.get_support())]
selected_features

Index(['mean texture', 'mean smoothness', 'mean concave points',
       'mean symmetry', 'radius error', 'perimeter error', 'area error',
       'concavity error', 'worst radius', 'worst texture', 'worst perimeter',
       'worst area', 'worst compactness', 'worst concavity',
       'worst concave points'],
      dtype='object')

In [27]:
X_train_transform = selection.transform(X_train)
X_test_transform = selection.transform(X_test)

In [28]:
%time run_randomForest(X_train_transform,X_test_transform, y_train, y_test) # with selected Features

Accuracy:  0.9649122807017544
Wall time: 409 ms


### ___Selecting best K-number of features___

In [38]:
for i in range(1,len(X_train.columns)+1):
    selection = RFE(GradientBoostingClassifier(random_state=0, n_estimators=100), n_features_to_select= i)
    selection.fit(X_train, y_train)
    selected_features = X_train.columns[(selection.get_support())]
    X_train_transform = selection.transform(X_train)
    X_test_transform = selection.transform(X_test)
    print('Selected No. of Features: ',i)
    run_randomForest(X_train_transform,X_test_transform, y_train, y_test)

Selected No. of Features:  1
Accuracy:  0.8771929824561403
Selected No. of Features:  2
Accuracy:  0.9035087719298246
Selected No. of Features:  3
Accuracy:  0.9649122807017544
Selected No. of Features:  4
Accuracy:  0.9736842105263158
Selected No. of Features:  5
Accuracy:  0.9649122807017544
Selected No. of Features:  6
Accuracy:  0.9912280701754386
Selected No. of Features:  7
Accuracy:  0.9736842105263158
Selected No. of Features:  8
Accuracy:  0.9649122807017544
Selected No. of Features:  9
Accuracy:  0.9736842105263158
Selected No. of Features:  10
Accuracy:  0.956140350877193
Selected No. of Features:  11
Accuracy:  0.956140350877193
Selected No. of Features:  12
Accuracy:  0.9736842105263158
Selected No. of Features:  13
Accuracy:  0.956140350877193
Selected No. of Features:  14
Accuracy:  0.956140350877193
Selected No. of Features:  15
Accuracy:  0.9649122807017544
Selected No. of Features:  16
Accuracy:  0.956140350877193
Selected No. of Features:  17
Accuracy:  0.96491228070

___select the K features with the best achieved accuracy.___