### Feature Selection using RFE

In this example, I am working with tree diferent datasets.

I will use cros_val_score as model selection applied to many dataset dimensions, then I will select the dimension that gets better score.

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib notebook

np.random.seed(0)

Function to get the feature importance of each feature of the dataset.

    1: most important
    
    no_feature: less important

In [2]:
def important_features(df,y_all):
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.feature_selection import RFE
    clf = DecisionTreeClassifier(random_state=0)
    selector = RFE(clf, 1, step=1)
    selector = selector.fit(df, y_all)
    pos=selector.ranking_
    return pos

Function to get the N features most important.

In [3]:
def select_features(no_feat,df,y):
    pos=important_features(df,y)
    no_features=no_feat
    df_pos=pd.DataFrame({'Columna':df.columns,'Pos':pos})
    l_max=[]
    l_maxtes=[]
    for n in range(1,df.shape[1]+1):  
        no_feat=n
        i=1
        new_data=[]
        l_max=[]
        l_maxtes=[]
        while i<=no_features:
            for j in range(len(df_pos)):
                if df_pos.values[j,1]==i:
                    agrega=df_pos.values[j,0]
                    new_data.append(agrega)
                    break
            i=i+1
    print('Set of selected features : ',new_data)
    return sorted(new_data)

The **model_selection** function uses cros_val_score to obtain the scores of each CV.

The **results_all_datas** function returns the mean_scores by each dataset dimension

In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import auc

def model_selection(df,y):
    clf = DecisionTreeClassifier(random_state=0)
    model=cross_val_score(clf,df,y,cv=10)
    return model

def results_all_datas(df,y):
    results=[]
    index_col=[]
    for no_features in range(1,df.shape[1]+1):
        cols=select_features(no_features,df,y)
        df_new=df[cols]
        r=model_selection(df_new,y)
        results.append(r.mean())
    return results

### DataSet1 

The iris dataset is a classic and very easy multi-class classification dataset.

    - Classes 	        : 3
    - Samples per class : 50  
    - Samples total 	: 150
    - Dimensionality 	: 4
    - Features 	        : real, positive

Import the data

In [5]:
from sklearn.datasets import load_iris
data = load_iris()
df=pd.DataFrame(data=data.data)
y=data.target
df.shape 

(150, 4)

Dataset Description

In [6]:
print(data.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [7]:
df.head()

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


results of model_selection by each set of selected features 

In [8]:
results=results_all_datas(df,y)    

Set of selected features :  [3]
Set of selected features :  [3, 2]
Set of selected features :  [3, 2, 1]
Set of selected features :  [3, 2, 1, 0]


In [9]:
print('num_features:      Score')
for n in range(len(results)):
    print("          {} : {}".format(n+1,results[n]))


num_features:      Score
          1 : 0.9533333333333334
          2 : 0.9466666666666667
          3 : 0.96
          4 : 0.96


select the set of features which obtained the best score

In [10]:
maxi=max(results) #best score
print('Best score: ',maxi)
no_feat = results.index(maxi)+1 #number of features with best score
print('Number of features with the best score: ',no_feat)

Best score:  0.96
Number of features with the best score:  3


training the model with the winner set

In [11]:
df_new=df[select_features(no_feat,df,y)] #new data


clf = DecisionTreeClassifier(random_state=0) #train the data
best_score=cross_val_score(clf,df_new,y,cv=10).mean() #mean_score of new_data with the selected features


Set of selected features :  [3, 2, 1]


In [12]:
best_score

0.96

In [13]:
no_feat

3

### DataSet2

The breast cancer dataset is a classic and very easy binary classification dataset.

    - Classes 	        : 2
    - Samples per class : 212(M),357(B)
    - Samples total 	: 569
    - Dimensionality    : 30
    - Features 	        : real, positive

Import the data

In [14]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df=pd.DataFrame(data=data.data)
y=data.target
df.shape

(569, 30)

In [15]:
y.shape

(569,)

results of model_selection by each set of selected features

In [16]:
results=results_all_datas(df,y)    

Set of selected features :  [20]
Set of selected features :  [20, 27]
Set of selected features :  [20, 27, 21]
Set of selected features :  [20, 27, 21, 24]
Set of selected features :  [20, 27, 21, 24, 23]
Set of selected features :  [20, 27, 21, 24, 23, 13]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26, 25]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26, 25, 14]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26, 25, 14, 22]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26, 25, 14, 22, 12]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26, 25, 14, 22, 12, 18]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26, 25, 14, 22, 12, 18, 10]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26, 25, 14, 22, 12, 18, 10, 11]
Set of selected features :  [20, 27, 21, 24, 23, 13, 7, 26, 25, 14, 22, 12, 18, 10, 11, 9]
Set

select the set of features which obtained the best score

In [17]:
maxi=max(results) #best score
print('Best score: ',maxi)
no_feat = results.index(maxi)+1 #number of features with best score
print('Number of features with the best score: ',no_feat)

Best score:  0.9440983925330568
Number of features with the best score:  6


training the model with the winner set

In [18]:
df_new=df[select_features(no_feat,df,y)]


clf = DecisionTreeClassifier(random_state=0)
best_score=cross_val_score(clf,df_new,y,cv=10).mean()

Set of selected features :  [20, 27, 21, 24, 23, 13]


In [19]:
best_score

0.9440983925330568

In [20]:
no_feat

6

### DataSet3

Each datapoint is a 8x8 image of a digit.

    - Classes 	        :10
    - Samples per class :~180
    - Samples total 	:1797
    - Dimensionality 	:64
    - Features 	integers 0-16

Import the data

In [21]:
from sklearn.datasets import load_digits
digits = load_digits()
df=pd.DataFrame(data=digits.data)
y=digits.target
df.shape

(1797, 64)

results of model_selection by each set of selected features

In [22]:
results=results_all_datas(df,y)   

Set of selected features :  [42]
Set of selected features :  [42, 5]
Set of selected features :  [42, 5, 21]
Set of selected features :  [42, 5, 21, 36]
Set of selected features :  [42, 5, 21, 36, 20]
Set of selected features :  [42, 5, 21, 36, 20, 27]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34, 44]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34, 44, 37]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34, 44, 37, 54]
Set of 

Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34, 44, 37, 54, 12, 10, 19, 53, 18, 13, 58, 50, 52, 3, 35, 45, 61, 38, 9, 51, 41, 46, 17, 11, 4, 59, 24, 62, 25, 6, 56, 49, 23, 39, 8, 7, 22, 2, 40, 15, 30, 16, 14, 57, 48, 47, 55]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34, 44, 37, 54, 12, 10, 19, 53, 18, 13, 58, 50, 52, 3, 35, 45, 61, 38, 9, 51, 41, 46, 17, 11, 4, 59, 24, 62, 25, 6, 56, 49, 23, 39, 8, 7, 22, 2, 40, 15, 30, 16, 14, 57, 48, 47, 55, 31]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34, 44, 37, 54, 12, 10, 19, 53, 18, 13, 58, 50, 52, 3, 35, 45, 61, 38, 9, 51, 41, 46, 17, 11, 4, 59, 24, 62, 25, 6, 56, 49, 23, 39, 8, 7, 22, 2, 40, 15, 30, 16, 14, 57, 48, 47, 55, 31, 63]
Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34, 44, 37, 54, 12, 10, 19, 53, 18, 13, 58, 50, 52, 3, 35, 45, 61, 38, 9, 51, 41, 46, 17, 11, 4, 59, 24, 62, 25, 6, 56, 49, 23, 39, 8, 7, 22,

select the set of features which obtained the best score

In [23]:
maxi=max(results) #best score
print('Best score: ',maxi)
no_feat = results.index(maxi)+1 #number of features with best score
print('Number of features with the best score: ',no_feat)

Best score:  0.8364550070064775
Number of features with the best score:  61


training the model with the winner set

In [24]:
df_new=df[select_features(no_feat,df,y)]


clf = DecisionTreeClassifier(random_state=0)
best_score=cross_val_score(clf,df_new,y,cv=10).mean()

Set of selected features :  [42, 5, 21, 36, 20, 27, 43, 60, 33, 29, 28, 26, 34, 44, 37, 54, 12, 10, 19, 53, 18, 13, 58, 50, 52, 3, 35, 45, 61, 38, 9, 51, 41, 46, 17, 11, 4, 59, 24, 62, 25, 6, 56, 49, 23, 39, 8, 7, 22, 2, 40, 15, 30, 16, 14, 57, 48, 47, 55, 31, 63]


In [25]:
best_score

0.8364550070064775

In [26]:
no_feat

61