# Random Forest Feature Selection Notebook

This notebook illustrates the use of Random Forest to perform feature selection on the SRBC dataset. The performance of a nearest neighbor classifier and a naive bayes classifier will be reported. We will compare results with a variable ranking approach based on mutual information.

First, we load the required packages.

In [3]:
from sklearn.model_selection import RepeatedKFold
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, SelectFromModel
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import numpy as np
import pandas as pd
import sys

## Loading the Data

We load the data from a CSV file using pandas. The dataset considered is 'Simple Round Blue Cell Tumors' (SRBCT) dataset from the reference: Khan, J., Wei, J. S., Ringnér, M., Saal, L. H., Ladanyi, M., Westermann, F., … Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6), 673–679. http://doi.org/10.1038/89044. This dataset has a training and a testing set, each containing 64 and 20 instances, respectively. The number of attributes is 2308, corresponding to different gene expression profiles. The number of classes is 4, corresponding to different tumors:  neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (NHL) and the Ewing family of tumors (EWS).

In [4]:
dataTrain = pd.read_csv('srbct_train.csv')
X_train = dataTrain.values[ :, 0 : (dataTrain.shape[ 1 ] - 1) ].astype(float)
y_train = (dataTrain.values[ :, dataTrain.shape[ 1 ] - 1 ]).astype(int)

dataTest = pd.read_csv('srbct_test.csv')
X_test = dataTest.values[ :, 0 : (dataTest.shape[ 1 ] - 1) ].astype(float)
y_test = (dataTest.values[ :, dataTest.shape[ 1 ] - 1 ]).astype(int)


In [5]:
print(X_train.shape)
print(X_test.shape)

(64, 2308)
(20, 2308)


We merge the train and the test data to get a single data set of 84 instances.

In [6]:
X = np.vstack((X_train, X_test))
y = np.hstack((y_train, y_test))

## Classifiers and Objects to Preprocess the Data

We create a standard scaler and a KNN classifier. The number of neighbors to be used will be equal to 3. We will use a standard scale to preprocess the data. We also create a Naive Bayes Classifier, a a filter approach based on variable ranking.

In [7]:
# We also set the random seed to 0, to guarantee reproducibility.

np.random.seed(0)

filtering = SelectKBest(mutual_info_classif, k = 10)
scaler = StandardScaler()
nb = GaussianNB()
knn= KNeighborsClassifier(n_neighbors=3)

## Instantiation of the Feature Selection Method Based on Random Forest

We create the objects that implement the feature selection approach based on Random Forest. This filtering method can be very expensive due to the need of building a big ensemble of decision trees. Therefore, we perform first a filtering approach based on variable ranking that will keep only 20% of the features. After that, Random Forest will pick up 10 features.

In [8]:
filtering_rf = SelectKBest(mutual_info_classif, k = int(np.round(X.shape[ 1 ] * 0.2)))

# We specify a threshold value equal to zero. Then it will be increased to choose only 10 feautres.

rf_selection =  SelectFromModel(RandomForestClassifier(n_estimators = 2000, \
    random_state = 0), threshold = 0.0)

## Cross Validation Process to Estimate the Generalization Performance

We carry out a 10-fold cross validation process to estimate the prediction performance of the KNN classifier as a function of the number of features considered.

In [9]:
# This is the number of times the 10-fold cv process will be repeated

n_repeats = 1

In [10]:
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)

We create an array to store the results.

In [11]:
errors_nb_vr = np.zeros(10 * n_repeats)
errors_knn_vr = np.zeros(10 * n_repeats)
errors_nb_rf = np.zeros(10 * n_repeats)
errors_knn_rf = np.zeros(10 * n_repeats)

We no do the loop over the data partitions. This will take some time due to the cost of estimating mutual information.

In [12]:
# First, a simple variable ranking filtering approach

split = 0

for train_index, test_index in rkf.split(X, y):

    sys.stdout.write('.')
    sys.stdout.flush()
    
    # First simple variable ranking

    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # We standardize the data to have zero mean and unit std

    scaler.fit(X_train, y_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    # We filter the data using variable ranking

    filtering.fit(X_train, y_train)
    X_train_vr = filtering.transform(X_train)
    X_test_vr = filtering.transform(X_test)
    
    # We fit the classifiers and compute the test performance

    nb.fit(X_train_vr, y_train)
    knn.fit(X_train_vr, y_train)

    errors_nb_vr[ split ] = 1.0 - np.mean(nb.predict(X_test_vr) == y_test)
    errors_knn_vr[ split ] = 1.0 - np.mean(knn.predict(X_test_vr) == y_test)

    split += 1


..........

In [13]:
# Now RF after an initial variable ranking filtering approach

np.random.seed(0)
    
split = 0

for train_index, test_index in rkf.split(X, y):

    sys.stdout.write('.')
    sys.stdout.flush()
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    # We standardize the data to have zero mean and unit std

    scaler.fit(X_train, y_train)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)
    
    # We filter the data first using variable ranking
    
    filtering_rf.fit(X_train, y_train)
    X_train_rf = filtering_rf.transform(X_train)
    X_test_rf = filtering_rf.transform(X_test)
    
    # We filter the data again using RF
    
    rf_selection.fit(X_train_rf, y_train)
    rf_selection.threshold = -1.0 * np.sort(-1.0 * rf_selection.estimator_.feature_importances_)[ 9 ]
    X_train_rf = rf_selection.transform(X_train_rf)
    X_test_rf = rf_selection.transform(X_test_rf)
    
    # We fit the classifiers and compute the test performance

    nb.fit(X_train_rf, y_train)
    knn.fit(X_train_rf, y_train)

    errors_nb_rf[ split ] = 1.0 - np.mean(nb.predict(X_test_rf) == y_test)
    errors_knn_rf[ split ] = 1.0 - np.mean(knn.predict(X_test_rf) == y_test)
    
    split += 1

..........

## Reporting the Results Obtained

We compute the performance of the classifier in terms of the feature selection method used.

In [14]:
# First simple variable ranking

print("\nWith Variable Ranking Feature Selection")
print("Mean Error Naive Bayes:%f" % np.mean(errors_nb_vr))
print("\tStd Mean Error Naive Bayes:%f" % (np.std(errors_nb_vr) / np.sqrt(len(errors_nb_vr))))
print("Mean Error KNN:%f" % np.mean(errors_knn_vr))
print("\tStd Mean Error KNN:%f" % (np.std(errors_knn_vr) / np.sqrt(len(errors_knn_vr))))

# Next, the RF approach

print("\nWith Variable ranking and RF Feature Selection")
print("Mean Error Naive Bayes:%f" % np.mean(errors_nb_rf))
print("\tStd Mean Error Naive Bayes:%f" % (np.std(errors_nb_rf) / np.sqrt(len(errors_nb_rf))))
print("Mean Error KNN:%f" % np.mean(errors_knn_rf))
print("\tStd Mean Error KNN:%f" % (np.std(errors_knn_rf) / np.sqrt(len(errors_knn_rf))))



With Variable Ranking Feature Selection
Mean Error Naive Bayes:0.083333
	Std Mean Error Naive Bayes:0.017347
Mean Error KNN:0.036111
	Std Mean Error KNN:0.017480

With Variable ranking and RF Feature Selection
Mean Error Naive Bayes:0.025000
	Std Mean Error Naive Bayes:0.015811
Mean Error KNN:0.034722
	Std Mean Error KNN:0.023011
