# Some preliminary python code

In [4]:
%matplotlib inline

import warnings; warnings.simplefilter('ignore')

import matplotlib.pyplot as plt
import numpy as np
import sklearn

## Example of multivariate Wrapper using a RBF SVC

For wrapper, we use a model that is trained on a selected subset and the score function $J$ is the estimated real risk of the predictor. Then, for selecting a subset of features, we can use a recursive exploration of the tree of the possible subsets with various strategies refered to as "Sequential search", e.g. sequential forward search, sequential backward search, sequential forward floating search, sequential backward floating search.

### Example: Breast cancer classification dataset

As an example, we will consider the breast cancer classification dataset. This classification dataset has 30 continuous features with 569 samples. We look for the best subset of 3 features. The score function is the real risk, as estimated by a 4-fold cross validation, of a linear SVC.

As scikit learn does not contain sequential feature selection algorithms, we will use the mlxtend [1] package.


[1] http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

In [5]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

data  = load_breast_cancer()

n_samples = data['data'].shape[0]
n_features = data['data'].shape[1]

# We split the dataset with 15% as a test, using stratified to ensure the train and test sets
# both have approximately the same statistics as the whole dataset
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], stratify=data['target'], test_size=0.15)


clf = SVC(kernel='rbf')

# We use Sequential Floating Forward Search
selector = SFS(clf, 
           k_features='best', 
           forward=True, 
           floating=True, 
           scoring='accuracy',
           cv=4,
           n_jobs=-1)
selector = selector.fit(X_train, y_train)

# Print out which are the selected features
selected_indices = selector.k_feature_idx_
names_selected_indices = [data['feature_names'][i] for i in selected_indices]
print("The selector selected {} features, which have the indices {}, i.e. {}".format(len(selected_indices), selected_indices, names_selected_indices))

# We perform the feature selection
X_train_red = selector.transform(X_train)

# And then train our classifier
clf = SVC(kernel='rbf')
clf.fit(X_train_red, y_train)

# Finally, we estimate the accuracy on the test set
y_test_pred = clf.predict(selector.transform(X_test))
test_accuracy = 100 * (y_test_pred == y_test).sum() / y_test.size
print("Accuracy on the test set : {:.2f} %".format(test_accuracy))


# We can also estimate the real risk by cross validating the whole process
clf = make_pipeline(StandardScaler(), 
                    SFS(SVC(kernel='rbf'), 
                        k_features=3, 
                        forward=True, 
                        floating=True, 
                        scoring='accuracy',
                        cv=4,
                        n_jobs=-1),
                    SVC(kernel='rbf'))
scores = cross_val_score(clf, data['data'], data['target'], cv=4)
print("Real risk by cross validation for 3 features : %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

The selector selected 20 features, which have the indices (0, 1, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17, 18, 19, 20, 21, 22), i.e. ['mean radius', 'mean texture', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter']
Accuracy on the test set : 90.70 %
Real risk by cross validation for 3 features : 0.96 (+/- 0.02)


In [6]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

for i in range(1,15):
    clf = make_pipeline(StandardScaler(), 
                    SFS(SVC(kernel='rbf'), 
                        k_features=i, 
                        forward=True, 
                        floating=True, 
                        scoring='accuracy',
                        cv=4,
                        n_jobs=-1),
                    SVC(kernel='rbf'))
    scores = cross_val_score(clf, data['data'], data['target'], cv=10)
    print("Real risk by cross validation for {} features : {:.2f} (+/- {:.2f})" .format(i, scores.mean(), scores.std() * 2))

Real risk by cross validation for 1 features : 0.90 (+/- 0.07)
Real risk by cross validation for 2 features : 0.94 (+/- 0.05)
Real risk by cross validation for 3 features : 0.95 (+/- 0.04)
Real risk by cross validation for 4 features : 0.95 (+/- 0.04)
Real risk by cross validation for 5 features : 0.96 (+/- 0.04)
Real risk by cross validation for 6 features : 0.96 (+/- 0.05)
Real risk by cross validation for 7 features : 0.96 (+/- 0.05)
Real risk by cross validation for 8 features : 0.97 (+/- 0.06)
Real risk by cross validation for 9 features : 0.96 (+/- 0.04)
Real risk by cross validation for 10 features : 0.96 (+/- 0.04)
Real risk by cross validation for 11 features : 0.96 (+/- 0.04)
Real risk by cross validation for 12 features : 0.96 (+/- 0.05)
Real risk by cross validation for 13 features : 0.96 (+/- 0.06)
Real risk by cross validation for 14 features : 0.97 (+/- 0.05)
