# Feature selection lab - WDBC
Lab developed by Gary Marigliano - 07.2018

Now that you are a little bit more familiar with the feature selection, you are going to compare multiple features selection techniques on a real-life dataset. The Breast Cancer Wisconsin (Diagnostic) Data Set (WDBC) is a dataset that contains 30 features (computed from digitalized images). You can have the full details [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).

The 1st column is the sample id, the 2nd contains the class (either "M" or "B" respectively malignant and benign) and the 30 last columns the features as real numbers.

## Lab goals

* Discover, use and compare some features selection algorithms with a real-life dataset
* Assess the quality of the selected features given by the algorithms


## TODO in this notebook

* Answer the questions in this notebook (where **TODO student** is written)
* Take a look at the [skfeature](http://featureselection.asu.edu/) python library. You can/should use some features selection algorithms listed here (the python library has already been installed for this project): http://featureselection.asu.edu/html/skfeature.function.html and http://featureselection.asu.edu/tutorial.php

# Prepare the dataset

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

In [2]:
from sklearn.datasets import load_svmlight_file
from sklearn import preprocessing
import scipy

filename = r"datasets/WDBC/data_WDBC.csv"
df = pd.read_csv(filename, sep=",")
df = df.dropna(axis=1) # remove last colunm which only contains NaN values
df.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [3]:
X = df.drop(['id', 'diagnosis'], axis=1).values

In [4]:
# transform the categorical diagnosis values into numerical values. This is required by many algorithms.
le = preprocessing.LabelEncoder()
le.fit(df["diagnosis"].values)
print("The classes are:",le.classes_)
y = le.transform(df["diagnosis"].values)

print("X contains (n_samples, n_features) =", X.shape)

The classes are: ['B' 'M']
X contains (n_samples, n_features) = (569, 30)


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Extract good features using features selection algorithms

In this part you will extract good features by trying several features selection algorithms. 

Remember that to extract relevant features you can use filter, wrapper or embedded based methods. You can use machine learning based techniques such as read the synaptic weights of an ANN, get the features importance from an ExtraTree,... or you can use statistical based techniques such as analyzing the variance of the variables.


**TODO student**

* Pick 3 features selection techniques from http://featureselection.asu.edu/html/skfeature.function.html
* For each of them:
    * Explain using 20-40 words the idea behind this feature selection technique. You will need to do some research to do that. If you don't understand it, pick an other ;-)
    * Indicate the family of this feature selection technique (filter, wrapper, embedded, statistical/ML-based,...)
    * Plot the features importance (the same way you did in the previous notebook)
    * Choose N features to you find relevant
    * Justify N and the chosen features
    * Analyze the stability of the selected features (does the FS algorithm always return the same list of features ? Prove it)
* Comment the 3 features importance plots. Here are some clues about the questions you should ask yourself:
    * Are the features selected by the 3 FS techniques similars ?
    * Are the number of features selected by the 3 FS techniques the same ?
* Keep the lists of selected features. You will need them later.

In [6]:
#TODO student...

# Assess the selected features

Now that you have lists of features (i.e. the 3 lists of N features your chosen FS techniques gave you), we are going to assess the relevance of these features.

To do that you are going to create a function that takes a list of features as input and returns one or more score metrics (accuracy, f1-score, sensibility, specificity,...) for this given list. Inside that function several classifiers will be used to evaluate the performance they can achieve using the selected features. Here is an example of this function:

``` python
def evaluate_features(features):
    score_clf_1 = compute_score_using_classifier_1(features)
    score_clf_2 = compute_score_using_classifier_2(features)
    score_clf_3 = compute_score_using_classifier_3(features)
    # ....
    return find_a_way_to_show_these_scores_nicely(score_clf_1, score_clf_2, score_clf_3)
```

You may be careful to the following points:
* the classifiers you use may not be determinist therefore you may want to run them multiple time to have an averaged score
* try to choose classifiers that are relatively different regarding how they use the data. Using 3 classifiers that are tree-based is not the best idea you can have
* try to choose classifiers that you didn't use to get the lists of features in the first place

**TODO student**

* Write the `evaluate_features()` function with at least 3 classifiers (for example ANN, SVM and KNN)
* Use this function with the lists you got from your FS algorithms
* Use this function with a random list of selected features (same size as the others lists)
* Use this function with all the features
* Make a plot similar to the one just below (see https://matplotlib.org/examples/api/barchart_demo.html)
* Comment the results. Here are some clues about the questions you should ask yourself:
    * How the scores of the lists of selected features behave compare to the random/all features ?
    * How behave the classifiers inside `evaluate_features()` ? Do they prefer a list in general ?

<img src="assets/02-WDBC-perf-plot.png" />

In [7]:
#TODO student...

### Going further (optional)

Now that you finished this notebook, it can be interesting to go a step further and try the points below:

* Can we have better results (i.e. more relevant features and/or less features) if the input data are normalized ?
* Compare the execution time between the FS algorithms you used. Given this additional information do you think you can exclude or prefer some FS techniques *for this particular case* ?
* Plot the classifier performance for the K best features where K is $1, 2,..,k_{-1},k$ and comment the results
* Anything you find relevant...

Please answer to these questions just below in this same notebook.