# Feature selection lab - Golub
Lab developed by Gary Marigliano - 07.2018

## Introduction

In this notebook, we are going to use the Golub dataset which contains about 7000 features, 72 samples and 2 classes (AML and ALL). 

The goal is to get lists of relevant features (100 or less) using several FS techniques. The idea is to combine several lists of features to get "super" lists that should (maybe) perform better than only one list. To merge the lists you are going to be creative (take the union/intersection between all the lists, use the the feature's importance score as a weight or a probability to select this feature from a list,...).

## Rules to build a super list

* The super list must contain 100 features or less
* You are not forced to use all the lists to build the super list
* The super list must not contain duplicate features
* The super list must at least use features from 2 different FS techniques

You can use some features selection algorithms listed here (the python library should already been installed for this project): http://featureselection.asu.edu/html/skfeature.function.html and http://featureselection.asu.edu/tutorial.php


## TODO in this notebook

Answer the questions in this notebook (where **TODO student** is written)

In [1]:
import glob
import pandas as pd
from sklearn import preprocessing

In [2]:
le_classes = preprocessing.LabelEncoder()
le_classes.fit(["AML", "ALL"])

LabelEncoder()

In [3]:
def parse_dataset(filenames, le_classes):
    df = pd.read_csv(filenames[0], sep="\t", usecols=["ID_REF", "VALUE"]).transpose().drop("ID_REF", axis=0)
    df.index = [filenames[0].split("/")[-1]]

    for i in range(1, len(filenames)):
        df2 = pd.read_csv(filenames[i], sep="\t", usecols=["ID_REF", "VALUE"]).transpose().drop("ID_REF", axis=0)
        df2.index = [filenames[i].split("/")[-1]]

        df = pd.concat([df, df2])

    X = df.values

    y = [fname[-7:-4] for fname in df.index.values]
    y = le_classes.transform(y)

    
    return X, y

In [4]:
train_filenames = glob.glob("./datasets/golub/train/*.csv")
test_filenames = glob.glob("./datasets/golub/test/*.csv")

X_train, y_train = parse_dataset(train_filenames, le_classes)
X_test, y_test = parse_dataset(test_filenames, le_classes)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(38, 7129) (38,) (34, 7129) (34,)


In [5]:
df = pd.read_csv(train_filenames[0], sep="\t", usecols=["ID_REF", "VALUE"]).transpose().drop("VALUE", axis=0)
features_name = df.values[0]
print(features_name)

['AFFX-BioB-5_at' 'AFFX-BioB-M_at' 'AFFX-BioB-3_at' ..., 'L49218_f_at'
 'M71243_f_at' 'Z78285_f_at']


**TODO student**

* Get 3 super lists using at least 2 different FS techniques. This is an example to build a super list:
    * FS techniques: ExtraTrees + Fisher score; Merge technique: intersection of the lists
    * FS techniques: Recursive Feature Elimination SVM + ANN + MRMR: pick K features with a probability based on the feature importance's score
* For each super list:
    * Get the `evaluate_features()` function you wrote in 02-WDBC
    * Use this function with the super list
    * Use this function with a random list of selected features (same size as the super list)
    * Use this function with all the features
    * Make a plot similar to the one just below (see https://matplotlib.org/examples/api/barchart_demo.html)
    * Comment the results. Here are some clues about the questions you should ask yourself:
        * How the scores of the lists of selected features behave compare to the random/all features ?
        * How behave the classifiers inside `evaluate_features()` ? Do they prefer a list in general ?
        
* You may be careful to the following points:
   * the classifiers you use may not be determinist therefore you may want to run them multiple time to have an averaged score
   * try to choose classifiers that are relatively different regarding how they use the data. Using 3 classifiers that are tree-based is not the best idea you can have
   * try to choose classifiers that you didn't use to get the lists of features in the first place

* Comment your results. Here are some clues about the questions you should ask yourself:
    * What kind of problems could you encounter when merging the lists ?
    * Do the super lists that use more lists perform better ?
    * What about the execution time ?

<img src="assets/02-WDBC-perf-plot.png" />