# MLBD - 

## Lab - Feature Selection -




## Exercise 1 - Noisy Iris -

- Developed by _Gary Marigliano - July 2018_

- Modified by _Shabnam Ataee - March 2020_

## Assistant -
Shabnam Ataee

## Introduction -

First, read _ReadMe_ notebook and install required packages for this lab.

In this exercise, the [famous iris](https://en.wikipedia.org/wiki/Iris_flower_data_set) dataset has been modified to insert noisy features. The goal is to retrieve 4 original features (sepal length/width and petal length/width) using feature selection models.

You can use some feature selection algorithms listed here (the python library should already be installed for this exercise): http://featureselection.asu.edu/html/skfeature.function.html and http://featureselection.asu.edu/tutorial.php

## ToDo in this notebook -

Answer to questions in this notebook where **_ToDo_** is written.

## Group members -

As mentioned in the _ReadMe_ notebook, during this lab you will work in groups composed of 2 or 3 students. Please specify firstname, lastname and email address of group members here.

In [1]:
# ToDo ...


## Setup the Iris dataset -

In [2]:
import numpy as np
from matplotlib import pyplot as plt
import itertools
%matplotlib inline

Below the dataset is modified to create new noisy features to the iris dataset.

In [3]:
from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target

## Add some noisy features in the iris dataset

# Add a feature that is always equal to a constant independently of the output --> useless feature
constant_features = np.array([[12 for _ in range(X.shape[0])]]).transpose()
X = np.append(X, constant_features, axis=1)

# Add random noisy features. 
# These features are created using the first feature values with a more or less important noise level
noise_levels = np.arange(1, 6, 0.3)
first_feat = X[:, 1]

n_samples = X.shape[0]
for k in noise_levels:
    noise = k*(np.random.rand() * 2 - 1)
    noisy_features = [noise + first_feat[x] for x in range(n_samples)]
    noisy_features = np.array([noisy_features]).transpose()
    X = np.append(X, noisy_features, axis=1)

# Here we can see that the 5th column is always equals to 12. The colunms after it are the noisy features.
print(X[:3, :])

[[  5.1          3.5          1.4          0.2         12.           3.27265586
    3.01640062   3.55403767   1.60548721   4.47214325   5.82137418
    2.38411585   4.07326902   0.13133793   6.97527672   2.34652693
    6.35289761   4.53652924   3.7418375    4.55191418   1.42010491
   -0.38458357]
 [  4.9          3.           1.4          0.2         12.           2.77265586
    2.51640062   3.05403767   1.10548721   3.97214325   5.32137418
    1.88411585   3.57326902  -0.36866207   6.47527672   1.84652693
    5.85289761   4.03652924   3.2418375    4.05191418   0.92010491
   -0.88458357]
 [  4.7          3.2          1.3          0.2         12.           2.97265586
    2.71640062   3.25403767   1.30548721   4.17214325   5.52137418
    2.08411585   3.77326902  -0.16866207   6.67527672   2.04652693
    6.05289761   4.23652924   3.4418375    4.25191418   1.12010491
   -0.68458357]]


In [4]:
X.shape

(150, 22)

Next, we start selecting relevant features. To do that, we do "Data Preparation" phase of data science pipeline.

This means we need to split the data into a train set (67%) and a test set (33%).


In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

For this notebook, the example below shows how to train and get the features sorted by decreasing importance.

In [6]:
from sklearn.ensemble import ExtraTreesClassifier

# train
clf = ExtraTreesClassifier(n_jobs=2, n_estimators=10)
clf.fit(X_train, y_train)

# get the score
score = clf.score(X_test, y_test)
print("score {:.3f}".format(score))

# rank the features
importances = clf.feature_importances_

indices = np.argsort(importances)[::-1]

n_features = X_train.shape[1]

# get the features sorted by decreasing importance
feat_importances_sorted = [(indices[f], importances[indices[f]]) for f in range(n_features)]

score 0.860


**_ToDo_**: 
* Draw feature importance plot using a bar chart (see picture below)
* Answer the following questions:
   * What does this plot represent?
   * How do you compare two features using this plot?
   * How would you choose a "good" number of features?
   * How can you be sure that the features you have been selected are relevant? What kind of tasks should you do?
   * How could you prove it?
   * For this modified dataset, is it really useful to reduce the number of features?
   * How easy/hard is it to retrieve the original features?

<img src="pictures/01-noisy-iris-feat-importances-example.png" />

In [7]:
# ToDo ...


## Choose the best _n_ features -

Now that we have the features sorted by decreasing importances, your task is to choose the best ones.

In [8]:
# This function plots the confusion matrix.
# You can use this function to plot confusion matrix.

def plot_confusion_matrix(cm, classes, title, cmap=plt.cm.Blues):

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar(fraction=0.046, pad=0.04)
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')

**_ToDo_**:

* Choose _n_ features that you find relevant.
* Justify the number _n_ that you have chosen.
* Select confusion matrix or another relevant score metric and compare the classifier performance between:
    * your selected features and the noisy iris dataset 
    * your selected features and some _n_ random features (take the average score of K runs for random features)
    * your selected features and the worst _n_ features (look at your feature importance plot)
* Answer the following questions:
    * Among the features you have selected, how many are the original ones?
    * Among the features you have selected, is there any useless feature (the one which always contains the same value)?
    
To plot a prettier confusion matrix you can use the follwoing code:

``` python
y_pred = clf.predict(X_test_random)
cm = confusion_matrix(y_test, y_pred)
n_classes = len(np.unique(y))
plot_confusion_matrix(cm, classes=range(n_classes), title="Confusion Matrix")
```

In [9]:
# ToDo ...


### Going further (optional) -

Now that you finished this notebook, it can be interesting to go one step further and try following scenarios:

* Can we achieve better results (i.e. more relevant features and/or less features) if we normalize the data set?
* Can we retrieve the same relevant features by applying another feature selection model?
* Plot the classifier performance for the best K features where K is $1, 2,..,k_{-1},k$ and comment the results.
* ...

Please answer to the above questions below in this notebook.