# Session 4: Model & Feature Selection

## 4.1 Objectives
As you may have noticed during the previous sessions, two practices are commonly used when solving a machine learning problem. The first one, introduced in Session 1, is feature selection. Indeed, some models are very sensitive to the presence of irrelevant features which affect, for instance, distance measures. The second practice, introduced in Session 2, is model selection. Indeed, most prediction models have one or several **meta-parameters** whose values have to be carefully chosen in order to get good model performances. Both the feature selection and the model selection procedures will be investigated more in depth in this session.

In [3]:
import scipy.io
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import itertools

from sklearn.feature_selection import mutual_info_regression

from rbfn import MyRBFN

## 4.2 Creating a dataset
For this session, you have to build an artificial dataset containing 1000 observations. For each observation $x_i=\left(x_{i,1},\dots,x_{i,6}\right)$, the values of the 6 features are randomly chosen in the interval $\left[  0,1 \right]$ (you can use `numpy.random.random_sample` for this). The target is then computed as:
\begin{equation}
	f(x_i) = 2 \sin\left(2 * x_{i,1}\right) x_{i,2} + 4 \left(x_{i,3}-.5\right)^2 + x_{i,4} + \epsilon_i
\end{equation}
where $\epsilon_i$ is a noise component following a normal distribution ${\mathcal N}(0,0.01)$ (you can use `numpy.random.normal` for this). Before going ahead, take a look at the equation above. What can you say about the six features of our problem? Are they all equally useful?

In [22]:
def draw_samples(nb_samples):
    # TODO: generate x and f(x) according to the equation above
    # x.shape = (nb_samples, 6)
    # y.shape = (nb_samples, 1)
    x = np.random.random_sample((nb_samples,6))
    noise = np.random.normal(0,0.01,(nb_samples,))
    y = 2*np.sin(2*x[:,0])*x[:,1]+4*(x[:,2]-0.5)**2+x[:,3]+noise
    return x,y[:,np.newaxis]

In [23]:
x_train, y_train = draw_samples(1000)

## 4.2 Feature selection
Feature selection can be performed with a criterion that quantifies the pertinence of features for predicting the target. We will investigate the two following criteria: the correlation coefficient (using `numpy.corrcoef`) and the mutual information (using `sklearn.feature_selection.mutual_info_regression`). A simple feature selection strategy consists in selecting the features achieving a sufficiently high score for a given criterion. Implement this strategy and apply it on your dataset.

**Analyse** the results. Did you expect these results ? Are they coherent with the equation above? Are both criteria appropriate for our task? 

Before moving to the next step, **build a reduced training** set only containing the features you have selected. Keep also a copy of the complete training set in order to make comparisons and to assess the interest of feature selection.


### Correlation

In [24]:
# TODO: compute the Correlation between the output (y) and the features (x_i)
np.corrcoef(x_train,y=y_train,rowvar= False)[-1,:-1]
#corr between -1 and 1
#for values 5 and 6 no correlation OK BUT for value 3 no corr is not OK

array([ 0.40829845,  0.60297706, -0.00147822,  0.40036916,  0.02219411,
       -0.05806866])

### Mutual information

In [25]:
# TODO: compute the Mutual Information between the output (y) and the features (x_i)
mutual_info_regression(x_train,y_train[:,0])
#we have no MI 5 and 6 OK and MI for 3 is non null so ok
#MI is from 0 to +inf

array([0.10833977, 0.25583933, 0.12214406, 0.12547215, 0.0165314 ,
       0.01825161])

## 4.3 Model selection
RBFNs that you have implemented in session 3 have two meta-parameters (the number of centers and the smoothing factor). These meta-parameters cannot be optimized directly using a training set since their role is to control the model complexity and to prevent under/overfitting. In the remaining of this session, you will **implement a simple validation procedure** which allows to select good values of the meta-parameters for a specific training set. To train RBFNs, you can either use your own code or the one we provide on Moodle (`rbfn.py`).
**Divide the dataset** you just created in a **training** set ($70\%$) and a **validation** set ($30\%$). Then, build a grid (of reasonable size) of the values you will test for the meta-parameters. For each pair of values (number of centers, smoothing factor), train a RBFN using the training set and measure the error made on the validation set. Use the results to select the appropriate number of centers and the smoothing factor for your problem. Eventually, train a RBFN on the whole dataset using the chosen meta-parameters.

**Build a test set** containing 10000 samples. Measure the error made on these data using your model. Repeat the whole procedure detailed in this section using all the features. How do the results compare to the case where feature have been selected? Is this always the case?

In [27]:
# TODO: split in train and test
x_train_reduced = x_train[:,(0,1,2,3)]

x_test,y_test = draw_samples(10000)
x_test_reduced = x_test[:,(0,1,2,3)]

In [34]:
# TODO: apply meta-parameter search for RBFN on the training set
def model_selection(x_train,y_train,x_val,y_val):
    best_score=np.inf
    best_parameters = (0,0)
    for nb_centers in (10,25,50,75,100,150):
        for width_scaling in (0.1,0.2,0.5,1.,5.,10.,20.):
            rbfn = MyRBFN(nb_centers,width_scaling)
            rbfn.fit(x_train,y_train)
            score = rbfn.score(x_val,y_val)
            print(score)
            if score < best_score:
                best_score = score
                best_parameters = (nb_centers,width_scaling)
    return best_parameters, best_score

def evaluate(x_train, y_train, best_parameters, x_test, y_test):
    rbfn = MyRBFN(best_parameters[0],best_parameters[1])
    rbfn.fit(x_train,y_train)
    score = rbfn.score(x_test,y_test)
    return score

In [35]:
# TODO: Do the same but now apply feature selection
best_parameters, best_score = model_selection(x_train[:700,:],y_train[:700],x_train[700:,:],y_val=y_train[700:])
print(best_parameters, best_score)

216.52046944327023
0.6959565769821802
0.5906822005729948
0.4259114865601797
0.37151829164455175
0.3593842455815681
0.3733597963067306
Stopped at epoch 63
912383.6367337966
Stopped at epoch 69
1.8418115337969712
Stopped at epoch 61
0.5584799248204951
Stopped at epoch 61
0.3941926478114398
Stopped at epoch 62
0.17664338533578453
Stopped at epoch 62
0.18153316217721552
Stopped at epoch 58
0.2227260450783657
Stopped at epoch 35
52268.23964340689
Stopped at epoch 33
1.6900045586698154
Stopped at epoch 32
0.5520080771869882
Stopped at epoch 33
0.3502765803984098
Stopped at epoch 34
0.06184699178797489
Stopped at epoch 36
0.06310700994333103
Stopped at epoch 31
1.3984813551539712
Stopped at epoch 23
72974094.87021206
Stopped at epoch 23
10.03009677174635
Stopped at epoch 24
0.5547088478323865
Stopped at epoch 23
0.2919496101906804
Stopped at epoch 24
0.04591968515118244
Stopped at epoch 21
0.04520538867240998
Stopped at epoch 23
52.09052983509665
Stopped at epoch 17
36329.846446564545
Stopped

In [36]:
evaluate(x_train,y_train,best_parameters,x_test,y_test)

Stopped at epoch 18


0.015328965734096601

#### With feature selection

In [37]:
best_parameters, best_score = model_selection(x_train_reduced[:700,:],y_train[:700],x_train_reduced[700:,:],y_val=y_train[700:])
print(best_parameters, best_score)

729.8473566923778
0.6898799557955133
0.5599581866637927
0.40531627839828427
0.33221450878427783
0.31356940270133926
0.3257529093356801
Stopped at epoch 52
2386584.0500843087
Stopped at epoch 54
1.4580103871790726
Stopped at epoch 54
0.5430594567913277
Stopped at epoch 51
0.3265759307280358
Stopped at epoch 51
0.06876795839020501
Stopped at epoch 52
0.053356367222977916
Stopped at epoch 53
0.3298794111794873
Stopped at epoch 29
1054.257685815239
Stopped at epoch 29
1.234035732011859
Stopped at epoch 28
0.5505084138902425
Stopped at epoch 28
0.30243974784038635
Stopped at epoch 27
0.019357868146828845
Stopped at epoch 28
0.041425542091269654
Stopped at epoch 29
3.4716265799863266
Stopped at epoch 19
35700906.125846855
Stopped at epoch 18
1.020054981149734
Stopped at epoch 19
0.5854118268332432
Stopped at epoch 17
0.2877075067141525
Stopped at epoch 21
0.013299094982684948
Stopped at epoch 18
0.23283061169396113
Stopped at epoch 16
1.1498257610943632
Stopped at epoch 15
72072.10182132508


In [38]:
evaluate(x_train_reduced,y_train,best_parameters,x_test_reduced,y_test)

Stopped at epoch 13


0.011077832567678345