# [LELEC2870] - Machine Learning

## Practical Session 2 - Feature Selection

Prof. M. Verleysen<br>
Prof. J. Lee<br>

**Teaching assistants :**  
Edouard Couplet : edouard.couplet@uclouvain.be  <br>
Cyril De Bodt: cyril.debodt@uclouvain.be<br>
Dany Rimez: dany.rimez@uclouvain.be<br>
Niels Sayez : niels.sayez@uclouvain.be <br> 
Antoine Vanderschueren : antoine.vanderschueren@uclouvain.be<br>

Last session, you implemented linear regression between the target values and a few features of a dataset.

This second session will focus on the behaviour one should adopt when confronted with a new dataset with a larger number of features. Indeed, preprocessing must be applied to the data received to ensure that machine learning models can be trained on this data with the available computational resources and that these models produce the best performance.

You will first make sure that the data can be ingested by a linear regressor and the apply several feature selection methods you learned about during the lectures in order to train better and better models.

We provide you a dataset for the topic covered in this session. This dataset can be found on the Moodle page of this course.

<div class="alert alert-success">
    
* **Load** the dataset from disk.

</div>

In [None]:
import pandas as pd

import matplotlib.pyplot as plt

import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

import sklearn
from sklearn.linear_model import LinearRegression

In [None]:
## Load the dataset here in a pandas dataframe
## Display the informations of the dataframe

## 1.  Preprocessing / Feature Engineering


<div class="alert alert-info">

When facing a new dataset, one must pay attention to several aspects regarding the following themes:

* **Sanity of data**: Are all the values in the dataset in a format that can be interpreted by the model we want to train?
    
    
* **Redundancy of data**: do all features bring new information of is there repetition between features?
    
    
* **Relevance of data** : Does a feature bring information about the target?
    
    
* **Balance of influences** : Will the orders of magnitude of the numerical values of the features influence the training of my models?
    
The initial data may contain flaws with respect to one or more of these themes. Careful inspection of the data is therefore necessary in order to ensure that the models that will be trained do not suffer from biases that can be avoided.

    
</div>  

### 1.a Sanity Verification

The first aspect to look at is the verification that models will be fed only with information it may exploit i.e. no data entry may interfere with proper computation.

#### Missing Values
It is quite common when facing a new dataset to notice missing values (NaN) that will prevent the
proper training of the machine learning models. A common solution to this problem is simply to
drop the data samples containing NaNs. Your first task in this session is to remove those entries
from the dataset.

<div class="alert alert-success">
    
* **Remove** the samples of the dataset with missing information using pandas functions.

</div>

#### Data Format

Some features may be expressed in formats that can not be processed as they are (mainly categorical features) and therefore must be expressed differently to be ingested by a model. 

**Note:**      A **single** original feature may be replaced by **more than one** new features.

<div class="alert alert-success">
    
* **Identify** features requiring a new expression
* **Replace** those features with equivalent ones able to be given as input to a model.
    
</div>

#### Outlier Removal

Some of the samples in the dataset may contain values of features that are not representative of
the general distribution of the other samples i.e. outliers. Remove those samples from the
dataset.

<div class="alert alert-success">
    
* **Choose** a criterion would you use to find outliers. How would you use that criterion?
* **Remove** the samples being outliers in the distributions of one or more features.
* **Show** box plots of the distributions of each feature, before and after removing outliers.
    
* What other processing could you do instead of removing samples? **Discuss**.

</div>

### 1.b Redundancy verification

Observe the correlation matrix between features. Do you find redundant features ? Can you
explain how such features affect a model such as a Linear regressor or a KNN regressor? In
both case, Is it important to keep or drop redundant features?


<div class="alert alert-info">
    
**KNN regressor/classifier**
    
The principle of the KNN methods is compare the given input sample(s) to the training data stored in its 'memory'.

For any input sample $\mathbf{s}$, the distance between $\mathbf{s}$ and any training sample in training dataset  $\mathbf{X}$ is computed:
    
\begin{equation}
\mathbf{D} = || \mathbf{s} - \mathbf{X}||
\end{equation}
    
Then, only the k samples with lowest distances in $\mathbf{D}$ - therefore named 'neighbors' of input $\mathbf{s}$ in $\mathbf{X}$- are considered such that the output depends on whether KNN is used for classification or regression: 
    
    

* In KNN classification, the output is a class membership. Input $\mathbf{s}$ is classified by a plurality vote of its neighbors, such that $\mathbf{s}$ is assigned to the class most common among its k nearest neighbors. 
    
    
    
    
* In KNN regression, the output is the target value for $\mathbf{s}$. This value is the average of the target values of the k nearest neighbors.
    
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/1024px-KnnClassification.svg.png" alt="Alternative text" width="300"/>


</div>


<div class="alert alert-success">
    
**Look for redundant information in the dataset**
    
* **Plot** the correlation between features, using the 'heatmap' function from the **seaborn** module.
* Can you tell which features are redundant ?

</div>

### 1.c Relevance verification 
Observe the correlation matrix between features and the target. Which features do you think are
more likely to be selected when performing Feature Selection for a linear model? Do you think
your answer will also hold for other regression models? Why?
<div class="alert alert-success">
    
**Find most relevant features**
    
* **Plot** the correlation between each feature and the target, using the 'heatmap' function from the **seaborn** module.
* Can you tell which features are the most relevant for a **linear regression model** ?

</div>

### 1.d Balance of Influences

Observe the modified dataset once again: are all features of the same order of magnitude? Does
this variation affect a linear regressor? And a KNN regressor? How would you solve this?

<div class="alert alert-success">
    
**Make sure features have equal impact on the regression models.**
    
* Can you identify which features may overwhelm others ?
* **Scale** each feature to a common range of values.

</div>

### 1.e Train-Test Split

To test out the generalization of your linear regressor, a data separation step is necessary. You'll now split the dataset in two equal parts at random ({X_train, Y_train} and {X_test, Y_test}). This will allow you to build a model on the former and assess its performance on the latter.

<div class="alert alert-success">
  
* **Split** the dataset in training and test subsets.

</div>

## 2. Feature Selection

Keeping unnecessary feature while training a machine learning model has several disadvantages such as increase its comlexity, and decrease its generalization ability.

Hence, feature selection is one of the important steps while building a machine learning model. the goal of this step is to find the best possible set of features for building a model with best possible accuracy and lowest bias.

<div class="alert alert-success">
  
* **Run** the following cells and observe the RMSE obtained after training on all features. 
    
    We will try to improve the performances with feature selection.

</div>

In [None]:
# Compute the Root Mean Square Error
def compute_rmse(predict, target):
    if len(target.shape) == 2:
        target = target.squeeze()
    if len(predict.shape) == 2:
        predict = predict.squeeze()
    diff = target - predict
    if len(diff.shape) == 1:
        diff = np.expand_dims(diff, axis=-1)
    rmse = np.sqrt(diff.T@diff / diff.shape[0])
    return float(rmse)

In [None]:
def fit_predict_with_features(X_train,Y_train, X_test, selected_features):
    X_train_filtered = X_train[selected_features]
    X_test_filtered = X_test[selected_features]
    
    linear_regression_m = LinearRegression()
    linear_regression_m.fit(X_train_filtered,Y_train)
    y_pred = linear_regression_m.predict(X_test_filtered)
    
    return y_pred


In [None]:
print('RMSE Without FS')
Y_pred = fit_predict_with_features(X_train,Y_train, X_test, X_train.columns)
print(compute_rmse(Y_pred, Y_test.to_numpy()))


## 2.1 Filtering

A first common method for feature selection is filtering out the available features that are unnecessary/redundant:

![image.png](attachment:image.png)

This filtering can be based on several metrics or statistical tests such as **Correlation coefficient**, **Mutual information**, **Chi-square tests**...

Today, we ask you to select relevant features using the filtering method. To do so, define a correlation threshold between
the features and the target and filter out features with a correlation lower than this threshold.


<div class="alert alert-success">
  
**Feature selection by filtering**
    
* According to your thoughts from section 1.3, **select** a subset of features based on their correlation coefficient with the target.
* Dou you get improved performances compared to unfiltered dataset?

</div>

In [None]:
# keep only the features with high correlation

#################### vvvvvvvv
## QUESTION Define threshold and filter out the features with corr < thresh
# create a variable named relevant_features containing a pandas series with the features to keep.
#
# relevant_features = ??????
#################### ^^^^^

rf = [k for k,v in relevant_features.items() if k != 'median_house_value']
print(rf)

In [None]:
y_pred = fit_predict_with_features(X_train,Y_train, X_test, rf)

print('RMSE after Filter')
print(compute_rmse(y_pred, Y_test.to_numpy()))

## 2.2 Wrapper method

Wrappers methods are iterative methods to select a subset of features to train a model such that addition and removal of features takes place according to conclusions made from prior training of the model. 

Stopping criteria for selecting the best subset are usually pre-defined e.g. when the performance of the model decreases or a specific number of features has been achieved. 

![image.png](attachment:image.png)

The main advantage of wrapper methods over the filter methods is that they provide an optimal set of features for training the model, thus resulting in better accuracy than the filter methods but are computationally more expensive.


<div class="alert alert-success">
  
**Feature selection by Wrapping methods**
    
* Apply both **Forward** and **Backward** feature selection using functions classes from **sklearn.feature_selection**
* **Compare** the performances with models trained on different feature subsets.
    
</div>

### Forward

In [None]:
#################### 
# QUESTION create new linear regressor + Use SFS in forward mode
# What are the selected features ? Print their names.
#
# sfs = ???
# selected_features_sfs_forward = ???
#################### 

In [None]:
y_pred = fit_predict_with_features(X_train,Y_train, X_test, selected_features_sfs_forward)

print('RMSE with wrapper Forward')
print(compute_rmse(y_pred, Y_test.to_numpy()))

### Backward

In [None]:
# Backward
####################
# QUESTION create new linear regressor + Use SFS in backward mode
# What are the selected features ? Print their names.
#
# sfs = ???
# selected_features_sfs_backward = ???
####################

In [None]:
y_pred = fit_predict_with_features(X_train,Y_train, X_test, selected_features_sfs_backward)

print('RMSE with wrapper Backward')
print(compute_rmse(y_pred, Y_test.to_numpy()))