# How the forward Variable selection works ? 
### A simple example with linear regression model

Forward variable selection is an iterative method in which we start with having no feature in the model. 

We apply the regression model and select the feature which is the most significant among the other ones. We compute the residual between the observed variable and its estimate. This residual becomes in next step the variable to be estimated. The process is repeated until the coefficient start to become non significant. Everyone can set its own criteria to stop the process.

## Illustration:

Let's denote $\Large Y$ the variable to be estimated and $\Large X_{1}, X_{2},X_{3},X_{4}, X_{5}$ covariates.

## 1st step

Estimates of five coefficients $ \Large  \theta_{1}, \theta_{2}, \theta_{3}, \theta_{4}, \theta_{5}$ from:

$\Large Y = \theta_{1}X_{1}$ <br>
$\Large  Y = \theta_{2}X_{2}$ <br> 
$\Large  Y = \theta_{3}X_{3}$ <br> 
$\Large  Y = \theta_{4}X_{4}$ <br> 
$\Large  Y = \theta_{5}X_{5}$ <br>

We apply the simple test (t-student test in this case because we will perform a linear regression with OLS) for each $\Large \theta_{i}$ and retain the variable which is the most significant.

Recall that the t-test is performed as follows:

$\Large H_{0}: \theta_{i} = 0$ <br>
$\Large H_{1}: \theta_{i} \neq 0$


t-student statistics is given by:     $\Large \frac{\hat{\theta}_{i} - \theta_{i}}{\hat{\sigma}_{\theta_{i}}} \sim \tau_{(n-1)} $ <br>

$ \Large n$ are observations, $ \Large n-1$ degre of freedom and $\Large \tau$ the student distribution. $ \Large \hat{\sigma}_{\theta_{i}}$ is the estimated standard deviation. 

Let's suppose that $\Large \theta_{3}$ has the smallest p-value meaning that $ \Large X_{3}$ is the most significant variable.

We will compute the residual as follows:    $ \Large residual = Y - \hat{\theta}_{3}X_{3} $

## 2nd step:

Estimates of fourth coefficients $ \Large  \beta_{1}, \beta_{2}, \beta_{4}, \beta_{5}$ from:

$ \Large  residual = \beta_{1}X_{1}$ <br>
$ \Large  residual = \beta_{2}X_{2}$ <br> 
$ \Large  residual = \beta_{4}X_{4}$ <br> 
$ \Large  residual = \beta_{5}X_{5}$ <br>

We will apply again a simple test (t-student test) for each  $\Large \beta_{i}$  and retain the variable which is the most significant.

If there are many covariates, the iteration could stop when covariate start to be non significant in explaining $\Large Y$. (However, everyone can set its own criteria to stop iterate).

## Application

We're going to work on the python diabetes database. The initial database contains n = 442 patients and p = 10 covariates. The variable Y to be explained is a score corresponding to the evolution of the disease. For fun, a malicious robot contaminated the dataset by adding 200 inappropriate explanatory variables. Then, not content with having already perverted our dataset, he deliberately mixed the variables together at random. Of course, the robot then took care to erase all traces of his rogue act so that we do not know the relevant variables. The new database contains n = 442 patients and p = 210 covariates, denoted X. The last column is the variable to be explained. Will you be able to thwart this prankster's plans and find the relevant variables?

In [96]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from scipy import stats
import math as m
from scipy.stats import t
from sklearn import linear_model
from numpy.linalg import inv

In [97]:
data = pd.read_csv('https://bitbucket.org/portierf/shared_files/downloads/data_dm3.csv', header=None)

# we specify the matrix of covariates x_data and the explained variable y_data
y_data = data[210]
x_data = data.loc[:, :209]

### Méthodology to select variables following the forward selection approach

In [98]:
def forward_selection(x, y, stop = 10):
    n = x.shape[0] 
    p = x.shape[1]
    v = np.ones(n)
    x_aug = np.c_[v,x.to_numpy()]
    
    variables = []
    p_value_list = []
    t_stats_list = []
    t_variable_list =[]
    
    for j in range(p + 1):
        p_value =[]
        residu = []
        t_stats = []
        t_variable = []
        
        for i in range(p + 1):
            if i not in variables:
                X = x_aug[:, i].reshape(-1,1)
                regr = linear_model.LinearRegression(fit_intercept = True).fit(X, y)
                y_pred = regr.predict(X)
                res = y -  y_pred
                residu.append(res)
                sigma2 = res.std()                               # the error standard deviation
                sigma_theta = sigma2 * np.sqrt(inv(X.T@X)[0,0])  # The corrected error standard deviation
                t_statistique = abs(regr.coef_[0]) / sigma_theta
                p_value_i = 2*(1-t.cdf(t_statistique, n-2))
                p_value.append(p_value_i)
                t_stats.append(t_statistique)
                t_variable.append(i)
                
        idx_min = np.argsort(p_value)[0]
        p_value_min = p_value[idx_min]
        res_min = residu[idx_min]
        y = res_min
        t_stats_list.append(t_stats)
        t_variable_list.append(t_variable)
        
        if p_value_min < stop:
            variables.append(idx_min)
            p_value_list.append(p_value_min)
        else:
            return pd.DataFrame({'variable': variables, 'p_value': p_value_list}, columns=['variable', 'p_value']), t_stats_list,t_variable_list
    
    return  pd.DataFrame({'variable': variables, 'p_value': p_value_list}, columns=['variable', 'p_value']), t_stats_list,t_variable_list

In [99]:
df_variables , t_stat, t_variable = forward_selection(x_data, y_data)

### The first 10 relevant variables are presented in table below:

In [106]:
df_variables[0:10]

Unnamed: 0,variable,p_value
0,35,0.0
1,58,0.0
2,78,8.721544e-09
3,165,7.754549e-09
4,133,9.920123e-06
5,121,0.0002174377
6,126,0.00162524
7,14,0.01722812
8,201,0.009570498
9,56,0.03294146
