# 2023/2024 - Ma412 

# Lab 2: Sparse Regression 

The purpose of this notebook is to use different linear regression algorithms on the dataset Boston House Price. This will include becoming familiar with the notions of regularization and selection of variables.

## 1. Linear regression

We consider $X$ $\in\mathbb{R}^{l\times(n+1)}$ the matrix containing the data whose the ith row is $(x_i,1)$ and $Y$ $\in\mathbb{R}^{l}$ the vector containing the labels $y_i$. We consider the least squares estimator the vector $$
\begin{pmatrix}
\hat{\alpha}\\[3mm]
\hat{\beta} \\[3mm]
\end{pmatrix}= (X^TX)^{-1}X^TY=min_{\alpha\in\mathbb{R}^{n}, \beta\in\mathbb{R}}\sum_{i=1}^l(y_i-(<\alpha,x_i>+\beta))^2$$


#### Questions:
1. Program a $regression(X, Y)$ function that returns the least squares estimator. Use your regression function on the Boston House Prices dataset (to be loaded with
the $datasets.load\_boston()$ function.

In [42]:
import pandas as pd
import numpy as np
from sklearn import datasets

def regression(X,Y):
    teta =  np.linalg.lstsq(X.T@X,X.T@Y)
    return teta[0:2]



In [45]:
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
xi = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
yi = raw_df.values[1::2, 2]
X = np.hstack([xi,np.ones((xi.shape[0],1))])
Y = yi

teta,residuals = regression(X,Y)
alpha = teta[:-1]
beta = teta[-1]
print(alpha)
print(beta)

[-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]
36.459488385075296


  teta =  np.linalg.lstsq(X.T@X,X.T@Y)[0]
  res =  np.linalg.lstsq(X.T@X,X.T@Y)[1]


2. Compare the vectors $\hat{\alpha}$ and $\hat{\beta}$ returned by the function by using with the $coef\_$ and $intercept\_$ attributes of a $linear\_model.LinearRegression$.
Some useful functions: $dot()$, $transpose()$, $pinv()$.

In [46]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(xi, yi)
print(reg.coef_)
print(alpha)
print(reg.intercept_)
print(beta)
print("R² linear : ",reg.score(xi,yi))
print("R² linear : ", residuals)

[-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]
[-1.08011358e-01  4.64204584e-02  2.05586264e-02  2.68673382e+00
 -1.77666112e+01  3.80986521e+00  6.92224640e-04 -1.47556685e+00
  3.06049479e-01 -1.23345939e-02 -9.52747232e-01  9.31168327e-03
 -5.24758378e-01]
36.45948838508985
36.459488385075296
R² linear :  0.7406426641094095
R² linear :  []


3. Write the funcion $regress(X,\alpha, \beta)$ which returns the vector $\hat{Y}$ of the predicted labels
such as $\hat{y}_i=<\alpha,x_i>+\beta$.

In [55]:
def regress(X,alpha,beta):
    Y = np.zeros((X.shape[0]))
    for i in range(Y.shape[0]):
        Y[i] = alpha @ X[i] + beta
    return Y
Y_lst = regress(xi,alpha,beta)
err = np.linalg.norm(Y-Y_lst,2)**2
print(err)

11078.784577954975


4. Calculate the least squares error $\epsilon=\lVert Y-\hat{Y} \rVert_2^2=\sum_{i=1}^l(y_i-\hat{y}_i)^2$ of the learned regressor about the entire Boston dataset.

## 2. Ridge regression

In some cases, the matrix $X^TX$ is not invertible. To remedy to this problem, we add a ridge $\lambda\mathbb{1}$ to this matrix where $\mathbb{1}\in\mathbb{R}^{(n+1)\times(n+1)}$ is the following matrix:

$$\begin{pmatrix}
1 & ... & \cdots & 0 \\
0& \ddots & \cdots & 0 \\
\vdots & \vdots & 1 & \vdots \\
0 & 0 & \cdots & 0
\end{pmatrix}$$

This corresponds to a slight modification of the optimization problem which penalizes the size of the
coefficients. The generalized least squares vector is given by: 
$$\begin{pmatrix}
\hat{\alpha}\\[3mm]
\hat{\beta} \\[3mm]
\end{pmatrix}= (X^TX+\lambda\mathbb{1})^{-1}X^TY=min_{\alpha\in\mathbb{R}^{n}, \beta\in\mathbb{R}}\sum_{i=1}^l(y_i-(<\alpha,x_i>+\beta))^2+\lambda\lVert \alpha \rVert_2^2$$

#### Questions:
1. Program a $ridge\_regression(X, Y,\lambda)$ function that returns the least squares estimator. Compare again the vectors $\hat{\alpha}$ and $\hat{\beta}$ obtained for the parameter
$\lambda = 1$ on the Boston dataset using $coe\_$ and $intercept\_$ attributes of a regressor $linear\_model.Ridge$.

In [None]:
# ============================================================
# Your code here ...
# ============================================================

2. Plot the evolution of the coefficients of the vector $\hat{\alpha}$ as a function of the regularization parameter
$\lambda$ for values between $1e-3$ and $1e3$. Which variables seem to best explain the house prices in Boston?

In [None]:
 # ============================================================
# Your code here ...
# ============================================================

3. Find by some appropriate means the best value for the parameter $\lambda$ . Learn then and run a regressor with this value on the entire Boston dataset and compute
the error in the sense of least squares on this same sample.

In [None]:
# ============================================================
# Your code here ...
# ============================================================

## 3. LASSO regression

In this regularization, the penalization of the vector of the coefficients is done here with the norm $l_1$ instead of the Euclidean norm $l_2$. Consider $\alpha\in\mathbb{R}^{n}$, $\lVert \alpha \rVert_1=\sum_{i=1}^n|\alpha_i|$. Solutions are then parsimonious. The optimization problem is given by:
$$min_{\alpha\in\mathbb{R}^{n}, \beta\in\mathbb{R}}\sum_{i=1}^l(y_i-(<\alpha,x_i>+\beta))^2+\lambda\lVert \alpha \rVert_1$$

#### Questions:
1. Using the $linear\_model.Lasso$ class, plot the evolution of the coefficients of the vector $\hat{\alpha}$ regarding the value of the parameter $\lambda$. Which variables seem to best explain the
house prices in Boston? Are they the same as those found in the previous exercise? How do other variables behave when the value of $\lambda$ increases?

In [None]:
# ============================================================
# Your code here ...
# ============================================================

2. Find by some appropriate means the best value for the $\lambda$ parameter. Learn then and run a regressor with this value on the entire Boston dataset and compute
the error in the sense of least squares on this same sample.

In [None]:
# ============================================================
# Your code here ...
# ============================================================

## 4. Elastic Net Regression

In Elastic Net Regularization we add the both terms of $l_1$ and $l_2$ to get the final loss function. Referring to the course, apply the elastic Net regularization to train your model, calculate the prediction and the mean square error. 

In [None]:
# ============================================================
# Your code here ...
# ============================================================

From the above analysis, which conclusions can you reach about the different regularization methods ?

## 5. Your Turn

The purpose now is to test these approaches on other datasets. You may choose one from the UCI machine learning repository http://archive.ics.uci.edu/.
Download a dataset, and try to determine the optimal set of parameters to use to model it! 

In [None]:
# ============================================================
# Your code here ...
# ============================================================