# Linear Regression

<img src="files/figures/LinReg.jpg" width="450px"/>

Image taken from: 
https://www.statstest.com/multiple-linear-regression/

<div class="alert alert-block alert-info">
    
Consider the following **training set** composed of $N$ observations:

$$
S_{\rm train} = \left\{ \left( \boldsymbol{x_1}, y_1 \right), \dots, \left( \boldsymbol{x_N}, y_N \right) \right\}.
$$

We define the feature matrix $\boldsymbol{X}$ and target vector $\boldsymbol{y}$ as follows:

$$
\boldsymbol{X} =
\begin{pmatrix}
1 & \boldsymbol{x_1}^T \\
\vdots & \vdots \\
1 & \boldsymbol{x_N}^T 
\end{pmatrix}
=
\begin{pmatrix}
1 & x_{11} & \cdots & x_{1p} \\
\vdots & \vdots & \ddots & \vdots \\
1 & x_{N1} &\cdots & x_{Np}
\end{pmatrix}
\text{ and }
\boldsymbol{y} =
\begin{pmatrix}
y_1 \\
\vdots \\
y_N 
\end{pmatrix}
$$


The solution of the **linear regression (LR)** is the vector
$$\boldsymbol{\hat \beta} = (\hat \beta_0, \dots, \hat \beta_p)$$ which minimizes the **residual sum of squares (RSS)** (distances between predictions et target):

$$
\mathrm{RSS(\boldsymbol{\beta})} 
:= \sum_{i=1}^N \big(\boldsymbol{x_i}^T \boldsymbol{\beta} - y_i \big)^2 
= \| \boldsymbol{X} \boldsymbol{\beta} - \boldsymbol{y} \|^2
$$

We have (cf. course):

$$
\boldsymbol{\hat{\beta}} 
= \underset{\boldsymbol{\beta}}{\arg \min} \left\| \boldsymbol{X} \boldsymbol{\beta} - \boldsymbol{y} \right\|^2 
= (\boldsymbol{X}^T \boldsymbol{X})^{-1} \boldsymbol{X}^T \boldsymbol{y}
$$
    
</div>

## Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import seaborn as sns
from matplotlib import pyplot as plt

%matplotlib inline
sns.set_theme()

## Data

- Download the **Bottle Database** (csv file) from the **California Cooperative Oceanic Fisheries Investigations (CalOFI)** portal:<br>
download: https://www.kaggle.com/datasets/sohier/calcofi<br>
    info: https://calcofi.org/data/oceanographic-data/bottle-database/
- Import the data and look at them with `pandas`.
- Select only the following colunms of the dataset:<br>
``columns = ["T_degC", "O2Sat", "O2ml_L", "STheta", "O2Sat", "Salnty"]``
- Remove lines that contain empty values.<br>
`data = data[data[columns].notnull().all(1)]`

In [None]:
# Load data
DATA_PATH = "put your data path here..."
data = pd.read_csv(DATA_PATH, delimiter=',', low_memory=False)

## Linear Regression (LR)

The **feature variables** are `"O2Sat", "O2ml_L", "STheta", "O2Sat", "Salnty"`.

The **target variable** is `"Salnty"`.

We want to predict the **target** using the **features**.

- Create the feature tensor $\boldsymbol{X}$ (2D) and the target tensor $\boldsymbol{y}$ (1D).<br>
Don't forget to add a column of $1$'s in your features $\boldsymbol{X}$.
- Shuffle the data and split them into train and test sets:<br>
(80% train / 20% test, use `train_test_split(...)`)
- **Implement in `numpy` the LR solution $\boldsymbol{\hat \beta}$ using the train dataset.**<br>
(use `np.linalg.pinv()` for matrix inversion)
- Compute the predictions $\boldsymbol{\hat y}$ on the train and test sets.
- Plot the true values $\boldsymbol{y}$ vs predictions $\boldsymbol{\hat y}$ for the train and test sets: if the predictions are good, the graph should look diagonal (why?).

**Conclusion:** Normally, we would do a more complex data analyis - investigate feature correlations, etc. - and then use a dedicated ML library to implement the model.

Here, at least once in your life, you implemented the **linear regression** by yourself.