# 1.2 - Multivariate Linear Regression

:::{grid} 1 1 2 2
```{card} [Open in Google Colab](https://colab.research.google.com/github/PilotLeoYan/inside-deep-learning/blob/main/content/1-linear-regression/1-2-multivariate-linear-regression.ipynb)
```{image} ../figures/colab_logo.png
:align: center
```
```{card} [Open in Jupyter NBViewer](https://nbviewer.org/github/PilotLeoYan/inside-deep-learning/blob/main/content/1-linear-regression/1-2-simple-multivariate-regression.ipynb)
```{image} ../figures/jupyter_logo.png
:align: center
```
:::

If you already understand Simple Linear Regreession, 
then we can make things a little more complicated. 
Multivariate Linear Regression considers inputs with
multiples *features*. This will be help us to develop a dense layer for
the next part.

```{image} ../figures/multivariate-perceptron.png
:width: 300
:class: hidden dark:block
```

```{image} ../figures/multivariate-perceptron-light.png
:width: 300
:class: dark:hidden
```

The goal of multivariate linear regression is similar to simple linear regression,
estimate $f(\cdot)$ by a linear approximation $\hat{f}(\cdot)$

$$
\mathbf{y} = f\left( \mathbf{X} \right) + \epsilon
$$

Note that input data $\mathbf{X}$ is now a *matrix*.

**Purpose of this Notebook:**

1. Create a dataset for multivariate linear regression task
2. Create our own Perceptron class from scratch
3. Calculate the gradient descent from scratch
4. Train our Perceptron
5. Compare our Perceptron to the one prebuilt by PyTorch

# Setup

In [1]:
print('Start package installation...')

Start package installation...


In [2]:
%%capture
%pip install torch
%pip install scikit-learn

In [3]:
print('Packages installed successfully!')

Packages installed successfully!


In [4]:
import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__

('3.14.0', '2.9.0+cu126')

In [5]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device

  return torch._C._cuda_getDeviceCount() > 0


'cpu'

In [6]:
torch.set_default_dtype(torch.float64)

In [7]:
def add_to_class(Class):
    """Register functions as methods in created class."""
    def wrapper(obj): setattr(Class, obj.__name__, obj)
    return wrapper

# Dataset

## create dataset

The dataset $\mathcal{D}$ is consists of the input data $\mathbf{X}$ and
the target data $\mathbf{y}$

$$
\mathcal{D} = \left\{(\mathbf{x}_{1}^{\top}, y_{1}), \cdots,
(\mathbf{x}_{m}^{\top}, y_{m}) \right\}
$$

The input data $\mathbf{X} \in \mathbb{R}^{m \times n}$ can be represented as a matrix

$$
\begin{align}
\mathbf{X} &= \begin{bmatrix}
    x_{11} & \cdots & x_{1n} \\
    \vdots & \ddots & \vdots \\
    x_{m1} & \cdots & x_{mn}
\end{bmatrix} \\
&= \begin{bmatrix}
    \mathbf{x}_{1}^{\top} \\
    \vdots \\
    \mathbf{x}_{m}^{\top}
\end{bmatrix}
\end{align}
$$

where $m$ is the number of samples, $n$ is the number of *features*, and
$\mathbf{x}_{i}^{\top} = \begin{bmatrix} x_{i1} & \cdots & x_{in} \end{bmatrix} \in \mathbb{R}^{1 \times n}$.

The target data $\mathbf{y} \in \mathbb{R}^{m}$ still without changes

$$
\mathbf{y} = \begin{bmatrix}
	y_{1} \\ \vdots \\ y_{m}
\end{bmatrix}
$$

In [8]:
from sklearn.datasets import make_regression
import random


M: int = 10_100 # number of samples
N: int = 4 # number of features

X, Y = make_regression(
    n_samples=M, 
    n_features=N, 
    n_targets=1,
    n_informative=N - 1, # let's add a features as a linear combination of others
    bias=random.random(), # random true bias
    noise=1
)

print(X.shape)
print(Y.shape)

(10100, 4)
(10100,)


## split dataset

In [9]:
X_train = torch.tensor(X[:100], device=device)
Y_train = torch.tensor(Y[:100], device=device)
X_train.shape, Y_train.shape

(torch.Size([100, 4]), torch.Size([100]))

In [10]:
X_test = torch.tensor(X[100:], device=device)
Y_test = torch.tensor(Y[100:], device=device)
X_test.shape, Y_test.shape

(torch.Size([10000, 4]), torch.Size([10000]))

## delete raw dataset

In [11]:
del X
del Y

# Scratch multivariate perceptron

## weight and bias

Our model $\hat{\mathbf{y}}(\cdot)$ still have two trainable parameters $b, \mathbf{w}$.
But now note that weight is a vector

$$
\mathbf{w} \in \mathbb{R}^{n}
$$

and $b \in \mathbb{R}$.

In [12]:
class MultiLinearRegression:
    def __init__(self, n_features: int):
        self.b = torch.randn(1, device=device)
        self.w = torch.randn(n_features, device=device)

    def copy_params(self, torch_layer: nn.modules.linear.Linear):
        """
        Copy the parameters from a module.linear to this model.

        Args:
            torch_layer: Pytorch module from which to copy the parameters.
        """
        self.b.copy_(torch_layer.bias.detach().clone())
        self.w.copy_(torch_layer.weight[0,:].detach().clone())

## weighted sum

$$
\begin{align}
\hat{\mathbf{y}}: \mathbb{R}^{m \times n} &\to \mathbb{R}^{m} \\
\mathbf{X} &\mapsto \hat{\mathbf{y}}(\mathbf{X}) = b + \mathbf{Xw}
\end{align}
$$

For one prediction

$$
\begin{align}
\hat{y}_{i} &= b + \sum_{j=1}^{n} x_{ij} w_{j}\\
&= b + \mathbf{x}_{i}^{\top} \mathbf{w}
\end{align}
$$

this will be useful for gradient descent.

In [13]:
@add_to_class(MultiLinearRegression)
def predict(self, x: torch.Tensor) -> torch.Tensor:
    """
    Predict the output for input x.

    Args:
        x: Input tensor of shape (n_samples, n_features).

    Returns:
        y_pred: Predicted output tensor of shape (n_samples,).
    """
    return torch.matmul(x, self.w) + self.b

## MSE

MSE still without changes.

$$
\begin{align}
L: \mathbb{R}^{m} &\to \mathbb{R}^{+} \\
\hat{\mathbf{y}} &\mapsto L(\hat{\mathbf{y}}), \;
\hat{\mathbf{y}} \in \mathbb{R}^{m}
\end{align}
$$

$$
L (\hat{\mathbf{y}}) = 
\frac{1}{m} \sum_{i=1}^{m} \left(
	\hat{y}_{i} - y_{i}
\right)^{2}
$$

In [14]:
@add_to_class(MultiLinearRegression)
def mse_loss(self, y_true: torch.Tensor, y_pred: torch.Tensor):
    """
    MSE loss function between target y_true and y_pred.

    Args:
        y_true: Target tensor of shape (n_samples,).
        y_pred: Predicted tensor of shape (n_samples,).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    return ((y_pred - y_true)**2).mean().item()

@add_to_class(MultiLinearRegression)
def evaluate(self, x: torch.Tensor, y_true: torch.Tensor):
    """
    Evaluate the model on input x and target y_true using MSE.

    Args:
        x: Input tensor of shape (n_samples, n_features).
        y_true: Target tensor of shape (n_samples,).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    y_pred = self.predict(x)
    return self.mse_loss(y_true, y_pred)

## gradients

Let's follow the same strategy as before:

+ First, determine the derivatives to be computed
+ Then, ascertain the shape of each derivative
+ Finally, compute the derivatives

⭐️ We are using *Einstein notation*, that implies summation. For example

$$
a_{i} b_{i} \equiv \sum_{i} a_{i} b_{i}
$$

Derivative of MSE respect to bias

$$
\frac{\partial L}{\partial b} = 
\frac{\partial L}{\partial \hat{y}_{p}} 
\frac{\partial \hat{y}_{p}}{\partial b}
$$

and derivative of MSE respect to weight

$$
\frac{\partial L}{\partial w_{q}} = 
\frac{\partial L}{\partial \hat{y}_{p}} 
\frac{\partial \hat{y}_{p}}{\partial w_{q}}
$$

where the shape of each derivative is

$$
\frac{\partial L}{\partial b} \in \mathbb{R},
\frac{\partial L}{\partial \mathbf{w}} \in \mathbb{R}^{n},
\frac{\partial L}{\partial \hat{\mathbf{y}}} \in \mathbb{R}^{m},
\frac{\partial \hat{\mathbf{y}}}{\partial b} \in \mathbb{R}^{m},
\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{w}} \in \mathbb{R}^{m \times n}
$$

### MSE derivative

Derivative of MSE respect to predicted data is

$$
\begin{align}
\frac{\partial L}{\partial \hat{y}_{p}} &= 
\frac{\partial}{\partial \hat{y}_{p}} \left( \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_{i} - y_{i} \right)^{2} \right) \\
&= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial}{\partial \hat{y}_{p}} \left( \left( \hat{y}_{i} - y_{i} \right)^{2} \right) \\
&= \frac{2}{m} \sum_{i=1}^{m} \left( \hat{y}_{i} - y_{i} \right) \frac{\partial \hat{y}_{i}}{\partial \hat{y}_{p}} \\
&= \frac{2}{m} \sum_{i=1}^{m} \left( \hat{y}_{i} - y_{i} \right) \delta_{ip} \\
&=\frac{2}{m} \sum_{i=1}^{m} \left[ \hat{\mathbf{y}} - \mathbf{y} \right]_{i} \delta_{ip} \\
&= \frac{2}{m} \left[ \hat{\mathbf{y}} - \mathbf{y} \right]_{p} \\
&= \frac{2}{m} \left( \hat{y}_{p} - y_{p} \right)
\end{align}
$$
for $p = 1, \ldots, m$.

The vectorized form is

$$
\frac{\partial L}{\partial \hat{\mathbf{y}}} = 
\frac{2}{m} \left( \hat{\mathbf{y}} - \mathbf{y} \right)
$$

### weighted sum derivative

#### respect to bias

$$
\begin{align}
\frac{\partial \hat{y}_{p}}{\partial b} &= \frac{\partial}{\partial b} \left( b + \mathbf{x}_{p}^{\top} \mathbf{w} \right) \\
&= 1
\end{align}
$$
for $p = 1, \ldots, m$. 

The vectorized form is

$$
\frac{\partial \hat{\mathbf{y}}}{\partial b} = \mathbf{1}
$$

where $\mathbf{1} \in \mathbb{R}^{m}$.

#### respect to weight

$$
\begin{align}
\frac{\partial \hat{y}_{p}}{\partial w_{q}} &= \frac{\partial}{\partial w_{q}} \left( b + \mathbf{x}_{p}^{\top} \mathbf{w} \right) \\
&= \frac{\partial}{\partial w_{q}} \left(\mathbf{x}_{p}^{\top} \mathbf{w} \right) \\
&= \frac{\partial}{\partial w_{q}} \left( x_{p1}w_{1} + \ldots + x_{pq}w_{q} + \ldots + x_{pn}w_{n} \right) \\
&= \frac{\partial}{\partial w_{q}} \left( x_{pk} w_{k} \right) \\
&= x_{pk} \delta_{kq} \\
&= x_{pq}
\end{align}
$$
for $p = 1, \ldots, m$, and $q = 1, \ldots, n$.

Vectoring for all $q = 1, \ldots, n$

$$
\frac{\partial \hat{y}_{p}}{\partial \mathbf{w}} = 
\mathbf{x}_{p}^{\top} \in \mathbb{R}^{1 \times n}
$$

Vectorizing for all $p = 1, \ldots, m$

$$
\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{w}} = 
\mathbf{X} \in \mathbb{R}^{m \times n}
$$

### full chain rule

Derivative of MSE respect to bias

$$
\begin{align}
\frac{\partial L}{\partial b} &= 
{\color{Cyan} \frac{\partial L}{\partial \hat{y}_{p}}}
{\color{Orange} \frac{\partial \hat{y}_{p}}{\partial b}} \\
&= {\color{Cyan} \frac{2}{m} \left( \hat{y}_{p} - y_{p} \right)}
{\color{Orange} 1_{p}} \\
&= \frac{2}{m} \left< \hat{\mathbf{y}} - \mathbf{y}, \mathbf{1} \right> \\
&= \frac{2}{m} \left( \hat{\mathbf{y}} - \mathbf{y} \right)^{\top} \mathbf{1}
\end{align}
$$

Derivative of MSE respect to weight

$$
\begin{align}
\frac{\partial L}{\partial w_{q}} &= 
{\color{Cyan} \frac{\partial L}{\partial \hat{y}_p}}
{\color{Magenta} \frac{\partial \hat{y}_{p}}{\partial w_{q}}} \\
&= {\color{Cyan} \frac{2}{m} \left(\hat{y}_{p} - y_{p} \right)} {\color{Magenta} x_{pq}} \\
&= \frac{2}{m} \left< \hat{\mathbf{y}} - \mathbf{y}, \mathbf{x}_{:,q} \right> \\
&= \frac{2}{m} \left( \mathbf{x}_{:,q} \right)^{\top} \left( \hat{\mathbf{y}} - \mathbf{y} \right)
\end{align}
$$
for $q = 1, \ldots, n$, where $\mathbf{x}_{:,q} = \begin{bmatrix} x_{1q} & \cdots & x_{mq} \end{bmatrix}^{\top} \in \mathbb{R}^{m \times 1}$. 

Vectorized form is

$$
\begin{align}
\frac{\partial L}{\partial \mathbf{w}} &= \frac{2}{m}
\mathbf{X}^{\top} \left( \hat{\mathbf{y}} - \mathbf{y} \right)
\end{align}
$$

### final gradients

$$
\nabla_{b}L = 
\frac{2}{m} \left( \hat{\mathbf{y}} - \mathbf{y} \right)^{\top} \mathbf{1}
$$

$$
\nabla_{\mathbf{w}} L =
\frac{2}{m} \mathbf{X}^{\top} \left( \hat{\mathbf{y}} - \mathbf{y} \right)
$$

## parameters update

$$
\begin{align}
b &\leftarrow b -\eta \nabla_{b}L \\ &=
b -\eta \left(
    \frac{2}{m} (\hat{\mathbf{y}} - \mathbf{y})^{\top} \mathbf{1}
\right)
\end{align}
$$

$$
\begin{align}
\mathbf{w} &\leftarrow \mathbf{w} -\eta \nabla_{\mathbf{w}}L \\ &=
\mathbf{w} -\eta \left(
    \frac{2}{m} \mathbf{X}^{\top} (\hat{\mathbf{y}} - \mathbf{y}) 
\right)
\end{align} 
$$

where $\eta \in \mathbb{R}^{+}$ is called *learning rate*.

In [15]:
@add_to_class(MultiLinearRegression)
def update(self, x: torch.Tensor, y_true: torch.Tensor, 
           y_pred: torch.Tensor, lr: float):
    """
    Update the model parameters.

    Args:
       x: Input tensor of shape (n_samples, n_features).
       y_true: Target tensor of shape (n_samples,).
       y_pred: Predicted output tensor of shape (n_samples,).
       lr: Learning rate. 
    """
    delta = 2 * (y_pred - y_true) / len(y_true)
    self.b -= lr * delta.sum()
    self.w -= lr * torch.matmul(x.T, delta)

## gradient descent

In [16]:
@add_to_class(MultiLinearRegression)
def fit(self, x: torch.Tensor, y: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
    """
    Fit the model using gradient descent.
    
    Args:
        x: Input tensor of shape (n_samples, n_features).
        y: Target tensor of shape (n_samples,).
        epochs: Number of epochs to fit.
        lr: learning rate.
        batch_size: Int number of batch.
        x_valid: Input tensor of shape (n_valid_samples, n_features).
        y_valid: Target tensor of shape (n_valid_samples,).
    """
    for epoch in range(epochs):
        loss = []
        for batch in range(0, len(y), batch_size):
            end_batch = batch + batch_size

            y_pred = self.predict(x[batch:end_batch])

            loss.append(self.mse_loss(
                y[batch:end_batch],
                y_pred
            ))

            self.update(
                x[batch:end_batch], 
                y[batch:end_batch], 
                y_pred, 
                lr
            )

        loss = round(sum(loss) / len(loss), 4)
        loss_v = round(self.evaluate(x_valid, y_valid), 4)
        print(f'epoch: {epoch} - MSE: {loss} - MSE_v: {loss_v}')

# Scrath vs Torch.nn

## Torch.nn model

In [17]:
class TorchLinearRegression(nn.Module):
    def __init__(self, n_features):
        super(TorchLinearRegression, self).__init__()
        self.layer = nn.Linear(n_features, 1, device=device)
        self.loss = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)
    
    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self.forward(x)
            return self.loss(y_pred, y).item()
    
    def fit(self, x, y, epochs, lr, batch_size, x_valid, y_valid):
        optimizer = torch.optim.SGD(self.parameters(), lr=lr)
        for epoch in range(epochs):
            loss_t = [] # train loss
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size

                y_pred = self.forward(x[batch:end_batch])
                loss = self.loss(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            loss_t = round(sum(loss_t) / len(loss_t), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss_t} - MSE_v: {loss_v}')
        optimizer.zero_grad()

In [18]:
torch_model = TorchLinearRegression(N)

## scratch model

In [19]:
model = MultiLinearRegression(N)

## evals

We will use a *metric* to compare our model with the PyTorch model.

### import MAPE modified

We will use a modification of *MAPE* as a metric

$$
\text{MAPE}(\mathbf{y}, \hat{\mathbf{y}}) =
\frac{1}{m} \sum^{m}_{i=1} \mathcal{L} (y_{i}, \hat{y}_{i})
$$

where

$$
\mathcal{L} (y_{i}, \hat{y}_{i}) = \begin{cases}
    \left| \frac{y_{i} - \hat{y}_{i}}{y_{i}} \right|
    & \text{if } y_{i} \neq 0 \\
    \left| \hat{y}_{i} \right| & \text{if } \hat{y}_{i} = 0
\end{cases}
$$

In [20]:
# This cell imports torch_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.torch_metrics import torch_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/torch_metrics.py'
    local_file = 'torch_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from torch_metrics import torch_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)

mape imported locally.


### predictions

Let's compare the predictions of our model and PyTorch's using modified MAPE.

In [21]:
mape(
    model.predict(X_test),
    torch_model.forward(X_test).squeeze(-1)
)

23.432626662762388

They differ considerably because each model has its own parameters 
initialized randomly and independently of the other model.

### copy parameters

We copy the values of the PyTorch model parameters to our model.

In [22]:
model.copy_params(torch_model.layer)

### predictions after copy parameters

We measure the difference between the predictions of both models again.

In [23]:
mape(
    model.predict(X_test),
    torch_model.forward(X_test).squeeze(-1)
)

0.0

We can see that their predictions do not differ greatly.

### loss

In [24]:
mape(
    model.evaluate(X_test, Y_test),
    torch_model.evaluate(X_test, Y_test.unsqueeze(-1))
)

0.0

### training

We are going to train both models using the same hyperparameters' value. 
If our model is well designed, then starting from the same parameters 
it should arrive at the same parameters' values as the PyTorch model after training.

In [25]:
LR: float = 0.01 # learning rate
EPOCHS: int = 16 # number of epochs
BATCH: int = len(X_train) // 3 # number of minibatch

In [26]:
torch_model.fit(
    X_train, 
    Y_train.unsqueeze(-1),
    EPOCHS, LR, BATCH,
    X_test,
    Y_test.unsqueeze(-1)
)

epoch: 0 - MSE: 3701.5284 - MSE_v: 3197.1711
epoch: 1 - MSE: 2913.2982 - MSE_v: 2740.885
epoch: 2 - MSE: 2324.5076 - MSE_v: 2376.62
epoch: 3 - MSE: 1881.3441 - MSE_v: 2081.6647
epoch: 4 - MSE: 1544.8211 - MSE_v: 1839.3778
epoch: 5 - MSE: 1286.6667 - MSE_v: 1637.5224
epoch: 6 - MSE: 1086.3535 - MSE_v: 1467.0603
epoch: 7 - MSE: 928.9557 - MSE_v: 1321.2811
epoch: 8 - MSE: 803.5995 - MSE_v: 1195.1708
epoch: 9 - MSE: 702.3454 - MSE_v: 1084.955
epoch: 10 - MSE: 619.3794 - MSE_v: 987.7678
epoch: 11 - MSE: 550.4292 - MSE_v: 901.4109
epoch: 12 - MSE: 492.3422 - MSE_v: 824.1796
epoch: 13 - MSE: 442.7796 - MSE_v: 754.7352
epoch: 14 - MSE: 399.9957 - MSE_v: 692.0129
epoch: 15 - MSE: 362.6781 - MSE_v: 635.1533


In [27]:
model.fit(
    X_train, Y_train,
    EPOCHS, LR, BATCH,
    X_test, Y_test
)

epoch: 0 - MSE: 3701.5284 - MSE_v: 3197.1711
epoch: 1 - MSE: 2913.2982 - MSE_v: 2740.885
epoch: 2 - MSE: 2324.5076 - MSE_v: 2376.62
epoch: 3 - MSE: 1881.3441 - MSE_v: 2081.6647
epoch: 4 - MSE: 1544.8211 - MSE_v: 1839.3778
epoch: 5 - MSE: 1286.6667 - MSE_v: 1637.5224
epoch: 6 - MSE: 1086.3535 - MSE_v: 1467.0603
epoch: 7 - MSE: 928.9557 - MSE_v: 1321.2811
epoch: 8 - MSE: 803.5995 - MSE_v: 1195.1708
epoch: 9 - MSE: 702.3454 - MSE_v: 1084.955
epoch: 10 - MSE: 619.3794 - MSE_v: 987.7678
epoch: 11 - MSE: 550.4292 - MSE_v: 901.4109
epoch: 12 - MSE: 492.3422 - MSE_v: 824.1796
epoch: 13 - MSE: 442.7796 - MSE_v: 754.7352
epoch: 14 - MSE: 399.9957 - MSE_v: 692.0129
epoch: 15 - MSE: 362.6781 - MSE_v: 635.1533


### predictions after training

In [28]:
mape(
    model.predict(X_test),
    torch_model.forward(X_test).squeeze(-1)
)

3.4907112647892055e-16

### bias

We directly measure the difference between the bias values of both models.

In [29]:
mape(
    model.b.clone(),
    torch_model.layer.bias.detach()
)

1.9354088981671021e-16

### weight

And measure the difference between the weight values of both models.

In [30]:
mape(
    model.w.clone(),
    torch_model.layer.weight.detach().squeeze(0)
)

0.0

All right, our implementation is correct respect to PyTorch. 
Now, we can finally tackle Multioutput in the next notebook.