<h1>1.2 - Multivariate Linear Regression</h1>

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/PilotLeoYan/inside-deep-learning/blob/main/1-linear-regression/1-2-multivariate-linear-regression.ipynb">
    <img src="../images/colab_logo.png" />Open in Google Colab</a>
  </td>
</table>

Now we are going to increase the complexity, 
instead of the perceptron having a single output, 
it will now have multiple outputs. 
The word "multivariable" usually means that the perceptron receives multiple inputs, 
but here we will use it to describe that the perceptron has multiple outputs.

<div style="text-align: center; background-color: black">
<img src="../images/multivariate-perceptron.png" alt="One multivariate perceptron" width="300">
</div>

We can think of the multivariate perceptron as a layer of multiple simple perceptrons, 
and that each perceptron output corresponds to an output feature.

<div style="text-align: center; background-color: black">
<img src="../images/multivariate-perceptron-as-layer.png" alt="One layer of simple perceptron" width="400">
</div>

In [1]:
import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__

('3.12.6', '2.5.1+cu124')

In [2]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device

'cuda'

In [3]:
torch.set_default_dtype(torch.float64)

In [4]:
def add_to_class(Class):  
    """Register functions as methods in created class."""
    def wrapper(obj): setattr(Class, obj.__name__, obj)
    return wrapper

# Dataset

## create dataset

$$
\begin{align*}
\mathbf{X} &\in \mathbb{R}^{m \times n} \\
\mathbf{Y} &\in \mathbb{R}^{m \times n_{1}}
\end{align*}
$$
where $n_{1}$ is the number of output features.

$$
\mathbf{X} = \begin{bmatrix}
    x_{11} & x_{12} & \cdots & x_{1n} \\
    x_{21} & x_{22} & \cdots & x_{2n} \\
    \vdots & \vdots & \ddots & \vdots \\
    x_{m1} & x_{m2} & \cdots & x_{mn}
\end{bmatrix}
$$

$$
\mathbf{y} = \begin{bmatrix}
    y_{11} & y_{12} & \cdots & y_{1n_{1}} \\
    y_{21} & y_{22} & \cdots & y_{2n_{1}} \\
    \vdots & \vdots & \ddots & \vdots \\
    y_{m1} & y_{m2} & \cdots & y_{mn_{1}} 
\end{bmatrix}
$$

In [5]:
from sklearn.datasets import make_regression
import random

M: int = 10_100 # number of samples
N: int = 6 # number of input features
NO: int = 3 # number of output features

X, Y = make_regression(
    n_samples=M, 
    n_features=N, 
    n_targets=NO, 
    n_informative=N - 1,
    bias=random.random(),
    noise=1
)

print(X.shape)
print(Y.shape)

(10100, 6)
(10100, 3)


## split dataset

In [6]:
X_train = torch.tensor(X[:100], device=device)
Y_train = torch.tensor(Y[:100], device=device)
X_train.shape, Y_train.shape

(torch.Size([100, 6]), torch.Size([100, 3]))

In [7]:
X_valid = torch.tensor(X[100:], device=device)
Y_valid = torch.tensor(Y[100:], device=device)
X_valid.shape, Y_valid.shape

(torch.Size([10000, 6]), torch.Size([10000, 3]))

## delete raw dataset

In [8]:
del X
del Y

# Scratch model

## weights and bias

trainable parameters

$$
\begin{align*}
\mathbf{W} &\in \mathbb{R}^{n \times n_{1}} \\
\mathbf{b} &\in \mathbb{R}^{n_{1}}
\end{align*}
$$

$$
\mathbf{W} = \begin{bmatrix}
    w_{11} & w_{12} & \cdots & w_{1n_{1}} \\
    w_{21} & w_{22} & \cdots & w_{2n_{1}} \\
    \vdots & \vdots & \ddots & \vdots \\
    w_{n1} & w_{n2} & \cdots & w_{nn_{1}}
\end{bmatrix}
$$

$$
\mathbf{b} = \begin{bmatrix}
    b_{1} \\
    b_{2} \\
    \vdots \\
    b_{n_{1}}
\end{bmatrix}
$$

In [9]:
class LinearRegression:
    def __init__(self, n_features: int, out_features: int):
        self.w = torch.randn(n_features, out_features, device=device)
        self.b = torch.randn(out_features, device=device)

    def copy_params(self, torch_layer: torch.nn.modules.linear.Linear):
        """
        Copy the parameters from a module.linear to this model.

        Args:
            torch_layer: Pytorch module from which to copy the parameters.
        """
        self.b.copy_(torch_layer.bias.detach().clone())
        self.w.copy_(torch_layer.weight.T.detach().clone())

## weighted sum

$$
\mathbf{\hat{Y}}(\mathbf{X}) = \mathbf{X}\mathbf{W} + \mathbf{b} \\
\mathbf{\hat{Y}} : \mathbb{R}^{m \times n} \rightarrow 
\mathbb{R}^{m \times n_{1}}
$$

where
$$
\hat{y}_{ij} =
\mathbf{x}_{i}^\top
\mathbf{w}_{:,j}
+ b_{j}
$$
for all $i = 1, \ldots, m$ and $j = 1, \ldots, n_{1}$.

In [10]:
@add_to_class(LinearRegression)
def predict(self, x: torch.Tensor) -> torch.Tensor:
    """
    Predict the output for input x

    Args:
        x: Input tensor of shape (n_samples, n_features).

    Returns:
        y_pred: Predicted output tensor of shape (n_samples, out_features).
    """
    return torch.matmul(x, self.w) + self.b

## MSE

Mean Squared Error

$$
\begin{align*}
L(\mathbf{\hat{Y}}) &= \frac{1}{mn_{1}} 
\sum_{i=1}^{m} \sum_{j=1}^{n_{1}}(
    \hat{y}_{ij} - y_{ij})^{2} \\
L &: \mathbb{R}^{m \times n_{1}} \rightarrow \mathbb{R}
\end{align*}
$$

Vectorized form

$$
L(\mathbf{\hat{Y}}) = \frac{1}{mn_{1}} \text{sum} \left(
    \left(
        \mathbf{\hat{Y} - Y}
    \right)^2
\right)
$$
where ${\mathbf{A}}^2$ is element-wise power or also ${\mathbf{A}}^2 = \mathbf{A} \odot \mathbf{A}$. <br>
**Note**: $\odot$ is called element-wise product or also Hadamard product.

In [11]:
@add_to_class(LinearRegression)
def mse_loss(self, y_true: torch.Tensor, y_pred: torch.Tensor):
    """
    MSE loss function between target y_true and y_pred.

    Args:
        y_true: Target tensor of shape (n_samples, out_features).
        y_pred: Predicted tensor of shape (n_samples, out_features).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    return ((y_pred - y_true)**2).mean().item()

@add_to_class(LinearRegression)
def evaluate(self, x: torch.Tensor, y_true: torch.Tensor):
    """
    Evaluate the model on input x and target y_true using MSE.

    Args:
        x: Input tensor of shape (n_samples, n_features).
        y_true: Target tensor of shape (n_samples, out_features).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    y_pred = self.predict(x)
    return self.mse_loss(y_true, y_pred)

## compute gradients

There are two ways to compute gradients
1. Computing each derivative individually and then joining them using the Einstein summation.
2. Computing an initial derivative and passing it backwards as an argument.

The most common way is to use method 2 
because it is easier to visualize and is more optimal. 
While method 1 needs more computing. 
We prefer method 2, but we will also use method 1 just for comparison.

### MSE derivative

$$
\begin{align*}
\frac{\partial L}{\partial \hat{y}_{pq}} &=
\frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
\frac{\partial}{\partial \hat{y}_{pq}} 
\left(
    (\hat{y}_{ij} - y_{ij})^2
\right) \\
&= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
(\hat{y}_{ij} - y_{ij})
\frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}}
\end{align*}
$$
for all $p = 1, \ldots, m$ and $q = 1, \ldots, n_{1}$.

$$
\frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}} =
\begin{cases}
    1 & \text{if } i=p, j=q \\
    0 & \text{otherwise}
\end{cases}
$$

then
$$
\begin{align*}
\frac{\partial L}{\partial \hat{y}_{pq}} &=
\frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
(\hat{y}_{ij} - y_{ij})
\frac{\partial \hat{y}_{ij}}{\partial \hat{y}_{pq}} \\
&= \frac{2}{mn_{1}} (\hat{y}_{pq} - y_{pq})
\end{align*}
$$

therefore
$$
\frac{\partial L}{\partial \mathbf{\hat{Y}}} =
\frac{2}{mn_{1}} \left(
    \mathbf{\hat{Y}} - \mathbf{Y}
\right)
$$

### weighted sum derivative

#### respect to bias

$$
\begin{align*}
\frac{\partial L}{\partial b_{p}} &=
\frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
\frac{\partial}{\partial b_{p}} 
\left(
    (\hat{y}_{ij} - y_{ij})^2
\right) \\
&= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
(\hat{y}_{ij} - y_{ij})
\frac{\partial \hat{y}_{ij}}{\partial b_{p}} \\
&= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
\frac{\partial L}{\partial \hat{y}_{ij}}
\frac{\partial \hat{y}_{ij}}{\partial b_{p}}
\end{align*}
$$
for all $p = 1, \ldots, n_{1}$.

$$
\frac{\partial \hat{y}_{ij}}{\partial b_{p}} =
\begin{cases}
    1 & \text{if } j=p \\
    0 & \text{otherwise}
\end{cases}
$$

then
$$
\begin{align*}
\frac{\partial L}{\partial b_{p}} &=
\sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
\frac{\partial L}{\partial \hat{y}_{ij}}
\frac{\partial \hat{y}_{ij}}{\partial b_{p}} \\
&= \sum_{i=1}^{m}
\frac{\partial L}{\partial \hat{y}_{ip}}
\end{align*}
$$
 

therefore
$$
\frac{\partial L}{\partial \mathbf{b}} =
\mathbf{1} \frac{\partial L}{\partial \mathbf{\hat{Y}}}
$$
where $\mathbf{1} \in \mathbb{R}^{m}$.

#### respect to weight

$$
\begin{align*}
\frac{\partial L}{\partial w_{pq}} &=
\frac{1}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
\frac{\partial}{\partial w_{pq}} 
\left(
    (\hat{y}_{ij} - y_{ij})^2
\right) \\
&= \frac{2}{mn_{1}} \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
(\hat{y}_{ij} - y_{ij})
\frac{\partial \hat{y}_{ij}}{\partial w_{pq}} \\
&= \sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
\frac{\partial L}{\partial \hat{y}_{ij}}
\frac{\partial \hat{y}_{ij}}{\partial w_{pq}}
\end{align*}
$$
for all $p = 1, \ldots, n$ and $q = 1, \ldots, n_{1}$.

$$
\frac{\partial \hat{y}_{ij}}{\partial w_{pq}} =
\begin{cases}
    x_{ip} & \text{if } j=q \\
    0 & \text{otherwise}
\end{cases}
$$

then
$$
\begin{align*}
\frac{\partial L}{\partial w_{pq}} &=
\sum_{i=1}^{m} \sum_{j=1}^{n_{1}}
\frac{\partial L}{\partial \hat{y}_{ij}}
\frac{\partial \hat{y}_{ij}}{\partial w_{pq}} \\
&= \sum_{i=1}^{m} x_{ip}
\frac{\partial L}{\partial \hat{y}_{iq}} \\
&= (x_{:,p})^\top
\frac{\partial L}{\partial \hat{y}_{:,q}} \\
&= x^\top_{p,:}
\frac{\partial L}{\partial \hat{y}_{:,q}}
\end{align*}
$$

therefore
$$
\frac{\partial L}{\partial \mathbf{w}} =
\mathbf{X}^\top
\frac{\partial L}{\partial \mathbf{\hat{Y}}}
$$

### gradients

$$
\begin{align*}
\nabla_{\mathbf{b}}L =
\frac{\partial L}{\partial \mathbf{b}} &=
{\color{Cyan} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}}
{\color{Orange} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}}} \\
&= {\color{Cyan} {\frac{2}{mn_{1}}}}
{\color{Orange} {\mathbf{1}}}
{\color{Cyan} {\left(\mathbf{\hat{Y}} - \mathbf{Y} \right)}}
\end{align*}
$$

and

$$
\begin{align*}
\nabla_{\mathbf{W}}L =
\frac{\partial L}{\partial \mathbf{W}} &=
{\color{Cyan} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}}
{\color{Magenta} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}}}} \\
&= {\color{Cyan} {\frac{2}{mn_{1}}}}
{\color{Magenta} {\mathbf{X}^\top}}
{\color{Cyan} {\left(\mathbf{\hat{Y}} - \mathbf{Y} \right)}}
\end{align*}
$$

## Parameters update

In [12]:
@add_to_class(LinearRegression)
def update(self, x: torch.Tensor, y_true: torch.Tensor, y_pred: torch.Tensor, lr: float):
    """
    Update the model parameters.

    Args:
       x: Input tensor of shape (n_samples, n_features).
       y_true: Target tensor of shape (n_samples, n_features).
       y_pred: Predicted output tensor of shape (n_samples, n_features).
       lr: Learning rate. 
    """
    delta = 2 * (y_pred - y_true) / y_true.numel()
    self.b -= lr * delta.sum(axis=0)
    self.w -= lr * torch.matmul(x.T, delta)

## fit (train)

In [13]:
@add_to_class(LinearRegression)
def fit(self, x_train: torch.Tensor, y_train: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
    """
    Fit the model using gradient descent.
    
    Args:
        x_train: Input tensor of shape (n_samples, n_features).
        y_train: Target tensor of shape (n_samples,).
        epochs: Number of epochs to fit.
        lr: learning rate.
        batch_size: Int number of batch.
        x_valid: Input tensor of shape (n_valid_samples, n_features).
        y_valid: Target tensor of shape (n_valid_samples,)
    """
    for epoch in range(epochs):
        loss = []
        for batch in range(0, len(y_train), batch_size):
            end_batch = batch + batch_size

            y_pred = self.predict(x_train[batch:end_batch])

            loss.append(self.mse_loss(
                y_train[batch:end_batch], 
                y_pred
            ))

            self.update(
                x_train[batch:end_batch], 
                y_train[batch:end_batch], 
                y_pred, 
                lr
            )

        loss = round(sum(loss) / len(loss), 4)
        loss_v = round(self.evaluate(x_valid, y_valid), 4)
        print(f'epoch: {epoch} - MSE: {loss} - MSE_v: {loss_v}')

# Scratch vs Torch.nn

## Torch.nn model

In [14]:
class TorchLinearRegression(nn.Module):
    def __init__(self, n_features, n_out_features):
        super(TorchLinearRegression, self).__init__()
        self.layer = nn.Linear(n_features, n_out_features, device=device)
        self.loss = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)
    
    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self.forward(x)
            return self.loss(y_pred, y).item()
    
    def fit(self, x, y, epochs, lr, batch_size, x_valid, y_valid):
        optimizer = torch.optim.SGD(self.parameters(), lr=lr)
        for epoch in range(epochs):
            loss_t = []
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size

                y_pred = self.forward(x[batch:end_batch])
                loss = self.loss(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            loss_t = round(sum(loss_t) / len(loss_t), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss_t} - MSE_v: {loss_v}')

In [15]:
torch_model = TorchLinearRegression(N, NO)

## scratch model

In [16]:
model = LinearRegression(N, NO)

## evals

### MAPE modified

In [17]:
import os
import sys

# Add the module path if running locally
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    # Try importing the module normally (for local execution)
    from tools.torch_metrics import torch_mape as mape
except ModuleNotFoundError:
    # If the module is not found, assume the code is running in Google Colab
    import subprocess

    repo_url = "https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/tools/torch_metrics.py"
    local_file = "torch_metrics.py"

    # Download the missing file from GitHub
    subprocess.run(["wget", repo_url, "-O", local_file], check=True)

    # Import the module after downloading it
    import torch_metrics
    from torch_metrics import torch_mape as mape

### predict

In [18]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

1801.3082740656382

### copy parameters

In [19]:
model.copy_params(torch_model.layer)
parameters = (model.b.clone(), model.w.clone())

### predict after copy parameters

In [20]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

0.0

### loss

In [21]:
mape(
    model.evaluate(X_valid, Y_valid),
    torch_model.evaluate(X_valid, Y_valid)
)

0.0

### train

In [22]:
LR = 0.01 # learning rate
EPOCHS = 16 # number of epochs
BATCH = len(X_train) // 3 # batch size

In [23]:
torch_model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)

epoch: 0 - MSE: 23450.5454 - MSE_v: 17270.5571
epoch: 1 - MSE: 21600.057 - MSE_v: 16147.6228
epoch: 2 - MSE: 19916.0383 - MSE_v: 15111.7659
epoch: 3 - MSE: 18382.2784 - MSE_v: 14155.2469
epoch: 4 - MSE: 16984.2129 - MSE_v: 13271.0748
epoch: 5 - MSE: 15708.7527 - MSE_v: 12452.9314
epoch: 6 - MSE: 14544.1318 - MSE_v: 11695.1036
epoch: 7 - MSE: 13479.7698 - MSE_v: 10992.4219
epoch: 8 - MSE: 12506.1498 - MSE_v: 10340.2066
epoch: 9 - MSE: 11614.7088 - MSE_v: 9734.2181
epoch: 10 - MSE: 10797.739 - MSE_v: 9170.6135
epoch: 11 - MSE: 10048.3004 - MSE_v: 8645.9068
epoch: 12 - MSE: 9360.1413 - MSE_v: 8156.9337
epoch: 13 - MSE: 8727.6283 - MSE_v: 7700.8198
epoch: 14 - MSE: 8145.683 - MSE_v: 7274.9523
epoch: 15 - MSE: 7609.7249 - MSE_v: 6876.9539


In [24]:
model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)

epoch: 0 - MSE: 23450.5454 - MSE_v: 17270.5571
epoch: 1 - MSE: 21600.057 - MSE_v: 16147.6228
epoch: 2 - MSE: 19916.0383 - MSE_v: 15111.7659
epoch: 3 - MSE: 18382.2784 - MSE_v: 14155.2469
epoch: 4 - MSE: 16984.2129 - MSE_v: 13271.0748
epoch: 5 - MSE: 15708.7527 - MSE_v: 12452.9314
epoch: 6 - MSE: 14544.1318 - MSE_v: 11695.1036
epoch: 7 - MSE: 13479.7698 - MSE_v: 10992.4219
epoch: 8 - MSE: 12506.1498 - MSE_v: 10340.2066
epoch: 9 - MSE: 11614.7088 - MSE_v: 9734.2181
epoch: 10 - MSE: 10797.739 - MSE_v: 9170.6135
epoch: 11 - MSE: 10048.3004 - MSE_v: 8645.9068
epoch: 12 - MSE: 9360.1413 - MSE_v: 8156.9337
epoch: 13 - MSE: 8727.6283 - MSE_v: 7700.8198
epoch: 14 - MSE: 8145.683 - MSE_v: 7274.9523
epoch: 15 - MSE: 7609.7249 - MSE_v: 6876.9539


### predict after training

In [25]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

5.278887730502764e-14

### weight 

In [26]:
mape(
    model.w.clone(),
    torch_model.layer.weight.detach().T
)

4.22480178152855e-15

### bias

In [27]:
mape(
    model.b.clone(),
    torch_model.layer.bias.detach()
)

1.3582561077298844e-14

# Compute gradient with einsum

$$
\frac{\partial L}{\partial \mathbf{W}} =
\frac{\partial L}{\partial \mathbf{\hat{Y}}}
\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}} \\
\frac{\partial L}{\partial \mathbf{b}} =
\frac{\partial L}{\partial \mathbf{\hat{Y}}}
\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}
$$


where their shapes are

$$
\begin{align*}
\frac{\partial L}{\partial \mathbf{W}} &\in \mathbb{R}^{n \times n_{1}} \\
\frac{\partial L}{\partial \mathbf{b}} &\in \mathbb{R}^{n_{1}} \\
\frac{\partial L}{\partial \mathbf{\hat{Y}}} &\in \mathbb{R}^{m \times n_{1}} \\
\frac{\partial \mathbf{\hat{Y}}}
{\partial \mathbf{W}} &\in \mathbb{R}^{(m \times n_{1}) \times (n \times n_{1})} \\
\frac{\partial \mathbf{\hat{Y}}}
{\partial \mathbf{b}} &\in \mathbb{R}^{(m \times n_{1}) \times (n_{1})}
\end{align*}
$$

**Note**: check $\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}}$
has four axes. This is an example because this method requires more computing.

weighted sum derivative respect to bias

$$
\frac{\partial \hat{y}_{ij}}{\partial b_{p}} = 
\begin{cases}
    1 & \text{if } j=p \\ 
    0 & \text{if } j\neq p 
\end{cases}
$$
for all $i = 1, \ldots, m$ and $j, p = 1, \ldots, n_{1}$

weighted sum derivative respect to weight

$$
\frac{\partial \hat{y}_{ij}}{\partial w_{pq}} = 
\begin{cases}
    x_{ip} & \text{if } j=q \\ 
    0 & \text{if } j\neq q 
\end{cases}
$$
for all $i = 1, \ldots, m$,<br>
$j, q = 1, \ldots, n_{1}$ and <br>
$p = 1, \ldots, n$.

Vectorized form

$$
\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}} = 
\mathbb{I} \otimes \mathbf{X}
$$
where $\otimes$ is Kronecker product.

therefore using **Einstein summation**

$$
\begin{align*}
{\color{Magenta} {\frac{\partial L}{\partial \mathbf{b}}}} &=
{\color{Orange} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}}
{\color{Cyan} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{b}}}} \\
&\in \mathbb{R}^{
    {\color{Orange} {(m \times n_{1})}} \times 
    {\color{Cyan} {(m \times n_{1} \times n_{1})}}} \\
&\in \mathbb{R}^{\color{Magenta} {n_{1}}}
\end{align*} 
$$

and

$$
\begin{align*}
{\color{Magenta} {\frac{\partial L}{\partial \mathbf{W}}}} &=
{\color{Orange} {\frac{\partial L}{\partial \mathbf{\hat{Y}}}}}
{\color{Cyan} {\frac{\partial \mathbf{\hat{Y}}}{\partial \mathbf{W}}}} \\
&\in \mathbb{R}^{
    {\color{Orange} {(m \times n_{1})}} \times 
    {\color{Cyan} {(m \times n_{1} \times n \times n_{1})}}} \\
&\in \mathbb{R}^{\color{Magenta} {n \times n_{1}}}
\end{align*}
$$

## Model

In [28]:
class EinsumLinearRegression(LinearRegression):
    def update(self, x: torch.Tensor, y_true: torch.Tensor, y_pred: torch.Tensor, lr: float):
        """
        Update the model parameters.

        Args:
            x: Input tensor of shape (n_samples, n_features).
            y_true: Target tensor of shape (n_samples, n_features).
            y_pred: Predicted output tensor of shape (n_samples, n_features).
            lr: Learning rate. 
        """
        delta = 2 * (y_pred - y_true) / y_true.numel()
        # d L / d b
        self.b -= lr * delta.sum(axis=0)
        # d L / d W
        identity = torch.eye(y_true.shape[-1], device=device)
        w_der = torch.kron(
            x.unsqueeze(1).unsqueeze(3),
            identity.unsqueeze(0).unsqueeze(2)
        )
        self.w -= lr * torch.einsum('pq,pqij->ij', delta, w_der)

In [29]:
einsum_model = EinsumLinearRegression(N, NO)
einsum_model.b.copy_(parameters[0])
einsum_model.w.copy_(parameters[1])

tensor([[-0.4061, -0.1745, -0.1816],
        [ 0.3046, -0.1626,  0.3205],
        [ 0.0931,  0.1911, -0.3605],
        [ 0.1585,  0.0218,  0.3361],
        [ 0.2438, -0.1821,  0.4021],
        [-0.1901,  0.0661,  0.0360]], device='cuda:0')

In [30]:
einsum_model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)

epoch: 0 - MSE: 23450.5454 - MSE_v: 17270.5571
epoch: 1 - MSE: 21600.057 - MSE_v: 16147.6228
epoch: 2 - MSE: 19916.0383 - MSE_v: 15111.7659
epoch: 3 - MSE: 18382.2784 - MSE_v: 14155.2469
epoch: 4 - MSE: 16984.2129 - MSE_v: 13271.0748
epoch: 5 - MSE: 15708.7527 - MSE_v: 12452.9314
epoch: 6 - MSE: 14544.1318 - MSE_v: 11695.1036
epoch: 7 - MSE: 13479.7698 - MSE_v: 10992.4219
epoch: 8 - MSE: 12506.1498 - MSE_v: 10340.2066
epoch: 9 - MSE: 11614.7088 - MSE_v: 9734.2181
epoch: 10 - MSE: 10797.739 - MSE_v: 9170.6135
epoch: 11 - MSE: 10048.3004 - MSE_v: 8645.9068
epoch: 12 - MSE: 9360.1413 - MSE_v: 8156.9337
epoch: 13 - MSE: 8727.6283 - MSE_v: 7700.8198
epoch: 14 - MSE: 8145.683 - MSE_v: 7274.9523
epoch: 15 - MSE: 7609.7249 - MSE_v: 6876.9539


In [31]:
mape(
    einsum_model.w.clone(),
    torch_model.layer.weight.detach().T
)

1.0018713882983006e-14

In [32]:
mape(
    einsum_model.b.clone(),
    torch_model.layer.bias.detach()
)

1.3582561077298844e-14