# 1.1 - Simple Linear Regression

:::{grid} 1 1 2 2
```{card} [Open in Google Colab](https://colab.research.google.com/github/PilotLeoYan/inside-deep-learning/blob/main/content/1-linear-regression/1-1-simple-linear-regression.ipynb)
```{image} ../figures/colab_logo.png
:align: center
```
```{card} [Open in Jupyter NBViewer](https://nbviewer.org/github/PilotLeoYan/inside-deep-learning/blob/main/content/1-linear-regression/1-1-simple-linear-regression.ipynb)
```{image} ../figures/jupyter_logo.png
:align: center
```
:::

If we want to start with a topic before getting into deep learning, 
the perceptron is a good place to start, 
as it is the basic unit with which artificial neural networks (ANNs) are built.
We can then use multiple perceptrons in parallel to form a dense layer. 
By using multiple dense layers, we can build a deep neural network (DNN).

```{image} ../figures/simple-perceptron.png
:width: 300
:class: hidden dark:block
```

```{image} ../figures/simple-perceptron-light.png
:width: 300
:class: dark:hidden
```

The objective of simple linear regression is to predict 
the target data $\mathbf{y}$ based on 
the input data $\mathbf{x}$

$$
\mathbf{y} = f(\mathbf{x}) + \epsilon
$$

where $f(\cdot)$ is the true function, but it is unknown,
and $\epsilon$ is a intrinsic noise independent of $\mathbf{x}$.

**Purpose of this Notebook**:

1. Create a dataset for simple linear regression task
2. Create our own Perceptron class from scratch
3. Calculate the gradient descent from scratch
4. Train our Perceptron
5. Compare our Perceptron to the one prebuilt by PyTorch

# Setup

In [1]:
print('Start package installation...')

Start package installation...


In [2]:
%%capture
%pip install torch
%pip install scikit-learn

In [3]:
print('Packages installed successfully!')

Packages installed successfully!


In [4]:
import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__

('3.14.0', '2.9.0+cu126')

In [5]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device

'cuda'

In [6]:
torch.set_default_dtype(torch.float64)

In [7]:
def add_to_class(Class):  
    """Register functions as methods in created class."""
    def wrapper(obj): setattr(Class, obj.__name__, obj)
    return wrapper

# Dataset

## create dataset

For our supervised task, we have a *dataset* denoted

$$
\mathcal{D} = \left\{
    (x_{1}, y_{1}), \cdots, (x_{m}, y_{m})
\right\}
$$

where $m$ is the number of samples in our dataset.

We assume that $x_i$ predicts $y_i$, and 
$(x_{1}, y_{1}), \cdots, (x_{m}, y_{m})$ is 
*independent and identical distributed* (iid assumption).
Independent means that two samples 
$(x_i, y_{i}), (x_{j}, y_{j}), \; i \neq j$
do not statistically depende on each other,
and identical distributed means that all $(x_{i}, y_{i})$
is distributed from the same unknown distribution.

The input data $x_{i}$ can be represented as a vector

$$
\mathbf{x} = \begin{bmatrix}
    x_{1} \\ \vdots \\ x_{m}
\end{bmatrix} \in \mathbb{R}^{m}
$$

and the target data $y_{i}$ can be also represented as a vector

$$
\mathbf{y} = \begin{bmatrix}
    y_{1} \\ \vdots \\ y_{m}
\end{bmatrix} \in \mathbb{R}^{m}
$$

In [8]:
from sklearn.datasets import make_regression
import random

M: int = 10_100 # number of samples

X, Y = make_regression(
    n_samples=M, 
    n_features=1, 
    n_targets=1,
    bias=random.random(), # random true bias
    noise=1
)

X = X.squeeze() # remove the axis of length 1

print(X.shape)
print(Y.shape)

(10100,)
(10100,)


## split dataset

We are going to split the dataset $\mathcal{D}$ into two sets,
the *training dataset* $\mathcal{D}_{\text{train}}$ 
and *test dataset* $\mathcal{D}_{\text{test}}$.

+ Train dataset dataset is used to train and calibrate our models
+ Test dataset is utilized for the purpose of evaluating our pre-trained models

**Remark**: $\mathcal{D}_{\text{train}}$ and
$\mathcal{D}_{\text{test}}$ are disjoint, 
$\mathcal{D}_{\text{train}} \cap \mathcal{D}_{\text{test}} = \varnothing$.

Let's refer $\mathbf{x}_{\text{train}}, \mathbf{y}_{\text{train}}$ as 
training input and target data respectively, and $\mathbf{x}_{\text{test}}, \mathbf{y}_{\text{test}}$ 
as test input and target data respectively.

In [9]:
X_train = torch.tensor(X[:100], device=device)
Y_train = torch.tensor(Y[:100], device=device)
X_train.shape, Y_train.shape

(torch.Size([100]), torch.Size([100]))

In [10]:
X_valid = torch.tensor(X[100:], device=device)
Y_valid = torch.tensor(Y[100:], device=device)
X_valid.shape, Y_valid.shape

(torch.Size([10000]), torch.Size([10000]))

We left more examples in the test set for better comparison purposes.

## delete raw dataset 

In [11]:
del X
del Y

# Scratch model

## weight and bias

We selected $\hat{y}(\cdot)$ to approximate the true function $f(\cdot)$

$$
\hat{y}(x) = b + xw
$$

where our model $\hat{y}$ has two *trainable parameters* 
$b, w \in \mathbb{R}$ are called *bias* and *weight* respectively.

In [12]:
class SimpleLinearRegression:
    def __init__(self):
        self.w = torch.randn(1, device=device)
        self.b = torch.randn(1, device=device)

    def copy_params(self, torch_layer: nn.modules.linear.Linear):
        """
        Copy the parameters from a module.linear to this model.

        Args:
            torch_layer: Pytorch module from which to copy the parameters.
        """
        self.b.copy_(torch_layer.bias.detach().clone())
        self.w.copy_(torch_layer.weight[0,:].detach().clone())

## weighted sum

We will refer $\hat{y}(\cdot)$ as simply *weighted sum* function $\hat{\mathbf{y}}$

$$
\begin{align}
\hat{\mathbf{y}}: \mathbb{R}^{m} &\to \mathbb{R}^{m} \\
\mathbf{x} &\mapsto \hat{\mathbf{y}}(\mathbf{x}), \;
\mathbf{x} \in \mathbb{R}^{m}
\end{align}
$$

**Note**: we remark $\hat{\mathbf{y}}$ with **bold** because 
given a vector $\mathbf{x}$, $\hat{\mathbf{y}}$ is a vector too.

Given an input $\mathbf{x}$ (not necessary the training dataset)

$$
\begin{align}
\hat{\mathbf{y}} &= b + w \mathbf{x} \\
&= b + w \begin{bmatrix}
    x_{1} \\ \vdots \\ x_{m}
\end{bmatrix} \\
&= \begin{bmatrix}
    b + wx_{1} \\ \vdots \\ b + wx_{m}
\end{bmatrix}
\end{align}
$$

**Note**: we are going to call $\hat{\mathbf{y}}$ as *predicted output data*.

In [13]:
@add_to_class(SimpleLinearRegression)
def predict(self, x: torch.Tensor) -> torch.Tensor:
    """
    Predict the output for input x.

    Args:
        x: Input tensor of shape (n_samples,).

    Returns:
        y_pred: Predicted output tensor of shape (n_samples,).
    """
    return self.b + self.w * x

## MSE

We need a loss function. We will use Mean Squared Error (MSE) 
as $L$

$$
\begin{align}
L: \mathbb{R}^{m} &\to \mathbb{R} \\
\hat{\mathbf{y}} &\mapsto L(\hat{\mathbf{y}}), \;
\hat{\mathbf{y}} \in \mathbb{R}^{m}
\end{align}
$$

this will help us to fit our trainables parameters.

MSE is defined as

$$
L(\hat{\mathbf{y}}) = \frac{1}{m} \sum_{i=1}^{m}
\left( \hat{y}_{i} - y_{i} \right)^{2}
$$

or using a vectorized form

$$
L(\hat{\mathbf{y}}) = \frac{1}{m} 
\left\| \hat{\mathbf{y}} - \mathbf{y} \right\|_{2}^2
$$

**Note**: $\|\cdot\|_{2}$ is the *Euclidean norm* or
also called $\ell_{2}$ norm (L2 norm).

In [14]:
@add_to_class(SimpleLinearRegression)
def mse_loss(self, y_true: torch.Tensor, y_pred: torch.Tensor):
    """
    MSE loss function between target y_true and y_pred.

    Args:
        y_true: Target tensor of shape (n_samples,).
        y_pred: Predicted tensor of shape (n_samples,).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    return ((y_pred - y_true)**2).mean().item()

@add_to_class(SimpleLinearRegression)
def evaluate(self, x: torch.Tensor, y_true: torch.Tensor):
    """
    Evaluate the model on input x and target y_true using MSE.

    Args:
        x: Input tensor of shape (n_samples,).
        y_true: Target tensor of shape (n_samples,).

    Returns:
        loss: MSE loss between predictions and true values.
    """
    y_pred = self.predict(x)
    return self.mse_loss(y_true, y_pred)

## computing gradients

To make adjustments to our model, it is necessary to compute derivatives. 
+ First, we must determine the derivatives to be computed
+ Then, we must ascertain the size of each derivative
+ Finally, we can compute the derivatives

Using the chain rule, we can determine the derivatives

$$
\frac{\partial L}{\partial b} =
\frac{\partial L}{\partial \hat{\mathbf{y}}}
\frac{\partial \hat{\mathbf{y}}}{\partial b}
$$

$$
\frac{\partial L}{\partial w} =
\frac{\partial L}{\partial \hat{\mathbf{y}}}
\frac{\partial \hat{\mathbf{y}}}{\partial w}
$$

Now, we can determine the size of each derivative

$$
\frac{\partial L}{\partial b} \in \mathbb{R},
\frac{\partial L}{\partial w} \in \mathbb{R},
\frac{\partial L}{\partial \hat{\mathbf{y}}} \in \mathbb{R}^{m},
\frac{\partial \hat{\mathbf{y}}}{\partial b} \in \mathbb{R}^{m},
\frac{\partial \hat{\mathbf{y}}}{\partial w} \in \mathbb{R}^{m}
$$

### MSE derivative

The derivative of MSE respect to $\mathbf{\hat{y}}$ is

$$
\frac{\partial L}{\partial \hat{\mathbf{y}}} = \begin{bmatrix}
    \frac{\partial L}{\partial \hat{y}_{1}} \\ \vdots \\
    \frac{\partial L}{\partial \hat{y}_{m}} 
\end{bmatrix}
$$

where

$$
\begin{align}
\frac{\partial L}{\partial \hat{y}_{p}} &=
\frac{\partial}{\partial \hat{y}_{p}} \left(
    \frac{1}{m} \sum_{i=1}^{m} \left(
    \hat{y}_{i} - y_{i} \right)^{2}
\right) \\
&= \frac{1}{m} \sum_{i=1}^{m} \frac{\partial}{\partial \hat{y}_{p}} \left(
    \left(\hat{y}_{i} - y_{i} \right)^{2}
\right) \\
&= \frac{2}{m} (\hat{y}_{p} - y_{p})
\end{align}
$$

for all $p = 1, \ldots, m$.

**Note**:

$$
\begin{align}
\sum_{i=1}^{m} \frac{\partial}{\partial \hat{y}_{p}} \left(
    \left(\hat{y}_{i} - y_{i} \right)^{2}
\right) &=
\frac{\partial}{\partial \hat{y}_{p}} \left(
    (\hat{y}_{1} - y_{1})^{2} + \ldots + (\hat{y}_{p} - y_{p})^{2}
    + \ldots + (\hat{y}_{m} - y_{m})^{2}
\right) \\
&= 0 + \ldots + \frac{\partial}{\partial \hat{y}_{p}} \left(
    (\hat{y}_{p} - y_{p})^{2}
\right) + \ldots + 0 \\
&= 2 (\hat{y}_{p} - y_{p})
\end{align}
$$

Therefore

$$
\begin{align}
\frac{\partial L}{\partial \hat{\mathbf{y}}} &= \begin{bmatrix}
    \frac{\partial L}{\partial \hat{y}_{1}} \\ \vdots \\
    \frac{\partial L}{\partial \hat{y}_{m}} 
\end{bmatrix} \\
&= \frac{2}{m} \begin{bmatrix}
    \hat{y}_{1} - y_{1} \\ \vdots \\
    \hat{y}_{m} - y_{m}
\end{bmatrix} \\
&= \frac{2}{m} \left(
    \mathbf{\hat{y}} - \mathbf{y}
\right)
\end{align}
$$

### weighted sum derivative

#### respect to bias

The derivative of weighted sum respect to bias is

$$
\frac{\partial \hat{\mathbf{y}}}{\partial b} =
\begin{bmatrix}
    \frac{\partial \hat{y}_{1}}{\partial b} \\
    \vdots \\
    \frac{\partial \hat{y}_{m}}{\partial b}
\end{bmatrix}
$$

where

$$
\begin{align}
\frac{\partial \hat{y}_{p}}{\partial b} &= 
\frac{\partial}{\partial b} \left( 
    b + w x_{p}
\right) \\
&= 1
\end{align}
$$

for all $m = 1, \ldots, m$.

Therefore

$$
\begin{align}
\frac{\partial \hat{\mathbf{y}}}{\partial b} &=
\begin{bmatrix}
    \frac{\partial \hat{y}_{1}}{\partial b} \\
    \vdots \\
    \frac{\partial \hat{y}_{m}}{\partial b}
\end{bmatrix} \\
&= \begin{bmatrix}
    1 \\ \vdots \\ 1
\end{bmatrix} \\
&= \mathbf{1}
\end{align}
$$

**Remark**: $\frac{\partial \hat{\mathbf{y}}}{\partial b} = \mathbf{1} \in \mathbb{R}^{m}$.

#### respect to weight

The derivative of weighted sum respecto to weight is

$$
\frac{\partial \hat{\mathbf{y}}}{\partial w} =
\begin{bmatrix}
    \frac{\partial \hat{y}_{1}}{\partial w} \\
    \vdots \\
    \frac{\partial \hat{y}_{m}}{\partial w}
\end{bmatrix}
$$

where

$$
\begin{align}
\frac{\partial \hat{y}_{p}}{\partial w} &=
\frac{\partial}{\partial w} \left(
    b + w x_{p}
\right) \\
&= 0 + \frac{\partial}{\partial w} \left(
    w x_{p}
\right) \\
&= x_{p}
\end{align}
$$

for all $p = 1, \ldots, m$.

**Note**: Remember the $i$-th predicted output $\hat{y}_{i}$ is
based on $x_{i}$, $\hat{y}_{i} = b + w x_{i}$.

Therefore

$$
\begin{align}
\frac{\partial \hat{\mathbf{y}}}{\partial w} &=
\begin{bmatrix}
    \frac{\partial \hat{y}_{1}}{\partial w} \\
    \vdots \\
    \frac{\partial \hat{y}_{m}}{\partial w}
\end{bmatrix} \\
&= \begin{bmatrix}
    x_{1} \\ \vdots \\ x_{m}
\end{bmatrix} \\
&= \mathbf{x}
\end{align}
$$

### gradients

Now that we have computed all the derivatives, 
we can find the gradients by composing these derivatives.

$$
\begin{align*}
\nabla_{b}L =
\frac{\partial L}{\partial b} &=
{\color{Cyan} {\frac{\partial L}{\partial \hat{\mathbf{y}}}}}
{\color{Orange} {\frac{\partial \hat{\mathbf{y}}}{\partial b}}} \\
&= {\color{Cyan} {\frac{2}{m} \left(\hat{\mathbf{y}} - \mathbf{y} \right)}}
{\color{Orange} {\mathbf{1}}}
\end{align*}
$$

$$
\begin{align*}
\nabla_{w}L =
\frac{\partial L}{\partial w} &=
{\color{Cyan} {\frac{\partial L}{\partial \hat{\mathbf{y}}}}}
{\color{Magenta} {\frac{\partial \hat{\mathbf{y}}}{\partial w}}} \\
&= {\color{Cyan} {\frac{2}{m} \left(\hat{\mathbf{y}} - \mathbf{y} \right)}}
{\color{Magenta} {\mathbf{x}}}
\end{align*}
$$

## parameters update

Now, let's update the trainable parameters using **gradient descent** (GD) as follows

$$
\begin{align}
b &\leftarrow b -\eta \nabla_{b}L \\ &=
b -\eta \left(
    \frac{2}{m} (\hat{\mathbf{y}} - \mathbf{y}) \mathbf{1}
\right)
\end{align}
$$

$$
\begin{align}
w &\leftarrow w -\eta \nabla_{w}L \\ &=
w -\eta \left(
    \frac{2}{m} (\hat{\mathbf{y}} - \mathbf{y}) \mathbf{x}
\right)
\end{align} 
$$

where $\eta \in \mathbb{R}^{+}$ is called *learning rate*.

In [15]:
@add_to_class(SimpleLinearRegression)
def update(self, x: torch.Tensor, y_true: torch.Tensor, 
           y_pred: torch.Tensor, lr: float):
    """
    Update the model parameters.

    Args:
       x: Input tensor of shape (n_samples,).
       y_true: Target tensor of shape (n_samples,).
       y_pred: Predicted output tensor of shape (n_samples,).
       lr: Learning rate. 
    """
    delta = 2 * (y_pred - y_true) / len(y_true)
    self.b -= lr * delta.sum()
    self.w -= lr * torch.matmul(delta, x)

## gradient descent

We will use *mini-batch gradient descent* (mini-batch GD) to adjust the parameters of our model

$$
\begin{array}{l}
\textbf{Algorithm: mini-batch Gradient Descent} \\
\textbf{for } t = 1 \text{ to } T \textbf{ do} \\
\quad i \leftarrow 1 \\
\quad j \leftarrow \mathcal{B} \\
\quad \textbf{while } i < m \textbf{ do} \\
\quad \quad \mathbf{\theta} \leftarrow 
\text{update}(\mathbf{x}_{\text{train } i:j,:}, 
\mathbf{y}_{\text{train } i:j}; \mathbf{\theta}) \\
\quad \quad i \leftarrow i + \mathcal{B} \\
\quad \quad j \leftarrow j + \mathcal{B} \\
\textbf{end for}
\end{array}
$$

where:
+ $T$ is the number of epochs
+ $\theta$ is an arbitrary model's parameter, in our case are $w$ and $b$
+ $\mathcal{B}$ is the number of samples per minibatch
+ $\mathbf{x}_{\text{train } i:j,:}$ and $\mathbf{y}_{\text{train } i:j}$ 
are the $i$-th to $j$-th train samples

**Note**: $\eta, T, \mathcal{B}$ are called *hyperparameters*, because
they are adjusted by the developer rather than the model.

To learn more about types of gradient descents, please watch 
[gradient descents](./gradient-descents.ipynb).

In [16]:
@add_to_class(SimpleLinearRegression)
def fit(self, x: torch.Tensor, y: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
    """
    Fit the model using gradient descent.
    
    Args:
        x: Input tensor of shape (n_samples,).
        y: Target tensor of shape (n_samples,).
        epochs: Number of epochs to fit.
        lr: learning rate.
        batch_size: Int number of batch.
        x_valid: Input tensor of shape (n_valid_samples,).
        y_valid: Target tensor of shape (n_valid_samples,).
    """
    for epoch in range(epochs):
        loss = []
        for batch in range(0, len(y), batch_size):
            end_batch = batch + batch_size

            y_pred = self.predict(x[batch:end_batch])

            loss.append(self.mse_loss(
                y[batch:end_batch],
                y_pred
            ))

            self.update(
                x[batch:end_batch], 
                y[batch:end_batch], 
                y_pred, 
                lr
            )

        loss = round(sum(loss) / len(loss), 4)
        loss_v = round(self.evaluate(x_valid, y_valid), 4)
        print(f'epoch: {epoch} - MSE: {loss} - MSE_v: {loss_v}')

# Scratch vs Torch.nn

We will be implementing a model created with PyTorch's pre-built classes for linear regression. 
This will allow us to compare our model from scratch with the PyTorch model.

## Torch.nn model

In [17]:
class TorchLinearRegression(nn.Module):
    def __init__(self, n_features):
        super(TorchLinearRegression, self).__init__()
        self.layer = nn.Linear(n_features, 1, device=device)
        self.loss = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)
    
    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self.forward(x)
            return self.loss(y_pred, y).item()
    
    def fit(self, x, y, epochs, lr, batch_size, x_valid, y_valid):
        optimizer = torch.optim.SGD(self.parameters(), lr=lr)
        for epoch in range(epochs):
            loss_t = [] # train loss
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size

                y_pred = self.forward(x[batch:end_batch])
                loss = self.loss(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            loss_t = round(sum(loss_t) / len(loss_t), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss_t} - MSE_v: {loss_v}')
        optimizer.zero_grad()

In [18]:
torch_model = TorchLinearRegression(1)

## scratch model

In [19]:
model = SimpleLinearRegression()

## evals

We will use a *metric* to compare our model with the PyTorch model.

### import MAPE modified

We use a modification of *MAPE* as a metric

$$
\text{MAPE}(\mathbf{y}, \hat{\mathbf{y}}) =
\frac{1}{m} \sum^{m}_{i=1} \mathcal{L} (y_{i}, \hat{y}_{i})
$$

where

$$
\mathcal{L} (y_{i}, \hat{y}_{i}) = \begin{cases}
    \left| \frac{y_{i} - \hat{y}_{i}}{y_{i}} \right|
    & \text{if } y_{i} \neq 0 \\
    \left| \hat{y}_{i} \right| & \text{if } \hat{y}_{i} = 0
\end{cases}
$$

In [20]:
# This cell imports torch_mape 
# if you are running this notebook locally 
# or from Google Colab.

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

try:
    from tools.torch_metrics import torch_mape as mape
    print('mape imported locally.')
except ModuleNotFoundError:
    import subprocess

    repo_url = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/content/tools/torch_metrics.py'
    local_file = 'torch_metrics.py'
    
    subprocess.run(['wget', repo_url, '-O', local_file], check=True)
    try:
        from torch_metrics import torch_mape as mape # type: ignore
        print('mape imported from GitHub.')
    except Exception as e:
        print(e)

mape imported locally.


### predict

Let's compare the predictions of our model and PyTorch's using modified MAPE

In [21]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid.unsqueeze(-1)).squeeze(-1)
)

20.991354499247294

they differ considerably because each model has its own parameters 
initialized randomly and independently of the other model.

### copy parameters

We copy the values of the PyTorch model parameters to our model. 

In [22]:
model.copy_params(torch_model.layer)

### predict after copy parameters

We measure the difference between the predictions of both models again

In [23]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid.unsqueeze(-1)).squeeze(-1)
)

1.968012456432162e-16

we can see that their predictions do not differ greatly.

### loss

In [24]:
mape(
    model.evaluate(X_valid, Y_valid),
    torch_model.evaluate(X_valid.unsqueeze(-1), Y_valid.unsqueeze(-1))
)

0.0

### training

We are going to train both models using the same hyperparameters. 
If our model is well designed, then starting from the same parameters, 
it should arrive at the same parameters as the PyTorch model after training.

In [25]:
LR = 0.01 # learning rate
EPOCHS = 16 # number of epochs
BATCH = len(X_train) // 3 # number of minibatch

In [26]:
torch_model.fit(
    X_train.unsqueeze(-1), 
    Y_train.unsqueeze(-1),
    EPOCHS, LR, BATCH,
    X_valid.unsqueeze(-1),
    Y_valid.unsqueeze(-1)
)

epoch: 0 - MSE: 727.0301 - MSE_v: 1094.5279
epoch: 1 - MSE: 652.7683 - MSE_v: 990.9578
epoch: 2 - MSE: 586.7811 - MSE_v: 897.8346
epoch: 3 - MSE: 528.0545 - MSE_v: 814.0122
epoch: 4 - MSE: 475.71 - MSE_v: 738.483
epoch: 5 - MSE: 428.9855 - MSE_v: 670.3594
epoch: 6 - MSE: 387.2188 - MSE_v: 608.858
epoch: 7 - MSE: 349.8329 - MSE_v: 553.2864
epoch: 8 - MSE: 316.3249 - MSE_v: 503.0315
epoch: 9 - MSE: 286.2552 - MSE_v: 457.5494
epoch: 10 - MSE: 259.2391 - MSE_v: 416.3571
epoch: 11 - MSE: 234.9392 - MSE_v: 379.0245
epoch: 12 - MSE: 213.0589 - MSE_v: 345.1686
epoch: 13 - MSE: 193.3376 - MSE_v: 314.4474
epoch: 14 - MSE: 175.5452 - MSE_v: 286.5551
epoch: 15 - MSE: 159.4786 - MSE_v: 261.2183


In [27]:
model.fit(
    X_train, Y_train,
    EPOCHS, LR, BATCH,
    X_valid, Y_valid
)

epoch: 0 - MSE: 727.0301 - MSE_v: 1094.5279
epoch: 1 - MSE: 652.7683 - MSE_v: 990.9578
epoch: 2 - MSE: 586.7811 - MSE_v: 897.8346
epoch: 3 - MSE: 528.0545 - MSE_v: 814.0122
epoch: 4 - MSE: 475.71 - MSE_v: 738.483
epoch: 5 - MSE: 428.9855 - MSE_v: 670.3594
epoch: 6 - MSE: 387.2188 - MSE_v: 608.858
epoch: 7 - MSE: 349.8329 - MSE_v: 553.2864
epoch: 8 - MSE: 316.3249 - MSE_v: 503.0315
epoch: 9 - MSE: 286.2552 - MSE_v: 457.5494
epoch: 10 - MSE: 259.2391 - MSE_v: 416.3571
epoch: 11 - MSE: 234.9392 - MSE_v: 379.0245
epoch: 12 - MSE: 213.0589 - MSE_v: 345.1686
epoch: 13 - MSE: 193.3376 - MSE_v: 314.4474
epoch: 14 - MSE: 175.5452 - MSE_v: 286.5551
epoch: 15 - MSE: 159.4786 - MSE_v: 261.2183


### predict after training

We will measure the difference between the predictions of both models after training them.

In [28]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid.unsqueeze(-1)).squeeze(-1)
)

7.095967135198154e-17

### bias

We directly measure the difference between the bias values of both models

In [29]:
mape(
    model.b.clone(),
    torch_model.layer.bias.detach()
)

0.0

### weight 

and measure the difference between the weight values of both models.

In [30]:
mape(
    model.w.clone(),
    torch_model.layer.weight.detach().squeeze(0)
)

0.0