<h1>3.1 - MLP</h1>

The first advance we will make towards deep learning 
will be the multilayer perceptron (MLP). 
It consists of interconnecting several dense layers 
and superimposing them to obtain a deep neural network (DNN).

<div style="text-align: center; background-color: black">
<img src="../images/mlp.png" alt="deep neuronal network" width="400">
</div>

In [1]:
import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__

('3.12.6', '2.5.1+cu124')

In [2]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device

'cuda'

In [3]:
torch.set_default_dtype(torch.float64)

In [4]:
def add_to_class(Class):  
    """Register functions as methods in created class."""
    def wrapper(obj):
        setattr(Class, obj.__name__, obj)
    return wrapper

# Dataset

## create dataset

$$
\mathbf{X} \in \mathbb{R}^{m \times n} \\
\mathbf{Y} \in \mathbb{R}^{m \times n_{o}}
$$

In [5]:
from sklearn.datasets import make_regression
import random

M: int = 10_100 # number of samples
N: int = 6 # number of input features
NO: int = 3 # number of output features

X, Y = make_regression(
    n_samples=M, 
    n_features=N, 
    n_targets=NO, 
    n_informative=N - 1,
    bias=random.random(),
    noise=1
)

print(X.shape)
print(Y.shape)

(10100, 6)
(10100, 3)


## split dataset into train and valid

In [6]:
x_train = torch.tensor(X[:1000], device=device)
x_valid = torch.tensor(X[1000:], device=device)
x_train.shape, x_valid.shape

(torch.Size([1000, 6]), torch.Size([9100, 6]))

In [7]:
y_train = torch.tensor(Y[:1000], device=device)
y_valid = torch.tensor(Y[1000:], device=device)
y_train.shape, y_valid.shape

(torch.Size([1000, 3]), torch.Size([9100, 3]))

## delete raw dataset

In [8]:
del X
del Y

# Model and layers

In [9]:
class Layer:
    is_trainable: bool = False
    pass


class Activation:
    pass


class Losses:
    pass

## initialization

### scratch model

The model as such will be the container of all our layers.

In [10]:
class Model:
    def __init__(self, layers: list[Layer], loss_f: Losses = None):
        self.layers = layers[1:] # do not save the input layer
        self.loss_f = MSE() if loss_f is None else loss_f

        # initialize all parameters
        out = layers[0].construct()
        for layer in self.layers:
            out = layer.construct(out)

    def copy_parameters(self, parameters) -> None:
        params = list(parameters())
        for layer in self.layers:
            if layer.is_trainable:
                layer.set_params(params.pop(0), params.pop(0))

### layers

#### dense

dense or full conect layer.

$$
\begin{align*}
\mathbf{W}^{(k)} &\in 
\mathbb{R}^{n_{k-1} \times n_{k}} \\
\mathbf{b}^{(k)} &\in 
\mathbb{R}^{n_{k}}
\end{align*}
$$
for all $k = 1, ..., l$. Where $l$ is the number of layers.

In [11]:
class Dense(Layer):
    def __init__(self, units: int, act_f: Activation = None):
        self.units = units
        self.act_f = act_f if act_f is not None else Linear()
        self.is_trainable = True

    def set_params(self, w: torch.Tensor, b: torch.Tensor) -> None:
        self.w.copy_(w.T.detach().clone())
        self.b.copy_(b.detach().clone())

    def construct(self, x: torch.Tensor) -> torch.Tensor:
        """
        Initialize the parameters.
        self.w := tensor (n_features, units).
        self.b := tensor (units).
        
        Args:
            x: input tensor of shape (m_samples, n_features).
        
        Return:
            z: out tensor of shape (m_samples, units).
        """
        n_features = x.shape[-1]
        self.w = torch.randn(n_features, self.units, device=device)
        self.b = torch.randn(self.units, device=device)
        return self.forward(x)

#### activation functions

For any activation function

$$
\mathbf{A}^{(k)} : \mathbb{R}^{m \times n_{k}} \rightarrow
\mathbb{R}^{m \times n_{k}}
$$
for all $k = 1, ..., l$.

In [12]:
class Linear(Activation):
    pass


class RelU(Activation):
    pass


class Sigmoid(Activation):
    pass


class Tanh(Activation):
    pass


class Softmax(Activation):
    pass

## forward propagation

$$
\begin{array}{l}
\textbf{Algorithm 1: Forward propagation} \\
\mathbf{A}^{(0)} := \mathbf{X} \\
\textbf{for } k = 1 \text{ to } l \textbf{ do}\\
\quad \mathbf{Z}^{(k)} = 
\mathbf{A}^{(k-1)} \mathbf{W}^{(k)} + \mathbf{b}^{(k)} \\
\quad \mathbf{A}^{(k)} = 
f(\mathbf{Z}^{(k)}) \\
\textbf{end for}
\end{array}
$$

### model

In [13]:
@add_to_class(Model)
def predict(self, x: torch.Tensor) -> torch.Tensor:
    """
    Forward propagation.
    
    Args:
        x: tensor of shape (m_samples, n_input_features).
        
    Return:
        y_pred: tensor of shape (m_samples, n_out_features).
    """
    out = x
    for layer in self.layers:
        out = layer.forward(out)
    return out

@add_to_class(Model)
def __forward__(self, x: torch.Tensor) -> torch.Tensor:
    out = x
    for layer in self.layers:
        out = layer.__forward__(out)
    return out

### layers

#### dense

Weighted sum

$$
\mathbf{Z}^{(k)}(\mathbf{A}^{(k-1)}) = 
\mathbf{A}^{(k-1)} \mathbf{W}^{(k)} + \mathbf{b}^{(k)} \\
\mathbf{Z}^{(k)} : \mathbb{R}^{m \times n_{k-1}} \rightarrow
\mathbb{R}^{m \times n_{k}}
$$

In [14]:
@add_to_class(Dense)
def forward(self, x: torch.Tensor) -> torch.Tensor:
    """
    Compute weighted sum Z = XW+b and activation function A = f(Z).
    
    Args:
        x: input tensor of shape (m_samples, n_features).
        
    Return:
        a: out tensor of shape (m_samples, units).
    """
    return self.act_f(torch.matmul(x, self.w) + self.b)

@add_to_class(Dense)
def __forward__(self, x: torch.Tensor) -> torch.Tensor:
    """Forward propagation for training step."""
    self.input = x.clone()
    self.a = self.forward(x)
    return self.a

#### activation functions

##### Linear

$$
\text{Linear}^{(k)}(\mathbf{Z}^{(k)}) = 
\mathbf{Z}^{(k)}
$$

In [15]:
@add_to_class(Linear)
def __call__(self, z: torch.Tensor) -> torch.Tensor:
    return z

##### ReLU

$$
\text{ReLU}^{(k)}(\mathbf{Z^{(k)}}) = 
\max(\mathbf{Z^{(k)}}, 0)
$$

In [16]:
@add_to_class(RelU)
def __call__(self, z: torch.Tensor) -> torch.Tensor:
    #return torch.relu(z)
    return torch.max(z, torch.zeros_like(z))

##### Sigmoid

$$
\text{Sigmoid}^{(k)}(\mathbf{Z}^{(k)}) = 
\frac{1}{1 + \exp(-\mathbf{Z}^{(k)})}
$$

In [17]:
@add_to_class(Sigmoid)
def __call__(self, z: torch.Tensor) -> torch.Tensor:
    #return torch.sigmoid(z)
    return 1 / (1 + torch.exp(-z))

##### Tanh

$$
\tanh^{(k)}(\mathbf{Z}^{(k)}) = 
\frac{1 - \exp(-2 \mathbf{Z}^{(k)})}
{1 + \exp(-2 \mathbf{Z}^{(k)})}
$$

In [18]:
@add_to_class(Tanh)
def __call__(self, z: torch.Tensor) -> torch.Tensor:
    #return torch.tanh(z)
    exp = torch.exp(-2 * z)
    return (1 - exp) / (1 + exp)

##### Softmax

$$
\text{Softmax}^{(k)} (\mathbf{Z}^{(k)}) =
\begin{bmatrix}
    \sigma(\mathbf{z}_{1,:}) \\
    \sigma(\mathbf{z}_{2,:}) \\
    \vdots \\
    \sigma(\mathbf{z}_{m,:})
\end{bmatrix}
$$

In [19]:
@add_to_class(Softmax)
def __call__(self, z: torch.Tensor) -> torch.Tensor:
    exp = torch.exp(z - torch.max(z, dim=1, keepdims=True)[0])
    return exp / exp.sum(1, keepdims=True)

#### input layer

The purpose of this layer is simply to create a random dataset 
to initialize all the parameters of the layers. 
This way we do not have to manually specify the dimensions of each parameter.

In [20]:
class InputLayer(Layer):
    def __init__(self, n_input_features: int):
        self.m = 10
        self.n = n_input_features

    def construct(self) -> torch.Tensor:
        return torch.randn(self.m, self.n, device=device)

## evaluation

### loss function

$$
\text{MSE}(\mathbf{A}^{(l)}) = 
\frac{1}{m n_{o}} 
\sum_{i=1}^{m} \sum_{j=1}^{n_{o}} \left(
    (a^{(l)}_{ij} - y_{ij})^2
\right)
$$

where $\mathbf{A}^{(l)}$ is the activation of the last layer of the model and
$n_{o}$ is the number of output features of the model.

In [21]:
class MSE(Losses):
    def loss(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> float:
        return ((y_pred - y_true)**2).mean().item()

    def __call__(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> float:
        return self.loss(y_pred, y_true)

### model

In [22]:
@add_to_class(Model)
def evaluate(self, x: torch.Tensor, y: torch.Tensor) -> float:
    """
    Evaluate the model between input x and target y
    
    Args:
        x: tensor (m_samples, n_input_features).
        y: target tensor (m_samples, n_out_features).
        
    Return:
        loss: error between y_pred and target y.
    """
    y_pred = self.predict(x)
    return self.loss_f(y_pred, y)

## backpropagation

We need to calculate the derivatives/gradients 
of each parameter in the model using **backpropagation**
and update each parameter using **gradient descent** (gd).

The main idea of ​​backpropagation is to calculate these derivatives

$$
\frac{\partial L}{\partial \theta^{(l)}} = 
{\color{Lime} \frac{\partial L}
{\partial \mathbf{A}^{(l)}}}
{\color{Cyan} \frac{\partial \mathbf{A}^{(l)}}
{\partial \mathbf{Z}^{(l)}}}
{\color{Orange} \frac{\partial \mathbf{Z}^{(l)}}
{\partial \theta^{(l)}} }
$$

$$
\frac{\partial L}{\partial \theta^{(l-1)}} = 
{\color{Lime} \frac{\partial L}
{\partial \mathbf{A}^{(l)}}}
{\color{Cyan} \frac{\partial \mathbf{A}^{(l)}}
{\partial \mathbf{Z}^{(l)}}}
{\color{Magenta} \frac{\partial \mathbf{Z}^{(l)}}
{\partial \mathbf{A}^{(l-1)}}}
{\color{Cyan} \frac{\partial \mathbf{A}^{(l-1)}}
{\partial \mathbf{Z}^{(l-1)}}}
{\color{Orange} \frac{\partial \mathbf{Z}^{(l-1)}}
{\partial \theta^{(l-1)}}}
$$

$$
\frac{\partial L}{\partial \theta^{(k)}} = 
{\color{Lime} \frac{\partial L}
{\partial \mathbf{A}^{(l)}}}
{\color{Cyan} \frac{\partial \mathbf{A}^{(l)}}
{\partial \mathbf{Z}^{(l)}}}
{\color{Magenta} \frac{\partial \mathbf{Z}^{(l)}}
{\partial \mathbf{A}^{(l-1)}}}
\cdots
{\color{Cyan} \frac{\partial \mathbf{A}^{(k)}}
{\partial \mathbf{Z}^{(k)}}}
{\color{Orange} \frac{\partial \mathbf{Z}^{(k)}}
{\partial \theta^{(k)}}}
$$

where $\theta^{(k)} = (\mathbf{b}^{(k)}, \mathbf{W}^{(k)})$.

It seems like there are many different derivatives. 
However, many of them are the same.
We only need to know 4 derivatives

$$
{\color{Lime} \frac{\partial L}
{\partial \mathbf{A}^{(l)}}}, 
{\color{Cyan} \frac{\partial \mathbf{A}^{(k)}}
{\partial \mathbf{Z}^{(k)}}},
{\color{Magenta} \frac{\partial \mathbf{Z}^{(k)}}
{\partial \mathbf{A}^{(k-1)}}},
{\color{Orange} \frac{\partial \mathbf{Z}^{(k)}}
{\partial \theta^{(k)}}}
$$

With these 4 derivatives we can compute 
$\nabla_{\theta^{(k)}} L$ for all 
$k = l, ..., 1$.

$$
\begin{array}{l}
\textbf{Algorithm 2: Backpropagation} \\
\mathbf{\Delta} := \nabla_{\mathbf{A}^{(l)}}L \\
\textbf{for } k = l, l-1, ..., 1 \textbf{ do}\\
\quad \mathbf{\Delta} := \mathbf{\Delta} 
\nabla_{\mathbf{Z}^{(k)}} \mathbf{A}^{(k)} \\
\quad \nabla_{\theta^{(k)}}L = \mathbf{\Delta}
\nabla_{\theta^{(k)}} \mathbf{Z}^{(k)} \\
\quad \mathbf{\Delta} := \mathbf{\Delta}
\nabla_{\mathbf{A}^{(k-1)}} \mathbf{Z}^{(k)} \\
\textbf{end for}
\end{array}
$$

### model

In [23]:
@add_to_class(Model)    
def update(self, y_pred: torch.Tensor, y_true: torch.Tensor, lr: float) -> None:
    delta = self.loss_f.backward(y_pred, y_true)
    for layer in reversed(self.layers):
        delta = layer.backward(delta, lr)

### loss function

In [24]:
@add_to_class(MSE)
def backward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    return 2 * (y_pred - y_true) / y_true.numel()

### layers

#### activation functions

For more information about the derivatives of these activation functions, 
see [gradients and activation functions](gradients-and-activation-functions.ipynb).

##### Linear

In [25]:
@add_to_class(Linear)
def backward(self, delta, a):
    return delta

##### ReLU

In [26]:
@add_to_class(RelU)
def backward(self, delta, a):
    return delta * (1 * (a > 0))

##### Sigmoid

In [27]:
@add_to_class(Sigmoid)
def backward(self, delta, a):
    return delta * (a * (1 - a))

##### Tanh

In [28]:
@add_to_class(Tanh)
def backward(self, delta, a):
    return delta * (1 - a**2)

##### Softmax

In [29]:
@add_to_class(Softmax)
def backward(self, delta, a):
    return a * (delta - (delta * a).sum(axis=1, keepdims=True))

#### dense

##### respect to bias

$$
\begin{align*}
\frac{\partial L}{\partial \mathbf{b}^{(k)}} &= 
\frac{\partial L}{\partial \mathbf{Z}^{(k)}}
{\color{Orange} \frac{\partial \mathbf{Z}^{(k)}}
{\partial \mathbf{b}^{(k)}}} \\
&= {\color{Orange} \mathbf{1}}
\frac{\partial L}{\partial \mathbf{Z}^{(k)}}
\end{align*}
$$
where $\mathbf{1} \in \mathbb{R}^{m}$.

##### respect to weight

$$
\begin{align*}
\frac{\partial L}{\partial \mathbf{W}^{(k)}} &=
\frac{\partial L}{\partial \mathbf{Z}^{(k)}}
{\color{Orange} \frac{\partial \mathbf{Z}^{(k)}}
{\partial \mathbf{W}^{(k)}}} \\
&= {\color{Orange} \left( \mathbf{A}^{(k-1)} \right)^\top}
\frac{\partial L}{\partial \mathbf{Z}^{(k)}}
\end{align*}
$$

##### respect to input

$$
\begin{align*}
\frac{\partial L}{\partial \mathbf{A}^{(k-1)}} &=
\frac{\partial L}{\partial \mathbf{Z}^{(k)}}
{\color{Magenta} \frac{\partial \mathbf{Z}^{(k)}}
{\partial \mathbf{A}^{(k-1)}}} \\
&= {\color{Magenta} \left( \mathbf{W}^{(k)} \right)^\top}
\frac{\partial L}{\partial \mathbf{Z}^{(k)}}
\end{align*}
$$

##### gradient descent

$$
\mathbf{W}^{(k)} := \mathbf{W}^{(k)} -\eta 
\nabla_{\mathbf{W}^{(k)}}L \\
\mathbf{b}^{(k)} := \mathbf{b}^{(k)} -\eta 
\nabla_{\mathbf{b}^{(k)}}L 
$$

In [30]:
@add_to_class(Dense)
def backward(self, delta, lr: float) -> torch.Tensor:
    # activation function derivative
    delta = self.act_f.backward(delta, self.a)
    # bias der and update
    self.b -= lr * torch.sum(delta, axis=0)
    # weight derivative (update weight after compute input der)
    w_der = torch.matmul(self.input.T, delta)
    # input derivative
    delta = torch.matmul(delta, self.w.T)
    # weight update
    self.w -= lr * w_der
    return delta

## train

In [31]:
@add_to_class(Model)    
def fit(self, x_train: torch.Tensor, y_train: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
    """
    Fit the model using gradient descent.

    Args:
        x_train: Input tensor of shape (n_samples, n_in_features).
        y_train: Target tensor one hot of shape (n_samples, n_out_features).
        epochs: Number of epochs to train.
        lr: learning rate).
        batch_size: Int number of batch.
        x_valid: Input tensor of shape (n_valid_samples, n_in_features).
        y_valid: Input tensor one hot of shape (n_valid_samples, n_out_features).
    """
    for epoch in range(epochs):
        loss_t = [] # train loss
        for batch in range(0, len(y_train), batch_size):
            end_batch = batch + batch_size

            y_pred = self.__forward__(x_train[batch:end_batch])
            loss_t.append(self.loss_f(y_pred, y_train[batch:end_batch]))

            self.update(y_pred, y_train[batch:end_batch], lr)
            
        loss_t = sum(loss_t) / len(loss_t)
        loss_v = self.evaluate(x_valid, y_valid) # valid loss
        print('Epoch: {} - L: {:.4f} - L_v {:.4f}'.format(epoch, loss_t, loss_v))

# Torch Sequential

In [32]:
class TorchSequential(nn.Module):
    def __init__(self, layers: list[nn.Module], loss_fn=None):
        super(TorchSequential, self).__init__()
        self.layers = nn.ModuleList(layers)
        for layer in self.layers:
            layer.to(device)
        self.loss_fn = loss_fn if loss_fn is not None else nn.MSELoss()
        self.eval()

    def forward(self, x):
        out = x.clone()
        for l in self.layers:
            out = l(out)
        return out

    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self(x)
            return self.loss_fn(y_pred, y).item()
        
    def fit(self, x: torch.Tensor, y: torch.Tensor, 
            epochs: int, lr: float, batch_size: int, 
            x_valid: torch.Tensor, y_valid: torch.Tensor):
        optimizer = torch.optim.SGD(self.parameters(), lr=lr, momentum=0.0)
        for epoch in range(epochs):
            loss_t = []
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size
                optimizer.zero_grad()

                y_pred = self(x[batch:end_batch])
                loss = self.loss_fn(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                loss.backward()
                optimizer.step()
            loss_t = sum(loss_t) / len(loss_t)
            loss_v = self.evaluate(x_valid, y_valid)
            print('Epoch: {} - L: {:.4f} - L_v {:.4f}'.format(epoch, loss_t, loss_v))

In [33]:
torch_model = TorchSequential([
    nn.Linear(N, 32), nn.Tanh(),
    nn.Linear(32, 32), nn.Softmax(dim=1),
    nn.Linear(32, 32), nn.Sigmoid(),
    nn.Linear(32, 32), nn.ReLU(),
    nn.Linear(32, NO)
])

# Scratch vs Sequential

## scratch model

In [34]:
model = Model([
    InputLayer(N),
    Dense(32, Tanh()),
    Dense(32, Softmax()),
    Dense(32, Sigmoid()),
    Dense(32, RelU()),
    Dense(NO, Linear())
])

## evals

### mape

In [35]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from tools.torch_metrics import torch_mape as mape

### predict

In [36]:
mape(
    model.predict(x_valid),
    torch_model(x_valid)
)

149.68578272216754

### copy parameters

In [37]:
model.copy_parameters(torch_model.parameters)

### predict after copy parameters

In [38]:
mape(
    model.predict(x_valid),
    torch_model(x_valid)
)

3.531087395236535e-17

### loss

In [39]:
mape(
    model.evaluate(x_valid, y_valid),
    torch_model.evaluate(x_valid, y_valid)
)

0.0

### train

In [40]:
LR: float = 0.01
EPOCHS: int = 32
BATCH_SIZE: int = len(y_train) // 3

In [41]:
torch_model.fit(
    x_train, y_train.double(), 
    EPOCHS, LR, BATCH_SIZE, 
    x_valid, y_valid.double()
)

Epoch: 0 - L: 10032.5215 - L_v 8184.6287
Epoch: 1 - L: 9824.7270 - L_v 14973.4147
Epoch: 2 - L: 11173.8396 - L_v 8124.3235
Epoch: 3 - L: 9916.6439 - L_v 8131.1648
Epoch: 4 - L: 9864.6388 - L_v 8139.2015
Epoch: 5 - L: 9815.3556 - L_v 8148.3304
Epoch: 6 - L: 9768.6520 - L_v 8158.4546
Epoch: 7 - L: 9724.3934 - L_v 8169.4834
Epoch: 8 - L: 9682.4521 - L_v 8181.3320
Epoch: 9 - L: 9642.7069 - L_v 8193.9207
Epoch: 10 - L: 9605.0433 - L_v 8207.1752
Epoch: 11 - L: 9569.3525 - L_v 8221.0258
Epoch: 12 - L: 9535.5313 - L_v 8235.4076
Epoch: 13 - L: 9503.4822 - L_v 8250.2598
Epoch: 14 - L: 9473.1125 - L_v 8265.5257
Epoch: 15 - L: 9444.3344 - L_v 8281.1523
Epoch: 16 - L: 9417.0649 - L_v 8297.0904
Epoch: 17 - L: 9391.2250 - L_v 8313.2940
Epoch: 18 - L: 9366.7400 - L_v 8329.7203
Epoch: 19 - L: 9343.5392 - L_v 8346.3296
Epoch: 20 - L: 9321.5553 - L_v 8363.0849
Epoch: 21 - L: 9300.7248 - L_v 8379.9519
Epoch: 22 - L: 9280.9873 - L_v 8396.8987
Epoch: 23 - L: 9262.2857 - L_v 8413.8959
Epoch: 24 - L: 9244.565

In [42]:
model.fit(
    x_train, y_train, 
    EPOCHS, LR, BATCH_SIZE, 
    x_valid, y_valid
)

Epoch: 0 - L: 10032.5215 - L_v 8184.6287
Epoch: 1 - L: 9824.7270 - L_v 14973.4147
Epoch: 2 - L: 11173.8396 - L_v 8124.3235
Epoch: 3 - L: 9916.6439 - L_v 8131.1648
Epoch: 4 - L: 9864.6388 - L_v 8139.2015
Epoch: 5 - L: 9815.3556 - L_v 8148.3304
Epoch: 6 - L: 9768.6520 - L_v 8158.4546
Epoch: 7 - L: 9724.3934 - L_v 8169.4834
Epoch: 8 - L: 9682.4521 - L_v 8181.3320
Epoch: 9 - L: 9642.7069 - L_v 8193.9207
Epoch: 10 - L: 9605.0433 - L_v 8207.1752
Epoch: 11 - L: 9569.3525 - L_v 8221.0258
Epoch: 12 - L: 9535.5313 - L_v 8235.4076
Epoch: 13 - L: 9503.4822 - L_v 8250.2598
Epoch: 14 - L: 9473.1125 - L_v 8265.5257
Epoch: 15 - L: 9444.3344 - L_v 8281.1523
Epoch: 16 - L: 9417.0649 - L_v 8297.0904
Epoch: 17 - L: 9391.2250 - L_v 8313.2940
Epoch: 18 - L: 9366.7400 - L_v 8329.7203
Epoch: 19 - L: 9343.5392 - L_v 8346.3296
Epoch: 20 - L: 9321.5553 - L_v 8363.0849
Epoch: 21 - L: 9300.7248 - L_v 8379.9519
Epoch: 22 - L: 9280.9873 - L_v 8396.8987
Epoch: 23 - L: 9262.2857 - L_v 8413.8959
Epoch: 24 - L: 9244.565

I know that both models are experiencing overfitting during training, 
but the goal of this notebook is not to create good predictors on synthetic data, 
but to understand their inner workings.

### predict after train

In [43]:
mape(
    model.predict(x_valid),
    torch_model(x_valid)
)

8.484146499694081e-17

### bias

In [44]:
filtered_layers = filter(
    lambda x: isinstance(x, nn.modules.linear.Linear), 
    torch_model.layers
)

for i, layer in enumerate(filtered_layers):
    print(f'scratch layer #{i} - torch layer #{i}')
    print(mape(model.layers[i].b, layer.bias))

scratch layer #0 - torch layer #0
1.698737091616012e-16
scratch layer #1 - torch layer #1
1.9898901268066587e-16
scratch layer #2 - torch layer #2
4.772048197527631e-16
scratch layer #3 - torch layer #3
3.4053826550299973e-16
scratch layer #4 - torch layer #4
8.484146499694081e-17


### weights

In [45]:
filtered_layers = filter(
    lambda x: isinstance(x, nn.modules.linear.Linear), 
    torch_model.layers
)

for i, layer in enumerate(filtered_layers):
    print(f'scratch layer #{i} - torch layer #{i}')
    print(mape(model.layers[i].w, layer.weight.T))

scratch layer #0 - torch layer #0
1.3672344606530367e-17
scratch layer #1 - torch layer #1
8.719104854173081e-17
scratch layer #2 - torch layer #2
8.961859691300975e-16
scratch layer #3 - torch layer #3
4.504781594348675e-16
scratch layer #4 - torch layer #4
6.820167367298755e-16
