# 1.3 - Weight Decay

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/PilotLeoYan/inside-deep-learning/blob/main/1-linear-regression/1-3-weight-decay.ipynb">
    <img src="../images/colab_logo.png" width="32">Open in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://nbviewer.org/github/PilotLeoYan/inside-deep-learning/blob/main/1-linear-regression/1-3-weight-decay.ipynb">
    <img src="../images/jupyter_logo.png" width="32">Open in Jupyter NBViewer</a>
  </td>
</table>

Let's continue from our multivariate linear regression. 
Now let's incorporate the $\ell_2$ regularization ($L_{2}$) into our model.

**Purpose of this Notebook**:

The purposes of this notebook are:
1. Incorporate $\ell_2$ regularization into our Perceptron from scratch
2. Train our Perceptron
5. Compare our Perceptron to the one prebuilt by PyTorch

In [1]:
import torch
from torch import nn

from platform import python_version
python_version(), torch.__version__

('3.13.5', '2.7.1+cu128')

In [2]:
device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
device

'cuda'

In [3]:
torch.set_default_dtype(torch.float64)

In [4]:
def add_to_class(Class):  
    """Register functions as methods in created class."""
    def wrapper(obj): setattr(Class, obj.__name__, obj)
    return wrapper

# Dataset

## create dataset

$$
\begin{align*}
\mathbf{X} &\in \mathbb{R}^{m \times n} \\
\mathbf{Y} &\in \mathbb{R}^{m \times n_{1}}
\end{align*}
$$

In [5]:
from sklearn.datasets import make_regression
import random

M: int = 10_100 # number of samples
N: int = 6 # number of input features
NO: int = 3 # number of output features

X, Y = make_regression(
    n_samples=M, 
    n_features=N, 
    n_targets=NO, 
    n_informative=N - 1,
    bias=random.random(),
    noise=1
)

print(X.shape)
print(Y.shape)

(10100, 6)
(10100, 3)


## split dataset

In [6]:
X_train = torch.tensor(X[:100], device=device)
Y_train = torch.tensor(Y[:100], device=device)
X_train.shape, Y_train.shape

(torch.Size([100, 6]), torch.Size([100, 3]))

In [7]:
X_valid = torch.tensor(X[100:], device=device)
Y_valid = torch.tensor(Y[100:], device=device)
X_valid.shape, Y_valid.shape

(torch.Size([10000, 6]), torch.Size([10000, 3]))

## delete raw dataset

In [8]:
del X
del Y

# Scratch model

The only thing we are going to modify is the way in which the model weights are updated. 
The rest, such as parameter initialization and model training, remain unchanged.

## Linear Regression model

In [9]:
class LinearRegression:
    def __init__(self, n_features: int, out_features: int, lambd: float):
        self.w = torch.randn(n_features, out_features, device=device)
        self.b = torch.randn(out_features, device=device)
        self.lambd = lambd

    def copy_params(self, torch_layer: torch.nn.modules.linear.Linear):
        """
        Copy the parameters from a module.linear to this model.

        Args:
            torch_layer: Pytorch module from which to copy the parameters.
        """
        self.b.copy_(torch_layer.bias.detach().clone())
        self.w.copy_(torch_layer.weight.T.detach().clone())

    def predict(self, x: torch.Tensor) -> torch.Tensor:
        """
        Predict the output for input x

        Args:
            x: Input tensor of shape (n_samples, n_features).

        Returns:
            y_pred: Predicted output tensor of shape (n_samples, out_features).
        """
        return torch.matmul(x, self.w) + self.b

    def mse_loss(self, y_true: torch.Tensor, y_pred: torch.Tensor):
        """
        MSE loss function between target y_true and y_pred.

        Args:
            y_true: Target tensor of shape (n_samples, out_features).
            y_pred: Predicted tensor of shape (n_samples, out_features).

        Returns:
            loss: MSE loss between predictions and true values.
        """
        return ((y_pred - y_true)**2).mean().item()

    def evaluate(self, x: torch.Tensor, y_true: torch.Tensor):
        """
        Evaluate the model on input x and target y_true using MSE.

        Args:
            x: Input tensor of shape (n_samples, n_features).
            y_true: Target tensor of shape (n_samples, out_features).

        Returns:
            loss: MSE loss between predictions and true values.
        """
        y_pred = self.predict(x)
        return self.mse_loss(y_true, y_pred)

    def fit(self, x_train: torch.Tensor, y_train: torch.Tensor, 
        epochs: int, lr: float, batch_size: int, 
        x_valid: torch.Tensor, y_valid: torch.Tensor):
        """
        Fit the model using gradient descent.
        
        Args:
            x_train: Input tensor of shape (n_samples, n_features).
            y_train: Target tensor of shape (n_samples,).
            epochs: Number of epochs to fit.
            lr: learning rate.
            batch_size: Int number of batch.
            x_valid: Input tensor of shape (n_valid_samples, n_features).
            y_valid: Target tensor of shape (n_valid_samples,)
        """
        for epoch in range(epochs):
            loss = []
            for batch in range(0, len(y_train), batch_size):
                end_batch = batch + batch_size

                y_pred = self.predict(x_train[batch:end_batch])

                loss.append(self.mse_loss(
                    y_train[batch:end_batch], 
                    y_pred
                ))

                self.update(
                    x_train[batch:end_batch], 
                    y_train[batch:end_batch], 
                    y_pred, 
                    lr
                )

            loss = round(sum(loss) / len(loss), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss} - MSE_v: {loss_v}')

## Parameters update

### objective function

Now instead of training the model with the gradient of the loss function, 
we are going to use the objective function $J$. 
Typically our objective function is as follows.

$$
J(\hat{\mathbf{Y}}, \mathbf{\theta}) = 
L(\hat{\mathbf{Y}}) + \text{regularization}
$$
where $\mathbf{\theta}$ is an arbitrary parameter.

**Note**: Do not use the objective function to evaluate the model.

### L2 regularization

As a weight decay technique, we will use regularization, commonly $\ell_2$ or $L_{2}$.

$$
\ell_2(\mathbf{\theta}) = 
\frac{\lambda}{2} \left\| \mathbf{\theta} \right\|^{2}_{2}
$$
where commonly $\mathbf{\theta} \in \mathbb{R}^{n}$.

**Note**: $\lambda \in \mathbb{R}$ is called as a *hyperparameter*, 
because it is a parameter set by the developer (you) not by the model.

But we have $\mathbf{W} \in \mathbb{R}^{n \times n_{1}}$, then we need to do an equivalence operation.

$$
\begin{align*}
\ell_2(\mathbf{W}) &= \frac{\lambda}{2} \sum_{i=1}^{n} \sum_{j=1}^{n_{1}} w_{ij}^{2} \\
&= \frac{\lambda}{2} \text{sum} \left( \mathbf{W}^{2} \right) 
\end{align*}
$$
where ${\mathbf{A}}^2$ is element-wise power or also ${\mathbf{A}}^2 = \mathbf{A} \odot \mathbf{A}$.

### objective function derivative

$$
\frac{\partial J}{\partial w_{pq}} =
\frac{\partial L}{\partial w_{pq}} +
\frac{\partial \ell_2}{\partial w_{pq}}
$$

$$
\begin{align*}
\frac{\partial \ell_2}{\partial w_{pq}} &=
\frac{\lambda}{2} \sum_{i=1}^{n} \sum_{j=1}^{n_{1}}
\frac{\partial}{\partial w_{pq}} \left( 
    w_{ij}^{2}
\right) \\
&= \lambda w_{pq}
\end{align*}
$$

Because
$$
\frac{\partial w_{ij}}{\partial w_{pq}} = \begin{cases}
    1 & \text{if } i=p, j=q \\
    0 & \text{otherwise}
\end{cases}
$$

In general, for all $p = 1, \ldots, n$ and $q = 1, \ldots, n_{1}$.
$$
\frac{\partial \ell_2}{\partial \mathbf{W}} =
\lambda \mathbf{W}
$$
**Remark**: $\nabla_{\mathbf{W}}\ell_2 \in \mathbb{R}^{n \times n_{1}}$.

$$
\begin{align*}
\frac{\partial J}{\partial \mathbf{W}} &=
{\color{Orange} {\frac{\partial L}{\partial \mathbf{W}}}} +
{\color{Cyan} {\frac{\partial \ell_2}{\partial \mathbf{W}}}} \\
&= {\color{Orange} {\nabla_{\mathbf{W}}L}} + 
{\color{Cyan} {\lambda \mathbf{W}}}
\end{align*}
$$

In [10]:
@add_to_class(LinearRegression)
def update(self, x: torch.Tensor, y_true: torch.Tensor, y_pred: torch.Tensor, lr: float):
    """
    Update the model parameters with L2 regularization.

    Args:
       x: Input tensor of shape (n_samples, n_features).
       y_true: Target tensor of shape (n_samples, n_features).
       y_pred: Predicted output tensor of shape (n_samples, n_features).
       lr: Learning rate. 
    """
    delta = 2 * (y_pred - y_true) / y_true.numel()
    self.b -= lr * delta.sum(axis=0)
    self.w -= lr * (torch.matmul(x.T, delta) + self.lambd * self.w) # L2 regularization

# Scratch vs Torch.nn

## Torch.nn model

In [11]:
class TorchLinearRegression(nn.Module):
    def __init__(self, n_features, n_out_features):
        super(TorchLinearRegression, self).__init__()
        self.layer = nn.Linear(n_features, n_out_features, device=device)
        self.loss = nn.MSELoss()

    def forward(self, x):
        return self.layer(x)
    
    def evaluate(self, x, y):
        self.eval()
        with torch.no_grad():
            y_pred = self.forward(x)
            return self.loss(y_pred, y).item()
    
    def fit(self, x, y, epochs, lr, batch_size, x_valid, y_valid, weight_decay):
        
        optimizer = torch.optim.SGD([
            {'params': self.layer.weight, 'weight_decay': weight_decay},
            {'params': self.layer.bias} # it is important to specify the weight decay for the bias.
        ], lr=lr)

        for epoch in range(epochs):
            loss_t = []
            for batch in range(0, len(y), batch_size):
                end_batch = batch + batch_size

                y_pred = self.forward(x[batch:end_batch])
                loss = self.loss(y_pred, y[batch:end_batch])
                loss_t.append(loss.item())

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

            loss_t = round(sum(loss_t) / len(loss_t), 4)
            loss_v = round(self.evaluate(x_valid, y_valid), 4)
            print(f'epoch: {epoch} - MSE: {loss_t} - MSE_v: {loss_v}')

In [12]:
torch_model = TorchLinearRegression(N, NO)

## scratch model

In [13]:
LAMBD: float = 0.01

model = LinearRegression(N, NO, LAMBD)
model.lambd

0.01

## evals

### import MAPE modified

In [14]:
import os
import sys
import requests
import importlib.util
from collections.abc import Callable

def import_mape(module_path: str = '..') -> Callable:
    """
    Tries to import the 'torch_mape' function from a local project structure.
    If it fails (ModuleNotFoundError), it assumes a cloud environment (like Colab),
    downloads the module from GitHub, imports it, and returns the function.

    Args:
        module_path (str): The relative path to the project's root directory
                           for the local search.

    Returns:
        The imported 'torch_mape' function, or None if it fails.
    """
    GITHUB_RAW_URL = 'https://raw.githubusercontent.com/PilotLeoYan/inside-deep-learning/main/tools/torch_metrics.py'
    MODULE_NAME = 'torch_metrics'
    LOCAL_FILE_NAME = f'{MODULE_NAME}.py'

    try:
        # Attempt 1: Standard import (if the package is installed or in PYTHONPATH)
        from tools.torch_metrics import torch_mape
        print("‚úÖ Module 'tools.torch_metrics' successfully imported from the environment.")
        return torch_mape

    except ModuleNotFoundError:
        # Attempt 2: Search in the specified local path (original behavior)
        # This is useful for local development without installing the package.
        project_path = os.path.abspath(os.path.join(module_path))
        if project_path not in sys.path:
            sys.path.insert(0, project_path)

        try:
            from tools.torch_metrics import torch_mape
            print("‚úÖ Local module 'tools.torch_metrics' imported after adjusting the path.")
            # Remove the added path to avoid side effects
            sys.path.pop(0)
            return torch_mape
        except ModuleNotFoundError:
            # If both local attempts fail, proceed with the download
            if project_path in sys.path:
                sys.path.pop(0) # Clean up the path if it was added
            print(f"‚ö†Ô∏è Local module not found. Proceeding to download from GitHub...")

    # Download and Dynamic Loading Logic}
    if not os.path.exists(LOCAL_FILE_NAME):
        try:
            print(f"‚¨áÔ∏è  Downloading '{LOCAL_FILE_NAME}' from GitHub...")
            response = requests.get(GITHUB_RAW_URL)
            response.raise_for_status()  # This will raise an error if the HTTP request failed
            with open(LOCAL_FILE_NAME, "w", encoding="utf-8") as f:
                f.write(response.text)
            print("üëç Download complete.")
        except requests.exceptions.RequestException as e:
            print(f"‚ùå Error downloading the file: {e}")
            return None

    # Dynamically load the module using importlib
    spec = importlib.util.spec_from_file_location(MODULE_NAME, LOCAL_FILE_NAME)
    dynamic_module = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(dynamic_module)
    
    print(f"‚úÖ Module '{MODULE_NAME}' successfully loaded from the downloaded file.")
    return dynamic_module.torch_mape

In [15]:
mape = import_mape()

‚úÖ Local module 'tools.torch_metrics' imported after adjusting the path.


### predict

In [16]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

3941.280306302811

### copy parameters

In [17]:
model.copy_params(torch_model.layer)
parameters = (model.b.clone(), model.w.clone())

### predict after copy parameters

In [18]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

0.0

### loss

In [19]:
mape(
    model.evaluate(X_valid, Y_valid),
    torch_model.evaluate(X_valid, Y_valid)
)

0.0

### train

In [20]:
LR = 0.01 # learning rate
EPOCHS = 16 # number of epochs
BATCH = len(X_train) // 3 # batch size

In [21]:
torch_model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid,
    LAMBD
)

epoch: 0 - MSE: 11924.5458 - MSE_v: 8943.0544
epoch: 1 - MSE: 10508.8767 - MSE_v: 8383.9955
epoch: 2 - MSE: 9339.9285 - MSE_v: 7887.3242
epoch: 3 - MSE: 8368.6366 - MSE_v: 7442.5281
epoch: 4 - MSE: 7556.0473 - MSE_v: 7041.1584
epoch: 5 - MSE: 6871.2145 - MSE_v: 6676.4069
epoch: 6 - MSE: 6289.535 - MSE_v: 6342.7709
epoch: 7 - MSE: 5791.431 - MSE_v: 6035.7873
epoch: 8 - MSE: 5361.3075 - MSE_v: 5751.8221
epoch: 9 - MSE: 4986.7263 - MSE_v: 5487.9032
epoch: 10 - MSE: 4657.7531 - MSE_v: 5241.5874
epoch: 11 - MSE: 4366.4399 - MSE_v: 5010.8556
epoch: 12 - MSE: 4106.4148 - MSE_v: 4794.0288
epoch: 13 - MSE: 3872.5586 - MSE_v: 4589.7018
epoch: 14 - MSE: 3660.7467 - MSE_v: 4396.6898
epoch: 15 - MSE: 3467.6462 - MSE_v: 4213.987


In [22]:
model.fit(
    X_train, Y_train, 
    EPOCHS, LR, BATCH, 
    X_valid, Y_valid
)

epoch: 0 - MSE: 11924.5458 - MSE_v: 8943.0544
epoch: 1 - MSE: 10508.8767 - MSE_v: 8383.9955
epoch: 2 - MSE: 9339.9285 - MSE_v: 7887.3242
epoch: 3 - MSE: 8368.6366 - MSE_v: 7442.5281
epoch: 4 - MSE: 7556.0473 - MSE_v: 7041.1584
epoch: 5 - MSE: 6871.2145 - MSE_v: 6676.4069
epoch: 6 - MSE: 6289.535 - MSE_v: 6342.7709
epoch: 7 - MSE: 5791.431 - MSE_v: 6035.7873
epoch: 8 - MSE: 5361.3075 - MSE_v: 5751.8221
epoch: 9 - MSE: 4986.7263 - MSE_v: 5487.9032
epoch: 10 - MSE: 4657.7531 - MSE_v: 5241.5874
epoch: 11 - MSE: 4366.4399 - MSE_v: 5010.8556
epoch: 12 - MSE: 4106.4148 - MSE_v: 4794.0288
epoch: 13 - MSE: 3872.5586 - MSE_v: 4589.7018
epoch: 14 - MSE: 3660.7467 - MSE_v: 4396.6898
epoch: 15 - MSE: 3467.6462 - MSE_v: 4213.987


### predict after training

In [23]:
mape(
    model.predict(X_valid),
    torch_model.forward(X_valid)
)

5.961535470546574e-14

### weight 

In [24]:
mape(
    model.w.clone(),
    torch_model.layer.weight.detach().T
)

1.5086733575246558e-14

### bias

In [25]:
mape(
    model.b.clone(),
    torch_model.layer.bias.detach()
)

4.972528614591374e-15