<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Implementing a Regression MLP

PyTorch provides a helpful `nn.Sequential` module that chains multiple modules: when you call this module with some inputs, it feeds these inputs to the first module, then feeds the output of the first module to the second module, and so on.

Most neural networks contain stacks of modules, and in fact many neural networks are just one big stack of modules: this makes the `nn.Sequential` module one of the most useful modules in PyTorch.

The MLP we want to build is just that: a simple stack of modules‚Äîtwo hidden layers and one output layer. So let‚Äôs build it using the `nn.Sequential` module:


In [None]:
import torch
import torch.nn as nn

torch.manual_seed(42)

model = nn.Sequential(
    nn.Linear(n_features, 50),
    nn.ReLU(),
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1)
)


### Layer-by-layer explanation

- **First layer**  
  The first layer must have the right number of inputs for our data: `n_features` (equal to 8 in our case).  
  The number of outputs is a tunable hyperparameter; here we choose 50.

- **ReLU activation**  
  `nn.ReLU` implements the ReLU activation function.  
  It has no parameters and applies the function elementwise.

- **Second hidden layer**  
  The second layer takes 50 inputs (matching the previous layer‚Äôs output) and outputs 40 features.  
  Hidden layers do not need to have the same width, as long as dimensions match.

- **Output layer**  
  The output layer must match the dimensionality of the targets.  
  Since our targets are scalar values, we use a single output neuron.


### Training the model


In [None]:
learning_rate = 0.1
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
mse = nn.MSELoss()

train_bgd(model, optimizer, mse, X_train, y_train, n_epochs)



That‚Äôs it‚Äîyou‚Äôve trained your first neural network with PyTorch!

However, we are still using **batch gradient descent**, which computes gradients over the entire training set at each iteration. This does not scale well to large datasets or models, so we will later switch to **mini-batch gradient descent**.


### The Vanishing / Exploding Gradients Problems

During backpropagation, gradients flow from the output layer back toward the input layer. Unfortunately, these gradients often become:

- **Very small** ‚Üí vanishing gradients  
- **Very large** ‚Üí exploding gradients  

When gradients vanish, lower layers learn extremely slowly or not at all.  
When gradients explode, training becomes unstable and diverges.


These problems were one of the main reasons deep neural networks were mostly abandoned in the early 2000s.

A 2010 paper by Xavier Glorot and Yoshua Bengio showed that poor **weight initialization** combined with **sigmoid activations** was a major cause of unstable gradients.


### Why sigmoid causes trouble

- Sigmoid saturates at 0 and 1
- Its derivative approaches 0 for large |z|
- Gradients shrink exponentially as they propagate backward

This leaves almost no learning signal for lower layers.


### Proper weight initialization

To keep signals stable, we want:

- Forward activations to keep the same variance
- Backward gradients to keep the same variance

Glorot and Bengio proposed initializing weights using:

- **fan-in**: number of inputs
- **fan-out**: number of outputs


### Common initialization strategies

| Initialization | Activation functions | Variance |
|----------------|---------------------|----------|
| Xavier (Glorot) | tanh, sigmoid | 1 / fanavg |
| He (Kaiming) | ReLU, GELU, Swish, Mish | 2 / fanin |
| LeCun | SELU | 1 / fanin |


### Manual initialization (not recommended)


In [None]:
layer = nn.Linear(40, 10)
layer.weight.data *= (6 ** 0.5)  # Kaiming init
torch.zero_(layer.bias.data)


### Recommended: use torch.nn.init


In [None]:
nn.init.kaiming_uniform_(layer.weight)
nn.init.zeros_(layer.bias)


### Applying initialization to all layers


In [None]:
def use_he_init(module):
    if isinstance(module, nn.Linear):
        nn.init.kaiming_uniform_(module.weight)
        nn.init.zeros_(module.bias)

model = nn.Sequential(
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1),
    nn.ReLU()
)

model.apply(use_he_init)


### ReLU and its problems

ReLU is fast and effective, but it suffers from **dying ReLUs**:
neurons can permanently output zero if their inputs become negative for all samples.


### Leaky ReLU

Leaky ReLU allows a small slope for negative values:

LeakyReLUŒ±(z) = max(Œ±z, z)

This prevents neurons from dying completely.


In [None]:
alpha = 0.2

model = nn.Sequential(
    nn.Linear(50, 40),
    nn.LeakyReLU(negative_slope=alpha)
)

nn.init.kaiming_uniform_(
    model[0].weight,
    alpha,
    nonlinearity="leaky_relu"
)


### ELU

ELU:
- Produces negative outputs
- Has nonzero gradients everywhere
- Is smooth at z = 0

This often speeds up training, at the cost of extra computation.


### SELU

SELU enables **self-normalizing networks**, but only if:
- Inputs are standardized
- LeCun normal initialization is used
- No batch norm or dropout is applied


### Modern activation functions

- **GELU** ‚Äì smooth, non-monotonic, widely used in transformers
- **Swish / SiLU** ‚Äì z ¬∑ sigmoid(z)
- **SwiGLU** ‚Äì gated Swish variant (common in transformers)
- **Mish** ‚Äì smooth, GELU-like
- **ReLU2** ‚Äì square of ReLU, simple but powerful


In [None]:
import torch.nn.functional as F

# ReLU2
y = F.relu(z).square()

# SwiGLU
z1, z2 = z.chunk(2, dim=-1)
y = F.silu(beta * z1) * z2


### Batch Normalization (BN)

Batch norm normalizes layer inputs using batch statistics, then learns:
- A scale parameter Œ≥
- A shift parameter Œ≤

BN reduces vanishing gradients, allows larger learning rates, and acts as a regularizer.


In [None]:
model = nn.Sequential(
    nn.Flatten(),
    nn.BatchNorm1d(28 * 28),
    nn.Linear(28 * 28, 300),
    nn.ReLU(),
    nn.BatchNorm1d(300),
    nn.Linear(300, 100),
    nn.ReLU(),
    nn.BatchNorm1d(100),
    nn.Linear(100, 10)
)


‚ö†Ô∏è Always remember:

- `model.train()` during training  
- `model.eval()` during evaluation


### Layer Normalization (LN)

LN normalizes across feature dimensions instead of the batch dimension.

Advantages:
- Same behavior during training and inference
- Works well with RNNs and transformers


In [None]:
inputs = torch.randn(32, 3, 100, 200)

layer_norm = nn.LayerNorm([3, 100, 200])
outputs = layer_norm(inputs)


### Gradient Clipping

Gradient clipping prevents exploding gradients by limiting their magnitude.


In [None]:
for epoch in range(n_epochs):
    for X_batch, y_batch in train_loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)

        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()
        optimizer.zero_grad()


## Reusing Pretrained Layers (Transfer Learning)

Training a very large deep neural network from scratch is usually not ideal. Instead, you should first try to find an existing model trained on a similar task and reuse most of its layers. This technique is called **transfer learning**.

Transfer learning:
- Speeds up training
- Requires much less labeled data
- Often leads to better generalization

Typically:
- Lower layers learn generic features (edges, textures, shapes)
- Upper layers learn task-specific patterns


### What to Reuse

When reusing a pretrained model:
- Replace the **output layer** (it likely has the wrong number of outputs)
- Reuse **lower hidden layers**
- Upper layers may or may not be reused depending on task similarity

> The more similar the tasks are, the more layers you should reuse.


### Best Practices

1. Freeze reused layers initially  
2. Train only the new output layer  
3. Gradually unfreeze top layers  
4. Reduce learning rate when unfreezing  
5. More data ‚Üí more layers can be unfrozen


In [None]:
import torch
import torch.nn as nn


### Example: Original Model (Model A)

Assume Model A was trained on an 8-class Fashion-MNIST-like dataset.


In [None]:
torch.manual_seed(42)

model_A = nn.Sequential(
    nn.Flatten(),
    nn.Linear(1 * 28 * 28, 100),
    nn.ReLU(),
    nn.Linear(100, 100),
    nn.ReLU(),
    nn.Linear(100, 100),
    nn.ReLU(),
    nn.Linear(100, 8)
)

# model_A is assumed to be trained or loaded with pretrained weights


### Reusing Model A for a New Binary Classification Task (Model B)

We remove the output layer and add a new one with a single output.


In [None]:
import copy

reused_layers = copy.deepcopy(model_A[:-1])

model_B_on_A = nn.Sequential(
    *reused_layers,
    nn.Linear(100, 1)  # binary classification
)


### Freezing Reused Layers

This prevents large gradients from damaging pretrained weights early in training.


In [None]:
for layer in model_B_on_A[:-1]:
    for param in layer.parameters():
        param.requires_grad = False


### Loss Function for Binary Classification

We use `BCEWithLogitsLoss`, which combines a sigmoid layer and binary cross-entropy.


In [None]:
loss_fn = nn.BCEWithLogitsLoss()


### After Initial Training

Once the new output layer has stabilized:
- Unfreeze reused layers
- Reduce learning rate
- Fine-tune the entire model


In [None]:
for param in model_B_on_A.parameters():
    param.requires_grad = True


### Important Reality Check

Transfer learning:
- Works **extremely well** for CNNs and Transformers
- Often **does not help much** for small dense networks
- Results can vary wildly with random seeds and dataset splits

Be cautious of overly positive results ‚Äî they may be cherry-picked.


## Unsupervised Pretraining

If you have:
- Little labeled data
- Plenty of unlabeled data
- No similar pretrained model

You can:
1. Train an unsupervised model (e.g., autoencoder)
2. Reuse lower layers
3. Fine-tune using labeled data


### Historical Note

Early deep learning relied heavily on **greedy layer-wise pretraining** using RBMs.

Today:
- Entire unsupervised models are trained in one shot
- Autoencoders and diffusion models are preferred


## Pretraining on an Auxiliary Task

Another strategy is **self-supervised learning**:
- Automatically generate labels
- Train on a related task
- Reuse learned representations

Example:
- Masked-word prediction for NLP
- Same idea behind modern language models


### Legal & Ethical Warning

Scraping images or personal data:
- May violate copyright law
- Often violates privacy laws
- Requires explicit consent in many countries


### Key Takeaway

If you lack labeled data:
1. Try transfer learning
2. Try unsupervised pretraining
3. Try self-supervised auxiliary tasks

These techniques power modern deep learning.


# Reusing Pretrained Layers (Transfer Learning)

Training a very large deep neural network (DNN) from scratch is usually not a good idea if a similar pretrained model already exists. Instead, you can reuse most of the layers of an existing model and only retrain the top layers. This technique is called **transfer learning**.

Transfer learning significantly speeds up training and requires far less labeled data.


## Why Transfer Learning Works

Suppose you have a DNN trained to classify images into 100 categories (animals, plants, vehicles, etc.), and you now want to classify **specific types of vehicles**. These tasks overlap, so the lower layers of the original network‚Äîwhich detect edges, textures, and shapes‚Äîare still useful.

Only the top layers, which learn task-specific patterns, usually need to be replaced.


## Important Notes

- If the new task uses images of a different size, you must resize them to match the original model‚Äôs input.
- Transfer learning works best when the new task has **similar low-level features**.
- Models trained on natural photos usually do **not** transfer well to medical or satellite images.


## Which Layers Should Be Reused?

- The **output layer** should almost always be replaced.
- Lower hidden layers are more reusable than upper layers.
- The more similar the tasks, the more layers you should reuse.


## Practical Strategy

1. Reuse lower layers from the pretrained model.
2. Freeze the reused layers initially.
3. Train the new output layer.
4. Gradually unfreeze top layers and fine-tune with a smaller learning rate.


# Transfer Learning with PyTorch ‚Äì Example

Assume a model (Model A) was trained on Fashion MNIST with **8 classes**.
We now want to build a **binary classifier** (T-shirt vs Pullover) using only **20 labeled images**.


In [None]:
import torch
import torch.nn as nn

torch.manual_seed(42)

model_A = nn.Sequential(
    nn.Flatten(),
    nn.Linear(1 * 28 * 28, 100),
    nn.ReLU(),
    nn.Linear(100, 100),
    nn.ReLU(),
    nn.Linear(100, 100),
    nn.ReLU(),
    nn.Linear(100, 8)  # output layer for 8 classes
)

# Assume model_A is already trained or pretrained


## Reusing All Layers Except the Output Layer

We copy all layers except the last one and add a new output layer suitable for binary classification.


In [None]:
import copy

device = "cuda" if torch.cuda.is_available() else "cpu"

torch.manual_seed(42)

reused_layers = copy.deepcopy(model_A[:-1])

model_B_on_A = nn.Sequential(
    *reused_layers,
    nn.Linear(100, 1)  # binary classification output
).to(device)


## Freezing the Reused Layers

To prevent large gradients from destroying pretrained weights, we freeze all reused layers at first.


In [None]:
for layer in model_B_on_A[:-1]:
    for param in layer.parameters():
        param.requires_grad = False


## Loss Function and Metrics

Since this is a **binary classification task**, we use `BCEWithLogitsLoss`.


In [None]:
import torchmetrics

loss_fn = nn.BCEWithLogitsLoss()
accuracy = torchmetrics.Accuracy(task="binary").to(device)


## Fine-Tuning

After a few epochs:
- Unfreeze the reused layers
- Reduce the learning rate
- Continue training to fine-tune the model


In [None]:
for param in model_B_on_A.parameters():
    param.requires_grad = True

# Reduce learning rate in optimizer before continuing training


## Results and Caveats

With transfer learning, test accuracy improved from **71.6% ‚Üí 92.5%**.

‚ö†Ô∏è However, this result depended heavily on:
- Random seed
- Class selection
- Hyperparameter tuning

This highlights the danger of **p-hacking**‚Äîreporting only the best results.


## When Transfer Learning Works Best

Transfer learning is most effective with:
- **Deep CNNs**
- **Transformer architectures**

It is much less effective with small dense networks.


# Unsupervised Pretraining

When no similar pretrained model exists, you can:
1. Collect large amounts of **unlabeled data**
2. Train an **unsupervised model** (e.g., autoencoder)
3. Reuse lower layers and fine-tune using labeled data


## Historical Note

Unsupervised pretraining (e.g., RBMs) was crucial to the revival of deep learning in 2006.
Today, autoencoders and diffusion models are more common.


# Pretraining on an Auxiliary Task

If labeled data is scarce, train a model on a related task with easy-to-obtain labels, then reuse its lower layers.

Example:
- Train a model to determine whether two face images show the same person
- Reuse its layers to build a face classifier with limited data


## Self-Supervised Learning (NLP Example)

Mask words in text and train a model to predict them:
> "What ___ you saying?"

The model learns language structure and can later be fine-tuned for downstream tasks.


# Faster Optimizers

Training very large deep neural networks can be painfully slow. So far, we have seen several techniques to speed up training and improve convergence:

- Better weight initialization
- Better activation functions
- Batch normalization or layer normalization
- Transfer learning

Another major speed boost comes from using **faster optimization algorithms** than plain gradient descent.


## Overview of Optimizers Covered

In this section, we cover the most popular optimizers:

- Momentum
- Nesterov Accelerated Gradient (NAG)
- AdaGrad
- RMSProp
- Adam and its variants (AdaMax, NAdam, AdamW)


# Momentum Optimization

Imagine a bowling ball rolling down a slope. At first it moves slowly, but it gradually accelerates as it builds momentum. This intuition inspired **momentum optimization**, proposed by Boris Polyak in 1964.

Regular gradient descent never builds speed: it only reacts to the current gradient. Momentum, instead, accumulates past gradients to build velocity.


## Gradient Descent Recap

Standard gradient descent updates parameters as:

\[
\theta \leftarrow \theta - \eta \nabla_\theta J(\theta)
\]

If gradients are small, learning becomes very slow.


## Momentum Update Rule

Momentum introduces a velocity vector **m**:

- Gradients act as acceleration
- Parameters are updated using accumulated momentum
- A momentum coefficient **Œ≤** controls friction

Typical value: **Œ≤ = 0.9**


## Why Momentum Is Faster

If gradients remain constant, momentum reaches a terminal velocity:

\[
\text{velocity} = \frac{\eta}{1 - \beta} \nabla J
\]

For Œ≤ = 0.9, updates become roughly **10√ó faster** than standard gradient descent.


## Momentum Helps With:

- Escaping plateaus
- Moving faster through narrow valleys
- Reducing training time in deep networks


### PyTorch: Momentum Optimizer


In [None]:
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.05,
    momentum=0.9
)


## Drawback of Momentum

Momentum introduces an extra hyperparameter (Œ≤).  
Fortunately, **Œ≤ = 0.9** works well in most cases.


# Nesterov Accelerated Gradient (NAG)

Nesterov momentum is a small but powerful improvement over standard momentum, proposed by Yurii Nesterov in 1983.


## Key Idea

Instead of computing the gradient at the current position Œ∏, NAG computes it slightly **ahead**:

\[
\theta + \beta m
\]

This allows the optimizer to correct its trajectory earlier.


## Why NAG Is Better

- More accurate gradient direction
- Reduced oscillations
- Faster convergence than regular momentum


### PyTorch: Nesterov Momentum


In [None]:
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.05,
    momentum=0.9,
    nesterov=True
)


# AdaGrad

AdaGrad adapts the learning rate for each parameter individually.  
It is especially useful for problems with **uneven curvature**.


## How AdaGrad Works

- Accumulates squared gradients over time
- Scales updates by the inverse square root of accumulated gradients
- Steep dimensions slow down faster than shallow ones


## Benefit of AdaGrad

- Automatically adapts learning rates
- Requires less tuning of Œ∑
- Corrects direction early toward the optimum


## Major Limitation

AdaGrad keeps accumulating gradients forever, causing learning rates to shrink too much.

‚û°Ô∏è Often **stops training too early** in deep neural networks.


### PyTorch: AdaGrad (Generally Not Recommended for DNNs)


In [None]:
optimizer = torch.optim.Adagrad(
    model.parameters(),
    lr=0.05
)


# RMSProp

RMSProp fixes AdaGrad‚Äôs main issue by using **exponentially decaying averages** instead of accumulating gradients forever.


## RMSProp Key Idea

- Keeps a moving average of squared gradients
- Recent gradients matter more than old ones
- Prevents learning rate from shrinking to zero


## Typical Hyperparameter

- Decay rate Œ± = **0.9**


### PyTorch: RMSProp


In [None]:
optimizer = torch.optim.RMSprop(
    model.parameters(),
    lr=0.05,
    alpha=0.9
)


## RMSProp Summary

- Much better than AdaGrad
- Was the go-to optimizer before Adam
- Still competitive on some tasks


# Adam (Adaptive Moment Estimation)

Adam combines:

- **Momentum** (first moment of gradients)
- **RMSProp** (second moment of gradients)


## What Adam Tracks

- Exponentially decaying average of gradients (mean)
- Exponentially decaying average of squared gradients (variance)
- Bias correction during early training


## Default Hyperparameters

- Œ≤‚ÇÅ = 0.9 (momentum)
- Œ≤‚ÇÇ = 0.999 (scaling)
- Œµ = 1e-8
- Learning rate Œ∑ = **0.001**


### PyTorch: Adam Optimizer


In [None]:
optimizer = torch.optim.Adam(
    model.parameters(),
    lr=0.001,
    betas=(0.9, 0.999)
)


## Why Adam Is Popular

- Fast convergence
- Minimal tuning required
- Works well on many problems


# Adam Variants


## AdaMax

- Replaces ‚Ñì2 norm with ‚Ñì‚àû norm
- Can be more stable than Adam
- Often slightly worse performance overall


In [None]:
optimizer = torch.optim.Adamax(
    model.parameters(),
    lr=0.001
)


## NAdam

- Adam + Nesterov momentum
- Often converges faster than Adam


In [None]:
optimizer = torch.optim.NAdam(
    model.parameters(),
    lr=0.001
)


## AdamW

AdamW fixes how weight decay is applied in Adam.

- Properly decouples weight decay from gradient updates
- Often generalizes better than Adam


In [None]:
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=0.001,
    weight_decay=1e-4
)


# Important Warning About Adaptive Optimizers

Adaptive optimizers (Adam, RMSProp, etc.) often converge fast but may **generalize poorly** on some datasets.

If your model overfits or underperforms:
‚û°Ô∏è Try **SGD with Nesterov momentum**


# Second-Order Optimization (Brief Note)

Second-order methods use Hessians (curvature information), but:

- Require O(n¬≤) memory
- Too slow for large neural networks

Approximate methods like **Shampoo** exist but are not built into PyTorch.


# Training Sparse Models

All optimizers discussed so far produce **dense models**.

To get sparse models:
- Prune small weights
- Remove entire neurons or channels
- Use ‚Ñì1 regularization


### PyTorch: Weight Pruning Example


In [None]:
import torch.nn.utils.prune as prune

prune.l1_unstructured(
    model.linear,
    name="weight",
    amount=0.3
)


# Optimizer Comparison Summary

- SGD: slow but reliable
- Momentum / NAG: fast and strong generalization
- Adam / RMSProp: very fast, less tuning
- AdamW: best Adam variant for generalization


# Learning Rate Scheduling

Finding a good learning rate is very important.

- If the learning rate is too high, training will diverge.
- If it is too low, training will be very slow and may get stuck.
- With a constant learning rate, training may improve quickly at first but fail to converge well.

A common strategy is to start with a higher learning rate and reduce it during training.
PyTorch provides several learning rate schedulers in `torch.optim.lr_scheduler`.


## Exponential Scheduling

Exponential scheduling multiplies the learning rate by a constant factor `gamma`
after each epoch:

Œ∑_t = Œ∑_0 ¬∑ gamma^t

Typically:
- gamma < 1
- Common values: 0.9, 0.95, 0.99


In [None]:
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(8, 50),
    nn.ReLU(),
    nn.Linear(50, 40),
    nn.ReLU(),
    nn.Linear(40, 1)
)

optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

scheduler = torch.optim.lr_scheduler.ExponentialLR(
    optimizer, gamma=0.9
)


In [None]:
for epoch in range(n_epochs):
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        loss = mse(model(X_batch), y_batch)
        loss.backward()
        optimizer.step()

    scheduler.step()


## Cosine Annealing

Cosine annealing gradually decreases the learning rate from a maximum value
to a minimum value using a cosine curve.

This keeps the learning rate high for most of training and allows fine-tuning
near the end.


In [None]:
cosine_scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
    optimizer,
    T_max=20,
    eta_min=0.001
)


## Performance Scheduling (ReduceLROnPlateau)

Performance scheduling adjusts the learning rate based on a validation metric.
If the metric stops improving, the learning rate is reduced.

Common parameters:
- mode: "min" or "max"
- patience: number of epochs to wait
- factor: multiplicative drop in learning rate


In [None]:
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer,
    mode="max",
    patience=2,
    factor=0.1
)


In [None]:
from torchmetrics import Accuracy

metric = Accuracy(task="multiclass", num_classes=10).to(device)

for epoch in range(n_epochs):
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        loss = mse(model(X_batch), y_batch)
        loss.backward()
        optimizer.step()

    val_acc = evaluate_tm(model, valid_loader, metric).item()
    scheduler.step(val_acc)


## Learning Rate Warm-Up

Warm-up starts training with a very small learning rate and gradually increases it.
This helps stabilize training early on, especially with large batch sizes
or sensitive models.


In [None]:
warmup_scheduler = torch.optim.lr_scheduler.LinearLR(
    optimizer,
    start_factor=0.1,
    end_factor=1.0,
    total_iters=3
)


In [None]:
for epoch in range(n_epochs):
    warmup_scheduler.step()

    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        loss = mse(model(X_batch), y_batch)
        loss.backward()
        optimizer.step()

    if epoch >= 3:
        scheduler.step(val_metric)


## Cosine Annealing with Warm Restarts

This schedule repeatedly applies cosine annealing.
The learning rate periodically jumps back up, helping escape local minima.

Each cycle can be longer than the previous one.


In [None]:
cosine_repeat_scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(
    optimizer,
    T_0=2,
    T_mult=2,
    eta_min=0.001
)


## 1cycle Scheduling

1cycle scheduling:
- Warms up the learning rate
- Gradually cools it down
- Often converges faster and better

It is implemented in PyTorch as `OneCycleLR`.


In [None]:
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=0.1,
    steps_per_epoch=len(train_loader),
    epochs=n_epochs
)


## Summary

- Always reduce the learning rate near the end of training
- Use warm-up if training is unstable at the start
- Use ReduceLROnPlateau if progress stalls
- 1cycle is a strong default choice


# Avoiding Overfitting Through Regularization

> ‚ÄúWith four parameters I can fit an elephant, and with five I can make him wiggle his trunk.‚Äù
> ‚Äî John von Neumann (via Enrico Fermi)

Deep neural networks often have tens of thousands to billions of parameters.
This flexibility allows them to fit complex datasets, but it also makes them
highly prone to overfitting.

Regularization techniques help constrain the model so it generalizes better.
In this section we cover:
- ‚Ñì1 and ‚Ñì2 regularization
- Dropout
- Monte Carlo (MC) Dropout
- Max-norm regularization


## ‚Ñì1 and ‚Ñì2 Regularization

‚Ñì2 regularization discourages large weights and is equivalent to weight decay
when using SGD (with or without momentum).

‚Ñì1 regularization encourages sparsity by driving many weights to zero.


In [None]:
# ‚Ñì2 regularization using weight decay (SGD)
optimizer = torch.optim.SGD(
    model.parameters(),
    lr=0.05,
    weight_decay=1e-4
)


When using Adam, you should use AdamW instead of Adam to get proper weight decay.


In [None]:
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=0.001,
    weight_decay=1e-4
)


### Manual ‚Ñì2 Regularization (Selective Parameters)

Weight decay applies to all parameters by default, including biases and
batch-norm parameters. Sometimes you want to exclude those.


In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.05)

params_to_regularize = [
    param for name, param in model.named_parameters()
    if "bias" not in name and "bn" not in name
]

for epoch in range(n_epochs):
    for X_batch, y_batch in train_loader:
        optimizer.zero_grad()
        y_pred = model(X_batch)
        main_loss = loss_fn(y_pred, y_batch)

        l2_loss = sum(param.pow(2).sum() for param in params_to_regularize)
        loss = main_loss + 1e-4 * l2_loss

        loss.backward()
        optimizer.step()


### Parameter Groups for Selective Weight Decay

Parameter groups allow different hyperparameters for different parts of the model.


In [None]:
params_bias_and_bn = [
    param for name, param in model.named_parameters()
    if "bias" in name or "bn" in name
]

optimizer = torch.optim.SGD(
    [
        {"params": params_to_regularize, "weight_decay": 1e-4},
        {"params": params_bias_and_bn}
    ],
    lr=0.05
)


### ‚Ñì1 Regularization

PyTorch does not provide built-in ‚Ñì1 regularization, so it must be added manually.


In [None]:
l1_loss = sum(param.abs().sum() for param in params_to_regularize)
loss = main_loss + 1e-4 * l1_loss


## Dropout

Dropout randomly disables neurons during training.
Each neuron has probability `p` of being dropped at each step.

Typical dropout rates:
- 20‚Äì30% for recurrent networks
- 40‚Äì50% for convolutional networks

Dropout is only active during training.


In [None]:
import torch.nn as nn

model = nn.Sequential(
    nn.Flatten(),
    nn.Dropout(p=0.2), nn.Linear(28 * 28, 100), nn.ReLU(),
    nn.Dropout(p=0.2), nn.Linear(100, 100), nn.ReLU(),
    nn.Dropout(p=0.2), nn.Linear(100, 100), nn.ReLU(),
    nn.Dropout(p=0.2), nn.Linear(100, 10)
).to(device)


‚ö†Ô∏è Warning

Since dropout is disabled during evaluation, training loss and validation loss
may appear similar even when the model is overfitting.

Always evaluate training loss with dropout disabled.


If the model overfits, increase the dropout rate.
If it underfits, decrease the dropout rate.

Often, applying dropout only to the top hidden layers works best.


### Alpha Dropout

For self-normalizing networks using SELU activation,
use AlphaDropout instead of standard dropout.


In [None]:
nn.AlphaDropout(p=0.1)


## Monte Carlo (MC) Dropout

MC dropout keeps dropout active during inference.
Multiple stochastic predictions are averaged to improve accuracy
and estimate uncertainty.


In [None]:
model.eval()
for module in model.modules():
    if isinstance(module, nn.Dropout):
        module.train()


In [None]:
X_new = X_new.to(device)

torch.manual_seed(42)
with torch.no_grad():
    X_new_repeated = X_new.repeat_interleave(100, dim=0)
    y_logits_all = model(X_new_repeated).reshape(3, 100, 10)
    y_probas_all = torch.softmax(y_logits_all, dim=-1)
    y_probas = y_probas_all.mean(dim=1)


Average probabilities across Monte Carlo samples:


In [None]:
y_probas.round(decimals=2)


Standard deviation of predicted probabilities gives uncertainty estimates.


In [None]:
y_std = y_probas_all.std(dim=1)
y_std.round(decimals=2)


‚ö†Ô∏è Important

Do NOT average logits before applying softmax.
Always average probabilities to correctly reflect uncertainty.


### Custom MC Dropout Layer

If training from scratch, use a dedicated MC Dropout module.


In [None]:
import torch.nn.functional as F

class McDropout(nn.Dropout):
    def forward(self, input):
        return F.dropout(input, self.p, training=True)


## Max-Norm Regularization

Max-norm regularization constrains the ‚Ñì2 norm of incoming weights for each neuron:

‚Äñw‚Äñ‚ÇÇ ‚â§ r

Instead of adding a loss term, weights are rescaled after each update.


In [None]:
## Max-Norm Regularization

Max-norm regularization constrains the ‚Ñì2 norm of incoming weights for each neuron:

‚Äñw‚Äñ‚ÇÇ ‚â§ r

Instead of adding a loss term, weights are rescaled after each update.


In [None]:
# Call after optimizer.step()
optimizer.step()
apply_max_norm(model, max_norm=2.0)


TIP

For convolutional layers, use:
dim = [1, 2, 3]

This constrains each convolutional kernel instead of each neuron.


# Practical Guidelines

In this chapter we have covered a wide range of techniques, and you may be
wondering which ones you should use. This depends on the task, and there is no
clear consensus yet, but the configuration below works well in most cases
without requiring much hyperparameter tuning.

‚ö†Ô∏è These are **guidelines**, not hard rules.

---

## Recommended Default Configuration

| Hyperparameter            | Default value                                  |
|---------------------------|-----------------------------------------------|
| Kernel initializer        | He initialization                              |
| Activation function       | ReLU if shallow; Swish if deep                 |
| Normalization             | None if shallow; batch-norm or layer-norm if deep |
| Regularization            | Early stopping; weight decay if needed         |
| Optimizer                 | Nesterov accelerated gradients or AdamW        |
| Learning rate schedule    | Performance scheduling or 1cycle               |

---

## Pretraining Guidelines

You should also try to:
- Reuse parts of a **pretrained neural network** if one exists for a similar task
- Use **unsupervised pretraining** if you have a lot of unlabeled data
- Use **pretraining on an auxiliary task** if you have labeled data for a related task

---

## Important Exceptions

### Sparse Models
If you need a sparse model:
- Use **‚Ñì1 regularization**
- Optionally prune small weights after training (e.g. `torch.nn.utils.prune.l1_unstructured()`)

‚ö†Ô∏è Note: This breaks self-normalization, so avoid SELU-based architectures.

---

### Low-Latency Models
If inference speed is critical:
- Use fewer layers
- Use fast activations such as:
  - `nn.ReLU`
  - `nn.LeakyReLU`
  - `nn.Hardswish`
- Fold batch-norm and layer-norm into previous layers after training
- Prefer sparse models
- Reduce numerical precision:
  - FP16
  - INT8

Appendix B covers:
- Reduced precision models
- Mixed precision training
- Quantization

---

### Risk-Sensitive Applications
If uncertainty matters more than latency:
- Use **Monte Carlo Dropout**
- Gain:
  - Better predictive performance
  - Reliable probability estimates
  - Uncertainty estimates

---

## Chapter Wrap-Up

Over the last three chapters, we have learned:
- What artificial neural networks are
- How to build and train them using Scikit-Learn and PyTorch
- Practical techniques to train deep and complex networks

In the next chapter, all of this comes together as we dive into one of the most
important applications of deep learning:

üëâ **Computer Vision**
