# Week 4: Training and Hyperparameter Optimization for Neural Networks


Welcome to Week 4! Last week, we built the foundational components of our machine learning pipeline: preparing the dataset, creating a simple model structure, and setting up the basics of training and evaluation for a basic neural network. This week, we’ll take a deeper look at the **training process itself** — the engine that allows our model to actually learn.

## Learning objectives

* Understand the role of **loss functions** and why they are central to guiding model learning.
* Explore different **optimization algorithms** (e.g., SGD, Adam) and how they update model parameters.
* Learn how to track and interpret **training vs. validation loss** curves.
* Gain hands-on experience adjusting **hyperparameters** (learning rate, batch size, epochs) and seeing their effect on performance.


In [None]:
!pip -q install nbimporter torch

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.nn.functional as F

Instead of having all our helper functions in this notebook, we will place it into `utils.py` and reference it when needed. Take a minute to review the functions and classes in `utils.py`. Make note of the familiar classes and functions from Week 3.

In [None]:
from utils import train_model, evaluate_model, plot_loss, get_generic_model, get_pseudo_predictions, get_pseudo_targets, plot_pseudo_data

## Section 1: Introduction to the Training Process

In this section, you will be introduced to the core mechanics of training a neural network. This includes the concepts of epochs, loss landscapes, and the goals of optimization. Training a neural network involves minimizing a loss function, which can be visualized as navigating a landscape with peaks and valleys. A global minimum represents the best possible performance, while local minima represent suboptimal solutions where training can get stuck.

![minima](minima.png)

Recall the training loop from Week 3. This is what the actual function looks like.

In [None]:
def train_model(model, train_loader, val_loader, criterion=None, optimizer=None, learning_rate=1e-3, num_epochs=30, seed=42):
    """
    Function to train the model and record loss history.
    """
    # Set the default loss function
    if criterion is None:
        criterion = nn.MSELoss()
    
    # Set the default optimizer
    if optimizer is None:
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    # Store loss history for plotting
    history = {'train_loss': [], 'val_loss': []}
    
    print("Starting training and validation...")
    for epoch in range(num_epochs):
        
        # Training Phase
        model.train()  # Set the model to training mode
        running_train_loss = 0.0
        
        for inputs, targets in train_loader:
            optimizer.zero_grad()               # Clear previous gradients
            outputs = model(inputs)             # Forward pass (get predictions)
            loss = criterion(outputs, targets)  # Calculate loss
            loss.backward()                     # Backward pass (compute gradients)
            optimizer.step()                    # Update weights with gradients
            
            running_train_loss += loss.item() * inputs.size(0)
            
        epoch_train_loss = running_train_loss / len(train_loader.dataset)
        history['train_loss'].append(epoch_train_loss)
        
        # Validation Phase
        model.eval()  # Set the model to evaluation mode
        running_val_loss = 0.0
        
        with torch.no_grad():  # Turns off gradient tracking during evaluation
            
            for inputs, targets in val_loader:
                
                outputs = model(inputs)
                loss = criterion(outputs, targets)
                running_val_loss += loss.item() * inputs.size(0)
                
        epoch_val_loss = running_val_loss / len(val_loader.dataset)
        history['val_loss'].append(epoch_val_loss)
        
        print(f"Epoch {epoch+1}/{num_epochs} | "
              f"Train Loss: {epoch_train_loss:.4f} | "
              f"Val Loss: {epoch_val_loss:.4f}")
        
    print("Finished Training.")
    return history

### Key Components of the `train_model()` Function

Let us walk through some key components of the `train_model()` function. 

- `criterion` (Loss): The `criterion` is our loss function. The `criterion` measures how far the model's predictions are from the ground truth values. The goal of training is to minimize this loss. Here, we use the **Mean Squared Error (MSE)** loss function.

- `optimizer`: The `optimizer` updates the model's weights to reduce the loss. This function uses the **Adam** optimizer, a widely used, adaptive method that performs well across many tasks with default settings. Think of it as the “coach” that decides how big or small the corrections should be after each mistake.

    * It looks at the **gradients** (calculated during backpropagation) and nudges the parameters in the direction that should reduce error.
    * Examples: **SGD**, **Adam**, **RMSProp**.

- `history`: The `history` dictionary stores the training and validation loss after each pass through the dataset. This enables us to plot learning curves, monitor progress, and diagnose issues such as overfitting. Its structure also works directly with the `data_loaders.plot_loss()` function.

- Training and Validation loops: The loops repeatedly train and evaluate the model over multiple full passes through the training data. Before each phase, we set the model mode with `model.train()` and `model.eval()`, which ensures that certain special layers (like dropout or batch normalization) behave correctly during training and evaluation.

Below, we will elaborate on some of these key concepts.

### The Training Phase

The **<span style="background-color: #AFEEEE">**training phase**</span>** is a repetitive process where a neural network learns from data. Each full pass through all the training examples is called an **<span style="background-color: #AFEEEE">**epoch**</span>**. During an epoch, the model makes predictions, measures error, and updates its parameters to reduce the loss. Over many epochs, the model gradually improves --- much like practicing a skill: you make a guess, see the error, adjust, and try again.

Within each epoch, we process the training data in **batches**, performing the following steps for each batch:

1. **Forward Pass**: Use the current model parameters to make predictions.
2. **Calculate Loss**: Measure how far the current predictions are from the ground truth values.
3. **Backward Pass**: Determines how each weight contributed to the error. The resulting gradients indicate how to adjust the weights to reduce the error.
4. **Weight Update**: Call `optimizer.step()` to adjust the weights based on the gradients.

### Backpropagation

The series of steps described above (including forward pass and backward pass) is sometimes called the **<span style="background-color: #AFEEEE">**backpropagation**</span>** algorithm. It refers to an efficient process to compute the gradients for all the weights in the neural network through one full backward pass from the outputs to the inputs. Behind the scenes, the algorithm computes gradients using calculus, and the gradients indicate how each weight should change to reduce the loss. 

You don't need to worry about the math.  The key idea to remember is that backpropagation tells the model how to adjust its weights to improve its predictions. If you're interested in learning more, take a look at [this resource](https://www.youtube.com/watch?v=Ilg3gGewQ5U&embeds_referring_euri=https%3A%2F%2Fchatgpt.com%2F&source_ve_path=MjM4NTE).

### The Validation Phase

The **<span style="background-color: #AFEEEE">**validation phase**</span>** is used to evaluate how well the neural network generalizes to unseen data. Unlike training, this phase does **not** update the model’s weights. Instead, it measures performance on a separate **validation set** after each epoch to track progress and "validate" the model's progress.

During validation, the following steps occur:

1. **Forward Pass:** The model makes predictions on the validation data using the current weights.
2. **Calculate Loss:** The loss is computed to estimate how far these predictions are from the true labels, just like in training — but without computing gradients.
3. **Performance Metrics:** Accuracy, precision, recall, or other metrics are calculated to assess model quality.
4. **No Backpropagation:** The model’s parameters are **not** updated — gradients are disabled to save memory and ensure the evaluation reflects the model’s current state.

By comparing **training** and **validation** loss over epochs, we can see whether the model is learning effectively, underfitting, or overfitting.

### Loss Function

In machine learning, a **<span style="background-color: #AFEEEE">**loss function**</span>** measures how wrong a model’s predictions are by comparing them to the true values. It outputs a single number—the **loss**—which indicates the model’s prediction error. A lower loss means better performance.

During training, the model adjusts its parameters to **minimize this loss**. It uses **backpropagation** to compute how much each parameter contributed to the loss, and an **optimizer** (like SGD or Adam) updates the parameters to reduce the error in future predictions.

The loss function is critical because it **defines the goal of learning**—guiding the model on what "improvement" means. Without it, the model wouldn’t know whether it’s getting better or worse.

Different tasks require different loss functions. For example:

* **Mean Squared Error (MSE)** is commonly used for regression problems.
* **Cross-Entropy Loss** is used for classification tasks.

#### Bullseye Metaphor

Imagine the true answer is the **center of a bullseye target**.

* When the model’s prediction is **right on the bullseye**, the loss is **zero** — perfect accuracy.
* If the prediction **lands close to the center**, the loss is **small**, meaning the model did well but not perfectly.
* If the prediction **hits far from the bullseye**, the loss is **large**, showing a bigger mistake.
* During training, the model tries to “aim its arrow” closer and closer to the bullseye to reduce the loss.

The loss function measures the **distance from the bullseye**, guiding the model to improve its aim with each try. 

### Implementing a Loss Function using `criterion()`

In `PyTorch`, the term `criterion` typically refers to the **loss function object**. It's defined before training starts and used to calculate the loss during each training step. 

For example:

> `import torch.nn as nn`
> 
> `criterion = nn.CrossEntropyLoss()`  # For classification

You then use it like this during training:

> `loss = criterion(predictions, targets)`

Here, `criterion()` takes the model’s predictions and the true labels, and returns the loss value. This value is then used in backpropagation:

> `loss.backward()`
> 
> `optimizer.step()`

So, the `criterion()` is the functional tool that connects **predictions**, **ground truth**, and the **learning objective**.

### Underfitting vs. Overfitting

As the model learns, it’s important to make sure it generalizes well—not just memorizing training data. We touched on underfitting and overfitting in previous modules such as HMB201 Week 7. Here is a quick refresher:

* **Underfitting** happens when a model is not able to capture the useful/important patterns in the dataset, leading to poor performance on both training and test data.
  *Example:* A straight line trying to fit a curved pattern (refer to left most graph in the image below).

* **Overfitting** happens when a model performs poorly on unseen test data, even though it performed well during training and performs well on training data but poorly on new, unseen data such as the test and validation set.
  *Example:* A wiggly curve that fits every training point exactly but fails on test data (refer to right most graph in the image below).
![fit examples](fit_examples.png)

Image retrieved from [here](https://medium.com/greyatom/what-is-underfitting-and-overfitting-in-machine-learning-and-how-to-deal-with-it-6803a989c76).

The loss function and `criterion()` help monitor these issues:

* A very low training loss but high validation loss indicates **overfitting**.
* A high loss on both training and validation sets suggests **underfitting**.
* A low training loss AND low validation loss, with both being close to each other suggests a well fitted model.

Let's try an exercise to visualize the targets, prediction and criterion.

In [None]:
# Example: Mean Squared Error Loss
criterion = nn.MSELoss()

# Fake "model outputs" (predictions)
predictions = get_pseudo_predictions()

# Fake "ground truth" targets
targets = get_pseudo_targets()

# Calculate loss
loss_value = criterion(predictions, targets)

print(f"Predictions:\n{predictions}")
print(f"Targets:\n{targets}")
print(f"Loss: {loss_value.item():.4f}")


In [None]:
plot_pseudo_data(predictions, targets, loss_value)

This will give you:

- Blue dots = model predictions

- Green dots = actual targets

- Red dashed lines = error for each point

**Formula:**

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (prediction_i - target_i)^2
$$

So in this case:

* Prediction: 0.2, Target: 0.0 → Error: 0.2 → Squared: 0.04
* Prediction: 0.8, Target: 1.0 → Error: -0.2 → Squared: 0.04
* Prediction: 0.5, Target: 0.6 → Error: -0.1 → Squared: 0.01
* **Mean** = (0.04 + 0.04 + 0.01) / 3 = 0.03

That `0.03` is what `criterion` returns — the **average squared difference between your predictions and the truth**.

Note: the criterion can be set to other loss functions. Here are some common choices:

- Regression → `MSELoss`, `L1Loss`, `HuberLoss`.

- Binary classification → `BCEWithLogitsLoss`.

- Multi-class classification → `CrossEntropyLoss`.


---
**Q*1: Choose 2 different `criterion` loss functions and run them with the pseudo predictions and targets. Print the losses.**
> Hint: Refer to the documentation [here](https://docs.pytorch.org/docs/stable/nn.html#loss-functions).

<span style="background-color: #FFD700">**Write your code below**</span>


In [None]:
# YOUR CODE HERE

criterion = ...

# Calculate loss


criterion = ...

# Calculate loss


### Visualizing the Loss Surface

In machine learning, a **loss surface** is a visual or mathematical representation of how the **loss function** (which measures prediction error) changes based on the model's parameters.

Definitions:

* The **loss function** quantifies how well a model's predictions match the actual values (lower is better).
* The **loss surface** is the shape created when plotting the loss values against different combinations of model parameters.

Real models have thousands or millions of parameters, but the following are 2D and 3D examples.

### 2D Case: Local vs Global Minimum

A **model with one parameter** (e.g., $w$) will produce a **2D graph** of the loss surface.

* **X-axis**: The single model parameter (e.g., $w$)
* **Y-axis**: Loss value $\mathcal{L}(w)$

In a 2D scenario, we can visualize the **loss landscape** as a curve:

![2D loss surface](2D_loss_surface.png)

Image retrieved from [here](https://medium.com/aimonks/navigating-the-peaks-and-valleys-of-optimization-global-minimum-vs-25c05de6f69a).


The **red dashed line** is the **global minimum** — the best loss value.

The **green dashed line** is a **local minimum** — better than nearby points, but not optimal.

---

### 3D Case: A More Realistic Loss Landscape

For a model with **two parameters** (say $w_1$ and $w_2$), the loss surface can be visualized in 3D:

* **X-axis**: Parameter 1 ($w_1$)
* **Y-axis**: Parameter 2 ($w_2$)
* **Z-axis**: Loss value $\mathcal{L}(w)$

This results in a surface — possibly a bowl, ridge, or complex landscape — depending on the model and data.



![loss function visualization](descent_2D_sphere.gif)

Image retrieved from [here](https://egallic.fr/Enseignement/ML/ECB/book/gradient-descent.html).


The GIF shows a point rolling down along a smooth 3D bowl-shaped loss surface, illustrating how an optimization algorithm like gradient descent moves model parameters step-by-step to minimize loss and find the best solution.

In training, your optimizer "walks" across this surface trying to find a valley (minimum loss).


### Summary

| **Concept**          | **Description**                                                                                                         |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| **Epoch**            | One full pass through the training data                                                                                 |
| **Loss**             | A measure of prediction error                                                                                           |
| **Local Minimum**    | A point where loss is low relative to nearby points                                                                     |
| **Global Minimum**   | The lowest possible point on the entire loss surface                                                                    |
| **Loss Surface**     | A multi-dimensional shape representing how loss changes with different model parameters                                 |
| **Loss Curve**       | Graph showing how loss decreases during training                                                                        |
| **Optimizer**        | Algorithm that adjusts model parameters to minimize loss; includes SGD, Adam, RMSProp, etc.                             |
| **Gradient Descent** | An optimization algorithm that updates model parameters in the opposite direction of the gradient                       |
| **Learning Rate**    | Step size used in gradient descent; too high may overshoot minima, too low slows training                               |
| **Underfitting**     | The model is too simple or hasn’t trained enough; performs poorly on all data                                           |
| **Overfitting**      | The model learns the training data too well, including noise; poor on new data                                          |
| **Well-Fitting**     | The model performs well on both training and validation data; generalizes effectively                                   |
| **Validation Loss**  | Loss measured on data not used for training; indicates generalization                                                   |
| **Training Loss**    | Loss measured on data used to train the model                                                                           |


## Section 2: Visualizing Training Performance

Visualizing loss and accuracy across epochs helps identify training behavior, such as overfitting, underfitting, and convergence. It gives immediate insight into whether training is working as expected.

### Exercises:

1. **Plot Training Curves**

   * Use `matplotlib` to plot loss and accuracy per epoch.
   * Add labels, grid, and legend for clarity.

2. **Analyze Curve Behavior**

   * Identify points where loss plateaus or spikes.
   * Discuss what these indicate about the training process.

3. **Bonus Task:**

   * Add validation accuracy and loss to the same plots.
   * Interpret gaps between training and validation curves.

---


From Week 3, you were introduced to a loss graph. Here is an example of a loss graph:

![loss graph](example_training_loss_over_epochs.png)

Image retrieved from [here](https://www.geeksforgeeks.org/deep-learning/training-and-validation-loss-in-deep-learning/).

The loss is a calculation dependent on the `criterion` selected. You can plot these graphs using `plot_loss(history)` in `utils.py`.

Here are example training loss curves:

![learning curve scenarios](learning_curves_scenarios.jpg)

1. **Plateau in Loss** (top left) – The loss stops improving, possibly due to a local minimum or a low learning rate.
2. **Spikes in Loss** (top right) – Irregular jumps in the loss, often caused by a high learning rate or noisy batches.
   > Note 1: small spikes = normal, large persistent spikes = check your setup.
   > 
   > Note 2: not necessarily a bad thing (depends on your model and training setup).
3. **Healthy Convergence** (bottom left) – A smooth, steady decrease in loss, showing effective learning.
4. **Underfitting** (bottom right) – Loss decreases only slightly, indicating the model is too simple or undertrained.

⚠️ **Important Note:**
Not every loss curve pattern automatically means something is *bad*. For example, plateaus may just mean convergence, small spikes can be normal with noisy data, and even underfitting might be acceptable if your goal is simplicity. What matters most is interpreting these patterns in the **context of your dataset, model, and training objectives**. Loss curves are clues, not verdicts.


### Comparing Train vs Validation

* If **training loss continues to decrease** but **validation loss increases**, it indicates **overfitting**.
* A **large gap between train and val accuracy** also suggests overfitting — the model isn’t generalizing well.
* **Small and stable gaps** between training and validation suggest the model is learning general patterns effectively.


---
**Q*2: For each of the images below (left, middle, right), make an educated prediction of whether the graph depicts an underfitted, overfitted, or well-fitted model. Provide a reasoning for each prediction you make.**

![fit](fit.jpg)

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:
* Left: 
* Middle: 
* Right: 
---

### Why use multiple epochs?

In machine learning, **an epoch** is **one complete pass through the entire training dataset** by the learning algorithm.

Training a model with only one pass (one epoch) is usually **not enough**. The model needs to **see the data multiple times** to learn the patterns well.

Think of it like studying:

* **One epoch** = reading the textbook once.
* **Multiple epochs** = going over it several times to reinforce learning.

### How many epochs to use?

* Too **few** epochs → Underfitting (model hasn’t learned enough)
* Too **many** epochs → Overfitting (model memorizes training data, doesn't generalize)

> Common practice: Monitor **validation accuracy/loss** and stop training when performance plateaus (often using **early stopping**).


---
**Q*3: Train a generic model with varying epochs. Use 3 different numbers for epoch settings and use `plot_loss()` for each test. Test your model using `evaluate_model()`**
> Hint: Try using a for loop to reduce code repetition.

In [None]:
# IMPORTANT: Call get_generic_model() each time you start training.
# This resets the model and optimizer configurations.
# If you do not reset them, the model weights and optimizer state will carry over from previous runs.
model, train_loader, val_loader, test_loader, criterion, optimizer = get_generic_model()

<span style="background-color: #FFD700">**Write your code below**</span>


In [None]:
# YOUR CODE HERE

num_epochs = [...]

...


## **Graded Exercise: (6 marks)** 

**GQ*1: Why do we plot training and validation loss over epochs? (1 mark)**

A. To measure how long training takes

B. To track how well the model learns and generalizes

C. To compare different datasets

D. To check memory usage

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

**GQ*2: If training loss decreases but validation loss increases, what is happening? (1 mark)**

A. Underfitting

B. Overfitting

C. Good generalization

D. Data leak

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

**GQ*3: If both training and validation losses are high and flat, what does that mean? (1 mark)**

A. The model is overfitting

B. The model is underfitting

C. The model is well-trained

D. The learning rate is too low


<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

**GQ*4: The training loss fluctuates wildly or suddenly increases. What’s the most likely cause? (1 mark)**

A. Too few epochs

B. Learning rate too high

C. Too much regularization

D. Dropout not applied


<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

**GQ*5: After 10 epochs, both training and validation losses are still decreasing. What should you do? (1 mark)**

A. Stop training 

B. Decrease the learning rate

C. Train for more epochs

D. Reduce model size

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

**GQ*6: If the training and validation losses decrease and then level off close together, this suggests: (1 mark)**

A. Underfitting

B. Overfitting

C. Good convergence

D. Poor data quality

<span style="background-color: #FFD700">**Write your answer below**</span>

Answer here:

---

## Conclusion

In this module, you explored the training process in more depth and learned how to interpret loss graphs. These skills are essential for understanding your model’s learning behavior, diagnosing issues like underfitting or overfitting, and identifying opportunities for improvement. You also learned how to evaluate model performance effectively—an important step in building reliable and accurate machine learning systems.