# **Lab 4**

TI3155TU Deep Learning (2024 - 2025)

Adapted by Elena Congeduti from TU Delft CS4240 Deep Learning course


# Instructions
**For this lab, we recommend working on Google Colab as it provides direct support for the TensorBoard library. To do this, select the 'Open in Colab' option from the notebook's homepage menu.**

Alternatively, you can work locally. In this case, you will need to set up your own virtual environment. Check the Lab Instructions in [Learning Material](https://brightspace.tudelft.nl/d2l/le/content/682797/Home?itemIdentifier=D2L.LE.Content.ContentObject.ModuleCO-3812764) on Brightspace for detailed information on the virtual environment configuration.

These labs include programming exercises and insight questions. Follow the instructions in the notebook. Fill in the text blocks to answer the questions and write your own code to solve the programming tasks within the designated part of the code blocks:

```python
#############################################################################
#                           START OF YOUR CODE                              #
#############################################################################


#############################################################################
#                            END OF YOUR CODE                               #
#############################################################################
```

Solutions will be shared the week after the lab is published. Note that these labs are designed for practice and are therefore **ungraded**.



In [None]:
# Setup
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torch.utils.data import TensorDataset, DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms

#Only if you run it on Kaggle
#!pip install torchsummary

import math
from torchsummary import summary
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt

# Additional Setup for MNIST-1D
!git clone https://github.com/greydanus/mnist1d
!mv mnist1d/mnist1d/* mnist1d/

import mnist1d
from mnist1d.data import get_templates, get_dataset_args, get_dataset
from mnist1d.utils import set_seed, plot_signals, ObjectView, from_pickle

# Additional Setup to use Tensorboard
!pip install -q tensorflow

%load_ext tensorboard

# 1 Exponentially Weighted Moving Average (EWMA)

Exponentially weighted moving average can be used to smooth out noisy data. It's a very versatile tool that has applications not only in machine learning, but also signal processing or finance.

The formula for EWMA is very simple. Note that the formula is recursive, meaning the current iteration uses the outcome of previous iterations

$$S^t=\rho S^{t-1} + (1-\rho)y^t$$

The $\rho$ parameter determines the strength of the smoothing, or how much of the output value will be determined by the average of previous values and how much by the current value.



****
**Question 1.1:** Where have we already seen this type of recursive formula?


<font color='green'> Write your answer here
</font>
****


Although EWMA is by far not the only way to smooth a series of (multi dimensional) data points it is very flexible and memory efficient since it can take an arbitrary number of input values and only requires to additionally store the previously calculated averages.

EWMA forms the basis of all optimizers that you'll implement today.

In this first exercise you will implement EWMA to smooth out a series of noisy data. Run the cell below to create the noisy data. Keep the data length, wave length and noise level at their preassigned value for now. Once you have implemented your own version of EWMA you can come back and see what changes if you adjust these parameters.

In [None]:
## Generate some noisy data
n = 400 # Length of the data
wl = 2 # Wavelength of underlying data
noise_level = 2 # Strength of noise

x = np.linspace(0,wl*math.pi,n)
data_clean = np.sin(x+2)+ noise_level
data = data_clean + (np.random.random(n)-0.5)*noise_level

plt.scatter(x, data, s=3)
plt.plot(x, data_clean, 'orange', linewidth=3)

****
**Task 1.2:** Implement EWMA update as `s_cur` and add the bias correction to `s_cur_bc`.
****

In [None]:
rho = 0.95 # Rho value for smoothing

s_prev = 0 # Initial value ewma value

# Empty arrays to hold the smoothed data
ewma, ewma_bias_corr = np.empty(0), np.empty(0)

for i,y in enumerate(data):

    # Variables to store smoothed data point
    s_cur = 0
    s_cur_bc = 0

    #############################################################################
    #                            START OF YOUR CODE                             #
    #############################################################################

    #############################################################################
    #                            END OF YOUR CODE                               #
    #############################################################################

    # Append new smoothed value to array
    ewma = np.append(ewma,s_cur)
    ewma_bias_corr = np.append(ewma_bias_corr,s_cur_bc)

    s_prev = s_cur

plt.scatter(x, data, s=3)
plt.plot(x, ewma, 'r--', linewidth=3)
plt.plot(x, ewma_bias_corr, 'g--', linewidth=3)
plt.plot(x, data_clean, 'orange', linewidth=3)

****
**Question 1.3:** What do the curves in this plot represent? Can we conclude something on the effect of ewma?


<font color='green'> Write your answer here </font>

****

# 2 Optimization Algorithms

Now that you have seen what an EWMA looks like in 1D we're ready to use it to build some optimizers for a 2D case and look at some nice visualizations.

You'll get the opportunity to explore three different types of optimizers: Momentum, RMSProp and Adam. There are multiple others optimizers available, but these three should give you a good intuition of what an optimizer does and a solid base to understand other optimizers.

Well use a quadratic function (inspired by [this]( https://xavierbourretsicotte.github.io/Intro_optimization.html) tutorial) as a toy example to show the effects of different optimizers. We'll use the derivative of that function to calculate our gradients.

Note that this is only a toy example and not a complete representation of stochastic gradient descent. Since we're using a function and it's derivative, we do know the exact distribution of our data. In a real world example we would not know it and would use batches of our training examples to approximate the gradient.

In the following exercises you will be asked to implement three optimizers and see how they compare.

In the next cell we will define the two methods that define the quadratic function and it's derivative to calculate the gradients as well as two helper functions to help us visualize the function and the training.

Using the ```setup_figure``` function we can plot the surface and contours of the quadratic functions we defined. You can always rerun this cell to reset your plot.


In [None]:
def f(x,y):
    '''A simple quadratic function'''
    return .01*x**2 + .1*y**2

def Grad_f(x,y):
    '''Gradients of function f'''
    g1 = 2*.01*x
    g2 = 2*.1*y
    return np.array([g1,g2])



In [None]:
%matplotlib inline

def setup_figure(f):
    '''Creates a Surface and a contour plot of the function f'''
    x = np.linspace(-3, 3, 250)
    y = np.linspace(-3, 3, 250)
    X, Y = np.meshgrid(x, y)
    Z = f(X, Y)

    fig,(ax1,ax2) = plt.subplots(1,2,figsize = (16,8))

    # Surface plot
    ax1 = plt.subplot(121, projection='3d')
    ax1.plot_surface(X, Y, Z, rstride=5, cstride=5, cmap='jet', alpha=.4, edgecolor='none')
    ax1.set_title('f(x,y) = .01x^2 + .1y^2')

    ax1.view_init(65, 340)
    ax1.set_xlabel('x')
    ax1.set_ylabel('y')

    # Contour plot
    ax2 = plt.subplot(122)
    ax2.contour(X, Y, Z, 50, cmap='jet')
    ax2.set_title('Gradient Descent')

    return fig,(ax1,ax2)

def add_line(iter_x,iter_y,fig,ax1,ax2,color='r'):
    '''Adds lines to the provided figure'''

    # Angles needed for quiver plot
    anglesx = iter_x[1:] - iter_x[:-1]
    anglesy = iter_y[1:] - iter_y[:-1]

    ax1.plot(iter_x, iter_y, f(iter_x,iter_y), color=color, marker='*', alpha=.4)

    ax2.scatter(iter_x,iter_y,color = color, marker='*')
    ax2.quiver(iter_x[:-1], iter_y[:-1], anglesx, anglesy, scale_units='xy',
               angles='xy', scale=1, color=color, alpha=.3)
    return fig, (ax1, ax2)

# Get figure and axis and plot to show
fig, axs = setup_figure(f)

****
**Question 2.1:** In our context, what do $x$, $y$ and $f$ represent?

<font color='green'> Write your answer here </font>

****

Next we define another helper function for our gradient descent. Only the first three arguments of the test_optimizer function are important to you, the rest you can ignore.

This function is written in a way that it accepts a Python function as the  ```optimizer``` argument. In the following exercises you will define those functions to implement the Momentum, RMSProp and Adam optimizer.

For the first iterations keep the learning rate and other optimizer parameters at their predefined values. Once you have implemented all optimizers feel free to explore how different parameter values influence different optimizers.

In [None]:
def test_optimizer(optimizer, rhos, learning_rate=0.00125,
                   starting_point=(-2,-2), nMax=10000, epsilon=0.0001,
                   Grad = Grad_f):
    """
    Tester function for optimization algorithms. Performs gradient descent using
    the provided Gradient until error < epsilon is reached.

    Args:
        optimizer: Optimization algorithm.
        rhos: Hyperparameters used in some optimization algorithms.
        learning_rate: Optimization step size.
        starting_point: Initialization point of optimization parameters.
        nMax: Maximum number of iterations to perform.
        epsilon: Stop optimization if error < epsilon.
        Grad: Gradient of quadratic function.
    """

    # Starting points
    x, y = starting_point

    # Initialization
    i = 0
    iter_x, iter_y, iter_count = np.empty(0), np.empty(0), np.empty(0)
    error = 10
    X = np.array([x,y])

    # Cache for previous values of optimizers
    prev_vals = np.zeros_like(rhos)

    # Looping as long as error is greater than epsilon
    while np.linalg.norm(error) > epsilon and i < nMax:
        i += 1
        iter_x = np.append(iter_x, x)
        iter_y = np.append(iter_y, y)
        iter_count = np.append(iter_count, i)

        X_prev = X

        # Perform optimization step
        X, prev_vals = optimizer(X, rhos, learning_rate, prev_vals, i)

        # Calculate error
        error = X - X_prev
        x,y = X[0], X[1]

    return iter_x, iter_y


## 2.1 Stochastic Gradient Descent (SGD)

The simplest form of an optimizer is SGD. It will always follow the gradient. Although this can lead to some good results, it requires some careful tuning of the learning rate to make sure the steps it takes are large enough to reach a minimum in appropriate time and not overshoot its target.

The following cell shows how the gradient update is implemented in gradient descent.

In [None]:
def vanilla_gd(X, rhos, learning_rate, prev_value, index, Grad=Grad_f):
    """
    Vanilla gradient descent optimization step.

    Args:
        X: Current value of objective function.
        rhos: Not used.
        learning_rate: Optimization step size.
        prev_value: Not used.
        index: Not used.
        Grad: Gradient of quadratic function.
    """
    gradient = Grad(*X)

    X = X - learning_rate * gradient

    return X, 0


****
**Question 2.2:** Try to change the learning rate. Do you get different SGD trajectories for larger or smaller values?
****

In [None]:
# Optimization settings
lr = 0.1
rho = None # rho is not used in gradient descent

# Run optimization
x_gd, y_gd = test_optimizer(vanilla_gd, rho, lr)

# Reset the image
plt.ioff()
fig, axs = setup_figure(f)

# Plot optimization trajectory
add_line(x_gd, y_gd, fig, *axs, color='r')

# Show SGD trajectory
fig

## 2.2 Momentum

Momentum is simply EWMA applied to gradient descent. It helps smoothen the gradient updates by incorporating earlier gradient steps. This is especially handy in case the SGD is overshooting the minimum due to a too high learning rate.
In case of overshooting, the gradient will point in (nearly opposite directions) after each update, by averaging over multiple gradients this will get cancelled out.

The formula for a gradient update with momentum is:

$$v_i=\rho v_{i-1}+(1-\rho)\nabla_{\theta}$$

$$\theta^{\prime}=\theta-\epsilon v_i$$




****
**Task 2.3:** Complete the Momentum gradient update with the update step for  `v` and `X`.
****

In [None]:
def momentum(X, rho, learning_rate, prev_value, index, Grad=Grad_f):
    """
    Gradient descent with momentum optimization step.

    Args:
        X: Current value of objective function.
        rhos: Optimization hyperparameter - see formula above.
        learning_rate: Optimization step size.
        prev_value: Momentum parameter from previous iteration.
        index: Not used.
        Grad: Gradient of quadratic function.
    """
    gradient = Grad(*X) # Gradient of current values
    v = 0               # Momentum parameter
    v_prev = prev_value # Momentum parameter from previous iteration

    #############################################################################
    #                           START OF YOUR CODE                              #
    #############################################################################

    #############################################################################
    #                            END OF YOUR CODE                               #
    #############################################################################

    return X, v

# Optimization settings
lr = 0.1
rho = 0.9

# Run optimization
res_mom = test_optimizer(momentum, rho, lr)

# Plot optimization trajectory
add_line(*res_mom, fig, *axs, color='b')

# Show figure
fig

## 2.3 RMSProp

RMSProp also uses EWMA for previous versions of the gradient update. However, unlike momentum it does not use this average to update the gradient directly but to scale the learning rate of update. By doing this it is able to take larger steps towards the beginning of the learning process and smaller steps towards the end.

The formula for a gradient update with RMSProp is:

$$r_i=\rho r_{i-1}+(1-\rho)\nabla^2_{\theta}$$

$$\theta^{\prime}=\theta-\epsilon \frac{\nabla_\theta}{\sqrt{r_i+\delta}}$$


****
**Task 2.4:** Complete the RMSProp gradient update with the update step for `r` and `X`.
****

In [None]:
def RMSprop(X, rho, learning_rate, prev_value, index, Grad=Grad_f):
    """
    RMSprop optimization step.

    Args:
        X: Current value of objective function.
        rhos: Optimization hyperparameter - see formula above.
        learning_rate: Optimization step size.
        prev_value: Momentum parameter from previous iteration.
        index: Not used.
        Grad: Gradient of quadratic function.
    """
    delta = 1e-5        # Tiny amount to prevent division by 0
    gradient = Grad(*X) # Gradient of current values
    r = 0               # RMSProp parameter
    r_prev = prev_value # RMSProp parameter from previous iteration

    #############################################################################
    #                           START OF YOUR CODE                              #
    #############################################################################

    #############################################################################
    #                            END OF YOUR CODE                               #
    #############################################################################

    return X, r

# Optimization settings
lr = 0.1
rho = 0.999

# Run optimization
res_rmsprp = test_optimizer(RMSprop,rho,lr)

# Plot optimization trajectory
add_line(*res_rmsprp,fig,*axs,color='y')

# Show figure
fig

## 2.4 Adaptive Moment Estimation (Adam)

The idea behind the last optimizer, Adam, is simple. It combines both Momentum and RMSProp in a single optimizer.

The formula for a gradient update with Adam is:

$$v_i=\rho_1 v_{i-1}+(1-\rho_1)\nabla_{\theta}$$

$$\hat{v_i}=\frac{v_i}{1-\rho^i_1}$$

$$r_i=\rho_2 r_{i-1}+(1-\rho_2)\nabla^2_{\theta}$$

$$\hat{r_i}=\frac{r_i}{1-\rho^i_2}$$

$$\theta^{\prime}=\theta-\epsilon \frac{\hat{v_i}}{\sqrt{\hat{r_i}+\delta}}$$


****
**Task 2.5:** Complete the Adam gradient update with the update step for `v`, `v_bc`, `r`, `r_bc` and `X`.
****

In [None]:
def adam(X, rhos, learning_rate, prev_values, index, Grad=Grad_f):
    """
    Adam optimization step.

    Args:
        X: Current value of objective function.
        rhos: Optimization hyperparameter - see formula above.
        learning_rate: Optimization step size.
        prev_value: Momentum parameter from previous iteration.
        index: Optimization step counter.
        Grad: Gradient of quadratic function.
    """

    delta = 1e-5                 # Tiny amount to prevent division by zero
    gradient = Grad(*X)          # Gradient of current values
    rho_v, rho_r = rhos          # Rho values for momentum & rmsProp part of Adam
    v_prev, r_prev = prev_values # Adam parameters from previous iterations

    v = r = 0                    # Adam paramters for momentum & rmsProp
    v_bc = r_bc = 0              # Bias corrected adam parameters

    #############################################################################
    #                           START OF YOUR CODE                              #
    #############################################################################

    #############################################################################
    #                            END OF YOUR CODE                               #
    #############################################################################

    return X,(v,r)

# Optimization settings
lr = 0.1
rhos = (0.9, 0.999)

# Run optimization
res_adam = test_optimizer(adam,rhos,lr)

# Plot optimization trajectory
add_line(*res_adam, fig, *axs, color='g')

# Show figure
fig

Now that you have implemented gradient descent with momentum, RMSprop and Adam you can start to play around with the optimizer hyperparameters.


****
**Question 2.6:** What effect does the learning rate have? What is the effect of the $\rho$ parameter?
****

# 3 Comparing optimizers in PyTorch

It is time to test the momentum, RMSProp and Adam optimizers. We will investigate how the optimizer choice influences the convergence speed and final performance on the [MNIST-1D](https://github.com/greydanus/mnist1d) classification task. In this variation of the MNIST dataset, the digits are represented by 1D 40-dimensional tensors.

You will use the training loop and model definition from the previous labs - let us start by setting them up.

In [None]:
###############################
### Set up MNIST-1D dataset ###
###############################
#Set random seed
torch.manual_seed(42)

# Set the batch size for training & testing
b_size = 100

# Load data
data = get_dataset(get_dataset_args(), path='./mnist1d_data.pkl',
                   download=False, regenerate=True)

# Convert 1D MNIST data to pytorch tensors
tensors_train = torch.Tensor(data['x']), torch.Tensor(data['y']).long()
tensors_test = torch.Tensor(data['x_test']), torch.Tensor(data['y_test']).long()

# Create dataloaders from the training and test set for easier iteration over the data
train_loader = DataLoader(TensorDataset(*tensors_train), batch_size=b_size)
test_loader = DataLoader(TensorDataset(*tensors_test), batch_size=b_size)

##########################################
### Define training and test functions ###
##########################################

def train(train_loader, net, optimizer, criterion):
    """
    Trains network for one epoch in batches.

    Args:
        train_loader: Data loader for training set.
        net: Neural network model.
        optimizer: Optimizer (e.g. SGD).
        criterion: Loss function (e.g. cross-entropy loss).
    """

    avg_loss = 0
    correct = 0
    total = 0

    # iterate through batches
    for i, data in enumerate(train_loader):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # keep track of loss and accuracy
        avg_loss += loss
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    return avg_loss/len(train_loader), 100 * correct / total

def test(test_loader, net, criterion):
    """
    Evaluates network in batches.

    Args:
        test_loader: Data loader for test set.
        net: Neural network model.
        criterion: Loss function (e.g. cross-entropy loss).
    """

    avg_loss = 0
    correct = 0
    total = 0

    # Use torch.no_grad to skip gradient calculation, not needed for evaluation
    with torch.no_grad():
        # iterate through batches
        for data in test_loader:
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data

            # forward pass
            outputs = net(inputs)
            loss = criterion(outputs, labels)

            # keep track of loss and accuracy
            avg_loss += loss
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    return avg_loss/len(test_loader), 100 * correct / total

####################
### Define model ###
####################

class FCNet(nn.Module):
    """
    Simple fully connected neural network with residual connections in PyTorch.
    Layers are defined in __init__ and forward pass implemented in forward.
    """

    def __init__(self):
        super(FCNet, self).__init__()

        self.fc1 = nn.Linear(40, 200)
        self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 200)
        self.fc4 = nn.Linear(200, 200)
        self.fc5 = nn.Linear(200, 200)
        self.fc6 = nn.Linear(200, 10)

    def forward(self, x):
        h = F.relu(self.fc1(x))
        h = h + F.relu(self.fc2(h))
        h = h + F.relu(self.fc3(h))
        h = h + F.relu(self.fc4(h))
        h = h + F.relu(self.fc5(h))
        return self.fc6(h)

# Print network architecture using torchsummary
summary(FCNet(), (40,), device='cpu')

For this part, please import your notebook in Google Colab or download it to work locally.

We will use [TensorBoard](https://pytorch.org/tutorials/intermediate/tensorboard_tutorial.html) to visualize our results. TensorBoard is a visualization toolkit that provides graphical interface for monitoring and analiyzing deep learning models. The key object for storing data is the `SummaryWriter`. While training your model, you will use methods of the SummaryWriter to log the data that you want to visualize in TensorBoard, as for instance `add_scalar` for scalar values (e.g. loss and accuracy). Let us have a look to a concrete example.



In [None]:
#Set random seed
torch.manual_seed(42)

# Create a writer to write to Tensorboard
writer = SummaryWriter()

# Create instance of Network
net = FCNet()

# Create loss function and optimizer
criterion = nn.CrossEntropyLoss()

optimizer = optim.SGD(net.parameters(), lr=5e-2)

# Set the number of epochs to for training
epochs = 100

for epoch in tqdm(range(epochs)):  # loop over the dataset multiple times
    # Train on data
    train_loss, train_acc = train(train_loader, net, optimizer, criterion)

    # Test on data
    test_loss, test_acc = test(test_loader, net, criterion)

    # Log metrics to Tensorboard
    writer.add_scalars("Loss", {'Train': train_loss, 'Test':test_loss}, epoch)
    writer.add_scalars('Accuracy', {'Train': train_acc,'Test':test_acc} , epoch)

print('Finished Training')

#Close the writer
writer.flush()
writer.close()

In [None]:
# Open Tensorboard and results from the log directory runs/
%tensorboard --logdir runs/

# For local users only: uncomment the last line, run this cell once and wait for
# it to time out, run this cell a second time and you should see the board.
# %tensorboard --logdir runs/ --host localhost

****
**Question 3.1:** Can you explain the training curves and the accuracy progressions?


<font color='green'> Write your answer here </font>
****

All popular optimization algorithms are readily available in the `torch.optim` package of PyTorch - have a look at the documentation [here](https://pytorch.org/docs/stable/optim.html)] and find how to use momentum, RMSProp and Adam.


****
**Task 3.2:** Run the training loop below using `torch.optim` functions for the different optimizers: SGD with momentum, RMSProp and Adam. Then try also different settings.
****

In [None]:
# Create a writer to write to Tensorboard
writer = SummaryWriter()

# Create instance of Network
net = FCNet()

# Create loss function and optimizer
criterion = nn.CrossEntropyLoss()

#############################################################################
#                          START OF YOUR CODE                               #
#############################################################################

#############################################################################
#                            END OF YOUR CODE                               #
#############################################################################

# Set the number of epochs to for training
epochs = 100

for epoch in tqdm(range(epochs)):  # loop over the dataset multiple times
    # Train on data
    train_loss, train_acc = train(train_loader, net, optimizer, criterion)

    # Test on data
    test_loss, test_acc = test(test_loader, net, criterion)

    # Write metrics to Tensorboard
    writer.add_scalars("Loss", {'Train': train_loss, 'Test':test_loss}, epoch)
    writer.add_scalars('Accuracy', {'Train': train_acc,'Test':test_acc} , epoch)

print('Finished Training')

writer.flush()
writer.close()

Now we use again use Tensorboard to visualize our results. Go back to the cell above and hit refresh in the top right corner of TensorBoard to display the results of all subsequent runs. image.png

****
**Question 3.3:** Which optimizer converges faster? Did they all reach the same final accuracy? Did you have to use different learning rates for different optimizers?
****

**That's all for this lab, see you in the next one!**

**Feedback Form:** please fill in the following form to provide feedback https://forms.office.com/e/KH7vWKMkQ8