<a href="https://colab.research.google.com/github/BeritKO/codelabs/blob/main/Bootcamp_2026_Prac_2_PyTorch_and_MLP_training_(JK).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **MUST 2026 Bootcamp: Introduction to PyTorch and network training**
Presented by Johan Kruger (thanks to Ruan van der Spoel for the 2025 example)

**What we'll cover**

* A basic introduction to PyTorch
* How to create and train a neural network
* How to tweak some hyperparameters
* Using MNIST dataset for classification
* Some fun challenges

**What is Pytorch?**

PyTorch is a deep learning framework that makes it easy to build, train, and experiment with neural networks. It’s flexible, fast, and works great with GPUs.

**Setting up the environment**
* Go to your copy of the notebook.
* Select Runtime->Change Runtime Type.
* Select 'T4 GPU' as a hardware accelerator if it is available.
* If a GPU option isn't available, it just means your training will be a little slower.
* Now we are setup! Let's get into it.

**Imports**

Now we need to import all the packages that we'll be using.

We'll be importing the following:
* "torch" the main PyTorch library.
* "torch.nn" the preconfigured PyTorch neural network building blocks that we need.
* "torchvision" which consists of popular datasets, model architectures, and common image transformations for computer vision.
* "torchvision.transforms" the image transforms referred to above.
* "matplotlib.pyplot" a python package for doing MATLAB style plotting.
* "numpy" which is a library that is used for fast numerical computations. Provides support for multi-dimensional arrays and mathematical functions.
* "copy" a package to copy things




In [None]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import copy

#Dataset

We will be using the MNIST dataset for this practical. MNIST consists of 70,000 grayscale images of handwritten digits (0-9), each 28x28 pixels in size, along with their corresponding labels.

Let's import the MNIST dataset from the torchvision library. We will then load a train, validation, and test set in Pytorch which is split into three steps:
1. Grab the actual train and test dataset from the torchvision library. This will download the dataset from a specified URL, and also transform the dataset from Numpy arrays to Torch tensors.
2. Split the train set into a train and validation set.
3. Use a DataLoader to load these data sets for training/validation/testing.

We split the training set into a train and validation set to evaluate the model's performance on unseen data during training.

In [None]:
# The size of the mini-batches that the data will be split up into
batch_size = 128
# The size of the validation set, which comes from the training set samples
validation_set_size = 10000

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root = '../data/', # Where to store the dataset locally, or get it if it's already beem downloaded
                                          train = True, # This means we are grabbing the train set of MNIST
                                          transform = transforms.ToTensor(), # Transform the numpy arrays to Torch tensors
                                          download = True) # Data set will require downloading (we haven't downloaded it before)

test_dataset = torchvision.datasets.MNIST(root = '../data/',
                                         train = False, # This means we are grabbing the test set of MNIST
                                         transform = transforms.ToTensor())

# Create a validation dataset from the training set
validation_dataset = copy.deepcopy(train_dataset)

# Split the training set
validation_dataset.data = train_dataset.data[0:validation_set_size] # Split the data (the images)
train_dataset.data = train_dataset.data[validation_set_size:]
validation_dataset.targets = train_dataset.targets[0:validation_set_size] # Split the targets (the labels)
train_dataset.targets = train_dataset.targets[validation_set_size:]

# Data loaders
train_loader = torch.utils.data.DataLoader(dataset = train_dataset, # Specify the data set for the data loader
                                           batch_size = batch_size, # Set the batch size
                                           shuffle = True) # Shuffle the training examples

val_loader = torch.utils.data.DataLoader(dataset = validation_dataset,
                                           batch_size = batch_size,
                                           shuffle = False) #No need to shuffle the validation set, we only use it for validation!

test_loader = torch.utils.data.DataLoader(dataset = test_dataset,
                                          batch_size = batch_size,
                                          shuffle = False) #No need to shuffle the test set, we only use it for evaluation!

100%|██████████| 9.91M/9.91M [00:00<00:00, 18.1MB/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 495kB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 4.58MB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 10.0MB/s]


**MNIST**

MNIST has 70k samples in total. The train set has 60k, and the test set has 10k. We have split the train set into 50k samples for training, and 10k for validation.

Let's check that our sizes are correct:

In [None]:
print('Training:\t', train_dataset.data.shape[0])
print('Validation:\t', validation_dataset.data.shape[0])
print('Evaluation:\t', test_dataset.data.shape[0])


**What does the data look like?**

To get an idea of how the MNIST data we are working with looks like, we can plot a sample or two from the training set.

For the title, we can use the target (label) associated with the sample.

In [None]:
n = 6
for i in range(1, n + 1):
    plt.subplot(2,3,i)
    plt.axis('off')
    plt.imshow(train_dataset.data[i].reshape((28, 28)), cmap="gray")
    plt.title('Target: ' + str(int(train_dataset.targets[i])));

#Setting up an architecture
* With our datasets prepped and ready to go, we can now build an actual neural network
* In PyTorch, it's best to build the network inside a python class, and then we can create an object of this class which serves as our model.
* We'll setup a simple model which takes the 28x28 MNIST images as input, passes them to three hidden layers, and outputs a 10-class probability distribution as prediction
* We'll use layers with a width of 200
* As activation function, we'll use the Rectified Linear Unit (ReLU)

In [None]:
num_classes = 10 # The number of MNIST classes (the number of possible outputs)
input_dim = 784 # 28x28 = 784 input features
layer_width = 200 # The number of nodes/neurons per hidden layer

class Model(nn.Module):
  def __init__(self):
    super(Model, self).__init__()
    # The Linear layer, also called a fully connected or dense layer, perform the simple linear transformations y=xW+b
    self.hidden_layer_1 = nn.Sequential(nn.Linear(input_dim, layer_width, bias=True), nn.ReLU()) # 784 in, 200 out, each input is then ReLU'd.
    self.hidden_layer_2 = nn.Sequential(nn.Linear(layer_width, layer_width, bias=False), nn.ReLU()) # 200 in, 200 out, each output is then ReLU'd.
    self.hidden_layer_3 = nn.Sequential(nn.Linear(layer_width, layer_width, bias=False), nn.ReLU()) # 200 in, 200 out, each output is then ReLU'd.
    self.output_layer = nn.Linear(layer_width, num_classes, bias=False) # 200 in, 10 out, no activation function applied.

  def forward(self, x, **kwargs):
    x = x.reshape(x.size(0), -1) # Converts the 2D 28x28 tensor into a 784-length vector
    out = self.hidden_layer_1(x) # Pass inputs to the first hidden layer
    out = self.hidden_layer_2(out) # Pass to hidden layer 2
    out = self.hidden_layer_3(out) # Pass to hidden layer 3
    out = self.output_layer(out) # Pass to output layer
    return out

**Create object of Model**

Now we can create an object of the model, and also assign it to a specific device (CPU or GPU)

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') # If the GPU is available, select it. Otherwise, select CPU
print('You are using: ' + str(device))

In [None]:
neuralNet = Model().to(device) # Create an object of the model, and move it to the appropriate device

**Hyperparameters**

We can now specify the hyperparameters our network will be using:

In [None]:
max_epochs = 10 # The maximum number of epochs to train the network for
learning_rate = 0.01 # The step-size multiplier for each optimisation step
model_path_torch = './model.pth' # Where to save the model



**Select optimiser and loss function**

* With the model created, we can now select an optimiser and the loss function we wish to use.
* We'll use Cross-Entropy loss and the ADAM optimiser to start with.
* NOTE: We don't use softmax on the last layer of the model, because in PyTorch Cross-Entropy already implements a softmax.

In [None]:
criterion = nn.CrossEntropyLoss() # Select cross-entropy loss as cost function
optimiser = torch.optim.Adam(neuralNet.parameters(), lr=learning_rate) # Select ADAM as optimiser, and assign the learning rate and model

#Train and test the model
* First we'll write a function to evaluate our model - meaning that we will find out what its accuracy and loss is for a specific dataset.
* Then we'll write a training loop that trains the model for a set number of epochs. After each epoch, the model is evaluated on the train and validation set.
* Finally, we'll test the model on the test set, once training has been completed.

**Evaluation function**

In [None]:
def evaluatePerformance(data_set_loader, model, loss_func):
  per_batch_loss = [] # Empty list to store the loss of each batch
  model.eval() # Put the model in evaluation mode. This disables things such as batchnorm and dropout layers during evaluation
  with torch.no_grad(): # Doesn't store the gradients of activations passed through the model. This speeds up the computation
    correct = 0
    total = 0
    for images, labels in data_set_loader: # Iterate through the data loader
      # Move each image and target label to the appropriate device (if using GPU, this would move the batch of images to the VRAM)
      images = images.to(device)
      labels = labels.to(device)

      # Pass the images through the model
      outputs = model(images)
      # Take the highest prediction from the model for each sample
      _, predicted = torch.max(outputs.data, 1)
      # Keep track of the total number of samples processed. This is necessary for computing accuracy.
      total += labels.size(0)
      # See how many of the predictions match the actual label
      correct += (predicted == labels).sum().item()
      # Calculate the loss between the predictions and true values
      loss = loss_func(outputs, labels)
      # Add this batch's loss the the list
      per_batch_loss.append(float(loss) * labels.size(0))
      # Instead of storing just the loss per batch, we multiply the loss by the batch size.
      # This ensures that when we compute the epoch loss, we properly weight each batch's contribution.
      # Otherwise, batches of different sizes (which can happen if the last batch is smaller) would not be accounted correctly.

  # Get the average loss for the epoch
  epoch_loss = sum(per_batch_loss) / total

  # Calculate the accuracy
  accuracy = round(100 * (correct/total), 4)

  # Return the results
  return accuracy, epoch_loss

**Training loop**


In [None]:
def trainModel(model, num_epochs):
  best_valid_accuracy = 0
  best_epoch = 0
  best_epoch_train_accuracy = 0
  total_step = len(train_loader)

  # Loop through all epochs.
  for epoch in range(num_epochs):
    # Put the model in training mode (enables things like batchnorm and dropout)
    model.train()
    print(f'Optimising epoch {epoch+1}...')
    # Iterate over the batches
    for i, (images, labels) in enumerate(train_loader):
      # Move each image and target label to the appropriate device (if using GPU, this would move the batch of images to the VRAM)
      images = images.to(device)
      labels = labels.to(device)

      # Reset the gradient of each parameter to zero
      optimiser.zero_grad()

      # Run the forward pass

      # Pass the samples through the model
      outputs = model(images)
      # Calculate the loss
      loss = criterion(outputs, labels)

      # Backward pass and optimise

      # Backpropagate, calculate the gradients
      loss.backward()
      # Update parameters using gradients with optimiser
      optimiser.step()

    # Test the model on the training set
    train_accuracy, train_loss = evaluatePerformance(train_loader, model, criterion)
    print(f'Train Accuracy: {train_accuracy}%')
    print(f'Train Loss: {train_loss}')
    # Test the model on the validation set
    validation_accuracy, validation_loss = evaluatePerformance(val_loader, model, criterion)
    print(f'Validation Accuracy: {validation_accuracy}%')
    print(f'Validation Loss: {validation_loss}')
    print('\n')

    # Save the model at this epoch if it has performed better
    if (validation_accuracy>best_valid_accuracy):
      best_valid_accuracy = validation_accuracy
      best_epoch = epoch+1
      best_epoch_train_accuracy = train_accuracy
      # Save the model state to the specified path
      torch.save(model.state_dict(), model_path_torch)
  # After training, test the best model on the test set

  # Load best model
  model.load_state_dict(torch.load(model_path_torch, weights_only=False))
  # Test the model on the test set
  test_accuracy, test_loss = evaluatePerformance(test_loader, model, criterion)

  print(f'Best model at epoch {best_epoch}')
  print(f'Train Accuracy: {best_epoch_train_accuracy}%')
  print(f'Validation Accuracy: {best_valid_accuracy}%')
  print(f'Evaluation Accuracy: {test_accuracy}%')

**Train the model**

In [None]:
trainModel(neuralNet, max_epochs)

#Challenges
1. Replace the ReLU activation function in the hidden layers with a different activation function e.g., Sigmoid or Tanh.
2. Add a batch norm layer to each hidden layer.
3. Add weight decay to the model (this is done via the optimiser). Compare the following weight decay values: 0.01, 0.0001, 0.000001.
4. Change the optimiser to SGD.
5. Add momentum to SGD and compare it to Adam.
6. Add an additional hidden layer. You can choose any size!
7. How does the accuracy look if you train the model on only 100 training samples?

BONUS: Play with all hyperparameters (model size, optimiser, learning rate, batch size, etc.) and see who can get the best evaluation accuracy within 20 epochs.
