# CSE5ML: Machine Learning
# Assessment 2: Image Classification with Neural Networks

## Completion Requirements

- **Working code and written report**
- **Due date:** 11.59 pm (AEST), Sunday 14 September 2025 (Week 7)
- **Weighting:** 30%
- **Length:** Working code and 1000-word report (+/– 10%)
- **SILOs:** Implement a neural network with different learning algorithms for time-series forecasting with real-world data from industry (SILO 4).

## Purpose

The purpose of this assessment is to develop hands-on experience with neural networks for image classification – a key application of machine learning used across industries such as health care, autonomous systems and digital security to interpret and act on visual data.

## Task Details

This assessment aims to consolidate your knowledge and practical skills to build neural networks (NNs) for supervised learning. The task is formulated as a multi-class classification problem for handwritten images, and the goal is to model the relationship between the images’ content, network structure and labels. You need to provide:

- **Working code** (part 1)
- **A written report** of 1000 words on the method and results (part 2).

### Instructions

The MNIST database is a dataset with handwritten digits (from 0 to 9). The digits have been size-normalised and centred in a fixed-size image (28 × 28 pixels) with values from 0 to 1. You can use the following code with TensorFlow in Python to download the data.

```python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.datasets import mnist

(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
```

Every MNIST data point has two parts: an image of a handwritten digit and a corresponding label. We will call the images 𝑥 and the labels 𝑦. Both the training set and test set contain 𝑥 and 𝑦.
Each image is 28 pixels by 28 pixels.
As mentioned, the corresponding labels in the MNIST are numbers between 0 and 9, describing which digit a given image represents. In this assessment, we regard the labels as one-hot vectors; that is, 0 in most dimensions, and 1 in a single dimension. In this case, the 𝑛-th digit will be represented as a vector, which is 1 in the 𝑛 dimensions. For example, 3 would be [0,0,0,1,0,0,0,0,0,0].
The assessment aims to build NNs for classifying handwritten digits in the MNIST database, train it on the training set and test it on the test set. Since the main object of this assessment is for you to understand the relationship between input, model and output, you are not expected to achieve very high accuracy in model performance; instead, for each task, you should be able to identify how you can improve model performance with the change of network structure.

There are two parts to this assessment:

### Part 1
Part 1 is comprised of three main tasks:

**Task 1**

Build a neural network without convolutional layers to do the classification task (hint: you will need the use of dense layers). Then you can change the model structure (i.e. number of dense layers, number of neurons in dense layers or activation functions) to be able to improve network performance.

**Task 2**

Build a neural network with the use of convolutional layers (you can decide other layer types you want to include in your network). Then you can change the number of convolutional layers and the number of filters or activation functions in the convolutional layers to be able to improve network performance.

**Task 3**

Change the type of optimiser or learning rate that you applied in the previous tasks and see how these changes can influence model performance. (You can keep the final network structure you applied in task 2 and try at least one different optimiser setting.)
Please read the following comments and requirements very carefully before starting the assessment:
1.	The assessment is based on the content of labs and Weeks 1–3.
2.	In Week 1 we talked about the use of training set, validation set and test set in machine learning. In this assessment, you are asked to train the NN on the training set and test the NN on the test set, without any given validation set. (If you want to monitor the training process, you can also try what we did in Week 3: you can consider the validation set is the same as the test set in this assessment.)
3.	In the assessment, the performance of an NN is measured by its prediction accuracy in classifying images from the test set (i.e. number of the correctly predicted images/number of the images in the test set).
4.	Since the MNIST dataset is a black-and-white image dataset, the shape of dataset is (dataset_length, 28,28). But to fit it into a conv2d layer, we need to make the input shape comply with its required format: (batch_size, image_width, image_depth, image_channels). Although batch_size can be decided later when you train it, you will still need to tell the number of image channels here. You can consider reshaping the dataset into (dataset_length, 28,28,1) or add one more dimension at the end with np.newaxis.
5.	You are expected to show at least two models in for tasks 1 and 2: one for the model you start with, and another model is the model that you identified to have better accuracy. For task 3, you need to show what optimiser and/or learning rate you applied.

### Part 2

Your report must at least contain the following content:
1.	Your name and student number.
2.	Architectures of the NNs, with figures for tasks 1 and 2.
3.	Description on the optimiser and learning rate you applied in the final model of task 2 and the optimiser or change of learning rate you used in task 3.
4.	Experiments and performances, with parameter setting.
5.	Discussion on the improvement/deterioration of the NN’s performance after changing the architecture and parameter setting for each task and findings of comparing the results from all three tasks.
6.	The ranking of all NNs’ performances from all the three tasks.

### Assessment criteria

This assessment will measure your ability to:

**Part 1:**

•	describe the two models, experiment settings and compare the results for task 1 (25%)
•	describe the two models, experiment settings and compare the results for task 2 (25%)
•	describe the two optimisers or learning rates, experiment settings and compare the results for task 3 (35%)

**Part 2:**

•	demonstrate correct code quality (10%)
•	research extensively and demonstrate depth of thinking; produce a well-structured report (5%).
Refer to the marking guide for marking and feedback information.

### Submission details

The submitted assessment consists of (1) a report (in PDF format) of no less than 1000 words and (2) all codes for modelling, training and testing the NN with TensorFlow in Python (you can choose to have one code file including all your codes, or you can have one code file for each task separately).
**If you use ChatGPT or other generative AI tools, you must cite them and clearly indicate your original contributions.**

In keeping with La Trobe University policy, all assignments are to be submitted in Moodle via Turnitin.
To be accepted, your assessment submission **must** generate a similarity score (you are responsible for checking this). Submitting in Word or PDF format is the best way to do this. If your submission does not generate a similarity score, it cannot be checked for plagiarism and therefore **will not be marked.**

Last modified: Tuesday, 12 August 2025, 8:58 PM




# Part 1, task 1 & 3

### Import relevant dependencies 

In [None]:
# Imports the MNIST dataset from Keras, a classic collection of 70,000 grayscale images of handwritten digits (0-9).
from keras.datasets import mnist 

# Imports TensorFlow, the core open-source library from Google for building and training machine learning models. We use the alias 'tf' by convention.
import tensorflow as tf

# Imports the Adam optimiser. An optimiser is an algorithm that adjusts the model's internal parameters (weights) to minimise the error, and Adam is a popular, efficient choice.
from tensorflow.keras.optimizers import Adam, SGD

# Imports specific performance metrics. Metrics are used to evaluate how well the model is performing.
# Precision: Measures the accuracy of positive predictions.
# Recall: Measures the model's ability to find all the actual positive instances.
# Accuracy: Measures the overall fraction of correct predictions.
from tensorflow.keras.metrics import Precision, Recall, Accuracy

# Imports the History callback object. A 'callback' is a function that can be executed at different stages of training. The History object automatically records the metrics and loss values from each epoch.
from tensorflow.python.keras.callbacks import History

# Imports the ModelCheckpoint callback. This callback saves the model to a file during training, typically only when its performance on a validation metric improves.
from tensorflow.keras.callbacks import ModelCheckpoint

# Imports the Pandas library, a powerful tool for data manipulation and analysis. It's mainly used for working with structured data in tables called DataFrames. 'pd' is the standard alias.
import pandas as pd

# Imports the Sequential model type from Keras. This is the simplest way to build a model, by creating a linear stack of layers.
from keras.models import Sequential

# Imports different types of layers, which are the fundamental building blocks of a neural network.
# Dense: A standard, fully-connected layer where each neuron is connected to every neuron in the previous layer.
# Input: A special layer used to define the shape and data type of the model's input.
# Flatten: A layer that transforms a multi-dimensional input (like a 2D image) into a one-dimensional vector.
# Normalization: A preprocessing layer that scales input data to a standard range (e.g., mean of 0, standard deviation of 1), which helps the model train faster. 
from tensorflow.keras.layers import Dense, Input, Flatten, Normalization, Input, Conv2D, MaxPooling2D, Dropout

# Imports a utility function from scikit-learn, a popular library for traditional machine learning. train_test_split is used to split a single dataset into separate training and testing sets.
from sklearn.model_selection import train_test_split

# Imports the pyplot interface from Matplotlib, which is the most widely used library for creating plots and visualisations in Python. 'plt' is the standard alias.
import matplotlib.pyplot as plt

# Imports the NumPy library, which is the foundation for numerical computing in Python. It provides support for large, multi-dimensional arrays and a wide range of mathematical functions. 'np' is the standard alias.
import numpy as np

# Imports a data scaling tool from scikit-learn. MinMaxScaler scales all data features to a specific range, usually 0 to 1.
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix, classification_report

# Imports tools for 'type hinting' from Python's typing module. Type hints make code more readable and can be used by external tools to check for errors.
# Tuple: Used to hint that a variable or function return is a tuple (an ordered, immutable collection of elements).
from typing import Tuple

# Imports a specific type hint from NumPy's typing module.
# NDArray: Used to hint that a variable is a NumPy n-dimensional array, which is more descriptive than a generic type.
from numpy.typing import NDArray

# Imports the 'os' module. This library provides a way for Python to interact with the computer's underlying operating system.
# We use it for tasks like reading file names from a folder (os.listdir()) and constructing file paths that work correctly on any system, like Windows, Mac, or Linux (os.path.join()).
import os 

# Imports the 're' module, which stands for Regular Expression. This is Python's library for advanced pattern matching in strings.
# We use it to find and extract specific pieces of text from a string, like pulling the accuracy score out of a complex filename (e.g., finding '0.9935' in 'model_acc-0.9935.keras').
import re

import seaborn as sns

### Define the training set features (X_train) and target variable (Y_train) as well as the test set features (X_test_) and target variable (Y_test)

In [None]:
# Load the MNIST dataset, which is a large database of handwritten digits.
# The function returns two tuples: one for training data and one for testing data.
# Recalling, a Tuple is a collection of objects that are ordered and immutable.
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

# First we convert the data to float32, which helps with numerical stability. A float32 provides sufficient precision, while also being memory efficient.
# Most modern CPU's and GPU's are optimized for float32 operations, making computations faster.
X_train = X_train.astype("float32")
X_test = X_test.astype("float32")

# Then we convert the training data to have a channel dimension, which is required for CNNs.
X_train = X_train[..., tf.newaxis] # Add the channel dimension
X_test = X_test[..., tf.newaxis] # Add the channel dimension
print(f"New shape for X_train for CNN's: {X_train.shape}")
print(f"New shape for X_test for CNN's: {X_test.shape}")

# Declare the types of the loaded data for clarity.
X_train: NDArray[np.float32]
Y_train: NDArray[np.uint8]
X_test: NDArray[np.float32]
Y_test: NDArray[np.uint8]

# We set the line width to a large value to avoid line breaks when printing the array.
with np.printoptions(linewidth=10000):
    # Print the shapes of the datasets to understand their dimensions.
    print("Shape of X_train:\t", X_train.shape)
    print("Shape of X_test:\t", X_test.shape)
    print("Shape of Y_train:\t", Y_train.shape)
    print("Shape of Y_test:\t", Y_test.shape)
    print(f"X_train data type: {X_train.dtype}")
    print(f"X_test data type: {X_test.dtype}")
    print(f"Y_train data type: {Y_train.dtype}")
    print(f"Y_test data type: {Y_test.dtype}")

    # Inspect a single data sample to see what it looks like.
    n: int = 5678
    print(f"\nX_train data {n}-th element (a 28x28 pixel image):\n", np.squeeze(X_train[n]))
    print("\nAnd its corresponding label:\t", Y_train[n])

    # TODO: For using this data in a neural network,
    # Tensorflow/Keras expects the input data to be in a 1D or 2D array format where each row represents a single sample and each column represents a feature. The general format for the input shape is: (batch_size, feature_1, feature_2, ...)
    # However, we can use the tf.keras.layers.Flatten layer as the first layer in our sequential model.
    # This layer automatically flattens the input shape without the need for manual reshaping of our data.
    # For a Dense (fully connected) network: We must flatten each 28x28 image into a single 1D array of 784 pixels. The input shape for the first layer of our model would then be (None, 784), where None represents a variable batch size.
    # For a Convolutional Neural Network (CNN): We must add a channel dimension. Since the images are grayscale, there is only one channel. We would reshape the data to (number_of_images, 28, 28, 1). The input shape for the first layer (typically a Conv2D layer) would be (28, 28, 1). The batch size is handled automatically by Keras.
    # Scaling can also be performd in the model using a tf.keras.layers.Rescaling or keras.layers.Normalization layer as the first layer in our sequential model.
    # The advantage of using these layers is that they integrate seamlessly into the model architecture, ensuring that the data is preprocessed consistently during both training and inference.
    # This approach also simplifies the code by reducing the need for separate preprocessing steps outside the model definition.
    # And, it ensures that inference data is processed in the same way as training data, which is crucial for maintaining model performance.

    # Analyze the distribution of the digits in the training set.
    # `np.unique` finds the unique digit labels and `return_counts=True` counts their occurrences.
    dataset_distribution: Tuple[np.ndarray, np.ndarray] = np.unique(Y_train, return_counts=True)
    digits: np.ndarray = dataset_distribution[0]
    counts: np.ndarray = dataset_distribution[1]
    
    print("\n--- Dataset Distribution ---")
    print("Digits:\t\t\t", digits)
    print("Count per digit:\t", counts)
    
    # Calculate basic statistics on the distribution.
    avg: float = np.mean(counts)
    print(f"Average sample size:\t {avg:.2f}")
    
    max_count: np.int64 = np.max(counts)
    min_count: np.int64 = np.min(counts)
    print(f"Maximum sample size:\t {max_count}")
    print(f"Minimum sample size:\t {min_count}")

# Create a bar chart from the counts and digits to visualize the distribution.
plt.bar(digits, counts, color='blue', edgecolor='black')

# Set the title and labels for clarity.
plt.title('Count of Each Digit in Training Set')
plt.xlabel('Digits')
plt.ylabel('Count')

# Set x-ticks to be at the center of each bar and label them with the digit.
plt.xticks(digits)

# Add a grid for better readability.
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Display the plot.
plt.show()

## Dataset Analysis

The content and size of the training and testing datasets align with the description on the Kaggle MNIST dataset page, Hojjat, F. (2017). MNIST: The Most Famous Dataset in the World. Kaggle. Retrieved August 28, 2025, from https://www.kaggle.com/datasets/hojjatk/mnist-dataset. The plot of digit distribution shows a fairly homogeneous representation across all classes (digits 0 through 9). While the digit '1' is slightly oversampled and the digit '5' is slightly undersampled, the class imbalance is not significant enough to warrant further action for this assessment.

In a scenario where the distribution were to be significantly imbalanced and we needed to make it more homogeneous, we would use a technique called **resampling**. Resampling involves adjusting the distribution of the training data to be more balanced. There are two primary types:

- **Oversampling** involves duplicating samples from the underrepresented classes to increase their frequency.

- **Undersampling** involves removing samples from the overrepresented classes to reduce their frequency.

## Part 1, Task 1: Creating a simple Multilayer Perceptron (MLP) neural network

The code below defines our base model.

To experiment with different architectures or tune its hyperparameters, we simply copy this entire cell and make our changes.

We need to make sure to give each new model a unique name. This ensures that when the ModelCheckpoint callback saves the best-performing version during training, the filename will be clear and identifiable.

In [None]:
# --- Set Seeds for Reproducibility ---

# This sets the global random seed for all TensorFlow operations.
# It ensures that things like model weight initialisation are the same every time.
# `tf.random.set_seed()` is the modern way to do this in TensorFlow 2.
tf.random.set_seed(1)

# This sets the random seed for all NumPy operations.
# This is important if we are creating our data using NumPy or using any
# NumPy functions that involve randomness.
np.random.seed(23)

In [None]:
# THIS CELL IS NO LONGER REQUIRED, BUT LEFT HERE FOR FUTURE PROJECTS WHERE A SINGLE MODEL IS ENOUGH, AND TO DEMONSTRATE THE BASIC CODE BASE.
# THE CELLS BELOW SPLIT UP THIS CODE IN A MODEL CREATION PART, A RUN EXPERIMENT PART, AND HYPERPARAMETER TUNING PART, ALLOWING FOR MUCH MORE FLEXIBILITY. 

# # --- 1. Define the MLP BASE Model Architecture ---
# # We are building a Sequential model, which is a simple, linear stack of layers.
# mnist_model_mlp: Sequential = Sequential([
#     # This is our preprocessing layer for Z-score scaling (normalisation with mean 0 and std. dev 1).
#     # It learns the mean and standard deviation from the training data and applies it.
#     # The input_shape must match a single sample from our data, which is a 28x28 image.
#     # We define the normalisation and flattening inside the model so that
#     # they are applied consistently during both training and inference.
#     Input(shape=(28, 28, 1)),
#     Normalization(),

#     # The Flatten layer converts the 2D image (28x28) into a 1D vector (784 elements).
#     # This is a necessary step to feed the data into the Dense (fully-connected) layers.
#     Flatten(),

#     # These are our hidden layers. We use the 'relu' activation function to help
#     # the model learn complex, non-linear patterns in the pixel data.
#     Dense(units=128, activation='relu'),
#     Dense(units=256, activation='relu'),
#     Dense(units=64, activation='relu'),

#     # This is our output layer. It must have 10 neurons (one for each digit 0-9).
#     # The 'softmax' activation converts the output into a probability distribution,
#     # showing the model's confidence for each digit.
#     Dense(units=10, activation='softmax')
# ], name="MLP_Base")

# # We can print a summary of the model's architecture to see the layers and parameter counts.
# print("\n--- Model Architecture ---")
# mnist_model_mlp.summary()

# # for n in range(len(mnist_model_mlp.layers)):
# #     print(f"MNIST model layer[{n}: {mnist_model_mlp.layers[n]}]")

# # --- 2. Adapt the Normalisation Layer ---
# # Before training, we must let the Normalization layer calculate the mean and
# # variance of our training data. The .adapt() method does this for us.
# print("Adapting the normalisation layer to the training data...")
# mnist_model_mlp.layers[0].adapt(X_train)
# print("Adaptation complete.")


# # --- 3. Configure and Compile the Model ---
# # We configure the optimiser and the list of metrics we want to track.
# adam_optimizer: Adam = Adam(learning_rate=0.001)
# metrics_list: list = ['accuracy'] #, Precision(), Recall()] Precision and Recall can be used for Binary Classification problems only.

# # The compile step brings everything together and prepares the model for training.
# mnist_model_mlp.compile(
#     optimizer=adam_optimizer,
#     loss='sparse_categorical_crossentropy', # Best for integer labels in multi-class classification, such as our MNIST labels (i.e. 0, 1, 2, ..., 9). Keras' sparse_categorical_crossentropy handles the one-hot encoding for us internally. 
#     metrics=metrics_list
# ) # See Cholet Book, Section 8.1 and Deep Learning Goodfellow, section 5.5 and 6.2.2.2

# # Define the filepath for the saved model.
# # The placeholders {epoch:02d} and {val_accuracy:.4f} will be automatically filled in.
# folder_name = 'MLP Models'
# filepath = f'{folder_name}/best_{mnist_model_mlp.name}_epoch-{{epoch:02d}}_val_acc-{{val_accuracy:.4f}}.keras'

# # Create a ModelCheckpoint callback so we can save the best model during training. Note: every epoch the model is evaluated
# # on validation accuracy. If the validation accuracy improves, the model will be saved. If not, it will not be saved.
# model_checkpoint_callback = ModelCheckpoint(
#     filepath=filepath,
#     monitor='val_accuracy',      # Monitor the validation accuracy
#     mode='max',                  # The direction of improvement (higher is better for accuracy)
#     save_best_only=True,         # Only save the model if `val_accuracy` has improved
#     verbose=1                    # Print a message when the model is saved
# )

# # --- 4. Train the Model ---
# # This is where the learning happens. The .fit() method trains the model on our data, and returns a History object.
# # The 'history' object will store the loss and metric values for each epoch.
# history_mlp: History = mnist_model_mlp.fit(
#     X_train,
#     Y_train,
#     epochs=5,
#     validation_split=0.1, # We hold back 10% of the training data to validate performance.
#     batch_size=64,
#     verbose=1, # We set verbose=1 to see the training progress bar.
#     callbacks=[model_checkpoint_callback], 
# )


In [None]:
def create_mlp_model() -> Sequential:
    """
    Defines and returns the (base) MLP model architecture.
    """
    # Note: we can change the model architecture here. However, it is more prudent to save the model parameters first, and then change it. 
    model = Sequential([
        # We use the implicit input_shape here for a cleaner look.
        Input(shape=(28, 28, 1)),
        Normalization(),
        Flatten(),
        Dense(units=128, activation='relu'),
        Dense(units=256, activation='relu'),
        Dense(units=64, activation='relu'),
        Dense(units=10, activation='softmax')
    ])
    return model

In [None]:
def run_experiment(model_creation_func, hyperparameters, parent_folder, X_train, Y_train) -> History:
    """
    Runs a full training experiment for a given model architecture and hyperparameter set.
    """
    # Create a fresh instance of the model for this experiment. We do this by calling the model creation function, which is
    # passed as an argument to this function.
    model = model_creation_func()
    model.name = hyperparameters.get('model_name', 'unnamed_model')
    
    print(f"\n--- Starting Experiment: ---")

    # We can print a summary of the model's architecture to see the layers and parameter counts.
    print("\n--- Model Architecture ---")
    model.summary() # This includes the name of the model as well.

    # And we can also print the hyperparameters of the model.
    print("\n--- Hyperparameters ---")
    for key, value in hyperparameters.items():
        print(f"{key:<20}: {value}")

    # --- Adapt the Normalisation Layer ---
    print("\nAdapting the normalisation layer...")
    model.layers[0].adapt(X_train)
    print("Adaptation complete.\n")

    # --- Configure the Optimiser ---
    optimiser_name = hyperparameters.get('optimiser', 'adam').lower() # if not provided, default to Adam
    learning_rate = hyperparameters.get('learning_rate', 0.001) # if not provided, default to 0.001

    # The reason we use an optimiser Object, rather than a string name, is to allow for more complex configurations in the future.
    # such as: learning rate schedules, gradient clipping, weight decay amongst others. 
    # This is not used in this workbook, but it makes the code more robust for future (personal) projects. 
    if optimiser_name == 'adam':
        optimiser = Adam(learning_rate=learning_rate)
    elif optimiser_name == 'sgd':
        optimiser = SGD(learning_rate=learning_rate)
    else:
        optimiser = optimiser_name

    # --- Configure the ModelCheckpoint Callback ---
    model_specific_folder = os.path.join(parent_folder, model.name)
    filepath = os.path.join(model_specific_folder, 'best_model_epoch-{epoch:02d}_val_acc-{val_accuracy:.4f}.keras')
    checkpoint = ModelCheckpoint(
        filepath=filepath,
        monitor='val_accuracy',
        mode='max',
        save_best_only=True,
        verbose=1
    )
    
    # --- Compile the Model ---
    model.compile(
        optimizer=optimiser,
        loss=hyperparameters.get('loss_function', 'sparse_categorical_crossentropy'),
        metrics=['accuracy']
    )
    
    # --- Train the Model ---
    history = model.fit(
        X_train,
        Y_train,
        epochs=hyperparameters.get('epochs', 10),
        batch_size=hyperparameters.get('batch_size', 64),
        validation_split=0.1,
        callbacks=[checkpoint],
        verbose=1
    )


    # The history.history dictionary contains a list of validation accuracies for each epoch.
    val_accuracies = history.history['val_accuracy']

    # Use Python's built-in max() function to find the highest value in that list.
    best_validation_accuracy = max(val_accuracies)
    
    # Use the .index() method to find which epoch this occurred at.
    # We add 1 because list indices are 0-based, but epochs are typically 1-based.
    best_epoch = val_accuracies.index(best_validation_accuracy) + 1
    # We subtract the 1 again from the best_epoch to obtain the associated training accuracy. 
    associated_train_acc = history.history['accuracy'][best_epoch - 1]

    print("\n--- Peak Performance Summary ---")
    print(f"{'Best validation accuracy:':<35} {best_validation_accuracy:.4f}")
    print(f"{'Associated training accuracy:':<35} {associated_train_acc:.4f}")
    print(f"{'Occurred at epoch:':<35} {best_epoch}")


    return history

In [None]:
# --- Define Hyperparameter Sets ---

# Experiment 1: Our baseline run
mlp_exp_1_config = {
    "model_name": "MLP_Baseline_Run",
    "optimiser": "Adam",
    "learning_rate": 0.01,
    "epochs":70,
    "batch_size": 64
}

# Experiment 2: A different setup with a lower learning rate, more epochs, and a smaller batch size
mlp_exp_2_config = {
    "model_name": "MLP_Slow_Learn_Run",
    "optimiser": "Adam",
    "learning_rate": 0.001,
    "epochs": 70,
    "batch_size": 64
}

mlp_exp_3_config = {
    "model_name": "MLP_SGD_Learn_Run",
    "optimiser": "SGD",
    "learning_rate": 0.01,
    "epochs": 70,
    "batch_size": 64
}

mlp_exp_4_config = {
    "model_name": "MLP_SGD_Slow_Learn_Run",
    "optimiser": "SGD",
    "learning_rate": 0.001,
    "epochs": 70,
    "batch_size": 64
}

In [None]:
# --- Run the first experiment ---

mlp_histories = []
mlp_history_1 = run_experiment(
    model_creation_func=create_mlp_model, 
    hyperparameters=mlp_exp_1_config,
    parent_folder='MLP_Models',
    X_train=X_train,
    Y_train=Y_train
)
mlp_histories.append(mlp_history_1)

# --- To run the second experiment, we just call it again with the other config :-)
mlp_history_2 = run_experiment(
    model_creation_func=create_mlp_model, 
    hyperparameters=mlp_exp_2_config,
    parent_folder='MLP_Models',
    X_train=X_train,
    Y_train=Y_train
)
mlp_histories.append(mlp_history_2)

mlp_history_3 = run_experiment(
    model_creation_func=create_mlp_model, 
    hyperparameters=mlp_exp_3_config,
    parent_folder='MLP_Models',
    X_train=X_train,
    Y_train=Y_train
)
mlp_histories.append(mlp_history_3)


mlp_history_4 = run_experiment(
    model_creation_func=create_mlp_model, 
    hyperparameters=mlp_exp_4_config,
    parent_folder='MLP_Models',
    X_train=X_train,
    Y_train=Y_train
)
mlp_histories.append(mlp_history_4)

# Notes on Sparse Categorical Loss vs. Categorical Loss
# Understanding Cross-Entropy Loss

At its heart, **cross-entropy** is a concept from information theory that measures how different two probability distributions are. In the context of training a neural network for classification, we use it to measure the "distance" between the model's predicted probability distribution and the true probability distribution of the labels. The goal of training is to minimise this distance, effectively making the model's predictions more accurate (Goodfellow et al., 2016).

---
### Categorical Cross-Entropy (for One-Hot Labels)

You use this loss function when your labels are explicitly **one-hot encoded** (e.g., the digit `3` is represented as `[0, 0, 0, 1, 0, 0, 0, 0, 0, 0]`). The formula for a single sample is:

$$L = -\sum_{i=0}^{C-1} y_i \log(\hat{y}_i)$$

-   $L$ is the final loss value for the sample.
-   $C$ is the total number of classes (e.g., 10 for MNIST).
-   $y_i$ is the ground truth (it is `1` for the correct class and `0` for all others).
-   $\hat{y}_i$ is the model's predicted probability for class $i$.

Because the `y` vector is almost all zeros, the summation simplifies to just the negative logarithm of the probability the model assigned to the single correct class. For a label of `3`, the loss simply becomes $L = -\log(\hat{y}_3)$.

---
### Sparse Categorical Cross-Entropy (for Integer Labels)

This is a more computationally and memory-efficient version used when your labels are simple **integers** (e.g., `3`). It arrives at the exact same mathematical result but skips the need for the one-hot encoded vector.

The formula is a direct implementation of the simplified logic:

$$L = -\log(\hat{y}_c)$$

-   $L$ is the final loss value for the sample.
-   $c$ is the integer representing the correct class (e.g., `c = 3`).
-   $\hat{y}_c$ is the model's predicted probability for that correct class $c$.

As Chollet (2021) explains, both formulas compute the exact same value. The choice is purely a practical one based on the format of your labels, not a mathematical one that affects the model's learning.

---
**References**

Chollet, F. (2021). *Deep learning with Python* (2nd ed.). Shelter Island, NY: Manning Publications.

Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep learning*. Cambridge, MA: MIT Press.

## Part 1, Task 2: Creating a simple Convolutional Neural Network (CNN)

The code below defines our base model.

To experiment with different architectures or tune its hyperparameters, we simply copy this entire cell and make our changes.

We need to make sure to give each new model a unique name. This ensures that when the ModelCheckpoint callback saves the best-performing version during training, the filename will be clear and identifiable.

In [None]:
# # --- 1. Define the CNN BASE model Architecture ---
# # We are building a Sequential model, which is a simple, linear stack of layers.
# mnist_model_cnn: Sequential = Sequential([
#     # This is our preprocessing layer for Z-score scaling (normalisation with mean 0 and std. dev 1).
#     # It learns the mean and standard deviation from the training data and applies it.
#     # The input_shape must match a single sample from our data, which is a 28x28 image.
#     # We define the normalisation and flattening inside the model so that
#     # they are applied consistently during both training and inference.

#     Input(shape=(28, 28, 1)),   # Important note: We must now include the number of channels as well, because the core operation
#     # of a convolution is to operate on volumes of data. The convolution slides a small filter (kernel) across the input image.
#     # This filter itself has a depth and must match the depth (channels) of the input. It processes all channels simultaneously 
#     # to produce a single output value for that position, effectively combining spatial and channel-wise information.

#     Normalization(),
    
#     # --- Convolutional Block 1 ---
#     # Conv2D layers act as feature detectors. They scan the image with filters (kernels)
#     # to find patterns like edges, curves, etc.
#     # 32 filters: The number of features to learn.
#     # (3, 3) kernel_size: The size of the scanning window.
#     Conv2D(filters=32, kernel_size=(3, 3), activation='relu'),
#     # Conv2D(filters=32, kernel_size=(3, 3), padding='same', strides=1, activation='relu'),

#     # MaxPooling2D downsamples the feature map, reducing dimensionality and making
#     # the model more robust to variations in the position of features.
#     MaxPooling2D(pool_size=(2, 2)),

#     # --- Convolutional Block 2 ---
#     # We add another Conv2D layer to learn more complex patterns from the features
#     # detected by the first layer. It's common to increase the number of filters.
#     Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
#     MaxPooling2D(pool_size=(2, 2)),


#     # The Flatten layer converts the 2D feature maps from the final pooling layer
#     # into a 1D vector, preparing the data for the final classification layers.
#     Flatten(),

#     # --- Classification Head ---
#     # A Dropout layer is added for regularisation. It randomly sets a fraction of
#     # input units to 0 at each update during training, which helps prevent overfitting.
#     Dropout(0.5), # (50% of the neurons will be dropped)
#     # This Dense layer interprets the features extracted by the convolutional layers.
#     Dense(units=128, activation='relu'),
    
#     # This is our output layer. It remains the same, with 10 neurons for digits 0-9
#     # and 'softmax' to provide a probability distribution.
#     Dense(units=10, activation='softmax')
# ], name="CNN_Base")

# # We can print a summary of the model's architecture to see the layers and parameter counts.
# print("\n--- Model Architecture ---")
# mnist_model_cnn.summary()

# # for n in range(len(mnist_model_cnn.layers)):
# #     print(f"MNIST model layer[{n}: {mnist_model_cnn.layers[n]}]")

# # --- 2. Adapt the Normalisation Layer ---
# # Before training, we must let the Normalization layer calculate the mean and
# # variance of our training data. The .adapt() method does this for us.
# print("Adapting the normalisation layer to the training data...")
# mnist_model_cnn.layers[0].adapt(X_train) 
# print("Adaptation complete.")


# # --- 3. Configure and Compile the Model ---
# # We configure the optimiser and the list of metrics we want to track.
# adam_optimizer: Adam = Adam(learning_rate=0.001)
# metrics_list: list = ['accuracy'] #, Precision(), Recall()] Precision and Recall can be used for Binary Classification problems only.

# # The compile step brings everything together and prepares the model for training.
# mnist_model_cnn.compile(
#     optimizer=adam_optimizer,
#     loss='sparse_categorical_crossentropy', # Best for integer labels in multi-class classification.
#     metrics=metrics_list
# )

# # Define the filepath for the saved model.
# # The placeholders {epoch:02d} and {val_accuracy:.4f} will be automatically filled in.
# folder_name = 'CNN Models'
# filepath = f'{folder_name}/best_{mnist_model_cnn.name}_epoch-{{epoch:02d}}_val_acc-{{val_accuracy:.4f}}.keras'

# # Create a ModelCheckpoint callback so we can save the best model during training. Note: every epoch the model is evaluated
# # on validation accuracy. If the validation accuracy improves, the model will be saved. If not, it will not be saved.
# model_checkpoint_callback = ModelCheckpoint(
#     filepath=filepath,
#     monitor='val_accuracy',      # Monitor the validation accuracy
#     mode='max',                  # The direction of improvement (higher is better for accuracy). Had we chosen Min, then we would be looking for the lowest validation accuracy. 
#     save_best_only=True,         # Only save the model if `val_accuracy` has improved
#     verbose=1                    # Print a message when the model is saved
# )

# # --- 4. Train the Model ---
# # This is where the learning happens. The .fit() method trains the model on our data.
# # The 'history' object will store the loss and metric values for each epoch.
# history_cnn: History = mnist_model_cnn.fit(
#     X_train,
#     Y_train,
#     epochs=5,
#     validation_split=0.1, # We hold back 10% of the training data to validate performance.
#     batch_size=64,
#     verbose=1, # We set verbose=1 to see the training progress bar.
#     callbacks=[model_checkpoint_callback],  # We add the callback here in the fit method so that we can save the best model during training.
# )

In [None]:
def create_cnn_model() -> Sequential:
    """
    Defines and returns the base CNN model architecture.
    """
    model = Sequential([
        # Preprocessing layers
        Normalization(input_shape=(28, 28, 1)),
        
        # --- Convolutional Block 1 ---
        Conv2D(filters=32, kernel_size=(3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),

        # --- Convolutional Block 2 ---
        Conv2D(filters=64, kernel_size=(3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),
        
        # --- Classification Head ---
        Flatten(),
        Dropout(0.5),
        Dense(units=128, activation='relu'),
        Dense(units=10, activation='softmax')
    ])
    return model

In [None]:
# --- Define Hyperparameter Set for the Base CNN ---
cnn_exp_1_config = {
    "model_name": "CNN_Base",
    "optimiser": "Adam",
    "learning_rate": 0.01,
    "epochs": 70,
    "batch_size": 64
}
cnn_exp_2_config = {
    "model_name": "CNN_Slow_Learn",
    "optimiser": "Adam",
    "learning_rate": 0.001,
    "epochs": 70,
    "batch_size": 64
}

cnn_exp_3_config = {
    "model_name": "CNN_SGD",
    "optimiser": "SGD",
    "learning_rate": 0.01,
    "epochs": 70,
    "batch_size": 64
}

cnn_exp_4_config = {
    "model_name": "CNN_SGD_Slow_Learn",
    "optimiser": "SGD",
    "learning_rate": 0.001,
    "epochs": 70,
    "batch_size": 64
}

In [None]:
# --- Run the CNN experiment ---
cnn_histories = []
cnn_history_1 = run_experiment(
    model_creation_func=create_cnn_model, 
    hyperparameters=cnn_exp_1_config,
    parent_folder='CNN_Models',
    X_train=X_train, 
    Y_train=Y_train
)
cnn_histories.append(cnn_history_1)

cnn_history_2 = run_experiment(
    model_creation_func=create_cnn_model, 
    hyperparameters=cnn_exp_2_config,
    parent_folder='CNN_Models',
    X_train=X_train, 
    Y_train=Y_train
)
cnn_histories.append(cnn_history_2)

cnn_history_3 = run_experiment(
    model_creation_func=create_cnn_model, 
    hyperparameters=cnn_exp_3_config,
    parent_folder='CNN_Models',
    X_train=X_train, 
    Y_train=Y_train
)
cnn_histories.append(cnn_history_3)

cnn_history_4 = run_experiment(
    model_creation_func=create_cnn_model, 
    hyperparameters=cnn_exp_4_config,
    parent_folder='CNN_Models',
    X_train=X_train, 
    Y_train=Y_train
)
cnn_histories.append(cnn_history_4)


In [None]:
# --- Define the Plotting Function ---
def plot_training_history(history: History):
    # summarize history for accuracy
    plt.plot(history.history['accuracy']) # the train accuracy
    plt.plot(history.history['val_accuracy'])
    plt.title(f'{history.model.name} model accuracy')
    plt.ylabel('accuracy')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

    # summarize history for loss
    plt.plot(history.history['loss']) # the train loss
    plt.plot(history.history['val_loss'])
    plt.title(f'{history.model.name} model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper right')
    plt.show()

### Plot the results of every epoc

In [None]:
def print_training_histories(histories):
    for history in histories:
        plot_training_history(history)
        print("-"*100)

print_training_histories(mlp_histories)
print_training_histories(cnn_histories)

## Testing the models on the held-out test set
We test the model on the test data, which is data that the model has never seen before. Then we verify the model's real-world accuracy. It is expected that this does not deviate much from the validation sets, because the MNIST dataset contains images that are very clean and simple:
- They are small (28 x 28 pixels only).
- The digits are centered and normalised in size.
- The background is a solid colour with no distracting noise.
  
Because of this simplicity, the patterns that differentiate one digit from another (e.g., a "1" is a vertical line, an "8" is two loops) are very strong and easy for our model to learn.

First we define a function that browser to a folder with saved models, extracts the file with the highest validation accuracy in its name, loads it and tests it with the held-out X_test and Y_test. 

A function is convenient because we will use it on different models, with different hyperparameters and hence, avoid repetition. 

In [None]:
def find_load_and_analyse_best_model(
    parent_folder: str, # Changed name for clarity
    x_test_data: NDArray[np.float32], 
    y_test_data: NDArray[np.int_]
) -> Tuple[tf.keras.Model | None, float | None, float | None]:
    """
    Recursively searches through all subfolders in a parent directory to find the
    single best Keras model, then loads and analyses it.
    """
    best_model_path = None # We will now store the full path directly
    best_val_accuracy = -1.0

    pattern = re.compile(r"val_acc-([\d.]+)\.keras")

    if not os.path.isdir(parent_folder):
        print(f"Error: Parent directory not found at '{parent_folder}'")
        return None, None, None

    # --- NEW: Use os.walk() to search through all subdirectories ---
    # os.walk() goes through a directory tree top-down.
    for dirpath, _, filenames in os.walk(parent_folder):
        for filename in filenames:
            match = pattern.search(filename)
            if match:
                val_accuracy = float(match.group(1))
                if val_accuracy > best_val_accuracy:
                    best_val_accuracy = val_accuracy
                    # Construct and store the full path to this new best model
                    best_model_path = os.path.join(dirpath, filename)
    
    # The rest of the function works perfectly, we just need to use best_model_path
    if best_model_path:
        print(f"Found and loading best model across all experiments: {best_model_path}")
        
        loaded_model = tf.keras.models.load_model(best_model_path)
        
        # --- Print Compiled Hyperparameters ---
        print("\n--- Key Hyperparameters ---")
        # Gets the configuration of the model's optimiser.
        optimiser_config = loaded_model.optimizer.get_config()
        optimiser_name = optimiser_config['name']
        learning_rate = optimiser_config['learning_rate']
        
        # Gets the name of the loss function the model was compiled with.
        loss_function = loaded_model.loss
        
        print(f"{'Optimiser:':<20} {optimiser_name}")
        print(f"{'Learning Rate:':<20} {learning_rate}")
        print(f"{'Loss Function:':<20} {loss_function}")
        
        # Prints a summary table of the model's architecture.
        print("\n--- Best Model Summary (Architecture) ---")
        loaded_model.summary()

        # Evaluates the loaded model's performance on the unseen test data.
        print("\n--- Evaluating model performance on the test set ---")
        loss, accuracy = loaded_model.evaluate(x_test_data, y_test_data, verbose=1)
        
        # Prints the final evaluation results, formatted to 4 decimal places.
        print(f"\nTest Set Loss: {loss:.4f}")
        print(f"Test Set Accuracy: {accuracy:.4f}")

        # --- Generate Detailed Performance Analysis ---
        print("\n--- Detailed Analysis ---")
        
        # Use the model to predict the class for each image in the test set.
        y_pred_probabilities = loaded_model.predict(x_test_data)
        # The model outputs probabilities; we use np.argmax to find the class with the highest probability.
        y_pred = np.argmax(y_pred_probabilities, axis=1)

        # Generate and print a text report showing precision, recall, and f1-score for each digit.
        print("\n--- Classification Report ---")
        report = classification_report(y_test_data, y_pred, target_names=[str(i) for i in range(10)])
        print(report)

        # Generate and plot a confusion matrix to visualise which digits are being confused.
        print("\n--- Confusion Matrix ---")
        cm = confusion_matrix(y_test_data, y_pred)
        plt.figure(figsize=(10, 8))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
        plt.xlabel('Predicted Label')
        plt.ylabel('Actual Label')
        plt.title(f'Confusion Matrix for {loaded_model.name}')
        plt.show()
        
        # Returns the loaded model object and its performance metrics for potential further use.
        return loaded_model, accuracy, loss
    else:
        # If no model files matching the pattern were found, print a message and return nothing.
        print(f"No model files found in any subfolders of '{parent_folder}'.")
        return None, None, None

### Testing the MLP model

In [None]:
model_folder="MLP_Models"
# To capture the output, assign it to variables
best_model, test_acc, test_loss = find_load_and_analyse_best_model(
    parent_folder=model_folder,
    x_test_data=X_test,
    y_test_data=Y_test
)



### Testing the CNN model

In [None]:
model_folder='CNN_Models'
# To capture the output, assign it to variables
best_model, test_acc, test_loss = find_load_and_analyse_best_model(
    parent_folder=model_folder,
    x_test_data=X_test,
    y_test_data=Y_test
)




# Future to do's (not part of this assessment)
- Implement KerasTuner, to automatically train and test models with a plethora of hyperparamaters, optimisers, loss functions:

We first need to install it first: uv pip install keras-tuner
import keras_tuner

def build_model(hp):
    """This is our hypermodel, which defines the search space."""
    
    model = Sequential(name="Tuned_MLP")
    model.add(Input(shape=(28, 28)))
    model.add(Normalization())
    model.add(Flatten())
    
    # --- Define Hyperparameters to Tune ---
    # Tune the number of units in the first Dense layer
    hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
    model.add(Dense(units=hp_units, activation='relu'))
    
    # Tune the learning rate
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
    
    # Add the output layer
    model.add(Dense(10, activation='softmax'))

    # --- Compile the model inside the function ---
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

--- Set up the Tuner ---
### We'll use RandomSearch, which randomly tries combinations.
tuner = keras_tuner.RandomSearch(
    build_model,
    objective='val_accuracy',
    max_trials=10,  # The total number of model variations to test
    executions_per_trial=2, # The number of times to train each model variation
    directory='tuning_results',
    project_name='MNIST_Tuning'
)

### --- Start the Search ---
### This is like model.fit(), but it runs the whole tuning process.
tuner.search(X_train, Y_train, epochs=10, validation_split=0.1)

### --- Get the Best Model ---
best_model = tuner.get_best_models(num_models=1)[0]
best_hyperparameters = tuner.get_best_hyperparameters(num_trials=1)[0]

print("\n--- Best Hyperparameters Found ---")
print(best_hyperparameters.values)

print("\n--- Evaluating the Best Model Found by the Tuner ---")
best_model.evaluate(X_test, Y_test)