In [None]:
pip install tensorflow matplotlib numpy


Note: you may need to restart the kernel to use updated packages.


# --------------------------------------------------
# **Loading and preprocessing the MNIST dataset**
# --------------------------------------------------

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

data = np.load("/Users/trippy/Library/CloudStorage/OneDrive-montclair.edu/Semester 2/Machine Learning/Assignments/Code/mnist.npz")

# Images and labels
x_train, y_train = data["x_train"], data["y_train"]
x_test, y_test = data["x_test"], data["y_test"]

# Converting pixel values from [0, 255] to [0, 1]
x_train = x_train.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0

# Adding channel dimension so each image is (28, 28, 1) instead of (28, 28)
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

print("Train images:", x_train.shape)
print("Test  images:", x_test.shape)
print("Train labels:", y_train.shape)
print("Test  labels:", y_test.shape)

Train images: (60000, 28, 28, 1)
Test  images: (10000, 28, 28, 1)
Train labels: (60000,)
Test  labels: (10000,)


In the above cell, I loaded the MNIST dataset from a local mnist.npz file instead of downloading it from the internet. Training and test images (x_train, x_test) and labels (y_train, y_test) are extracted using NumPy.

The images are then converted from integer pixel values in the range [0, 255] to floating-point values in [0, 1] by dividing by 255. Finally, I add a channel dimension so that each image has shape (28, 28, 1), which is the format expected by the convolutional and fully connected models in Keras.

The printed shapes confirm that there are 60,000 training images and 10,000 test images, each with the correct dimensions and corresponding labels.


# --------------------------------------------------

# **Defining three-layer and five-layer perceptron architectures**
This cell defines two helper functions, build_mlp_3 and build_mlp_5, which construct Multi-Layer Perceptron (MLP) models with different depths.

# --------------------------------------------------

In [None]:


def build_mlp_3(activation="relu"):
    """
    Three-layer perceptron for MNIST:
    - Input → Flatten(28x28x1) to 784
    - Hidden layer 1: 256 units
    - Hidden layer 2: 128 units
    - Output layer: 10 units (one per digit class) with softmax
    """
    return keras.Sequential([
        layers.Flatten(input_shape=(28, 28, 1)),
        layers.Dense(256, activation=activation),
        layers.Dense(128, activation=activation),
        layers.Dense(10, activation="softmax"),
    ])

def build_mlp_5(activation="relu"):
    """
    Five-layer perceptron for MNIST:
    - More hidden layers and units than the 3-layer version
      so we can see the effect of increased depth.
    """
    return keras.Sequential([
        layers.Flatten(input_shape=(28, 28, 1)),
        layers.Dense(512, activation=activation),
        layers.Dense(256, activation=activation),
        layers.Dense(128, activation=activation),
        layers.Dense(64, activation=activation),
        layers.Dense(10, activation="softmax"),
    ])

This defines two helper functions, build_mlp_3 and build_mlp_5, which construct Multi-Layer Perceptron (MLP) models with different depths:

build_mlp_3(activation=...) creates a three-layer perceptron:

	Flatten input (28×28×1) → 784,
	Hidden layer with 256 units,
	Hidden layer with 128 units,
	Output layer with 10 units and softmax activation for digit classification.

build_mlp_5(activation=...) creates a deeper five-layer perceptron:
	Flatten input,
	Hidden layers with 512, 256, 128, and 64 units,
	Output layer with 10 units and softmax.

Both functions accept a choice of activation function ("relu" or "sigmoid") for all hidden layers. These builders allow me to easily change the depth and activation of the perceptron for Parts A.1 to A.4.

# --------------------------------------------------
# **Preparation – Training and evaluation helper for all models (Parts A and B)**
# --------------------------------------------------

In [None]:
import time

def run_model(model, optimizer, epochs=5):
    """
    Compiles the given model, trains it on MNIST,
    and returns (test_accuracy, running_time_in_seconds).
    """

    # Setting loss and metric for classification
    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )

    start = time.time()

    # Train on training data, keeping 10% for validation
    model.fit(
        x_train, y_train,
        epochs=epochs,
        batch_size=128,
        validation_split=0.1,
        verbose=1,
    )

    # Evaluate on test set
    test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)

    end = time.time()
    elapsed = end - start

    return test_acc, elapsed

For any given model and optimizer, the function:
	1.	Compiles the model with:
	•	Loss: sparse_categorical_crossentropy (for integer digit labels 0–9),
	•	Metric: accuracy.
	2.	Trains the model on the training set for a chosen number of epochs (batch size 128), with 10% of the training data used as validation.
	3.	Evaluates the trained model on the test set to compute the final test accuracy.
	4.	Measures total running time (training + evaluation) using time.time().

It returns a pair: (test_accuracy, running_time_in_seconds).
This function is reused across all perceptron and CNN experiments so that the results are comparable in Parts A and B.

# --------------------------------------------------
# ***PART A – MULTI-LAYER PERCEPTRON***
# --------------------------------------------------

# --------------------------------------------------
**A.1 Three-layer perceptron with ReLU activation and Adam optimizer**

In this experiment (Part A.1), I train a three-layer perceptron using:
Hidden layers: 256 and 128 units,
ReLU activation in the hidden layers,
Adam optimizer,
5 training epochs.

In [None]:
mlp_relu_adam = build_mlp_3(activation="relu")

acc_relu_adam, time_relu_adam = run_model(
    mlp_relu_adam,
    optimizer="adam",
    epochs=5,
)

print("Three-layer MLP (ReLU + Adam) – test accuracy:", acc_relu_adam)
print("Three-layer MLP (ReLU + Adam) – running time (s):", time_relu_adam)

Epoch 1/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9152 - loss: 0.2923 - val_accuracy: 0.9670 - val_loss: 0.1157
Epoch 2/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9678 - loss: 0.1090 - val_accuracy: 0.9730 - val_loss: 0.0843
Epoch 3/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9789 - loss: 0.0702 - val_accuracy: 0.9765 - val_loss: 0.0784
Epoch 4/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9849 - loss: 0.0496 - val_accuracy: 0.9788 - val_loss: 0.0726
Epoch 5/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9882 - loss: 0.0371 - val_accuracy: 0.9807 - val_loss: 0.0721
Three-layer MLP (ReLU + Adam) – test accuracy: 0.9787999987602234
Three-layer MLP (ReLU + Adam) – running time (s): 3.0660860538482666


Test accuracy ≈ 0.9788 (97.88%)
Running time ≈ 3.07 seconds

This shows that a relatively shallow MLP with ReLU and Adam already performs very well on MNIST. It serves as a strong baseline for comparing other activation functions and optimizers in later parts of Part A.

# --------------------------------------------------
**A.2 – Three-layer perceptron with Sigmoid activation and Adam optimizer**

In Part A.2, I keep the same three-layer perceptron architecture but change the hidden-layer activation function from ReLU to Sigmoid, while still using the Adam optimizer and 5 training epochs.

In [None]:
mlp_sigmoid_adam = build_mlp_3(activation="sigmoid")

acc_sigmoid_adam, time_sigmoid_adam = run_model(
    mlp_sigmoid_adam,
    optimizer="adam",
    epochs=5,
)

print("Three-layer MLP (Sigmoid + Adam) – test accuracy:", acc_sigmoid_adam)
print("Three-layer MLP (Sigmoid + Adam) – running time (s):", time_sigmoid_adam)

Epoch 1/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.8403 - loss: 0.6299 - val_accuracy: 0.9348 - val_loss: 0.2397
Epoch 2/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9289 - loss: 0.2443 - val_accuracy: 0.9507 - val_loss: 0.1713
Epoch 3/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9450 - loss: 0.1851 - val_accuracy: 0.9620 - val_loss: 0.1367
Epoch 4/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9573 - loss: 0.1465 - val_accuracy: 0.9682 - val_loss: 0.1157
Epoch 5/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.9656 - loss: 0.1191 - val_accuracy: 0.9707 - val_loss: 0.1007
Three-layer MLP (Sigmoid + Adam) – test accuracy: 0.963699996471405
Three-layer MLP (Sigmoid + Adam) – running time (s): 3.3703339099884033


Test accuracy ≈ 0.9637 (96.37%)
Running time ≈ 3.37 seconds

Compared to the ReLU network in A.1, the Sigmoid network is slightly slower and less accurate.

This suggests that ReLU is a better activation function than Sigmoid for this task, likely because it avoids saturation and vanishing-gradient issues during training.

# --------------------------------------------------
**Assignment A.3 – Three-layer perceptron with ReLU activation and SGD optimizer**

In Part A.3, I return to the three-layer perceptron with ReLU activation, but change the optimizer from Adam to SGD (Stochastic Gradient Descent) with a learning rate of 0.01. The model is trained for 5 epochs.

In [None]:
mlp_relu_sgd = build_mlp_3(activation="relu")

acc_relu_sgd, time_relu_sgd = run_model(
    mlp_relu_sgd,
    optimizer=keras.optimizers.SGD(learning_rate=0.01),
    epochs=5,
)

print("Three-layer MLP (ReLU + SGD) – test accuracy:", acc_relu_sgd)
print("Three-layer MLP (ReLU + SGD) – running time (s):", time_relu_sgd)

Epoch 1/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.7165 - loss: 1.2302 - val_accuracy: 0.8823 - val_loss: 0.5597
Epoch 2/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.8719 - loss: 0.5048 - val_accuracy: 0.9117 - val_loss: 0.3569
Epoch 3/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 944us/step - accuracy: 0.8926 - loss: 0.3914 - val_accuracy: 0.9183 - val_loss: 0.3003
Epoch 4/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 991us/step - accuracy: 0.9041 - loss: 0.3439 - val_accuracy: 0.9247 - val_loss: 0.2687
Epoch 5/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 946us/step - accuracy: 0.9106 - loss: 0.3149 - val_accuracy: 0.9297 - val_loss: 0.2488
Three-layer MLP (ReLU + SGD) – test accuracy: 0.919700026512146
Three-layer MLP (ReLU + SGD) – running time (s): 2.4648451805114746


Test accuracy ≈ 0.9197 (91.97%)
Running time ≈ 2.46 seconds

Although SGD is slightly faster than Adam, the accuracy drops significantly (from about 97.88% with Adam to about 91.97% with SGD). This indicates that Adam is a much more effective optimizer than SGD for this architecture and dataset.

# --------------------------------------------------
**Assignment A.4 – Five-layer perceptron using the best activation function and optimizer**

In Part A.4, I build a five-layer perceptron using the best activation function and optimizer found in Parts A.1 to A.3, which are ReLU (activation) and Adam (optimizer).

In [None]:
best_activation = "relu"
best_optimizer_name = "adam"

mlp_5_best = build_mlp_5(activation=best_activation)

if best_optimizer_name == "adam":
    chosen_optimizer = "adam"
else:
    chosen_optimizer = keras.optimizers.SGD(learning_rate=0.01)

acc_mlp5, time_mlp5 = run_model(
    mlp_5_best,
    optimizer=chosen_optimizer,
    epochs=5,
)

print("Five-layer MLP (best config) – test accuracy:", acc_mlp5)
print("Five-layer MLP (best config) – running time (s):", time_mlp5)

Epoch 1/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.9212 - loss: 0.2610 - val_accuracy: 0.9637 - val_loss: 0.1123
Epoch 2/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9716 - loss: 0.0947 - val_accuracy: 0.9760 - val_loss: 0.0856
Epoch 3/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9809 - loss: 0.0611 - val_accuracy: 0.9762 - val_loss: 0.0850
Epoch 4/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9867 - loss: 0.0427 - val_accuracy: 0.9783 - val_loss: 0.0801
Epoch 5/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9883 - loss: 0.0358 - val_accuracy: 0.9798 - val_loss: 0.0743
Five-layer MLP (best config) – test accuracy: 0.9794999957084656
Five-layer MLP (best config) – running time (s): 5.846661806106567


This deeper network has hidden layers with 512, 256, 128, and 64 units before the final 10-class output layer.

Training for 5 epochs results in:
Test accuracy ≈ 0.9795 (97.95%)
Running time ≈ 5.85 seconds

Compared to the three-layer ReLU + Adam model from A.1 (97.88% accuracy, ~3.07 s), the five-layer version achieves a slightly higher accuracy but takes almost twice as long to train. This shows that increasing depth can improve performance, but at the cost of higher computational effort.


# --------------------------------------------------
**Assignment A.5 – Summary of observations for perceptron models**

This cell summarizes the results of all perceptron experiments in Part A:

Three-layer MLP (ReLU + Adam, A.1):	   Accuracy = 97.88%, time = 3.07 s
Three-layer MLP (Sigmoid + Adam, A.2): Accuracy = 96.37%, time = 3.37 s
Three-layer MLP (ReLU + SGD, A.3):     Accuracy = 91.97%, time = 2.46 s
Five-layer MLP (ReLU + Adam, A.4):     Accuracy = 97.95%, time = 5.85 s

In [None]:
print("Three-layer MLP  (ReLU + Adam) : accuracy =", acc_relu_adam, ", time (s) =", time_relu_adam)
print("Three-layer MLP (Sigmoid + Adam) : accuracy =", acc_sigmoid_adam, ", time (s) =", time_sigmoid_adam)
print("Three-layer MLP (ReLU + SGD) : accuracy =", acc_relu_sgd, ", time (s) =", time_relu_sgd)
print("Five-layer MLP – best config (ReLU + Adam) : accuracy =", acc_mlp5, ", time (s) =", time_mlp5)

Three-layer MLP – ReLU + Adam   : accuracy = 0.9787999987602234 , time (s) = 3.0660860538482666
Three-layer MLP – Sigmoid + Adam: accuracy = 0.963699996471405 , time (s) = 3.3703339099884033
Three-layer MLP – ReLU + SGD    : accuracy = 0.919700026512146 , time (s) = 2.4648451805114746
Five-layer MLP – best config    : accuracy = 0.9794999957084656 , time (s) = 5.846661806106567


From these results, the main observations are:
Activation functions: ReLU clearly outperforms Sigmoid for this dataset.
Optimizers: Adam is much more effective than SGD, giving significantly higher accuracy.
Model depth: Increasing depth from three to five layers yields a small accuracy improvement but increases training time.

These points directly address the impact of activation choice, optimizer, and depth on the performance of perceptron models.

# --------------------------------------------------
# ***PART B – CONVOLUTIONAL NEURAL NETWORKS***
# --------------------------------------------------

Defining CNN architectures:

Simple 4-layer CNN and LeNet-5

Simple 4-layer CNN (build_simple_cnn):
Conv2D(32 filters, 3x3) + MaxPooling2D(2x2)
Conv2D(64 filters, 3x3) + MaxPooling2D(2x2)
Flatten to Dense(64) to Dense(10 with softmax)


LeNet-5 style CNN (build_lenet5):
Conv2D(6 filters, 5x5) + AveragePooling
Conv2D(16 filters, 5x5) + AveragePooling
Flatten to Dense(120) to Dense(84) to Dense(10 with softmax)

LeNet-5 is a classic architecture originally proposed for handwritten digit recognition, making it a natural and efficient choice for the MNIST dataset.

These architectures are then trained and evaluated in the following Part B experiments

In [None]:
from tensorflow.keras import layers, models

def build_simple_cnn():
    """
    Simple 4-layer CNN for MNIST:
    - Conv(32) + MaxPool
    - Conv(64) + MaxPool
    - Dense(64) + Dense(10)
    """
    model = models.Sequential([
        layers.Conv2D(32, kernel_size=3, activation="relu", input_shape=(28, 28, 1)),
        layers.MaxPooling2D(pool_size=2),
        layers.Conv2D(64, kernel_size=3, activation="relu"),
        layers.MaxPooling2D(pool_size=2),
        layers.Flatten(),
        layers.Dense(64, activation="relu"),
        layers.Dense(10, activation="softmax"),
    ])
    return model

def build_lenet5():
    """
    LeNet-5 style CNN for MNIST.
    Original LeNet-5 was proposed for handwritten digit recognition,
    so it fits this assignment very naturally.
    """
    model = models.Sequential([
        # C1: 6 feature maps, 5x5 kernel, same padding - output 28x28
        layers.Conv2D(6, kernel_size=5, activation="relu", padding="same", input_shape=(28, 28, 1)),
        # S2: average pooling → 14x14
        layers.AveragePooling2D(pool_size=2),
        # C3: 16 feature maps, 5x5 kernel, valid padding - 10x10
        layers.Conv2D(16, kernel_size=5, activation="relu"),
        # S4: average pooling → 5x5
        layers.AveragePooling2D(pool_size=2),
        # Flatten - 16 * 5 * 5 = 400 features
        layers.Flatten(),
        # F5 and F6 fully connected layers
        layers.Dense(120, activation="relu"),
        layers.Dense(84, activation="relu"),
        # Output layer for 10 digit classes
        layers.Dense(10, activation="softmax"),
    ])
    return model

# --------------------------------------------------
**B.1 - Simple 4-layer CNN with Adam optimizer**

In Part B.1, I train the simple 4-layer CNN (two convolution + pooling blocks followed by two dense layers) using the Adam optimizer for 5 training epochs.

In [None]:

cnn_simple = build_simple_cnn()

acc_cnn_simple, time_cnn_simple = run_model(
    cnn_simple,
    optimizer="adam",
    epochs=5,
)

print("Simple CNN (4-layer) – test accuracy:", acc_cnn_simple)
print("Simple CNN (4-layer) – running time (s):", time_cnn_simple)

Epoch 1/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.9254 - loss: 0.2556 - val_accuracy: 0.9800 - val_loss: 0.0675
Epoch 2/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9781 - loss: 0.0715 - val_accuracy: 0.9852 - val_loss: 0.0549
Epoch 3/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9841 - loss: 0.0511 - val_accuracy: 0.9878 - val_loss: 0.0451
Epoch 4/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9870 - loss: 0.0408 - val_accuracy: 0.9885 - val_loss: 0.0423
Epoch 5/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 10ms/step - accuracy: 0.9894 - loss: 0.0341 - val_accuracy: 0.9892 - val_loss: 0.0377
Simple CNN (4-layer) – test accuracy: 0.9872000217437744
Simple CNN (4-layer) – running time (s): 21.555065155029297


Test accuracy ≈ 0.9872 (98.72%)
Running time ≈ 21.56 seconds

This CNN achieves higher accuracy than all MLP models from Part A, demonstrating the advantage of convolutional layers in capturing local spatial patterns in image data. However, it also takes more time to train compared to the perceptron models.

# --------------------------------------------------
**B.2 - LeNet-5 CNN (chosen architecture) with Adam optimizer**
In Part B.2, I train the LeNet-5 style CNN using the Adam optimizer for 5 epochs. This architecture is chosen from the list {LeNet-5, AlexNet, VGG-16, GoogLeNet, ResNet-18} provided in the assignment, because it was originally proposed for handwritten digit recognition and is well-suited to the MNIST dataset.

In [None]:
cnn_lenet5 = build_lenet5()

acc_lenet5, time_lenet5 = run_model(
    cnn_lenet5,
    optimizer="adam",
    epochs=5,
)

print("LeNet-5 CNN – test accuracy:", acc_lenet5)
print("LeNet-5 CNN – running time (s):", time_lenet5)

Epoch 1/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.8849 - loss: 0.3962 - val_accuracy: 0.9635 - val_loss: 0.1302
Epoch 2/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9642 - loss: 0.1175 - val_accuracy: 0.9790 - val_loss: 0.0739
Epoch 3/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 7ms/step - accuracy: 0.9764 - loss: 0.0777 - val_accuracy: 0.9825 - val_loss: 0.0588
Epoch 4/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9821 - loss: 0.0584 - val_accuracy: 0.9850 - val_loss: 0.0497
Epoch 5/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 6ms/step - accuracy: 0.9848 - loss: 0.0475 - val_accuracy: 0.9850 - val_loss: 0.0520
LeNet-5 CNN – test accuracy: 0.98580002784729
LeNet-5 CNN – running time (s): 13.691288232803345


Test accuracy ≈ 0.9858 (98.58%)
Running time ≈ 13.69 seconds

LeNet-5 performs slightly worse in accuracy than the simple 4-layer CNN (98.58% vs 98.72%), but it trains faster (≈13.69 s vs 21.56 s). Both CNNs clearly outperform the perceptron models from Part A, confirming the strength of convolutional architectures on image classification tasks like MNIST.

# --------------------------------------------------
# ***PART C – PERFORMANCE COMPARISON AND ANALYSIS***
# --------------------------------------------------


# --------------------------------------------------
**C.1 - Performance comparison among perceptrons, CNNs, and best classical model (Assignment 1)**
In Part C.1, I compare three categories of models:
1.	The best perceptron (MLP) from Part A,
2.	The best CNN from Part B,
3.	The best classical machine learning model from Assignment 1.

In [None]:
# Comparing MLPs, CNNs, and Assignment 1 model

print("MLP results:")
print("Three-layer MLP – ReLU + Adam   :", acc_relu_adam,    ", time (s):", time_relu_adam)
print("Three-layer MLP – Sigmoid + Adam:", acc_sigmoid_adam, ", time (s):", time_sigmoid_adam)
print("Three-layer MLP – ReLU + SGD    :", acc_relu_sgd,     ", time (s):", time_relu_sgd)
print("Five-layer MLP – best config    :", acc_mlp5,         ", time (s):", time_mlp5)

print("\nCNN results:")
print("Simple CNN (4-layer)            :", acc_cnn_simple,   ", time (s):", time_cnn_simple)
print("LeNet-5 CNN                     :", acc_lenet5,       ", time (s):", time_lenet5)


best_assignment1_accuracy = 0.9792

print("\nBest model from Assignment 1 – accuracy:", best_assignment1_accuracy)

MLP results:
Three-layer MLP – ReLU + Adam   : 0.9787999987602234 , time (s): 3.0660860538482666
Three-layer MLP – Sigmoid + Adam: 0.963699996471405 , time (s): 3.3703339099884033
Three-layer MLP – ReLU + SGD    : 0.919700026512146 , time (s): 2.4648451805114746
Five-layer MLP – best config    : 0.9794999957084656 , time (s): 5.846661806106567

CNN results:
Simple CNN (4-layer)            : 0.9872000217437744 , time (s): 21.555065155029297
LeNet-5 CNN                     : 0.98580002784729 , time (s): 13.691288232803345

Best model from Assignment 1 – accuracy: 0.9792



From the experiments:
	•	Best MLP (Part A) – five-layer perceptron with ReLU activation and Adam optimizer (A.4):
	•	Accuracy ≈ 97.95%, time ≈ 5.85 s
	•	Best CNN (Part B) – simple 4-layer CNN (B.1):
	•	Accuracy ≈ 98.72%, time ≈ 21.56 s
	•	LeNet-5 CNN (Part B.2):
	•	Accuracy ≈ 98.58%, time ≈ 13.69 s
	•	Best classical model (Assignment 1) – SVM with RBF kernel:
	•	Accuracy ≈ 97.92%

From these results, I observe that:
	•	The best classical model (SVM with RBF) and the best MLP have very similar accuracy (around 97.9%), showing that classical methods and MLPs can both perform strongly on MNIST.
	•	Both CNNs (simple 4-layer CNN and LeNet-5) achieve higher accuracy than the MLP and SVM, demonstrating that convolutional architectures are more effective at leveraging the spatial structure in image data.
	•	The simple 4-layer CNN has the highest test accuracy overall (≈98.72%), while LeNet-5 provides a slightly lower accuracy but better training time compared to the 4-layer CNN.

Overall, the comparison shows that CNNs are the best-performing family of models for MNIST in this assignment, followed by MLPs and classical models like SVM. CNNs offer superior accuracy at the cost of increased training time and model complexity.