## Fadhla Mohamed Mutua
## SM3201434

### 1. Introduction

This experiment aims to explore and compare the performance of two neural network architectures (Fully Connected Neural Network - FCNN and Convolutional Neural Network - CNN) on the KMNIST dataset. The focus is on tuning hyperparameters to improve model performance.

We will train each model using different optimizers (Adam, SGD, RMSprop) and evaluate their performance using accuracy and loss metrics. The best-performing model will be selected based on validation accuracy, and its confusion matrix will be analyzed.

### 2. Data Eploration

We use the KMNIST dataset, containing 28x28 pixel grayscale images of handwritten non-alpha-numeric characters. The dataset is divided into a training set (60,000 images) and a test set (10,000 images).

#### 2.1 Data Preprocessing

The images are normalized using the ToTensor() transformation, converting them into PyTorch tensors. The dataset is then split into training and validation subsets using random_split(), with a fixed seed of 0 for reproducibility. To optimize training, the data is loaded in batches of 64, with shuffling enabled for each epoch to prevent memorization and ensure generalization.

#### 2.2 Model Architectures

`Note: For both architectures, the number of training epochs was set to 10 to accommodate the slower convergence of the SGD optimizer and FCNN architecture (two layers)`

##### 2.2.1

The `FCNN` consists of three fully connected layers with `ReLU` activation functions and dropout regularization. The model architecture is as follows: Input layer (784), Hidden layer 1 (128 neurons), Hidden layer 2 (64 neurons), Output layer (10 neurons for 10 classes).

##### 2.2.2

The `CNN` model uses two convolutional layers followed by max-pooling layers. After flattening, two fully connected layers are used. The architecture is: `Conv2D` (1, 32, 3x3), `MaxPool2D` (2x2), `Conv2D` (32, 64, 3x3), `MaxPool2D` (2x2), Fully connected layers (128 neurons).

#### 2.3 Training 

Three optimizers are tested: `Adam`, `SGD`, and `RMSprop`. The learning rate for `Adam`, acquired through grid search is set to 0.001, for `SGD` it’s 0.1, and for `RMSprop` it’s 0.001. `Batch size` is set to 64, and `dropout` 0.2 . `Weight decay` is set to $10^{−5}$ for all but `SDG` which is set to 0.

Each model is trained for 10 epochs. The training subset is used for model training, while the validation set is used to evaluate performance during training.

------------------------------------------------------

### 3 Results
#### 3.1 CNN Summary

<div style="font-size: 15px;">

| Optimizer | Convergence Speed | Overfitting Tendency    | Notes                                      |
|-----------|-------------------|----------------------|-----------------------------------------------|
| SGD       | Slower            | Moderate             | Gradual learning with stable generalization.   |
| Adam      | Fast              | High                 | Quick convergence; tends to overfit after ~4 epochs. |
| RMSprop   | Fast              | High                 | Quick convergence; tends to overfit after ~4 epochs.  |

</div>

##### 3.1.1 CNN results
<div style="font-size: 15px;">

| Epoch | Model Type | Hidden Layers | Optimizer | Learning Rate | Batch Size | Dropout | Weight Decay | Train Loss | Val Loss | Train Accuracy (%) | Val Accuracy (%) | Notes              |
|-------|------------|---------------|-----------|---------------|------------|---------|--------------|------------|----------|---------------------|-------------------|--------------------|
| 1     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.346233   | 0.123885 | 89.54               | 96.10             | Optimizer test run |
| 2     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.112289   | 0.092516 | 96.66               | 97.15             | Optimizer test run |
| 3     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.071199   | 0.068479 | 97.79               | 97.99             | Optimizer test run |
| 4     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.051989   | 0.059173 | 98.36               | 98.24             | Optimizer test run |
| 5     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.042251   | 0.057165 | 98.60               | 98.32             | Optimizer test run |
| 6     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.030576   | 0.052000 | 98.98               | 98.49             | Optimizer test run |
| 7     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.025114   | 0.058259 | 99.16               | 98.42             | Optimizer test run |
| 8     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.020003   | 0.058715 | 99.34               | 98.47             | Optimizer test run |
| 9     | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.020919   | 0.050999 | 99.29               | 98.56             | Optimizer test run |
| 10    | CNN        | 256 → 128     | Adam      | 0.001         | 64         | 0.2     | 0.00001      | 0.015988   | 0.058847 | 99.46               | 98.58             | Optimizer test run |
| 1     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.323950   | 0.106598 | 89.66               | 96.63             | Optimizer test run |
| 2     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.108676   | 0.078619 | 96.79               | 97.79             | Optimizer test run |
| 3     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.083326   | 0.081923 | 97.42               | 97.43             | Optimizer test run |
| 4     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.065329   | 0.088826 | 98.00               | 97.60             | Optimizer test run |
| 5     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.061676   | 0.086476 | 98.12               | 97.67             | Optimizer test run |
| 6     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.057780   | 0.070900 | 98.26               | 97.95             | Optimizer test run |
| 7     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.050113   | 0.066999 | 98.52               | 98.38             | Optimizer test run |
| 8     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.045445   | 0.059361 | 98.74               | 98.57             | Optimizer test run |
| 9     | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.059032   | 0.110443 | 98.40               | 97.71             | Optimizer test run |
| 10    | CNN        | 256 → 128     | SGD       | 0.100         | 64         | 0.2     | 0.00000      | 0.049820   | 0.082476 | 98.59               | 98.15             | Optimizer test run |
| 1     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.354701   | 0.123143 | 89.07               | 96.29             | Optimizer test run |
| 2     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.119414   | 0.071515 | 96.36               | 97.82             | Optimizer test run |
| 3     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.078580   | 0.060124 | 97.62               | 98.12             | Optimizer test run |
| 4     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.059117   | 0.052534 | 98.18               | 98.38             | Optimizer test run |
| 5     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.044739   | 0.048090 | 98.61               | 98.48             | Optimizer test run |
| 6     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.035218   | 0.052307 | 98.81               | 98.46             | Optimizer test run |
| 7     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.028625   | 0.051062 | 99.06               | 98.43             | Optimizer test run |
| 8     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.024825   | 0.053835 | 99.21               | 98.47             | Optimizer test run |
| 9     | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.021879   | 0.049814 | 99.30               | 98.66             | Optimizer test run |
| 10    | CNN        | 256 → 128     | RMSprop   | 0.001         | 64         | 0.2     | 0.00001      | 0.019064   | 0.050724 | 99.35               | 98.62             | Optimizer test run |

</div>

- All models tend to underfit at epochs 1–4, which is expected as the models have just stated generalizing.
- Optimal performance generally occurs around epochs 3–4.
- SGD shows more stable validation performance and less overfitting in later epochs.
- Adam and RMSprop achieve high training accuracy quickly but overfit earlier.

#### 3.2 FCNN summary

| Optimizer | Convergence Speed | Overfitting Tendency    | Notes                                      |
|-----------|-------------------|----------------------|-----------------------------------------------|
| SGD       | Slower            | Moderate             | Gradual learning with stable generalization at around 7+ epochs.   |
| Adam      | Fast              | High                 | Quick convergence; tends to overfit after ~6 epochs. |
| RMSprop   | Fast              | High                 | Quick convergence; tends to overfit after ~6 epochs.  |


##### 3.2.1 FCNN results
<div style="font-size: 15px;">

| Epoch | Model Type | Hidden Layers | Optimizer | Learning Rate | Batch Size | Dropout | Weight Decay | Train Loss | Val Loss | Train Accuracy (%) | Val Accuracy (%) | Notes              |
|-------|------------|----------------|-----------|----------------|-------------|---------|---------------|-------------|-----------|---------------------|-------------------|---------------------|
| 1     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.603669    | 0.320558  | 81.59               | 90.25             | Optimizer test run |
| 2     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.313351    | 0.242666  | 90.47               | 92.60             | Optimizer test run |
| 3     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.241655    | 0.206248  | 92.53               | 93.97             | Optimizer test run |
| 4     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.199400    | 0.186593  | 93.80               | 94.33             | Optimizer test run |
| 5     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.173871    | 0.175666  | 94.61               | 94.85             | Optimizer test run |
| 6     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.150559    | 0.169167  | 95.28               | 95.13             | Optimizer test run |
| 7     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.135818    | 0.165706  | 95.71               | 95.24             | Optimizer test run |
| 8     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.124829    | 0.168856  | 95.98               | 95.20             | Optimizer test run |
| 9     | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.117159    | 0.162457  | 96.35               | 95.39             | Optimizer test run |
| 10    | FCNN       | 256 → 128      | Adam      | 0.001          | 64          | 0.2     | 0.00001       | 0.108685    | 0.164668  | 96.43               | 95.35             | Optimizer test run |
| 1     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.582876    | 0.315444  | 81.70               | 90.41             | Optimizer test run |
| 2     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.343576    | 0.230780  | 89.59               | 93.11             | Optimizer test run |
| 3     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.288211    | 0.230458  | 91.35               | 93.18             | Optimizer test run |
| 4     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.258612    | 0.238517  | 92.11               | 93.36             | Optimizer test run |
| 5     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.237265    | 0.207680  | 92.99               | 93.90             | Optimizer test run |
| 6     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.228997    | 0.217669  | 93.21               | 93.76             | Optimizer test run |
| 7     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.209951    | 0.215322  | 93.78               | 94.04             | Optimizer test run |
| 8     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.199019    | 0.198590  | 94.06               | 94.41             | Optimizer test run |
| 9     | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.188269    | 0.217437  | 94.44               | 93.97             | Optimizer test run |
| 10    | FCNN       | 256 → 128      | SGD       | 0.100          | 64          | 0.2     | 0.00000       | 0.188893    | 0.201386  | 94.44               | 94.50             | Optimizer test run |
| 1     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.498353    | 0.284981  | 84.53               | 91.52             | Optimizer test run |
| 2     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.286033    | 0.212626  | 91.27               | 93.78             | Optimizer test run |
| 3     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.224582    | 0.181572  | 93.15               | 94.59             | Optimizer test run |
| 4     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.188475    | 0.187888  | 94.17               | 94.39             | Optimizer test run |
| 5     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.164078    | 0.168633  | 94.95               | 95.13             | Optimizer test run |
| 6     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.148354    | 0.161313  | 95.42               | 95.30             | Optimizer test run |
| 7     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.133011    | 0.155040  | 95.80               | 95.31             | Optimizer test run |
| 8     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.120738    | 0.154486  | 96.25               | 95.37             | Optimizer test run |
| 9     | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.114880    | 0.148084  | 96.24               | 95.86             | Optimizer test run |
| 10    | FCNN       | 256 → 128      | RMSprop   | 0.001          | 64          | 0.2     | 0.00001       | 0.104249    | 0.153006  | 96.65               | 95.88             | Optimizer test run |

<div>

- All models tend to underfit at epochs 1–4, which is expected as the models have just stated generalizing.
- Optimal performance generally occurs around epochs 5–6.
- SGD shows more stable validation performance and less overfitting in later epochs.
- Adam and RMSprop achieve high training accuracy quickly but overfit earlier.

----------------------------------------------------------------------------------

### 4 Results for Adam model
#### 4.1 CNN Accuracy and Loss Graph

![CNN Accuracy and Loss](CNN_Accuracy_and_Loss.png)

Looking at the graph, we observe that during the early epochs (1-2), the model’s accuracy is relatively low, and the loss is high. This is expected as the model is just beginning to generalize, and its learning process is still in its early stages. At this point, the train accuracy is noticeably lower than the validation accuracy, and the train loss is higher than the validation loss. This suggests underfitting, as the model has not yet learned to properly fit the data.

In the mid-epochs (3-4), both the training and validation accuracies have increased significantly, and are nearly in alignment with each other. Additionally, the training and validation losses have dropped and are now comparable, indicating that the model is achieving a good balance between fitting the data and generalizing. This period typically reflects the model’s optimal learning phase, where it captures patterns in the data without overfitting.

However, in the later epochs (5+), we notice a shift towards overfitting. Despite maintaining high accuracy and low loss on the training set, the model starts to show signs of memorizing the training data rather than generalizing well. This is indicated by the train accuracy being higher than the validation accuracy and the train loss being lower than the validation loss. As the epochs progress, particularly at the 7th, 8th, and 10th epochs, we see a rise in validation loss, signaling that the model is losing its ability to generalize and is instead overfitting to the training data.

#### 4.2 FCNN Accuracy and Loss Graph

![FCNN Accuracy and Loss](FCNN_Accuracy_and_Loss.png)

Looking at the graph, we observe that during the early epochs (1-4), the model’s accuracy is relatively low, and the loss is high. This is expected as the model is just beginning to generalize, and its learning process is still in its early stages. At this point, the train accuracy is noticeably lower than the validation accuracy, and the train loss is higher than the validation loss. This suggests underfitting, as the model has not yet learned to properly fit the data.

In the mid-epochs (5-6), both the training and validation accuracies have increased significantly, and are nearly in alignment with each other. Additionally, the training and validation losses have dropped and are now comparable, indicating that the model is achieving a good balance between fitting the data and generalizing. This period typically reflects the model’s optimal learning phase, where it captures patterns in the data without overfitting.

However, in the later epochs (7+), we notice a shift towards overfitting. Despite maintaining high accuracy and low loss on the training set, the model starts to show signs of memorizing the training data rather than generalizing well. This is indicated by the train accuracy being higher than the validation accuracy and the train loss being lower than the validation loss. As the epochs progress, particularly at the 8th epochs, we see a rise in validation loss, signaling that the model is losing its ability to generalize and is instead overfitting to the training data.

#### 4.3 CNN Confusion Matrix

![Confusion CNN](Confusion_CNN.png)

We note a test accuraccy of `96.07`

#### 4.4 FCNN Confusion Matrix

![Confusion FCNN](Confusion_FCNN.png)

We note a test accuraccy of `89.62`


### 5 Discussion
- The train set is used to teach the model by adjusting its parameters. The validation set evaluates the model during training, helping monitor generalization and tune hyperparameters. It also detects overfitting or underfitting. The test set is used after training to provide an unbiased measure of the model's performance on unseen data.

- Accuracy while being a common metric for classification tasks, has limitations when applied to datasets like KMNIST. Although it provides an overall performance snapshots, accuracy might not capture the model’s struggles with visually similar characters in KMNIST.

- In cases where certain characters are harder to distinguish, accuracy can be misleading, as a model might perform well on simpler classes but fail on more complex ones. For a more better evaluation, additional metrics are needed to better assess how well the model handles all classes, especially the more challenging ones.

- Thus, while accuracy is useful, it shouldn't be the sole metric when evaluating KMNIST type models.