# Laboratory 3: Getting started with Pytorch

In this laboratory we will begin working with Pytorch to implement and train complex, nonlinear models for supervised learning problems. You will notice many similarities between Numpy and Pytorch -- this is deliberate, but it can cause some confusion and for many things we will have to convert back and forth between Numpy arrays and Pytorch tensors.

## Part 0: First steps

**Important**: You **must** install Pytorch in your Anaconda environment for this laboratory. The easiest way to do this is to just install the CPU version of Pytorch like this:

```
conda activate FML
conda install -c pytorch pytorch torchvision
```

**Note**: If you have an Nvidia GPU on your computer you can also install the GPU-enabled version of Pytorch which will **greatly** improve performance for more complex models and larger datasets. However, it can be very hard to get all of the versions of the required libraries to match correctly... During the laboratory we can look at it together if you are interested.

After installing Pytorch, use the next cell to verify that the installation is working. If it prints a 3x3 sensor, we're good to go.

In [None]:
# We're still going to need numpy and matplotlib.
import numpy as np
import matplotlib.pyplot as plt
import random


import warnings
warnings.filterwarnings("ignore")

# Verify that pytorch is working.
import torch

# If this works, things should be OK.
print(torch.randn((3, 3)))

## Part 1: Dataset preparation

We will work with the venerable MNIST dataset of handwritten digits in this laboratory. The `torchvision` library provides classes for a bunch of standard datasets, including MNIST. These classes automatically download and prepare the dataset for use.

In [None]:
# Download and load the MNIST dataset.
from torchvision.datasets import MNIST
import torchvision

# Load the MNIST training and test splits.
ds_train = MNIST(root='./data', download=True, train=True)
ds_test  = MNIST(root='./data', download=True, train=False)

In [None]:
display(ds_train)
display(ds_test)

**Analisys:** The dataset is automatically split into train and test, assigning 60,000 datapoints to the train set and 10,000 to the test set. This is predefined by the dataset, so in the code, you only need to specify whether `train = True` or `train = False`.

### Exercise 1.1: Exploratory data analysis

Spend some time inspecting the `ds_train` and `ds_test` data structures in order to get a feel for the data. What is the format? How big are the images? How many are there? What about the range of pixel values? Where are the labels for images?

Remember that one of the best ways to explore is to *visualize*.

In [None]:
# Print some information about the dataset
print("Number of training examples:", len(ds_train))
print("Number of test examples:", len(ds_test))
print("Image size:", ds_train[0][0].size)

# Plot a few sample images from the training set
fig, axes = plt.subplots(2, 5, figsize=(10, 4))
for i in range(2):
    for j in range(5):
        image, label = ds_train[i * 5 + j]
        axes[i, j].imshow(image, cmap='summer') #summer
        axes[i, j].set_title(f"Label: {label}")
        axes[i, j].axis('off')

plt.show()

In [None]:
# Create a random permutation of indices for the first 100 images in the training set
random_indices = np.random.permutation(ds_train.data.shape[0])[:100]

plt.figure(figsize=(10, 10))

# Visualize the first 100 randomly selected images from the training set
for i, index in enumerate(random_indices):
    # Subplot organization: 10 rows, 10 columns, i+1 refers to the current subplot index
    plt.subplot(10, 10, i+1)

    # Display the image with random colormap
    plt.imshow(ds_train.data[index], cmap="summer")

    
    # Set the title with the corresponding label
    plt.title(f"Label: {ds_train.targets[index]}")
    
    # Turn off axis ticks for cleaner visualization
    plt.axis('off')

# Adjust layout for better spacing
plt.tight_layout()

# Display the visualization
plt.show()

#MINIST dataset
#https://paperswithcode.com/dataset/mnist



**Analisys:** The MNIST database (Modified National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger NIST Special Database 3 (digits written by employees of the United States Census Bureau) and Special Database 1 (digits written by high school students) which contain monochrome images of handwritten digits. The digits have been size-normalized and centered in a fixed-size image. 


https://paperswithcode.com/dataset/mnist


### Exercise 1.2: Dataset conversion and normalization

+ **Datatype Conversion**:
The first thing we need to do is convert all data tensors to `torch.float32` -- this is fundamental as it is extremely inconvenient to work with `uint8` data. Using 32-bit floating point numbers is a compromise between precision and space efficiency.
The `torch.Tensor` class has a very useful method `to()` for performing datatype and device (e.g. to GPU) conversions. Check out the [documentation here](https://pytorch.org/docs/stable/generated/torch.Tensor.to.html#torch-tensor-to).

+ **Normalization**:
Next, we need to correct the inconvenient range of [0, 255] for the pixel values. You should *subtract* the mean intensity value and divide by the standard deviation in order to *standardize* our data. **Important**: Think *very carefully* about *which* split you should use to compute the pixel statistics for standardization.

+ **Reshaping**: Is the data in an appropriate format (i.e. shape) for the training the models we know? Think about whether (and how) to fix this if needed. 

**What to do**: In the cell below you should perform this sequence preprocessing operations on the `ds_train.data` and `ds_test.data` tensors. 

In [None]:
import torch

# Conversione dei dati di addestramento in tensori float32
Xs_train = ds_train.data.to(torch.float32)
ys_train = ds_train.targets  # Target di addestramento (etichette)
Xs_test = ds_test.data.to(torch.float32)
ys_test = ds_test.targets  # Target di test (etichette)

# Calcolo della media e della deviazione standard dell'insieme di addestramento
mean_px = Xs_train.mean()
std_px = Xs_train.std()

# Normalizzazione: sottrazione della media e divisione per la deviazione standard
Xs_train = (Xs_train - mean_px) / std_px
Xs_test = (Xs_test - mean_px) / std_px  # Normalizzazione coerente rispetto ai dati di addestramento

# Controllo delle statistiche dopo la normalizzazione
Xs_train.mean(), Xs_train.std(), Xs_test.mean(), Xs_test.std()

In [None]:
Xs_train.flatten(1).shape

In [None]:
Xs_test.flatten(1).shape

In [None]:
plt.imshow(Xs_train[0], cmap='gray')

**Analisys:** The data in ds_train.data and ds_test.data is converted to float32. This is important because many machine learning operations require data in floating-point format.
The mean (mean_px) and standard deviation (std_px) are calculated on the training data (Xs_train). These values will be used for normalization.

### Exercise 1.3: Subsampling the MNIST dataset.

MNIST is kind of big, and thus inconvenient to work with unless using the GPU. For this laboratory we will use a smaller subset of the dataset for training to keep memory and computation times low.

Modify `ds.train` to use only a subset of, say, 10000 images sampled from the original data. Make sure to select the correct corresponding targets.


In [None]:
# Your code here.
train_size = 10000
I = np.random.permutation(range(len(Xs_train)))[:train_size]
Xs_train_i = Xs_train[I]
ys_train_i = ys_train[I]

Xs_train_i = Xs_train_i.flatten(start_dim=1)
Xs_test_i = Xs_test.flatten(start_dim=1)

Xs_train_i.shape, Xs_test_i.shape

**Analisys:** I reduced the dataset to 10000, so it’s lighter.

## Establishing a stable baseline

In this exercise you will establish a reliable baseline using a classical approach. This is an important step in our methodology in order to judge whether our Deep MLP is performing well or not.

### Exercise 2.1: Establish the stable baseline

Train and test your stable baseline to estimate the best achievable accuracy using classical models.

**Tip**: Don't do any extensive cross-validation of your baseline (for now). Just fit a simple model (e.g. a linear SVM) and record the accuracy.



In [None]:
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Parametro massimo di iterazioni per LinearSVC
max_iter = 1000

# Addestramento del modello
svc = LinearSVC(max_iter=max_iter)
svc.fit(Xs_train_i, ys_train_i)

# Predizione sui dati di test
preds = svc.predict(Xs_test_i)

# Calcolo dell'accuratezza
accuracy = accuracy_score(ys_test, preds)
print("Accuracy: ", accuracy)

# Report di classificazione
print("Classification report: ")
print(classification_report(ys_test, preds))

# Matrice di confusione non normalizzata
cm = confusion_matrix(ys_test, preds)

# Visualizzazione della matrice di confusione non normalizzata
plt.figure(figsize=(10, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.title('Confusion Matrix (Not Normalized)')
plt.show()

# Normalizzazione della matrice di confusione (lungo le righe)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

# Visualizzazione della matrice di confusione normalizzata
plt.figure(figsize=(10, 10))
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.title('Confusion Matrix (Normalized)')
plt.show()

**Analisys:**  The results from the analysis of the LinearSVC model show an overall accuracy of 87.4%, which is positive but suggests there is still room for improvement, especially for certain specific classes.

From the classification report, it appears that the model performs well for some classes, such as 0 and 1, which achieve precision and recall above 92%. These figures indicate that the model can clearly distinguish these categories. However, there are problematic classes, like 5, which records the lowest F1-score (0.81), followed by 8 (F1-score: 0.79). These values suggest that the model struggles to differentiate these digits from others, likely due to overlaps in the data features.

Looking at the confusion matrix, it is evident that class 5 is frequently confused with class 3 and class 8, while samples from class 8 are often predicted as belonging to classes 5 or 9. These errors indicate that the model has difficulty correctly separating visually or numerically similar classes.

Moving to the normalized version of the confusion matrix, we see that classes like 0, 1, 6, and 7 have a very high recall, confirming that the model is particularly reliable at recognizing these categories. In contrast, classes 2, 5, and 8 have lower recall, indicating greater confusion.

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
import seaborn as sns
from sklearn.metrics import confusion_matrix

Xs_test = Xs_test.flatten(-2)
# Reshape the data if needed (SVC expects 2D input)
Xs_train_2d = Xs_train_i.reshape(Xs_train_i.shape[0], -1)
Xs_test_2d = Xs_test.reshape(Xs_test.shape[0], -1)

# Instantiate the Support Vector Classifier (SVC)
svc = SVC()

# Fit the model on the training data
svc.fit(Xs_train_2d, ys_train_i)

# Predict the labels for the test data
predictions = svc.predict(Xs_test)

# Calculate accuracy
accuracy = accuracy_score(ys_test.numpy(), svc.predict(Xs_test_2d))
print("Accuracy:", accuracy)

# Display classification report
print("Classification Report:")
print(classification_report(predictions, ys_test))

# Confusion Matrix
cm = confusion_matrix(ys_test, predictions)

# Visualize the Confusion Matrix
plt.figure(figsize=(10, 10))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted 0', 'Predicted 1'], yticklabels=['Actual 0', 'Actual 1'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

# Normalizzazione della matrice di confusione (lungo le righe)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

# Visualizzazione della matrice di confusione normalizzata
plt.figure(figsize=(10, 10))
sns.heatmap(cm_normalized, annot=True, fmt='.2f', cmap='Blues')
plt.xlabel('Predicted label')
plt.ylabel('True label')
plt.title('Confusion Matrix (Normalized)')
plt.show()

**Analisys:** The results obtained from the analysis of the SVC model with the default kernel (probably RBF) show an overall accuracy of 96.34%, a very high value that reflects the model’s ability to correctly distinguish most classes. This result is significantly better than linear models like LinearSVC, suggesting that the choice of a more sophisticated model had a positive impact.

From the classification report, it is evident that the model performs excellently for many classes, particularly for classes 0, 1, and 6, which achieve precision and recall above 97%. This indicates that the model recognizes these digits with great accuracy. However, some classes, such as 8 and 9, show slightly lower F1-scores (0.95 and 0.95, respectively), indicating that the model has some difficulty distinguishing them.

Analyzing the confusion matrix, we see that the main errors occur in classes 5, 8, and 9. For example, class 5 is sometimes confused with class 3 or 8, while class 9 is occasionally predicted as 4 or 7. These errors reflect some visual overlap between the digits, especially for samples that are written similarly or have common characteristics.

The normalized confusion matrix provides a clearer view of the proportions of errors. Classes such as 0, 1, 6, and 7 show very high recall, approaching 99%, demonstrating that the model recognizes them almost perfectly. In contrast, classes 5, 8, and 9 have slightly lower recall, around 93-95%, highlighting that the model encounters some difficulties in more ambiguous situations.

These results indicate the strength of the SVC model, which, thanks to the nonlinear kernel, is capable of capturing complex relationships between the features of the data. However, for more difficult classes like 5 and 9, it might be useful to explore alternative approaches.

## Part 3: Training some deep models (finally)

Now we will finally train some deep models (Multilayer Perceptrons, to be precise). Since the dataset is a bit too large to use batch gradient descent, we will first need to setup a `torch.utils.data.DataLoader` for our training data. A `DataLoader` breaks the dataset up into a sequence of *batches* that will be used for training. In order to use this, we will first have to use `torch.utils.data.TensorDataset` on `ds_train.data` and `ds_train.targets` to make a new torch `dataset` for use in the dataloader. 

### Exercise 3.1: Creating the DataLoader

Create a `DataLoader` for `ds_train` use a `batch_size` of about 16 or 32 to start. After you have your `DataLoader` experiment with is using `next(iter(dl_train))` to see what it returns. The pytorch `DataLoader` is a Python iterator.

**EXTREMELY IMPORTANT**: Make sure you use `shuffle=True` in the constructor of your dataloader.

In [None]:
# Your code here.
from torch.utils.data import DataLoader, TensorDataset

batch_size = 32

ds = TensorDataset(Xs_train_i, ys_train_i)
dl_train = DataLoader(ds, batch_size=batch_size, shuffle=True)

len(dl_train)

batch = next(iter(dl_train))
print("Feature shape:", batch[0].shape)
print("Target shape:", batch[1].shape)

# Pytorch example
#https://github.com/pytorch/examples


### Some support code (NOT an exercise).

Here is some support code that you can use to train a model for a **single** epoch. The function returns the mean loss over all iterations. You will use it in the next exercise to train and monitor training.

In [None]:
# Train a model for a single epoch. You should pass it a model, a dataloader,
# and an optimizer. Returns the mean loss over the entire epoch.
def train_epoch(model, dl, optimizer):
    model.train()
    losses = []
    for (xs, ys) in dl:
        optimizer.zero_grad()
        output = model(xs)
        loss = torch.nn.functional.nll_loss(output, ys) #compute the negative log likelihood loss
        loss.backward() #compute the gradient 
        optimizer.step() #tell the optimizer to perform a gradient step
        losses.append(loss.item())
    model.eval()
    return np.mean(losses)

**Analisys:** model.train() and model.eval():
model.train() sets the model to training mode, and model.eval() sets it to evaluation mode. This is important because certain layers (e.g., dropout) behave differently during training and evaluation. In training mode, dropout is active, but in evaluation mode, it's turned off.

optimizer.zero_grad():
Before computing the gradients for the parameters, the optimizer's zero_grad() method is called to clear the previously calculated gradients. This is necessary because PyTorch accumulates gradients by default.

Forward Pass (output = model(xs)):
The input batch (xs) is passed through the model to obtain predictions (output). This is the forward pass.

Loss Computation (loss = torch.nn.functional.nll_loss(output, ys)):
The negative log likelihood loss is computed using the model's predictions (output) and the ground truth labels (ys).

Backward Pass (loss.backward()):
Gradients are computed for all model parameters with respect to the loss using the backward() method.

Gradient Descent Step (optimizer.step()):
The optimizer's step() method is called to perform a step of gradient descent, updating the model parameters based on the computed gradients.

Record the Loss (losses.append(loss.item())):
The loss for the current iteration is recorded in the losses list.

Return the Mean Loss (return np.mean(losses)):
The function returns the mean loss over all iterations in the epoch.

### Exercise 3.2: Defining a 1-layer neural network

Define a simple model that uses a **single** `torch.nn.Linear` layer followed by a `torch.nn.Softmax` to predict  the output probabilities for the ten classes.

In [None]:
# Define a fresh model.
import torch.nn as nn

model = torch.nn.Sequential(
    # Your code here.
    nn.Linear(784, 10),
    nn.LogSoftmax(dim=1)   # Specify dim=1 to apply LogSoftmax along the second dimension
)

**Analisys:** `torch.nn.Linear`: This is a linear transformation layer in PyTorch. It represents an affine transformation, which is a linear transformation with a bias term. Mathematically, it performs the operation y = xA^T + b, where x is the input tensor, A is the weight matrix, b is the bias vector, and y is the output tensor.

`torch.nn.LogSoftmax`: This layer applies the logarithm of the softmax function to the output of the linear layer. It converts raw scores (logits) into log probabilities, which are numerically more stable for classification tasks. The dim=1 argument specifies that the operation is performed along the second dimension (class scores for each input sample in a batch).


https://pytorch.org/docs/

https://pytorch.org/vision/stable/

### Exercise 3.2: Training our model

Instantiate a `torch.optim.SGD` optimizer using `model.parameters()` and the learning rate (**tip**: make the learning rate a variable you can easily change). Then run `train_epoch` for a set number of epochs (e.g. 100, make this a variable too). Is your model learning? How can you tell?

In [None]:
# Your code here.
from tqdm import tqdm

model = torch.nn.Sequential(
    # Your code here.
    nn.Linear(784, 10),
    nn.LogSoftmax(dim=1)   
)

epochs = 100
lr = 1e-2   # Learning rate
losses = []
opt = torch.optim.SGD(model.parameters(),lr=lr)

#Training loop
for epoch in tqdm(range(epochs)):
    loss = train_epoch(model, dl_train, opt)
    losses.append(train_epoch(model, dl_train, opt))
    
plt.plot(losses)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Training Loss')
plt.show()


print("Classification report: ")
print(classification_report(ys_test, model(Xs_test_i).argmax(dim=1)))

**Analisys:** The loss curve and the classification report provide a clear overview of the model’s performance and the training process.

`Loss Curve`

The training loss curve shows a typical behavior of a model that is learning correctly. Initially, the loss decreases rapidly, indicating that the model is learning the fundamental features from the data. As the epochs progress, the rate of decrease slows down, and the curve stabilizes, indicating that the model is reaching convergence. This is a good sign, as it suggests that the model has been trained sufficiently and shows no clear signs of overfitting or underfitting.

`Classification Report`

Looking at the results, the overall accuracy is 91%, which is a good value.
Strengths: Classes 0, 1, 6, and 7 show very high precision and recall, close to 95%-96%. This indicates that the model can effectively distinguish these categories.

Classes 3, 5, and 8, however, have slightly lower F1-scores (between 85% and 88%). This may suggest that the model finds it more difficult to distinguish these classes, likely due to less obvious features or overlap in the input data.

The global averages (macro avg and weighted avg) confirm that the model maintains solid performance overall, with no significant imbalance between the classes.

### Exercise 3.3: Evaluating our model

Write some code to plot the loss curve for your training run and evaluate the performance of your model on the test data. Play with the hyperparameters (e.g. learning rate) to try to get the best performance on the test set. Can you beat the stable baseline?

In [None]:
import torch
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader, TensorDataset

# Funzione per calcolare l'accuratezza
def compute_accuracy(model, X, y):
    predictions = torch.argmax(model(X), dim=1)
    return (predictions == y).float().mean().item()

# Funzione per allenare il modello con una combinazione di iperparametri
def train_with_hyperparameters(model, optimizer, data_loader, num_epochs):
    train_loss_curve = []
    for epoch in range(num_epochs):
        train_loss = train_epoch(model, data_loader, optimizer)
        train_loss_curve.append(train_loss)
    return train_loss_curve

# Funzione principale per allenare e valutare il modello
def train_and_evaluate(model, learning_rate, batch_size, num_epochs):
    # Configurazione iperparametri
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
    train_loader = DataLoader(
        TensorDataset(Xs_train_2d, ys_train_i),
        batch_size=batch_size,
        shuffle=True
    )

    # Allenamento
    train_loss_curve = train_with_hyperparameters(model, optimizer, train_loader, num_epochs)

    # Valutazione sul test set
    test_accuracy = compute_accuracy(model, Xs_test_2d, ys_test)

    return train_loss_curve, test_accuracy

# Funzione per tracciare la curva di perdita
def plot_loss_curve(train_loss_curve, learning_rate, batch_size):
    plt.figure(figsize=(8, 6))
    plt.plot(train_loss_curve, label=f"LR: {learning_rate}, Batch: {batch_size}")
    plt.xlabel("Epoch")
    plt.ylabel("Training Loss")
    plt.title("Training Loss Curve")
    plt.legend()
    plt.grid(True)
    plt.show()

def reset_parameters(model):
    for layer in model.children():
        if hasattr(layer, 'reset_parameters'):
            layer.reset_parameters()

# Iperparametri iniziali
learning_rates = [0.0001, 0.001, 0.01]
batch_sizes = [16, 64, 256]
num_epochs = 100

# Baseline per confronto
accuracy_score_linearSVC = 0.88  # Accuracy ottenuta con LinearSVC
accuracy_score_nl_SVC = 0.96  # Supponendo un baseline di accuratezza

# Ricerca del miglior modello
best_accuracy = 0.0
best_hyperparameters = {}
best_loss_curve = []

for lr in learning_rates:
    for batch_size in batch_sizes:
        print(f"Training with LR: {lr}, Batch Size: {batch_size}")
        model.apply(reset_parameters)  # Resetta i pesi del modello per ogni nuova combinazione
        loss_curve, test_accuracy = train_and_evaluate(model, lr, batch_size, num_epochs)

        if test_accuracy > best_accuracy:
            best_accuracy = test_accuracy
            best_hyperparameters = {'learning_rate': lr, 'batch_size': batch_size}
            best_loss_curve = loss_curve

        print(f"Test Accuracy: {test_accuracy:.4f}")

# Traccia la curva di perdita del miglior modello
print("\nBest Hyperparameters:")
print(f"Learning Rate: {best_hyperparameters['learning_rate']}, Batch Size: {best_hyperparameters['batch_size']}")
print(f"Best Test Accuracy: {best_accuracy:.4f})")

if(best_accuracy > accuracy_score_nl_SVC):
    print("The model is better than the Non-Linear SVC baseline and the Linear SVC baseline")
else:
    if(best_accuracy > accuracy_score_linearSVC):
        print("The model is better than the Linear SVC baseline")
    else:
        print("The model is worse than both baselines")

plot_loss_curve(best_loss_curve, best_hyperparameters['learning_rate'], best_hyperparameters['batch_size'])

**Analisys:** A higher learning rate (0.001 and 0.01) tends to lead to better accuracy compared to 0.0001. This is normal because values that are too low can slow down convergence or cause the model to get stuck in a suboptimal local minimum.

The batch size impacts performance. With smaller batches (e.g., 16), there is a slight tendency toward better generalization. This is probably due to the increased variability introduced by smaller batches, which helps to better explore the solution space.

The model performs well with a higher learning rate and larger batches, suggesting a good ability to learn from larger datasets without overfitting.

## Going Deeper

Now we will go (at least one layer) deeper to see if we can significantly improve on the baseline.

### Exercise 3.4: A 2-layer MLP
Define a new model with one hidden layer. Use the code you wrote above to train and evaluate this new model. Can you beat the baseline? You might need to train in two stages using different learning rates.

**Things to think about**:

+ It might be hard to beat (or even equal) the baseline with deeper networks. Why?
+ Is there something else we should be monitoring while training, especially for deep networks?

In [None]:
import torch
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

# Define a 2-layer MLP
class MLP2Layer(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(MLP2Layer, self).__init__()
        self.fc1 = torch.nn.Linear(input_size, hidden_size)
        self.relu = torch.nn.ReLU()
        self.fc2 = torch.nn.Linear(hidden_size, output_size)
        self.log_softmax = torch.nn.LogSoftmax(dim=1)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.log_softmax(x)
        return x

# Function to plot training curves
def plot_training_curves(loss_curve, accuracy_curve, title):
    plt.figure(figsize=(12, 5))

    # Plot Training Loss
    plt.subplot(1, 2, 1)
    plt.plot(loss_curve, label='Training Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title(f'Training Loss Curve ({title})')
    plt.legend()

    # Plot Training Accuracy
    plt.subplot(1, 2, 2)
    plt.plot(accuracy_curve, label='Training Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.title(f'Training Accuracy Curve ({title})')
    plt.legend()

    plt.tight_layout()
    plt.show()

# Function to train the model for one epoch
def train_epoch(model, dataloader, optimizer, criterion):
    model.train()
    total_loss = 0
    for batch_x, batch_y in dataloader:
        optimizer.zero_grad()
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / len(dataloader)

# Function to evaluate the model accuracy
def evaluate_accuracy(model, dataloader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_x, batch_y in dataloader:
            outputs = model(batch_x)
            predictions = torch.argmax(outputs, dim=1)
            correct += (predictions == batch_y).sum().item()
            total += batch_y.size(0)
    return correct / total

# Function to train and evaluate the model
def train_and_evaluate(model, optimizer, criterion, dataloader, num_epochs, title):
    train_loss_curve = []
    train_accuracy_curve = []

    for epoch in range(num_epochs):
        # Train the model for one epoch
        train_loss = train_epoch(model, dataloader, optimizer, criterion)
        train_loss_curve.append(train_loss)

        # Evaluate training accuracy
        train_accuracy = evaluate_accuracy(model, dataloader)
        train_accuracy_curve.append(train_accuracy)

    # Plot the training curves
    plot_training_curves(train_loss_curve, train_accuracy_curve, title)

    return model

# Instantiate the model
input_size = 784
hidden_size = 128
output_size = 10
model_2layer = MLP2Layer(input_size, hidden_size, output_size)

# Set hyperparameters
learning_rate_stage1 = 0.1
learning_rate_stage2 = 0.001
batch_size = 32
num_epochs_stage1 = 50
num_epochs_stage2 = 450

# Instantiate the optimizer and loss function
criterion = torch.nn.NLLLoss()
optimizer_stage1 = optim.SGD(model_2layer.parameters(), lr=learning_rate_stage1)

# Create DataLoaders
dl_train_2layer = DataLoader(TensorDataset(Xs_train_2d, ys_train_i), batch_size=batch_size, shuffle=True)
dl_test_2layer = DataLoader(TensorDataset(Xs_test_2d, ys_test), batch_size=batch_size, shuffle=False)

# Stage 1 Training
model_2layer = train_and_evaluate(model_2layer, optimizer_stage1, criterion, dl_train_2layer, num_epochs_stage1, "Stage 1")

# Stage 2 Training
optimizer_stage2 = optim.SGD(model_2layer.parameters(), lr=learning_rate_stage2)
model_2layer = train_and_evaluate(model_2layer, optimizer_stage2, criterion, dl_train_2layer, num_epochs_stage2, "Stage 2")

# Evaluate on the test set
test_accuracy = evaluate_accuracy(model_2layer, dl_test_2layer)
print(f"Test Accuracy with 2-layer MLP: {test_accuracy:.4f}")

`Stage 1`

**Trining Loss Curve**

The loss decreases rapidly during the first epochs, indicating that the model is effectively learning patterns from the data. After about 20 epochs, the loss stabilizes near 0, suggesting that the model has reached a good level of convergence during Stage 1.

The convergence speed is high due to the relatively high learning rate (0.1), and there does not appear to be significant overfitting at this stage, as the loss stabilizes without visible oscillations.

**Training Accuracy Curve**

The accuracy increases rapidly during the first 10 epochs, exceeding 99%. After this threshold, the accuracy stabilizes near 100% for the rest of the stage.

This is a sign that the model is highly capable of fitting the training data.

`Stage 2`

**Trining Loss Curve**

During Stage 2, the loss continues to decrease, although very gradually, from around 0.00064 to 0.00058. The decrease is linear and without significant oscillations, which suggests a stable fine-tuning phase.

The lower learning rate (0.001) allows the model to refine the parameters without making drastic changes; this further improvement is typical in fine-tuning phases.


**Training Accuracy Curve**

The accuracy remains fixed at 100% throughout Stage 2, indicating that the model no longer improves on the training set.

Stage 2 does not bring improvements in training accuracy, but this is expected: the purpose of fine-tuning is to improve generalization and stability, not necessarily to increase accuracy on the training data.