<a href="https://colab.research.google.com/github/Gopi138942/gcrportfolio/blob/main/Assignemnt_3_Gopi_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dropout

Let's think briefly about what we
expect from a good predictive model.
We want it to peform well on unseen data.
Classical generalization theory
suggests that to close the gap between
train and test performance,
we should aim for a simple model.
Simplicity can come in the form
of a small number of dimensions.
We explored this when discussing the
monomial basis functions of linear models.
Additionally, as we saw when discussing weight decay
($\ell_2$ regularization),
the (inverse) norm of the parameters also
represents a useful measure of simplicity.
Another useful notion of simplicity is smoothness,
i.e., that the function should not be sensitive
to small changes to its inputs.
For instance, when we classify images,
we would expect that adding some random noise
to the pixels should be mostly harmless.

Scientists formalized
this idea when he proved that training with input noise
is equivalent to Tikhonov regularization.
This work drew a clear mathematical connection
between the requirement that a function be smooth (and thus simple),
and the requirement that it be resilient
to perturbations in the input.

Then, :citet:`Srivastava.Hinton.Krizhevsky.ea.2014`
developed a clever idea for how to apply Bishop's idea
to the internal layers of a network, too.
Their idea, called *dropout*, involves
injecting noise while computing
each internal layer during forward propagation,
and it has become a standard technique
for training neural networks.
The method is called *dropout* because we literally
*drop out* some neurons during training.
Throughout training, on each iteration,
standard dropout consists of zeroing out
some fraction of the nodes in each layer
before calculating the subsequent layer.

To be clear, we are imposing
our own narrative with the link to Bishop.
The original paper on dropout
offers intuition through a surprising
analogy to sexual reproduction.
The authors argue that neural network overfitting
is characterized by a state in which
each layer relies on a specific
pattern of activations in the previous layer,
calling this condition *co-adaptation*.
Dropout, they claim, breaks up co-adaptation
just as sexual reproduction is argued to
break up co-adapted genes.
While such an justification of this theory is certainly up for debate,
the dropout technique itself has proved enduring,
and various forms of dropout are implemented
in most deep learning libraries.


The key challenge is how to inject this noise.
One idea is to inject it in an *unbiased* manner
so that the expected value of each layer---while fixing
the others---equals the value it would have taken absent noise.
In Bishop's work, he added Gaussian noise
to the inputs to a linear model.
At each training iteration, he added noise
sampled from a distribution with mean zero
$\epsilon \sim \mathcal{N}(0,\sigma^2)$ to the input $\mathbf{x}$,
yielding a perturbed point $\mathbf{x}' = \mathbf{x} + \epsilon$.
In expectation, $E[\mathbf{x}'] = \mathbf{x}$.

In standard dropout regularization,
one zeros out some fraction of the nodes in each layer
and then *debiases* each layer by normalizing
by the fraction of nodes that were retained (not dropped out).
In other words,
with *dropout probability* $p$,
each intermediate activation $h$ is replaced by
a random variable $h'$ as follows:

$$
\begin{aligned}
h' =
\begin{cases}
    0 & \textrm{ with probability } p \\
    \frac{h}{1-p} & \textrm{ otherwise}
\end{cases}
\end{aligned}
$$

By design, the expectation remains unchanged, i.e., $E[h'] = h$.


In [3]:
import torch
from torch import nn

## Dropout in Practice

Recall the MLP with a hidden layer and five hidden units
from :numref:`fig_mlp`.
When we apply dropout to a hidden layer,
zeroing out each hidden unit with probability $p$,
the result can be viewed as a network
containing only a subset of the original neurons.
In :numref:`fig_dropout2`, $h_2$ and $h_5$ are removed.
Consequently, the calculation of the outputs
no longer depends on $h_2$ or $h_5$
and their respective gradient also vanishes
when performing backpropagation.
In this way, the calculation of the output layer
cannot be overly dependent on any
one element of $h_1, \ldots, h_5$.

![MLP before and after dropout.](http://d2l.ai/_images/dropout2.svg)
:label:`fig_dropout2`

Typically, we disable dropout at test time.
Given a trained model and a new example,
we do not drop out any nodes
and thus do not need to normalize.
However, there are some exceptions:
some researchers use dropout at test time as a heuristic
for estimating the *uncertainty* of neural network predictions:
if the predictions agree across many different dropout outputs,
then we might say that the network is more confident.

## Implementation from Scratch

To implement the dropout function for a single layer,
we must draw as many samples
from a Bernoulli (binary) random variable
as our layer has dimensions,
where the random variable takes value $1$ (keep)
with probability $1-p$ and $0$ (drop) with probability $p$.
One easy way to implement this is to first draw samples
from the uniform distribution $U[0, 1]$.
Then we can keep those nodes for which the corresponding
sample is greater than $p$, dropping the rest.

In the following code, we (**implement a `dropout_layer` function
that drops out the elements in the tensor input `X`
with probability `dropout`**),
rescaling the remainder as described above:
dividing the survivors by `1.0-dropout`.


In [4]:
def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    if dropout == 1: return torch.zeros_like(X)
    mask = (torch.rand(X.shape) > dropout).float()
    return mask * X / (1.0 - dropout)

We can [**test out the `dropout_layer` function on a few examples**].
In the following lines of code,
we pass our input `X` through the dropout operation,
with probabilities 0, 0.5, and 1, respectively.


In [5]:
X = torch.arange(16, dtype = torch.float32).reshape((2, 8))
print('dropout_p = 0:', dropout_layer(X, 0))
print('dropout_p = 0.5:', dropout_layer(X, 0.5))
print('dropout_p = 1:', dropout_layer(X, 1))

dropout_p = 0: tensor([[ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11., 12., 13., 14., 15.]])
dropout_p = 0.5: tensor([[ 0.,  2.,  0.,  0.,  8.,  0.,  0.,  0.],
        [ 0., 18.,  0.,  0.,  0., 26.,  0., 30.]])
dropout_p = 1: tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])


### Defining the Model

The model below applies dropout to the output
of each hidden layer (following the activation function).
We can set dropout probabilities for each layer separately.
A common choice is to set
a lower dropout probability closer to the input layer.
We ensure that dropout is only active during training.


In [6]:
class DropoutMLPScratch(torch.nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super(DropoutMLPScratch, self).__init__()
        self.lin1 = nn.LazyLinear(num_hiddens_1)
        self.lin2 = nn.LazyLinear(num_hiddens_2)
        self.lin3 = nn.LazyLinear(num_outputs)
        self.relu = nn.ReLU()

    def forward(self, X):
        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
        if self.training:
            H1 = dropout_layer(H1, self.dropout_1)
        H2 = self.relu(self.lin2(H1))
        if self.training:
            H2 = dropout_layer(H2, self.dropout_2)
        return self.lin3(H2)

### Training

Write your code to train the provided network similar to the training of MLPs described early in the lectures on the FashionMNIST Dataset for 10 epochs.

https://pytorch.org/vision/0.19/generated/torchvision.datasets.FashionMNIST.html


In [7]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLPScratch(**hparams)

#write your training and testing code here
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score

# Dropout function
def dropout_layer(X, dropout):
    assert 0 <= dropout <= 1
    if dropout == 1:
        return torch.zeros_like(X)
    mask = (torch.rand(X.shape) > dropout).float()
    return mask * X / (1.0 - dropout)

# Define DropoutMLPScratch model
class DropoutMLPScratch(nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super(DropoutMLPScratch, self).__init__()
        self.lin1 = nn.LazyLinear(num_hiddens_1)
        self.lin2 = nn.LazyLinear(num_hiddens_2)
        self.lin3 = nn.LazyLinear(num_outputs)
        self.relu = nn.ReLU()
        self.dropout_1 = dropout_1
        self.dropout_2 = dropout_2
        self.lr = lr

    def forward(self, X):
        H1 = self.relu(self.lin1(X.reshape((X.shape[0], -1))))
        if self.training:
            H1 = dropout_layer(H1, self.dropout_1)
        H2 = self.relu(self.lin2(H1))
        if self.training:
            H2 = dropout_layer(H2, self.dropout_2)
        return self.lin3(H2)

# Hyperparameters
hparams = {
    'num_outputs': 10,       # 10 classes for FashionMNIST
    'num_hiddens_1': 256,
    'num_hiddens_2': 256,
    'dropout_1': 0.5,
    'dropout_2': 0.5,
    'lr': 0.1
}
model = DropoutMLPScratch(**hparams)

# Load FashionMNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_dataset = datasets.FashionMNIST(root='data', train=True, transform=transform, download=True)
test_dataset = datasets.FashionMNIST(root='data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=model.lr)

# Training function
def train(model, train_loader, epochs=10):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        correct = 0
        total = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

            _, predicted = torch.max(outputs.data, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()

        print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.4f}, '
              f'Accuracy: {correct / total:.4f}')

# Evaluation function
def evaluate(model, test_loader):
    model.eval()
    all_preds = []
    all_labels = []
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            _, preds = torch.max(outputs, 1)
            all_preds.extend(preds.numpy())
            all_labels.extend(y_batch.numpy())
    return accuracy_score(all_labels, all_preds)

# Train the model for 10 epochs
train(model, train_loader, epochs=10)

# Test the model
test_accuracy = evaluate(model, test_loader)
print(f'Test Accuracy: {test_accuracy:.4f}')


Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data/FashionMNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 26.4M/26.4M [00:01<00:00, 22.4MB/s]


Extracting data/FashionMNIST/raw/train-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 29.5k/29.5k [00:00<00:00, 340kB/s]


Extracting data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 4.42M/4.42M [00:00<00:00, 6.21MB/s]


Extracting data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 5.15k/5.15k [00:00<00:00, 11.9MB/s]


Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw

Epoch 1, Loss: 0.6977, Accuracy: 0.7460
Epoch 2, Loss: 0.5111, Accuracy: 0.8157
Epoch 3, Loss: 0.4646, Accuracy: 0.8324
Epoch 4, Loss: 0.4413, Accuracy: 0.8402
Epoch 5, Loss: 0.4221, Accuracy: 0.8462
Epoch 6, Loss: 0.4062, Accuracy: 0.8535
Epoch 7, Loss: 0.3946, Accuracy: 0.8570
Epoch 8, Loss: 0.3868, Accuracy: 0.8601
Epoch 9, Loss: 0.3738, Accuracy: 0.8648
Epoch 10, Loss: 0.3677, Accuracy: 0.8667
Test Accuracy: 0.8621


## Higher Level Implementation

With high-level APIs, all we need to do is add a `Dropout` layer
after each fully connected layer,
passing in the dropout probability
as the only argument to its constructor.
During training, the `Dropout` layer will randomly
drop out outputs of the previous layer
(or equivalently, the inputs to the subsequent layer)
according to the specified dropout probability.
When not in training mode,
the `Dropout` layer simply passes the data through during testing.


In [8]:
class DropoutMLP(torch.nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2,
                 dropout_1, dropout_2, lr):
        super(DropoutMLP, self).__init__()
        self.net = nn.Sequential(
            nn.Flatten(), nn.LazyLinear(num_hiddens_1), nn.ReLU(),
            nn.Dropout(dropout_1), nn.LazyLinear(num_hiddens_2), nn.ReLU(),
            nn.Dropout(dropout_2), nn.LazyLinear(num_outputs))





Next, write your code to train the given model on the FashionMNIST Dataset for 10 epochs.


In [9]:
hparams = {'num_outputs':10, 'num_hiddens_1':256, 'num_hiddens_2':256,
           'dropout_1':0.5, 'dropout_2':0.5, 'lr':0.1}
model = DropoutMLP(**hparams)

#write your training and testing code here

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define DropoutMLP model
class DropoutMLP(nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2, dropout_1, dropout_2, lr):
        super(DropoutMLP, self).__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.LazyLinear(num_hiddens_1), nn.ReLU(),
            nn.Dropout(dropout_1),
            nn.LazyLinear(num_hiddens_2), nn.ReLU(),
            nn.Dropout(dropout_2),
            nn.LazyLinear(num_outputs)
        )
        self.lr = lr

    def forward(self, X):
        return self.net(X)

# Hyperparameters
hparams = {
    'num_outputs': 10,  # 10 classes for FashionMNIST
    'num_hiddens_1': 256,
    'num_hiddens_2': 256,
    'dropout_1': 0.5,
    'dropout_2': 0.5,
    'lr': 0.1
}
model = DropoutMLP(**hparams)

# Load FashionMNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

train_dataset = datasets.FashionMNIST(root='data', train=True, transform=transform, download=True)
test_dataset = datasets.FashionMNIST(root='data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])

# Training function
def train(model, train_loader, epochs=10):
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        correct = 0
        total = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()

        print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.4f}, '
              f'Accuracy: {correct / total:.4f}')

# Evaluation function
def evaluate(model, test_loader):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            _, predicted = torch.max(outputs, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()

    accuracy = correct / total
    print(f'Test Accuracy: {accuracy:.4f}')
    return accuracy

# Train the model for 10 epochs
train(model, train_loader, epochs=10)

# Evaluate the model on the test set
test_accuracy = evaluate(model, test_loader)


Epoch 1, Loss: 0.6913, Accuracy: 0.7480
Epoch 2, Loss: 0.5038, Accuracy: 0.8193
Epoch 3, Loss: 0.4648, Accuracy: 0.8328
Epoch 4, Loss: 0.4407, Accuracy: 0.8397
Epoch 5, Loss: 0.4204, Accuracy: 0.8482
Epoch 6, Loss: 0.4100, Accuracy: 0.8517
Epoch 7, Loss: 0.3925, Accuracy: 0.8583
Epoch 8, Loss: 0.3838, Accuracy: 0.8606
Epoch 9, Loss: 0.3761, Accuracy: 0.8648
Epoch 10, Loss: 0.3672, Accuracy: 0.8670
Test Accuracy: 0.8691


## Summary

Beyond controlling the number of dimensions and the size of the weight vector, dropout is yet another tool for avoiding overfitting. Often tools are used jointly.
Note that dropout is
used only during training:
it replaces an activation $h$ with a random variable with expected value $h$.


## Exercises

1. What happens if you change the dropout probabilities for the first and second layers? In particular, what happens if you switch the ones for both layers? Design an experiment to answer these questions, describe your results quantitatively, and summarize the qualitative takeaways.
1. Increase the number of epochs to 50 and compare the results.
1. Why is dropout not typically used at test time?



In [10]:
#1"""What happens if you change the dropout probabilities for the first and second layers? In particular, what happens if you switch the ones for both layers? Design an experiment to answer these questions, describe your results quantitatively, and summarize the qualitative takeaways.""""

def train_and_evaluate(dropout_1, dropout_2, epochs=10):
    # Initialize the model with specified dropout values
    model = DropoutMLP(
        num_outputs=10, num_hiddens_1=256, num_hiddens_2=256,
        dropout_1=dropout_1, dropout_2=dropout_2, lr=0.1
    )

    # Loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.1)

    # Training
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        correct = 0
        total = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == y_batch).sum().item()
            total += y_batch.size(0)

        train_accuracy = correct / total
        print(f'Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.4f}, '
              f'Training Accuracy: {train_accuracy:.4f}')

    # Evaluation
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == y_batch).sum().item()
            total += y_batch.size(0)

    test_accuracy = correct / total
    print(f'Test Accuracy: {test_accuracy:.4f}')
    return test_accuracy

# Run experiments with different dropout configurations
print("Experiment 1: Dropout 1 = 0.2, Dropout 2 = 0.5")
test_accuracy_1 = train_and_evaluate(0.2, 0.5)

print("\nExperiment 2: Dropout 1 = 0.5, Dropout 2 = 0.2")
test_accuracy_2 = train_and_evaluate(0.5, 0.2)


Experiment 1: Dropout 1 = 0.2, Dropout 2 = 0.5
Epoch 1, Loss: 0.6337, Training Accuracy: 0.7690
Epoch 2, Loss: 0.4537, Training Accuracy: 0.8364
Epoch 3, Loss: 0.4081, Training Accuracy: 0.8519
Epoch 4, Loss: 0.3833, Training Accuracy: 0.8613
Epoch 5, Loss: 0.3599, Training Accuracy: 0.8690
Epoch 6, Loss: 0.3487, Training Accuracy: 0.8736
Epoch 7, Loss: 0.3315, Training Accuracy: 0.8791
Epoch 8, Loss: 0.3229, Training Accuracy: 0.8827
Epoch 9, Loss: 0.3112, Training Accuracy: 0.8868
Epoch 10, Loss: 0.3016, Training Accuracy: 0.8888
Test Accuracy: 0.8663

Experiment 2: Dropout 1 = 0.5, Dropout 2 = 0.2
Epoch 1, Loss: 0.6577, Training Accuracy: 0.7587
Epoch 2, Loss: 0.4800, Training Accuracy: 0.8251
Epoch 3, Loss: 0.4395, Training Accuracy: 0.8399
Epoch 4, Loss: 0.4158, Training Accuracy: 0.8505
Epoch 5, Loss: 0.3975, Training Accuracy: 0.8557
Epoch 6, Loss: 0.3806, Training Accuracy: 0.8617
Epoch 7, Loss: 0.3723, Training Accuracy: 0.8636
Epoch 8, Loss: 0.3611, Training Accuracy: 0.8679


In [11]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define DropoutMLP model
class DropoutMLP(nn.Module):
    def __init__(self, num_outputs, num_hiddens_1, num_hiddens_2, dropout_1, dropout_2, lr):
        super(DropoutMLP, self).__init__()
        self.net = nn.Sequential(
            nn.Flatten(),
            nn.LazyLinear(num_hiddens_1), nn.ReLU(),
            nn.Dropout(dropout_1),
            nn.LazyLinear(num_hiddens_2), nn.ReLU(),
            nn.Dropout(dropout_2),
            nn.LazyLinear(num_outputs)
        )
        self.lr = lr

    def forward(self, X):
        return self.net(X)

# Hyperparameters
hparams = {
    'num_outputs': 10,  # 10 classes for FashionMNIST
    'num_hiddens_1': 256,
    'num_hiddens_2': 256,
    'dropout_1': 0.5,
    'dropout_2': 0.5,
    'lr': 0.1
}
model = DropoutMLP(**hparams)

# Load FashionMNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_dataset = datasets.FashionMNIST(root='data', train=True, transform=transform, download=True)
test_dataset = datasets.FashionMNIST(root='data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64)

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=hparams['lr'])

# Training function
def train(model, train_loader, epochs=50):
    print(f"{'=' * 20} Training Starts {'=' * 20}")
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        correct = 0
        total = 0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == y_batch).sum().item()
            total += y_batch.size(0)

        train_accuracy = correct / total
        print(f"[Epoch {epoch + 1:02d}/{epochs}] "
              f"Loss: {total_loss / len(train_loader):.4f} | "
              f"Training Accuracy: {train_accuracy * 100:.2f}%")
    print(f"{'=' * 20} Training Complete {'=' * 20}")

# Evaluation function
def evaluate(model, test_loader):
    print(f"\n{'=' * 20} Evaluation Starts {'=' * 20}")
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == y_batch).sum().item()
            total += y_batch.size(0)

    accuracy = correct / total
    print(f"Test Accuracy: {accuracy * 100:.2f}%")
    print(f"{'=' * 20} Evaluation Complete {'=' * 20}")
    return accuracy

# Train the model for 50 epochs
print("Training for 50 epochs...\n")
train(model, train_loader, epochs=50)

# Evaluate the model on the test set
print("\nEvaluating model after 50 epochs...")
test_accuracy_50 = evaluate(model, test_loader)


Training for 50 epochs...

[Epoch 01/50] Loss: 0.6913 | Training Accuracy: 74.78%
[Epoch 02/50] Loss: 0.5070 | Training Accuracy: 81.73%
[Epoch 03/50] Loss: 0.4636 | Training Accuracy: 83.38%
[Epoch 04/50] Loss: 0.4386 | Training Accuracy: 84.37%
[Epoch 05/50] Loss: 0.4209 | Training Accuracy: 84.92%
[Epoch 06/50] Loss: 0.4066 | Training Accuracy: 85.38%
[Epoch 07/50] Loss: 0.3927 | Training Accuracy: 85.84%
[Epoch 08/50] Loss: 0.3815 | Training Accuracy: 86.33%
[Epoch 09/50] Loss: 0.3758 | Training Accuracy: 86.30%
[Epoch 10/50] Loss: 0.3666 | Training Accuracy: 86.73%
[Epoch 11/50] Loss: 0.3598 | Training Accuracy: 86.97%
[Epoch 12/50] Loss: 0.3543 | Training Accuracy: 87.11%
[Epoch 13/50] Loss: 0.3448 | Training Accuracy: 87.39%
[Epoch 14/50] Loss: 0.3426 | Training Accuracy: 87.38%
[Epoch 15/50] Loss: 0.3376 | Training Accuracy: 87.64%
[Epoch 16/50] Loss: 0.3333 | Training Accuracy: 87.78%
[Epoch 17/50] Loss: 0.3267 | Training Accuracy: 88.07%
[Epoch 18/50] Loss: 0.3256 | Training 