# LAB-2 : Making a feed forward neural network using Fashion MINST dataset

Here for a better learning purposes i have used a guided learning model from gemini to learn better rather than just taking out the code from any of the LLMs.

Now lets proceed with importing of the libraries, here we will import necessary libraries at a time to know properly which one is being used where.

## Preprosessing of the MINST dataset

Initially we use a torchvision.datasets.FashionMINST to import the dataset. and then process to convert the image into a tensor, and then futher normalise it down.

why is normalisation important?  
* Normalization keeps input ranges consistent
* This makes gradient updates stable and faster
* It prevents some features (pixels) from dominating learning just because of scale

In [7]:
import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

In [3]:
transform = transforms.Compose([
    transforms.ToTensor()
])


In [4]:
train_dataset= datasets.FashionMNIST(
    root="data",
    train=True,
    download=True,
    transform=transform
)

100%|██████████| 26.4M/26.4M [00:02<00:00, 10.7MB/s]
100%|██████████| 29.5k/29.5k [00:00<00:00, 206kB/s]
100%|██████████| 4.42M/4.42M [00:01<00:00, 3.38MB/s]
100%|██████████| 5.15k/5.15k [00:00<00:00, 10.1MB/s]


In [5]:
train_dataset

Dataset FashionMNIST
    Number of datapoints: 60000
    Root location: data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
           )

In [8]:
train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True
)


In [12]:
test_dataset = datasets.FashionMNIST(
    root="data",
    train=False,
    transform=transform,
    download=True
)

test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


We now only inspect if got the things right from our previous steps

In [9]:
images, labels = next(iter(train_loader))

print(images.shape)
print(labels.shape)


torch.Size([32, 1, 28, 28])
torch.Size([32])


We can now see that we have got our expected results.
we can now see that our image size is [1,28,28] which is not in the shape that our nn.module expects.  
Therefore we will reduce the shape of it by doinh 1x28x28=784, this process is called flattening our image. as our nn.module only takes vectors as input.

In [10]:
import torch.nn as nn
class MLP(nn.Module):
  def __init__(self):
    super().__init__()
    self.fc1=nn.Linear(784,392)
    self.fc2=nn.Linear(392,196)
    self.fc3=nn.Linear(196,49)
    self.fc4=nn.Linear(49,10)


  def forward(self, x):
    x=x.view(x.size(0),-1)
    x=torch.relu(self.fc1(x))
    x=torch.relu(self.fc2(x))
    x=torch.relu(self.fc3(x))
    x=self.fc4(x)
    return x

Now just testing an instance

In [11]:
model = MLP()
images, labels = next(iter(train_loader))
outputs = model(images)
print(outputs.shape)

torch.Size([32, 10])


Looks good

Loss Function and Optimizer

CrossEntropyLoss is used for multi-class classification problems, where the model predicts one out of several discrete classes. It internally applies the Softmax function to convert raw outputs (logits) into class probabilities and then computes the negative log-likelihood loss.

The Adam optimizer is chosen because it combines the benefits of momentum and adaptive learning rates. It adjusts the learning rate for each parameter individually, leading to faster convergence and stable training. A learning rate of 0.001 provides a balanced trade-off between training speed and stability.

In [14]:
import torch.optim as optim
model = MLP()

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)


Training Loop (Loss + Accuracy Monitoring)

* The model is set to training mode using model.train(), enabling gradient computation and ensuring that all layers behave correctly during training.
* Before computing gradients for the current batch, previously accumulated gradients are cleared. This prevents incorrect gradient accumulation across batches.
* During the forward pass, input images are propagated through the neural network to produce output logits. The loss function then measures the discrepancy between predicted outputs and true class labels, quantifying how well the model is performing.
* The backward pass computes gradients of the loss with respect to all trainable parameters using backpropagation. The optimizer then updates the weights based on these gradients to minimize the loss.

In [15]:
num_epochs = 5

for epoch in range(num_epochs):
    model.train()
    running_loss = 0
    correct = 0
    total = 0

    for images, labels in train_loader:
        optimizer.zero_grad()

        outputs = model(images)
        loss = criterion(outputs, labels)

        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss:.4f}, Accuracy: {accuracy:.2f}%")


Epoch [1/5], Loss: 971.8550, Accuracy: 80.97%
Epoch [2/5], Loss: 695.1672, Accuracy: 86.44%
Epoch [3/5], Loss: 621.0059, Accuracy: 87.77%
Epoch [4/5], Loss: 573.7068, Accuracy: 88.72%
Epoch [5/5], Loss: 542.8310, Accuracy: 89.25%


The training loss consistently decreases across epochs, indicating that the neural network is learning meaningful patterns from the data. Simultaneously, training accuracy improves from approximately 81% to 89%, demonstrating better classification performance as training progresses.

This trend confirms that the optimizer and learning rate are well chosen and that the model is converging without instability or divergence.

Evaluation on Test Data

In [16]:
model.eval()
correct = 0
total = 0

with torch.no_grad():                         ## as gradients are not required for testing.
    for images, labels in test_loader:
        outputs = model(images)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

test_accuracy = 100 * correct / total
print(f"Test Accuracy: {test_accuracy:.2f}%")


Test Accuracy: 87.70%


The model achieves a test accuracy of 87.70%, which is slightly lower than the training accuracy. This indicates good generalization with minimal overfitting. The small gap between training and test accuracy confirms that the model has learned robust features rather than memorizing the training data.

**Role of ReLU and Linear Output**

* ReLU activation introduces non-linearity, enabling the network to learn complex patterns.
* Without ReLU, multiple linear layers would behave like a single linear transformation.
* The output layer is linear because CrossEntropyLoss internally applies Softmax.

**Forward Pass**

* Input data flows through each layer of the network.
* Linear transformations and activations compute predicted outputs.
* The final output represents raw class scores (logits).

**Backward Pass**

* The loss function computes error between predictions and true labels.
* Backpropagation calculates gradients of loss with respect to weights
* Gradients indicate how much each parameter contributed to the error.

**Gradient Updates**

* The optimizer (Adam/SGD) updates weights using computed gradients.
* Learning rate controls the step size of weight updates.
* This process minimizes the loss function over time.

**Training Loss and Accuracy**

* Loss measures how incorrect the predictions are.
* Accuracy measures how many predictions are correct.
* Monitoring both ensures proper learning and avoids underfitting/overfitting.

**Evaluation on Test Data**

* The model is tested on unseen data to measure generalization.
* model.eval() disables dropout and batch normalization effects.
* No gradients are computed during evaluation for efficiency.

**Impact of Hyperparameters**  

**Learning Rate**.  
Too high → unstable training.  
Too low → slow convergence.  

**Batch Size**.  
Small batch → noisy but generalizable updates.  
Large batch → stable but memory intensive.  

**Number of Epochs**.  
Too few → underfitting.  
Too many → overfitting.  

####ADV task 1: Experimenting with multiple hidden layer and dicussing the trade off

Step 1 : I define a flexible MLP model so that i don't have to rewrite the code every time i want to change the number of hidden layers.

In [18]:
class FlexibleMLP(nn.Module):
    def __init__(self, hidden_layers):
        super().__init__()

        layers = []
        input_size = 784

        for hidden_size in hidden_layers:
            layers.append(nn.Linear(input_size, hidden_size))
            layers.append(nn.ReLU())
            input_size = hidden_size

        layers.append(nn.Linear(input_size, 10))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        x = x.view(x.size(0), -1)
        return self.network(x)


In [19]:
def train_and_evaluate(hidden_layers, epochs=5):
    model = FlexibleMLP(hidden_layers)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(epochs):
        model.train()
        running_loss = 0

        for images, labels in train_loader:
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()

    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

    accuracy = 100 * correct / total
    return running_loss, accuracy


In [20]:
configs = {
    "1 Hidden Layer": [128],
    "3 Hidden Layers": [392, 196, 49],
    "5 Hidden Layers": [512, 256, 128, 64, 32]
}

results = {}

for name, layers in configs.items():
    loss, acc = train_and_evaluate(layers)
    results[name] = (loss, acc)
    print(f"{name} → Test Accuracy: {acc:.2f}%")


1 Hidden Layer → Test Accuracy: 86.04%
3 Hidden Layers → Test Accuracy: 87.66%
5 Hidden Layers → Test Accuracy: 85.92%


**Interpretation**

Increasing depth initially improves learning capacity, but beyond a point, deeper networks show diminishing returns and may overfit due to excessive model complexity.

In simple terms:  
1 hidden layer - underfitting.  
3 hidden layer - good training.  
5 hidden layer - slightly overfitting.

An optimal network depth provides sufficient expressive power without memorizing training data. In this experiment, a 3-hidden-layer architecture achieves the best generalization performance.