# **2: Your First Model - A Multilayer Perceptron (MLP)**

## **Data Loading and Visualization**

### The MNIST Dataset

`MNIST` is a popular dataset in the machine learning community, consisting of `70,000` grayscale images of handwritten digits `(0-9)`. Each image is `28x28` pixels, and the task is to classify each image into one of the 10 digit classes. The dataset is split into a training set of `60,000` images and a test set of `10,000` images.

**torchvision.datasets** 

PyTorch provides a convenient way to load the MNIST dataset through the `torchvision.datasets` module. You can use the `MNIST` class to download and load the dataset.

**DataLoaders** 

To efficiently load and batch the data during training, we use `DataLoader` from `torch.utils.data`. A `DataLoader` takes a dataset and provides an iterable over the dataset with support for automatic batching, shuffling, and parallel data loading.


**transforms.ToTensor()**

The `transforms.ToTensor()` function converts a PIL image or a NumPy array into a PyTorch tensor. It also scales the pixel values from the range `[0, 255]` to `[0.0, 1.0]`, which is important for training neural networks as it helps with convergence. When you apply `transforms.ToTensor()` to an image, it converts the image into a tensor of shape `(C, H, W)`, where `C` is the number of channels (1 for grayscale images), `H` is the height, and `W` is the width. For MNIST, this means that each image will be converted to a tensor of shape `(1, 28, 28)`.

In [2]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Define transforms
transform = transforms.ToTensor()

# Download and create datasets
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

# Create DataLoaders
batch_size = 64
train_dataloader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False
)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Batch size: {batch_size}")

100%|██████████| 9.91M/9.91M [00:11<00:00, 889kB/s] 
100%|██████████| 28.9k/28.9k [00:00<00:00, 123kB/s]
100%|██████████| 1.65M/1.65M [00:03<00:00, 456kB/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 431kB/s]

Training samples: 60000
Test samples: 10000
Batch size: 64





## **Inspecting the Data Shape**

Understanding tensor shapes is critical for debugging neural networks. Shape mismatches are the #1 cause of errors in deep learning. When you load the MNIST dataset, each image is represented as a tensor of shape `(1, 28, 28)`, where `1` is the number of channels (since MNIST images are grayscale), `28` is the height, and `28` is the width. The labels are typically represented as a tensor of shape `(N,)`, where `N` is the number of samples in the batch. For example, if you load a batch of 32 images, the image tensor will have a shape of `(32, 1, 28, 28)` and the label tensor will have a shape of `(32,)`.

Let's think about what we expect:

- Each image should have a shape of `(1, 28, 28)`.
- Each label should be a single integer representing the digit class (0-9).
- When we load a batch of images, we expect the image tensor to have a shape of `(batch_size, 1, 28, 28)` and the label tensor to have a shape of `(batch_size,)`.
- If we see a shape that doesn't match these expectations, it could indicate an issue with how the data is being loaded or processed. For example, if the image tensor has a shape of `(28, 28)` instead of `(1, 28, 28)`, it means that the channel dimension is missing, which could lead to errors when feeding the data into a neural network.


1. `Batch size:` We set `batch_size=64`, so each batch contains 64 images.
2. `Image dimensions:` MNIST images are grayscale (1 channel) and are 28x28 pixels.
3. `Shape convention:` PyTorch uses the format `(batch_size, channels, height, width)`.

Therefore, the shape of a single batch should be `(64, 1, 28, 28)`.

The labels will be a 1D tensor of shape `(64,)` containing the digit labels (0-9).

In [3]:
# Get one batch of data
X, y = next(iter(train_dataloader))


print("Image tensor shape:", X.shape)
print("Label tensor shape:", y.shape)
print(f"\nExpected image shape: (64, 1, 28, 28)")
print(f"Expected label shape: (64,)")
print(f"\nLabels in this batch: {y[:10].tolist()}...")  # Show first 10 labels
print(f"Unique labels in y: {torch.unique(y)}")  # Should be digits 0-9 

Image tensor shape: torch.Size([64, 1, 28, 28])
Label tensor shape: torch.Size([64])

Expected image shape: (64, 1, 28, 28)
Expected label shape: (64,)

Labels in this batch: [5, 0, 8, 4, 2, 3, 8, 4, 6, 1]...
Unique labels in y: tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


## **Building the MLP Model**

An MLP `(Multilayer Perceptron)` is a type of feedforward neural network that consists of multiple layers of neurons. Each layer is fully connected to the next layer. The MLP can be used for classification tasks, such as classifying the MNIST digits.

### **nn.Module - The Base Class**

In PyTorch, all neural network models should inherit from `nn.Module`. This base class provides a lot of functionality that makes it easier to define and train neural networks. When you create a custom model by subclassing `nn.Module`, you need to implement the `__init__` method to define the layers of your model and the `forward` method to specify how the input data flows through the layers.

When you create a model, you must define two essential methods:

- `__init__(self):` This method initializes the layers of the model. You define the architecture of your neural network here by creating instances of layers (e.g., `nn.Linear`, `nn.ReLU`, etc.) and assigning them as attributes of the class.

- `forward(self, x):` This method defines the forward pass of the model. It specifies how the input data `x` flows through the layers defined in the `__init__` method to produce the output. The `forward` method is called when you pass data through the model (e.g., `model(input_data)`), and it should return the output of the model.



### **The Layers We'll Use**

- `nn.Flatten():` Converts the 2D image tensor `(1, 28, 28)` into a 1D vector `(784)`. This is necessary because fully-connected layers expect 1D input.

- `nn.Linear(in_features, out_features):` A fully connected layer that applies a linear transformation to the input data. The `in_features` parameter specifies the size of each input sample, and the `out_features` parameter specifies the size of each output sample. For example, `nn.Linear(784, 128)` creates a layer that takes an input of size `784` and produces an output of size `128`.
  
- `nn.ReLU():` A non-linear activation function that introduces non-linearity into the model. It stands for "Rectified Linear Unit" and is defined as `ReLU(x) = max(0, x)`. This means that it outputs the input directly if it is positive; otherwise, it outputs zero. This helps the model learn complex patterns in the data.

- `nn.Sequential(*layers):` A container module that allows you to stack layers together in a sequential manner. You can pass a list of layers to `nn.Sequential`, and it will create a single module that applies each layer in order. For example, `nn.Sequential(nn.Flatten(), nn.Linear(784, 128), nn.ReLU(), nn.Linear(128, 10))` creates a model that first flattens the input, then applies a linear transformation to 128 features, applies the ReLU activation, and finally applies another linear transformation to produce `10` output features (one for each digit class).

In [4]:
class SimpleMLP(nn.Module):
    def __init__(self):
        super(SimpleMLP, self).__init__()
        # Flatten 28x28 image to 784
        self.flatten = nn.Flatten()
        # First layer: 784 -> 128
        self.linear1 = nn.Linear(784, 128)
        self.relu1 = nn.ReLU()
        # Second layer: 128 -> 64
        self.linear2 = nn.Linear(128, 64)
        self.relu2 = nn.ReLU()
        # Output layer: 64 -> 10 (one for each digit 0-9)
        self.linear3 = nn.Linear(64, 10)
    
    def forward(self, x):
        x = self.flatten(x)
        x = self.linear1(x)
        x = self.relu1(x)
        x = self.linear2(x)
        x = self.relu2(x)
        x = self.linear3(x)
        return x

# Instantiate the model
model = SimpleMLP()
print(model)

SimpleMLP(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear1): Linear(in_features=784, out_features=128, bias=True)
  (relu1): ReLU()
  (linear2): Linear(in_features=128, out_features=64, bias=True)
  (relu2): ReLU()
  (linear3): Linear(in_features=64, out_features=10, bias=True)
)


In [5]:
class SequentialMLP(nn.Module):
    def __init__(self):
        super(SequentialMLP, self).__init__()
        self.model = nn.Sequential(
            nn.Flatten(),          # Flatten 28x28 to 784
            nn.Linear(784, 128),  # First layer
            nn.ReLU(),            # Activation
            nn.Linear(128, 64),   # Second layer
            nn.ReLU(),            # Activation
            nn.Linear(64, 10)     # Output layer
        )
    
    def forward(self, x):
        return self.model(x)

# Instantiate the Sequential model
sequential_model = SequentialMLP()
print(sequential_model)

SequentialMLP(
  (model): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=784, out_features=128, bias=True)
    (2): ReLU()
    (3): Linear(in_features=128, out_features=64, bias=True)
    (4): ReLU()
    (5): Linear(in_features=64, out_features=10, bias=True)
  )
)


## **The Training Essentials**

To train a neural network, you need to define three key components:

1. **Loss Function:** The loss function measures how well the model's predictions match the true labels. For classification tasks, a common loss function is `nn.CrossEntropyLoss()`, which combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class.
   
2. **Optimizer:** The optimizer updates the model's parameters based on the computed gradients. A common choice is `torch.optim.SGD` (Stochastic Gradient Descent) or `torch.optim.Adam`, which is an adaptive learning rate optimization algorithm.
   
3. **Training Loop:** The training loop iterates over the dataset for a specified number of epochs. In each epoch, you loop through the batches of data, perform a forward pass to compute the predictions, compute the loss, perform a backward pass to compute the gradients, and then update the model's parameters using the optimizer.

    - Feeding data to the model
    - Calculating the loss
    - Performing backpropagation
    - Updating the model's parameters
    - repeating the process for multiple epochs until the model converges or reaches a satisfactory level of performance.

In [6]:
# Instantiate loss function
loss_fn = nn.CrossEntropyLoss()

# Instantiate optimizer
# lr = learning rate (how big of steps to take)
learning_rate = 1e-3
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

print("Loss function:", loss_fn)
print("Optimizer:", optimizer)

Loss function: CrossEntropyLoss()
Optimizer: Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    decoupled_weight_decay: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: None
    lr: 0.001
    maximize: False
    weight_decay: 0
)


## **The Training Loop Explained** 

In the training loop, we iterate over the training data for a specified number of epochs. For each batch of data, we perform the following steps:
1. **Forward Pass:** We pass the input data `X` through the model to get the predictions `y_pred`. This is done by calling `model(X)`, which internally calls the `forward` method of the model.

2. **Calculate Loss:** We compute the loss by comparing the predicted labels `y_pred` with the true labels `y` using the loss function defined earlier (e.g., `loss_fn(y_pred, y)`).

3. **Backward Pass:** We call `loss.backward()` to compute the gradients of the loss with respect to the model's parameters. This populates the `.grad` attributes of the parameters with the computed gradients.

4. **Update Parameters:** We call `optimizer.step()` to update the model's parameters based on the computed gradients. This step modifies the parameters in the direction that reduces the loss.

5. **Zero Gradients:** We call `optimizer.zero_grad()` to reset the gradients of the model's parameters to zero. This is important because, by default, PyTorch accumulates gradients, so we need to clear them before the next iteration.

In [9]:
def train(dataloader, model, loss_fn, optimizer, epochs: int):
    """
    Train the model for a specified number of epochs.
    
    Args:
        dataloader: DataLoader providing batches of training data
        model: The neural network model
        loss_fn: Loss function
        optimizer: Optimizer for updating weights
        epochs: Number of training epochs
    """
    model.train()  # Set model to training mode
    
    for epoch in range(epochs):
        total_loss = 0.0
        num_batches = 0
        
        for batch_idx, (X, y) in enumerate(dataloader):
            # Step 1: Forward pass
            pred = model(X)
            
            # Step 2: Calculate loss
            loss = loss_fn(pred, y)
            
            # Step 3: Backpropagation
            loss.backward()
            
            # Step 4: Update weights
            optimizer.step()
            
            # Step 5: Zero gradients
            optimizer.zero_grad()
            
            total_loss += loss.item()
            num_batches += 1
            
            # Print progress every 100 batches
            if (batch_idx + 1) % 100 == 0:
                avg_loss = total_loss / num_batches
                print(f'Epoch {epoch + 1}/{epochs}, Batch {batch_idx + 1}/{len(dataloader)}, Loss: {avg_loss:.4f}')
        
        # Print average loss for the epoch
        avg_loss = total_loss / num_batches
        print(f'Epoch {epoch + 1}/{epochs} completed. Average Loss: {avg_loss:.4f}\n')

# Train the model
epochs = 5
print("Starting training...\n")
train(train_dataloader, model, loss_fn, optimizer, epochs=epochs)
print("Training completed!")

Starting training...

Epoch 1/5, Batch 100/938, Loss: 0.0464
Epoch 1/5, Batch 200/938, Loss: 0.0418
Epoch 1/5, Batch 300/938, Loss: 0.0414
Epoch 1/5, Batch 400/938, Loss: 0.0417
Epoch 1/5, Batch 500/938, Loss: 0.0427
Epoch 1/5, Batch 600/938, Loss: 0.0430
Epoch 1/5, Batch 700/938, Loss: 0.0435
Epoch 1/5, Batch 800/938, Loss: 0.0439
Epoch 1/5, Batch 900/938, Loss: 0.0450
Epoch 1/5 completed. Average Loss: 0.0455

Epoch 2/5, Batch 100/938, Loss: 0.0331
Epoch 2/5, Batch 200/938, Loss: 0.0320
Epoch 2/5, Batch 300/938, Loss: 0.0340
Epoch 2/5, Batch 400/938, Loss: 0.0338
Epoch 2/5, Batch 500/938, Loss: 0.0331
Epoch 2/5, Batch 600/938, Loss: 0.0332
Epoch 2/5, Batch 700/938, Loss: 0.0324
Epoch 2/5, Batch 800/938, Loss: 0.0327
Epoch 2/5, Batch 900/938, Loss: 0.0347
Epoch 2/5 completed. Average Loss: 0.0353

Epoch 3/5, Batch 100/938, Loss: 0.0245
Epoch 3/5, Batch 200/938, Loss: 0.0242
Epoch 3/5, Batch 300/938, Loss: 0.0251
Epoch 3/5, Batch 400/938, Loss: 0.0272
Epoch 3/5, Batch 500/938, Loss: 0.