In [14]:
import torch.nn as nn
import torch
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import random_split, DataLoader

### Define the network architecture:

This part of the code defines a simple feed-forward neural network using PyTorch's nn.Sequential container. This network will be used to classify MNIST images, which are grayscale images of handwritten digits (0-9).

1. nn.Linear(28*28, 64): This is the first layer of the network, a linear (also known as fully connected) layer. It takes as input a flattened MNIST image. Since each image is 28x28 pixels, this results in 784 input features. The output size is 64, meaning this layer consists of 64 neurons.

2. nn.ReLU(): This is the activation function for the first layer. It introduces non-linearity into the model, allowing the network to learn more complex patterns. ReLU stands for Rectified Linear Unit and it operates element-wise on the output of the previous layer, effectively clipping negative values to zero.

3. nn.Linear(64, 64): This is the second layer of the network, another linear layer. It takes as input the output of the first layer (and its activation function), which has a size of 64. The output size is also 64.

4. nn.ReLU(): This is the activation function for the second layer.

5. nn.Linear(64, 10): This is the final layer of the network. It takes as input the output of the second layer (and its activation function), which has a size of 64. The output size is 10, corresponding to the 10 possible classes (digits 0-9) of the MNIST dataset.

The output of this network will be a 10-dimensional vector for each input image, where each dimension corresponds to the predicted score for each class. The class with the highest score is the model's prediction.

In [15]:
# Define the network
model = nn.Sequential(
    # MNIST images are 28x28, so 784 input features
    # 64 is the number of hidden units
    nn.Linear(28*28, 64), # 28*28 = 784
    nn.ReLU(),
    nn.Linear(64, 64),
    nn.ReLU(),
    nn.Dropout(0.1), # drop 10% of the neuron if we're overfitting
    nn.Linear(64, 10) # 10 classes
)

### Define my Optimizer:

The optimizer is responsible for updating the model's parameters in the direction that minimizes the loss function. Here, Stochastic Gradient Descent (SGD) is used with a learning rate of 0.01. The model.parameters() method is used to fetch the parameters of the model to be optimized.

In [16]:
# Define my optimizer
params = model.parameters()
optimizer = optim.SGD(params, lr=0.01)

### Define my Loss Function

The loss function measures the difference between the predicted value (output of the model) and the target value (the true value). Here, the cross entropy loss function is used, which is commonly used to train classification models.

In [17]:
# Define my loss function
loss = nn.CrossEntropyLoss()

### Dataset Train, Test Split and Loaders

The MNIST dataset is split into three parts: training set (used to train the model), validation set (used to evaluate model during training) and test set (used to test the model after training). The MNIST training set contains 60,000 examples, and the test set contains 10,000 examples. The validation set contains 10,000 examples from the MNIST training set. The validation set is used to evaluate the model during training by computing and reporting metrics such as accuracy. The MNIST dataset is provided as part of the torchvision package, which downloads and loads the dataset into a PyTorch dataset.

In [18]:
# train, val split
train_data = datasets.MNIST(root='data', train=True, download=True, transform=transforms.ToTensor())
train, val = random_split(train_data, [55000, 5000])
train_loader = DataLoader(train, batch_size=32)
val_loader = DataLoader(val, batch_size=32)

# Creating the 5 Step PyTorch Training Loop

The training loop is the heart of the training process. It consists of five steps:

1. Forward Pass: The model makes a prediction based on the input data x (flattened images). The output l (logits) is the raw, unnormalized scores for each class.

2. Compute the Objective Function: The loss function J is computed using the predicted output l and the true labels y. This measures the difference between the model's predictions and the true values.

3. Clean the Gradients: Before computing the gradients, we need to set the existing gradients to zero. This is because PyTorch accumulates gradients, and we don't want to mix up gradients between mini batches.

4. Compute the Partial Derivative with respect to the Parameters: The backward() function computes the gradient of the loss function J with respect to the model parameters. This is used in the next step for the gradient descent update.

5. Step in the Opposite Direction: The step() function updates the model parameters in the opposite direction of the gradients to minimize the loss function. This is done by the optimizer, which in this case is Stochastic Gradient Descent (SGD).

The loop is run for a specified number of epochs, where an epoch is one complete pass through the entire training dataset. In this case, the number of epochs is set to 5.

In [20]:
# Create the training and validation loops
epochs = 5
for epoch in range(epochs):
    losses = []
    for batch in train_loader:
        x, y = batch

        # Flatten the image
        # x: b x 1 x 28 x 28 -> b x 784
        b = x.size(0)
        x = x.view(b, -1) # -1 means infer this dimension

        # 1. Forward pass
        l = model(x) # l: logits

        # 2. Compute the objective function
        J = loss(l, y)

        # 3. Cleaning the gradients
        model.zero_grad()
        # optimizer.zero_grad()
        # params.grad.zero_()

        # 4. Compute the partial derivatives of J w.r.t parameters
        J.backward()
        # params.grad.add_(dJ/dparams)

        # 5. Step in the opposite direction of the gradient
        optimizer.step()
        # with torch.no_grad(): params = params - lr * params.grad

        losses.append(J.item())

    print(f'Epoch {epoch + 1}, training loss: {torch.tensor(losses).mean():.2f}')

    # Evaluate the model on the validation set
    losses = []
    for batch in val_loader:
        x, y = batch

        # x: b x 1 x 28 x 28 -> b x 784
        b = x.size(0)
        x = x.view(b, -1)

        # 1. Forward pass
        with torch.no_grad():
            l = model(x) # l: logits

        # 2. Compute the objective function
        J = loss(l, y)

        losses.append(J.item())

    print(f'Epoch {epoch + 1}, validation loss: {torch.tensor(losses).mean():.2f}')


Epoch 1, training loss: 0.23
Epoch 1, validation loss: 0.25
Epoch 2, training loss: 0.21
Epoch 2, validation loss: 0.24
Epoch 3, training loss: 0.20
Epoch 3, validation loss: 0.22
Epoch 4, training loss: 0.18
Epoch 4, validation loss: 0.21
Epoch 5, training loss: 0.17
Epoch 5, validation loss: 0.20
