In [None]:
%load_ext autoreload
%autoreload 2 

# Introduction

**What you will learn today**: This lab serves as an introduction to PyTorch. We will learn the different steps required in training a deep learning model with modern libraries, such as PyTorch. The idea is to make the model consistent with the structure of previous labs. In other words, we will simply call the **fit** function (like Scikit-learn!) and the model will train. 

So, which are these steps?

* Preliminaries:
    * load the train and test datasets, `train_dataset` and `test_dataset` (MNIST in our case)
    * turn the datasets into a "dataloaders": `train_dataloader` and `test_dataloader`
    * define your `model` architecture
    * define your `optimizer`, e.g. SGD


* Training: Now we have all the building blocks and we need to make our model "learn". In most cases, the training follows a specific "recipe". Specifically, we feed the `model` the whole `train_dataset` using batches that come from the `train_dataloader`. We repeat this a certain number of times, called `epochs`. Each epoch consists of `batches`. So what do we do for each batch?
    * zero out the optimizer. In essence we prepare the optimizer for the incoming data
    * compute the output of the model $f(\cdot)$ for our current data: $x\mapsto f(x)$
    * compute the loss: $\mathcal{L}(f(x), y)$ where $y$ denotes the ground truth
    * perform the `backpropagation` algorithm which involves computing the gradients and performing the update rule



# Getting the preliminaries out of the way

In [None]:
# first we load all the necessary libraries
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

We now load the datasets. We are going to work with MNIST and our goal is classify digits. This is a popular dataset and PyTorch offers it out-of-the-box, making our life easy! We simply need to call the corresponding method.

In [None]:
# The data are given as PIL images. We need to convert our data to a type 
# that is readable by a Neural Network. Thus, we use the ToTensor() "transform" 
transform = transforms.Compose([
    torchvision.transforms.ToTensor(),
    # torchvision.transforms.Normalize((0.1307,), (0.3081,))
])

# load the train dataset
train_dataset =  # INSERT YOUR CODE HERE

# load the test dataset
test_dataset = # INSERT YOUR CODE HERE

In [None]:
# define the hyperparameters
BATCH_SIZE = 1024
TEST_BATCH_SIZE = 2048
LEARNING_RATE = 0.01

# find out which device is available
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(DEVICE)

However, we cannot use the whole dataset; it is too large for computers to handle. Instead, we perform *stochastic* gradient descent, i.e. we feed the model part of the data called batches. In order to do so, we use Pytorch DataLoaders. 

In [None]:
# construct the dataloader for the traininig dataset. 
# Here we shuffle the data to promote stochasticity.
# The dataloader 
train_dataloader = # INSERT YOUR CODE HERE


# Construct the dataloader for the testing dataset.
test_dataloader = # INSERT YOUR CODE HERE

Now, let's visualize some samples.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

images = next(iter(train_dataloader))[0][:10]
grid = torchvision.utils.make_grid(images, nrow=5, padding=10)

def show(img):
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1,2,0)), interpolation='nearest')

show(grid)

Now, we are ready to define our model. We will start with a simple model, a MultiLayer Perceptron (MLP) with 2 layers.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        # define the different modules of the network
        super(Net, self).__init__()
        # input layer has 28x28=784 features 
        self.fc1 = nn.Linear(784, 50)
        # the output layer has 10 neurons. i.e. the number of output classes.
        self.fc2 = nn.Linear(50, 10)
        # we also define the non-linearity 
        self.relu = nn.ReLU()



    def forward(self, x):
        # ***************************************************
        # INSERT YOUR CODE HERE
        # You should (a) transform the a size that is readable
        # by the MLP and (b) pass the the input x successively 
        # through the layers.
        # ***************************************************s
        return x


In [None]:
# initialize the model
model = Net()

# move model to device
model = model.to(DEVICE)

# define the optimizer
optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE)

We now define:
* the `fit` function that performs the training part
* the `predict` function that takes as input the test dataloader and prints the performance metrics (e.g. accuracy)

In [None]:
def predict(model, test_dataloader, device):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_dataloader:
            # move data and target to the device. 
            # Model and data must be on the same device
            # INSERT YOUR CODE HERE

            # do the forward pass
            output = # INSERT YOUR CODE HERE

            # compute the loss
            loss = # INSERT YOUR CODE HERE

            # compute the running test loss
            test_loss += loss.item()
            # ... and how many samples were correctly classified
            pred = output.data.max(1, keepdim=True)[1]
            correct += pred.eq(target.data.view_as(pred)).sum()

    test_loss /= len(test_dataloader.dataset)
    accuracy = 100. * correct / len(test_dataloader.dataset)

    print(f'Test set: Avg. loss: {test_loss:.4f}, Accuracy: {correct}/{len(test_dataloader.dataset)} ({accuracy:.0f}%)')


We perform a "sanity check". Our model is at the moment initialized randomly and we have 10 classes (each class has approximately the same number of samples). This means that we should get random performance -> ~10% accuracy.

In [None]:
predict(model=model, test_dataloader=test_dataloader, device=DEVICE)

In [None]:
def train_epoch(model, train_dataloader, optimizer, device=None):
    '''
    This function implements the core components of any Neural Network training regiment.
    In our stochastic setting our code follows a very specific "path". First, we load the batch
    a single batch and zero the optimizer. Then we perform the forward pass, compute the gradients and perform the backward pass. And ...repeat!
    '''

    running_loss = 0.0

    model.train()
    for batch_idx, (data, target) in enumerate(train_dataloader):
        # move data and target to device
        # INSERT YOUR CODE HERE

        # zero the parameter gradients
        # INSERT YOUR CODE HERE

        # do the forward pass
        output = # INSERT YOUR CODE HERE

        # compute the loss
        loss = # INSERT YOUR CODE HERE

        # compute the gradients
        loss.backward()

        # perform the backpropagation step
        optimizer.step()
        
        # print statistics
        running_loss += loss.item()
    
    return running_loss / len(train_dataloader.dataset)


def fit(model, train_dataloader, optimizer, epochs, device):
    '''
    the fit method simply calls the train_epoch() method for a 
    specified number of epochs.
    '''

    # keep track of the losses in order to visualize them later
    losses = []
    for epoch in range(epochs):
        running_loss = train_epoch(
            model=model, 
            train_dataloader=train_dataloader, 
            optimizer=optimizer, 
            # device=device
        )
        print(f"Epoch {epoch}: Loss={running_loss}")
        losses.append(running_loss)

    return losses

In [None]:
losses = fit(
    model=model, 
    train_dataloader=train_dataloader,
    optimizer=optimizer,
    epochs=10,
    device=DEVICE)

Let's visualize the loss progression.

In [None]:
plt.plot(losses)

plt.xlabel('Epoch')
plt.ylabel("Loss")
plt.title("Loss progression across epochs")

In [None]:
predict(model=model, test_dataloader=test_dataloader, device=DEVICE)

The results are not very good. There are some major problems. We see from the plot above that the loss keeps dropping and does not "plateau". This indicates that we can run the optimization a few more epochs and improve the performance. Another point is that our learning rate is too sloww or the selection of vanilla SGD as our optimizer is not optimal. In the next section we will see that simply changing the optimizer (from SGD to Adam) yields very different results!

# Putting everything together

So far we have created a model, its optimizer, functions for training and predicting. Our end goal is to have something more "object oriented". In other words, we want a model with the ease and clarity of scikit-learn: simply call model.fit() and training runs.

We throw everything we have written above into a single class (+some extra functionality). After all, every model has a different forward function while the remaining structure stays the same.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm

class BasicModel(nn.Module):
    def __init__(self):
        super().__init__()
        # Here we define the model modules
        

    def forward(self, x):
        # defines the forward function of the model. 
        raise NotImplementedError


    def fit(self, train_dataloader, optimizer, epochs, device, plot_loss=True):
        losses = []
        for epoch in range(epochs):
            running_loss = self.train_epoch(
                train_dataloader=train_dataloader, 
                optimizer=optimizer, 
                epoch_idx=epoch,
                device=device)
            
            losses.append(running_loss)

        if plot_loss:
            self.plot_loss_progression(losses=losses)

    def plot_loss_progression(self, losses):
        plt.plot(losses)
        plt.xlabel('Epoch')
        plt.ylabel("Loss")
        plt.title("Loss progression across epochs")

    def train_epoch(self, train_dataloader, optimizer, epoch_idx, device):
        running_loss = 0.0

        self.train()
        tk0 = tqdm(train_dataloader, total=len(train_dataloader), desc=f"Epoch {epoch_idx}")
        for batch_idx, (data, target) in enumerate(tk0):
            
            # ***************************************************
            # INSERT YOUR CODE HERE
            # Copy paste from before.
            # ***************************************************   

            # print statistics
            running_loss += loss.item()
            avg_loss = running_loss / (batch_idx + 1)
            tk0.set_postfix(loss=avg_loss, stage="train")

        
        return running_loss / len(train_dataloader.dataset)


    def predict(self, test_dataloader, device):
        self.eval()
        test_loss = 0
        correct = 0
        with torch.no_grad():
            for data, target in test_dataloader:
                # ***************************************************
                # INSERT YOUR CODE HERE
                # Copy paste from before.
                # ***************************************************   
                test_loss += loss.item()
                pred = output.data.max(1, keepdim=True)[1]
                correct += pred.eq(target.data.view_as(pred)).sum()

        test_loss /= len(test_dataloader.dataset)
        accuracy = 100. * correct / len(test_dataloader.dataset)

        print(f'Test set: Avg. loss: {test_loss:.4f}, Accuracy: {correct}/{len(train_dataloader.dataset)} ({accuracy:.0f}%)')

Notice that the class above is not actually a model, but it defines all the relevant functions. Now we create the model by simply **inheriting** the class above and defining two `__init__` and the `forward` functions.

In [None]:
# We create a model identical with the previous one.

class MLP(BasicModel): # inherit the BasicModel class

    def __init__(self):
        super().__init__()
        # input layer has 28x28=784 features 
        self.fc1 = nn.Linear(784, 50)
        # the output layer has 10 neurons. i.e. the number of output classes.
        self.fc2 = nn.Linear(50, 10)
        # we also define the non-linearity 
        self.relu = nn.ReLU()


    def forward(self, x):
        # ***************************************************
        # INSERT YOUR CODE HERE
        # Copy paste from before.
        # ***************************************************   
        return x

        

In [None]:
# initialize model and define the optimizer. 
mlp = MLP().to(DEVICE)

# Instead of SGD we will use a more sophisticated one called Adam.
optimizer = optim.Adam(mlp.parameters(), lr=LEARNING_RATE)

# train the mlp
mlp.fit(
    train_dataloader=train_dataloader, 
    optimizer=optimizer,
    epochs=10,
    device=DEVICE)

In [None]:
mlp.predict(test_dataloader=test_dataloader, device=DEVICE)

The results are much better using Adam (in the same number of epochs). This shows the importance of selecting the correct optimizer.

## CNN

Notice that the MLP does not take into account the nature of images: close pixels convey local information that is important. Using an MLP, we do not have the notion of the "pixel neighbourhood". We, therefore, neglect important information with an MLP. There are however models better suited for vision problems, such as Convolutional Neural Networks or CNNs.

With the code structure we have created, we can simply define a CNN and test its performance quickly.

In [None]:
class CNN(BasicModel): 
    def __init__(self):
        super().__init__()

        # We use a Sequential, i.e. the inputs passes through each of
        # the modules below, one-by-one
        self.conv = nn.Sequential(         
            nn.Conv2d(
                in_channels=1,              
                out_channels=16,            
                kernel_size=5,              
                stride=1,                   
                padding=2,                  
            ),                              
            nn.ReLU(),                      
            nn.MaxPool2d(kernel_size=2), 
            nn.Conv2d(
                in_channels=16, 
                out_channels=32, 
                kernel_size=5, 
                stride=1, 
                padding=2),     
            nn.ReLU(),                      
            nn.MaxPool2d(2),    
        )
              
        # fully connected layer, output 10 classes
        self.out = nn.Linear(32 * 7 * 7, 10)    
        
    def forward(self, x):
        # ***************************************************
        # INSERT YOUR CODE HERE
        # Copy paste from before.
        # *************************************************** 
        return x   
        

In [None]:
# initialize model and define the optimizer. Instead of SGD we will use a more sophisticate one called Adam.
cnn = CNN().to(DEVICE)

optimizer = optim.Adam(cnn.parameters(), lr=LEARNING_RATE)

# train the mlp
cnn.fit(
    train_dataloader=train_dataloader, 
    optimizer=optimizer,
    epochs=10,
    device=DEVICE)

In [None]:
cnn.predict(test_dataloader=test_dataloader, device=DEVICE)

The CNN outperforms the MLP, which is to be expected in out case.

# Exercise: play with CIFAR10

MNIST is a fairly simple dataset. What happens in more challenging datasets?