# Introduction to Deep Learning with PyTorch


![AI_venn.png](https://drive.google.com/uc?export=view&id=1pThL5BY428RjG15vBI4dWdvf6fUtyXbw)


## Artificial Intelligence
AI is a very broad term and includes anything that enables computers to mimic human intelligence. On an elementary level, AI can be a predefined rule that enables a machine to react to specific situations in predetermined ways, or in simple terms a set of if-else rules. When we are talking about Artificial Intelligence it's worthwhile to concentrate on two important subfields.

## Machine Learning
Machine learning is a subset of Artificial Intelligence where a **series of algorithms analyze data** and **learn from it to make informed decisions based on learned insights**.
The mathematical inception of a lot of the basis of machine learning is not new and has been around since the 60s. However, it hasn't been until relatively recently that we have the computational power to leverage it. The rapid progression of machine learning in the past years has lead such methods to be adopted in virtually every industry sector.

ML incorporates a lot of classical algorithms for learning how to perform tasks from learned examples, such as clusterring, regression and classification. There are generally four types of machine learning:
* **Supervised** learning, where the model learns from labelled examples. 
* **Unsupervised** learning, where data has no labels, i.e. the system does not know the correct answer. A popular example is classification problems, like identifying clusters of customers based on their attributes for marketing campaigns.
* **Semi-supervised** learning is a combination of the two above, i.e. it uses both labelled and unlabelled data for training. It can be useful when we have a mixture of labelled and unlabelled data, where the cost of labelling data is too high.
* **Reinforcement** learning, where the model learns by itself which actions yield the highest rewards trough trial and error. 

## Deep Learning
A subset of Machine Learning based on the use of **neural networks**, even though this term is often used interchangeably with Machine Learning. The term **deep** denotes the use of a neural network with 3 or more layers, including the **input** and **output** layers, i.e. one more **hidden layers**. 

![neural_network](https://drive.google.com/uc?export=view&id=1LbAhOtBInXdTzV7AMPEMAUqIT1ybH9hC)

Neural networks is not a new concept, but have been making large strides in the past decade due to the vast amount of data available and most importantly the use of GPUs or other accelerate units (like TPUs). 

# Introduction to PyTorch
![pytorch](https://drive.google.com/uc?export=view&id=1GP0ENtTIW_raHLoYzErL-mbSzZgkmzv1)

<https://pytorch.org/>


PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. It was developed and published by META AI (Facebook) and has been amongst the most popular frameworks for the development of neural networks for the past years. 

**What is PyTorch?**

PyTorch is a Python-based scientific computing package serving two broad purposes:

* A replacement for NumPy to use the power of GPUs and other accelerators.
* An automatic differentiation library that is useful to implement neural networks.

PyTorch uses a specialised data structure, similar to numpy's ndarrays, called **tensors**, to encode the inputs and outputs of a model as well as the model's parameters. A very significant difference compared to numpy's arrays is the ability of tensors to run on GPUs or other accelerators to accelerate computing.

In [None]:
# Import the pytorch and numpy libraries
import torch
import numpy as np

Check whether CUDA is available and print out the GPU model you are using.

In [None]:
print(f"Is CUDA available in the session? {torch.cuda.is_available()}")
print(f"\nHow many CUDA enabled devices are available? {torch.cuda.device_count()}")
print(f"\nWhich CUDA device is available? {torch.cuda.get_device_name(0)}")

## PyTorch tensors tutorial

There are various ways to initialise a tensor.

**Directly from data**

In [None]:
# 2x2 list
data = [[1, 2], [3, 4]]
data_tensor = torch.tensor(data)

**From a numpy array**

In [None]:
# Convert the list to a numpy array
data_numpy = np.array(data)
data_np_tensor = torch.tensor(data_numpy)

**From another tensor**

We can create a new tensor from an existing one, which will retain the properties of the argument one (shape, datatype), unless we explicitely override any of the properties.

In [None]:
# Create a new tensor with the same properties as data_tensor (shape and datatype), filled with 1s.
x_ones = torch.ones_like(data_tensor)

# Create a new tensor from data_tensor, filled with random numbers and change the datatype to float.
x_rand = torch.rand_like(data_tensor, dtype=torch.float)

print(f"Argument tensor: \n{data_tensor}\n")
print(f"New tensor filled with 1s: \n{x_ones}\n")
print(f"New tensor filled with random numbers: \n{x_rand}\n")

**We can also initialise the tensor by defining its shape**

The *shape* argument is a *tuple* of tensor dimensions.


In [None]:
# Define the shape
shape = (2, 3,)   # <- rows x columns

# Tensor with random numbers
tensor_random = torch.rand(shape)
# Tensor filles with zeros
tensor_zeros = torch.zeros(shape)

# Print the tensors
print(f"Filled with random numbers: \n{tensor_random}\n")
print(f"Filled with zeros: \n{tensor_zeros}\n")

**Tensor operations**

There are over a hundred tensor operations, which you can find here <https://pytorch.org/docs/stable/torch.html>.

To take advantage of the GPU acceleration of tensors, we should move them to the GPU (if available):

In [None]:
# Check if the GPU is available and assign the proper name to the device variable
device = "cuda" if torch.cuda.is_available() else "cpu"

# Assign the device to the tensor (by default it's on cpu)
print(f"Default tensor device assignment: {data_tensor.device}")
data_tensor = data_tensor.to(device)
print(f"New device: {data_tensor.device}")

**Indexing** a tensor works just like numpy arrays

In [None]:
# Define a new tensor of shape 4x4 filled with ones
tensor = torch.ones((4, 4), dtype=torch.int64)

# Assign the number 2 to all the members of the second column
tensor[:, 1] = 2
print(tensor)

We can also concatenate tensors using the **torch.cat** function along a predefined axis, denoted by **dim**.

In [None]:
new_tensor = torch.cat([tensor, tensor, tensor], dim=1)  # concat along the x axis
new_tensor2 = torch.cat([tensor, tensor], axis=0)        # concat along the y axis

print(f"Along x axis:\n{new_tensor}\n")
print(f"Along y axis:\n{new_tensor2}")

## The **autograd** function

One of the most important functions in PyTorch is the *torch.autograd* function. It's the automatic differentation engine that drives neural network training.

Training a Neural Network happens in two stages: 

**Forward Propagation:** In forward prop, the NN makes its best guess about the correct output. It runs the input data through each of its functions to make this guess.

**Backward Propagation:** In backprop, the NN adjusts its parameters proportionate to the error in its guess. It does this by traversing backwards from the output, collecting the derivatives of the error with respect to the parameters of the functions (gradients), and optimizing the parameters using gradient descent.

### Differentiation in Autograd

In this example we will look at how *autograd* collects gradients. To start with let's create two tensors, *a* and *b* with *requires_grad=True* (this enables autograd to track all the operations).

In [None]:
a = torch.tensor([5., 7.], requires_grad=True)
b = torch.tensor([8., 4.], requires_grad=True)

Create another tensor *Q* from *a* and *b*.
$$
Q = 5a^{3} - b^{2}
$$

In [None]:
Q = 5*a**3 - b**2

Assume that *a* and *b* are parameters of a neural network, and *Q* is the error. While training a neural network, we want gradients of the error w.r.t. parameters, i.e.
$$
\frac{\partial Q}{\partial a}=15a^{2}
$$

$$
\frac{\partial Q}{\partial b}=-2b
$$

*.backward()* calculates these gradients and stores them in the respective tensor *.grad* attribute.

To do so, we need to pass a *gradient* argument in *Q.backward()*  because it is a vector. *gradient* is a tensor of the same shape as *Q*, and it represents the gradient of Q w.r.t itself, i.e.
$$
\frac{dQ}{dQ}=1
$$
Equivalently, we can also aggregate *Q* into a scalar and call backward implicitly, like *Q.sum().backward()*.

In [None]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

Gradients are now stored in *a.grad* and *b.grad*.

In [None]:
print(f"a.grad: {a.grad}\nb.grad: {b.grad}\n")

# Manually check if this is correct
print(15*a**2 == a.grad)
print(-2*b == b.grad)

------
## Let's make a neural network!

The building blocks of neural networks in PyTorch are contained in the **torch.nn** package.

*nn* depends on *autograd* to define models and differentiate them. An *nn.Module* contains layers, and a method *forward(input)* that returns the *output*. The *input* is the first layer of the neural network and the *output* the last.

A typical training procedure for a neural network in PyTorch is as follows:
* The neural network that has some learnable parameters (or weights) is defined.
* We iterate over a dataset of inputs.
* The input is processed through the network.
* Loss is computed (the difference between the model's output and ground truth)
* Gradients are propagated back into the network's parameters
* The weights of the network are updated, typically using a simple update rule $ weight=weight - learning\_rate * gradient $

### Import the necessary packages

In [None]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt

## Fashion-MNIST dataset

In this example we will use the FashionMNIST dataset (<https://github.com/zalandoresearch/fashion-mnist>). 

![fashionMNIST](https://github.com/zalandoresearch/fashion-mnist/blob/master/doc/img/fashion-mnist-sprite.png?raw=true)

It's comprised of a training set of 60000 examples and a test set of 10000 examples, from Zalando (e-commerce platform). Each example is a 28x28 grayscale image and an associated label fron one of 10 classes.

| Label | Description |
| --- | --- |
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |

<https://doi.org/10.48550/arXiv.1708.07747>

The Fashion-MNIST dataset is available through the **datasets** module in **torchvision**.

To load the dataset we define the following parameters:
* **root** is the path to where the train/test data are stored
* **train** specifies training or test dataset
* **download=True** specifies whether to download the data or not (if not available in root)
* **transform** and **target_transform** specify the feature and label transformations.

In [None]:
# First get the current working directory - We will use the same directory to store the data
wd = os.getcwd()

# Download the training and test datasets
training_data = datasets.FashionMNIST(
    root=wd,
    train=True,
    download=True,
    transform=ToTensor()  # <- convert to tensor as neural networks don't know what to do with images
)

test_data = datasets.FashionMNIST(
    root=wd,
    train=False,
    download=True,
    transform=ToTensor()
)

## Visualise the Dataset

We can index **Datasets** manually like a list: *training_data[index]* and then used matplotlib to visualise a number of examples from the training set.

In [None]:
# Define a dictionary with the labels
labels = {
    0: "T-Shirt",
    1: "Trouser",
    2: "Pullover",
    3: "Dress",
    4: "Coat",
    5: "Sandal",
    6: "Shirt",
    7: "Sneaker",
    8: "Bag",
    9: "Ankle Boot",
}

# Visualise 5 of them using matplotlib
figure = plt.figure(figsize=(25, 5))
cols, rows = 5, 1
for i in range(1, cols*rows+1):
  # Choose a random index from the training_data dataset
  sample_idx = torch.randint(len(training_data), size=(1, )).item()
  # Get the image data and the label for this example
  img, label = training_data[sample_idx]
  figure.add_subplot(rows, cols, i)
  plt.title(labels[label])
  plt.axis("off")
  plt.imshow(img.squeeze(), cmap="gray")

plt.show()

To use data with PyTorch we must create a custom **Dataset** class, which must implement three functions
* \__init__ : Runs once when instantiating the Dataset object.
* \__len__ : Returns the number of examples in the dataset.
* \__getitem__ : Loads and returns a sample from the dataset specified by a given idx.

In [None]:
import pandas as pd
from torchvision.io import read_image


class CustomImageDataset(Dataset):

  def __init__(self, annotations_file, img_dir, transform=None, target_transform=None):
    self.img_labels = pd.read_csv(annotations_file)  # Labels are stored in a .csv which we read here using pandas
    self.img_dir = img_dir  # The directory where the images are stored
    self.transform = transform
    self.target_transform = target_transform

  def __len__(self):
    return len(self.img_labels)

  def __getitem__(self, idx):
    img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0])  # The first column of the pandas dataframe holds the filename for the images
    image = read_image(img_path)  # Use the torchvision function read_image to load the image
    label = self.img_labels.iloc[idx, 1]  # The second column of the dataframe holds the labels
    # If we pass a transform argument to the function it applies it here.
    if self.transform:
      image = self.transform(image)
    if self.target_transform:
      label = self.target_transform(label)
    return image, label

## DataLoader module

We used the **Dataset** module to retrieve our dataset's features and labels, but we now need to use the **DataLoader** module create the appropriate objects to pass to the neural network. **DataLoader** is an easy to use API, which abstracts a lot of behind the scenes operations that need to happen to use the data with neural networks (like creating "minibatches", shuffle the data at each epoch to limit overfitting etc.)

We can iterate through the object, but since it's an iterable object we need to use **next(iter())**, i.e.
train_features, train_labels = next(iter(train_dataloader))

This will return a mini-batch of size **batch_size** and if **shuffle** is enabled the mini batch is shuffled each time we call iter().

In [None]:
from torch.utils.data import DataLoader


# Training data object (we enable shuffling of the data at each epoch and define the batch size)
train_dataloader = DataLoader(training_data, batch_size=64, shuffle=True)  
# Test data object
test_dataloader = DataLoader(test_data, batch_size=64, shuffle=True)

## Define the neural network model

As we mentioned above, in deep learning we use neural networks with 1 or more hidden layers. Let's build a fully-connected feed-forward neural network to tackle our classification problem. We will start simple, with two hidden layers, each containing 512 units.

PyTorch provides **torch.nn**, which provides all the building blocks we need to build our neural network. To define our neural network we need to define a **class**, which subclasses **nn.Module** and initialise the network's layers in \__init__. Every **nn.Module** subclass implements the operations of input data in the **forward** method.

In [None]:
from torch import nn


class NeuralNetwork(nn.Module):
  
  def __init__(self):
    super().__init__()  # <- This allows us to access methods and properties of a parent class (i.e. nn.Module)
    # Define the layers of the neural network here
    self.flatten = nn.Flatten()  # Arrange the data in a vector to feed to the neural network
    # Define a Sequence of layers (2 layers with 512 units each with a ReLU activation function)
    self.linear_relu_stack = nn.Sequential(
        nn.Linear(in_features=28*28, out_features=512),  # First argument in the Linear layer is the input size and second argument is the output size
        nn.ReLU(),
        nn.Linear(in_features=512, out_features=512),
        nn.ReLU(),
        nn.Linear(in_features=512, out_features=10)
    )
    
  # Next we need to define the forward() method, which implements the data flow through the layers
  # X is the input data
  def forward(self, X):
    x = self.flatten(X)
    logits = self.linear_relu_stack(x)
    return logits

Next, we instantiate the neural network.

In [None]:
model = NeuralNetwork()
print(model)

Before we start training the model, we need to define a few **hyperparameters**, that is adjustable parameters that control the optimisation process and can impact the final performance and convergence rate of the model (more here <https://pytorch.org/tutorials/beginner/hyperparameter_tuning_tutorial.html>).

In this example we will define the following:
* **Number of Epochs** - the number of times to iterate over the dataset.
* **Batch Size** - the number of data samples propagated through the network before the parameters are updated (usually limited by the amount of VRAM available).
* **Learning Rate** - How much to update model's parameters at each batch/epoch. This is trade off between learning speed and accuracy, higher values of learning rate can lead to faster convergence but achieve lower performance or lead to unpredictable behaviour during training (e.g. exploding gradients).

### Loss function

We also need to define the loss function, which measures the degree of similarity on the model's output compared to the target value, and it is what we are trying to minimise during the training procedures. In this case we will use the **Cross Entropy Loss** function, a common metric used to measure the performance of classification models. This will normalise the logits (model output) and compute the prediction error.

### Optimizer

The optimiser defines how the process of adjusting the model parameters to reduce model error in each training step is adjusted. In this case we will use the **Stochastic Gradient Descent** algorithm. Another popular algorithm is **Adam** and its variations (more here <https://pytorch.org/docs/stable/optim.html>).

When initialising the optimizer, we need to register our model's parameters that need to be trained and pass the learning rate to it.

In [None]:
# Hyperparameters
learning_rate = 1e-3
batch_size=64
epochs=5

# Loss function
loss_fn = nn.CrossEntropyLoss()

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

Finally, we define the **train_loop** which loops over the optimization procedures and **test_loop** which evaluates the model's performance against the test data.

Since the **dataloader** runs exclusively on cpu (in parallel with the multiprocessing backend) we need to move each batch to the GPU if we are using the GPU. 

In [None]:
def train_loop(dataloader, model, loss_fn, optimizer, device="cpu"):
    # Get the number of mini-batches
    size = len(dataloader.dataset)
    
    # Loop through the mini batches and perform the training procecures
    for batch, (X, y) in enumerate(dataloader):  # <- enumerate() is a great function to put in your arsenal if you haven't already!!
        
        # Move these to GPU if device is cuda
        if device == "cuda":
            X, y = X.cuda(), y.cuda()
        
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)
        
        # Perform the backpropagation procedure (update the parameters)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if batch % 200 == 0:
            loss, current = loss.item(), (batch+1)*len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")
            

def test_loop(dataloader, model, loss_fn, device="cpu"):
    
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0
    
    # We need to use the torch.no_grad() option here, as we don't want to update the
    # gradients of the network's parameters, just use it to calculate the loss
    with torch.no_grad():
        for X, y in dataloader:
            
            # Move these to GPU if device is cuda
            if device == "cuda":
                X, y = X.cuda(), y.cuda()
                
            # Calculate the model's prediction
            pred = model(X)
            # Get the loss of this batch and add it to the total
            test_loss += loss_fn(pred, y)
            # Get the number of correct predictions from the batch and add it to the total
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()
            
    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

Next we loop through the train_loop and test_loop for the number of epochs we defined to optimize the model.

In [None]:
%%time

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
    
print("Done!!")

In [None]:
%%time

model_gpu = NeuralNetwork()
model_gpu.to("cuda")

# Since we defined a new model we need to instantiate a new optimiser
optimizer = torch.optim.SGD(model_gpu.parameters(), lr=learning_rate)

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model_gpu, loss_fn, optimizer, 
               device="cuda")
    test_loop(test_dataloader, model_gpu, loss_fn,
              device="cuda")
    
print("Done!!")

## Convolutional Neural Networks

Now let's try a *Convolutional Neural Network*, which generally outperform fully connected feed-forward networks for computer vision tasks. A convolution is a mathematical operation used to extract features from an image and is defined by an image kernel (a small matrix). 

![conv](https://miro.medium.com/v2/resize:fit:640/1*ZCjPUFrB6eHPRi4eyP6aaA.gif)

### Padding

To avoid shrinking of the image size, one can use **padding**, that is add pixels with 0 value around the image.

![padding](https://miro.medium.com/v2/resize:fit:640/format:webp/1*5rLRx19ot0QggMn9teY14Q.png)

### Stride

Sometimes is useful to implement the convolution kernel with step sizes larger than 1 pixel, to improve computation. This is called **stride**.

![stride](https://miro.medium.com/v2/resize:fit:720/format:webp/1*y3Ydr1oCHRfOegWxZITIOA.png)

### Convolution layer size

If a $ n*n $ matrix is convolved with a $ f*f $ matrix with padding *p* and stride *s*, the output dimension is of size
$$
( \frac{n+2p-f+1}{s} +1 ) * ( \frac{n+2p-f+1}{s} +1 )
$$

### Pooling

In pooling layers we progressively reduce the spatial size of the representation to reduce the network complexity and computational cost. The most popular pooling layers are the **Max Poolint** and the **Average Pooling** layers.

<sub> More at <https://medium.com/analytics-vidhya/convolution-padding-stride-and-pooling-in-cnn-13dc1f3ada26> </sub>

### Batch Normalisation layer

**Batch Normalisation** is a method used to make training a neural network more stable and efficient by normalising the layer's inputs by re-centering and re-scaling them. 

### Dropout

During training, randomly zeroes some of the elements of the input tensor with probability *p* using samples from a Bernoulli distribution. Each channel will be zeroed out independently on every forward call.

This has proven to be an effective technique for regularization and preventing the co-adaptation of neurons as described in the paper Improving neural networks by preventing co-adaptation of feature detectors.

**NOTE**: Be careful when using both batch normalisation and dropout together, it might lead to unexpected results. 

## CNN architecture

The CNN comprises of:
* A sequential layer with a kernel size of 3\*3, with padding=1 and stride=1, followed by a *Batch normalization* layer, a ReLU activation function and a *Max pooling layer* with kernel size=2 and stride=2.
* A 2nd sequential layer with a kernel size of 3\*3, padding=0 and stride=1, followed by a *Batch normalization* layer, a ReLU activation function and a *Max pooling layer* with kernel size=2 and stride=2.
* This is then passed to twp fully connected layers, the first with 600 units, followed by a dropout with p=0.25, a 2nd fully connected layer with 120 units and finally the output layer with 10 units (i.e. 10 classes).



In [None]:
class ConvNet(nn.Module):
    
    def __init__(self):
        
        super(ConvNet, self).__init__()
        
        self.layer1 = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1),
            nn.BatchNorm2d(num_features=32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        self.layer2 = nn.Sequential(
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        self.fc1 = nn.Linear(in_features=64*6*6, out_features=600)
        self.drop = nn.Dropout(0.25)
        self.fc2 = nn.Linear(in_features=600, out_features=120)
        self.fc3 = nn.Linear(in_features=120, out_features=10)
        
    def forward(self, X):
        
        out = self.layer1(X)
        out = self.layer2(out)
        out = out.view(out.size(0), -1)
        out = self.fc1(out)
        out = self.drop(out)
        out = self.fc2(out)
        out = self.fc3(out)
        
        return out
        

Instantiate the model and train it.

In [None]:
%%time

model_conv = ConvNet()

# Hyperparameters
learning_rate = 1e-3
epochs = 5

# Optimizer
optimizer = torch.optim.SGD(model_conv.parameters(), lr=learning_rate)

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model_conv, loss_fn, optimizer, 
               device="cpu")
    test_loop(test_dataloader, model_conv, loss_fn,
              device="cpu")
    
print("Done!!")

Now let's try it on the GPU.

In [None]:
%%time

device = "cuda"

model_conv_gpu = ConvNet()
model_conv_gpu.to(device)

# The convolutional neural network is much larger than the fully-connected one, so
# we might need to decrease the batch_size in order to be able to train it
batch_conv = 64
train_dataloader = DataLoader(training_data, batch_size=batch_conv, shuffle=True)  
test_dataloader = DataLoader(test_data, batch_size=batch_conv, shuffle=True)

# Hyperparameters
learning_rate = 1e-3
epochs = 5

# Optimizer
optimizer = torch.optim.SGD(model_conv_gpu.parameters(), lr=learning_rate)

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model_conv_gpu, loss_fn, optimizer, 
               device=device)
    test_loop(test_dataloader, model_conv_gpu, loss_fn,
              device=device)
    
print("Done!!")

## Exercises

1. Increase the batch size of the train and test sets. How does this affect training time and the accuracy acquired in 5 epochs? Can you think on what limits the batch size we can set for our training procedures?

In [None]:
%%time

batch_size=64

# Instantiate the model and send it to the GPU
model_conv_gpu = ConvNet()
model_conv_gpu.to("cuda")

# Train and test data loader objects
train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)  
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

# Hyperparameters
learning_rate = 1e-3
epochs = 5

# Optimizer
optimizer = torch.optim.SGD(model_conv_gpu.parameters(), lr=learning_rate)

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model_conv_gpu, loss_fn, optimizer, 
               device=device)
    test_loop(test_dataloader, model_conv_gpu, loss_fn,
              device=device)

2. Increase or decrease the learning rate. How does this affect training? How do you think this affects the performance of the model wrt the number of epochs?

In [None]:
%%time

batch_size=64

# Instantiate the model and send it to the GPU
model_conv_gpu = ConvNet()
model_conv_gpu.to("cuda")

# Train and test data loader objects
train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)  
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

# Hyperparameters
learning_rate = 1e-3
epochs = 5

# Optimizer
optimizer = torch.optim.SGD(model_conv_gpu.parameters(), lr=learning_rate)

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model_conv_gpu, loss_fn, optimizer, 
               device=device)
    test_loop(test_dataloader, model_conv_gpu, loss_fn,
              device=device)

3. Add another convolution layer to the network and run the training procedures. e.g. A third convolution layer which takes as input the output of the 2nd (64 channels) and outputs 128 with a MaxPool layer with kernel_size=2 and stride=2.

**Remember** this to calculate the input and output sizes

If a $ n*n $ matrix is convolved with a $ f*f $ matrix with padding *p* and stride *s*, the output dimension is of size
$$
( \frac{n+2p-f+1}{s} +1 ) * ( \frac{n+2p-f+1}{s} +1 )
$$

> How do you think adding complexity to the architecture of the neural network affects training?

In [None]:
class ConvNet2(nn.Module):
    
    def __init__(self):
        
        super(ConvNet2, self).__init__()
        
        self.layer1 = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1),
            nn.BatchNorm2d(num_features=32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        self.layer2 = nn.Sequential(
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        
        self.fc1 = nn.Linear(in_features=64*6*6, out_features=600)
        self.drop = nn.Dropout(0.25)
        self.fc2 = nn.Linear(in_features=600, out_features=120)
        self.fc3 = nn.Linear(in_features=120, out_features=10)
        
    def forward(self, X):
        
        out = self.layer1(X)
        out = self.layer2(out)
        out = out.view(out.size(0), -1)
        out = self.fc1(out)
        out = self.drop(out)
        out = self.fc2(out)
        out = self.fc3(out)
        
        return out
batch_size=64

# Instantiate the model and send it to the GPU
model_conv2 = ConvNet2()
model_conv2.to("cuda")

# Train and test data loader objects
train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)  
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

# Hyperparameters
learning_rate = 1e-3
epochs = 5

# Optimizer
optimizer = torch.optim.SGD(model_conv2.parameters(), lr=learning_rate)

for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model_conv2, loss_fn, optimizer, 
               device=device)
    test_loop(test_dataloader, model_conv2, loss_fn,
              device=device)