## Deep Learning Tutorial with Residual Nets (aka ResNets)  in PyTorch! <br> No equations, just code and comments!

If you are - similarly to me - fairly new to the field of Machine Learning (ML) in general, and Deep Learning in particular, then the enormous landscape of neural network architectures can often be pretty overwhelming. I guess, all newbies know what I am talking about. CNNs, RNNs, GRUs, LSTMs, U-Nets, Dense Nets, Residual Nets, Highway Nets, Transformers, the entire sesame street, etc.


- Don't worry! We will not cover all of them in this little `PyTorch` tutorial. 
- Moreover, we will not use any equations! We will just use simple Python and PyTorch code.


### Why ResNets?

We will cover ResNets only. Why? 

- Firstly, because I believe that understanding `ResNets` will not only help to demystify the magic of neural networks, it will also lead to a more visual understanding of linear transformations and matrix-matrix operations in high-dimensional space. You will see that every architecture that appears complex at first sight is nothing more than applied linear algebra.
    
    
- Secondly, it does require to be familiar with `Convolutional Neural Networks (CNNs)`. Again, don't worry! I don't expect you to be an expert in developing CNNs. If you have an intuitive understanding of convolutions and know how they work (e.g., convoluting a Gaussian kernel ($3x3$, $5x5$, etc.) with an equally sized window in some image), you will grasp the nature of CNNs along the way. If you have never hear of convolutions before, please have a look at them first before you move on. There is no point in skipping the small but indispensable bits. As ever, start with wikipedia: [Convolutions](https://en.wikipedia.org/wiki/Convolution)

### Let's demystify ResNets and break it down!

The first time, I've heard about Residual Nets (aka ResNets) I was thinking that this is something I will maybe, if I'm lucky, understand at some point in the distant future. "ResNets must be beyond my understanding of maths and programming" - those were my thoughts. Then, however, I've immersed myself in Convolutional Neural Networks and their big brother - Residual Networks. To be frank, ResNets are actually not that complex. Everyone who knows about Linear Algebra and has an understanding about the nature of linear transformations with vectors and matrices, will quickly grasp the mechanisms of ResNets. Let's dive right into it!

## Install PyTorch

If you haven't installed `PyTorch` on your machine, then install the latest version of PyTorch with the usual pip command <br> `pip install torch`.

### Download MNIST

Download the famous MNIST data set for Deep Learning from [here](http://deeplearning.net/data/mnist/). Create a `data` subfolder and store the data set there. You don't have to unzip it. We will do this automatically via Python.


In [1]:
import pickle
import gzip
import torch 
import numpy as np
import torch.nn as nn
import torch.nn.functional as F

from torch import optim
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader

In [2]:
PATH = './data/mnist/'
FILENAME = 'mnist.pkl.gz'

Let's write a little function that will both unzip the data and load it into memory.
You might need to install `gzip` and `pickle` first before you can execute the cell below. Use the usual `pip` command.

Since the MNIST data is stored as a `pickle` file, we have to open it in both `read` and `binary` mode. Hence, instead of the usual `open` command, we will `gzip.open` the data (which automatically unzips the data set for us). The `r` stands for read mode and the `b` stands for binary mode. You'll be reading a binary file. The data is already split into `train` and `dev` set.

In [3]:
def get_files(path,filename):
    with gzip.open((PATH + FILENAME), "rb") as file:
        ((x_train, y_train), (x_val, y_val), _) = pickle.load(file, encoding='latin-1')
    return x_train, y_train, x_val, y_val

**Tensors**: Since we are developing our neural network in PyTorch, we have to deal with `tensors` instead of `arrays`. However, mathematically speaking there is no difference between tensors and arrays. Arrays such as tensors can be vectors or matrices. The difference lies in the speed of executions. Tensors can be casted onto the `GPU`, whereas `NumPy` array operations are executed on the `CPU`. Hence, tensor operations are faster and more convenient for Deep Learning, where we have to deal with multiple matrix-matrix operations at the same time and need the GPU to execute code rapidly.

To convert the data into `tensors`, we will use Python's handy built-in `map` object that applies a particular function (in our case `torch.tensor`) on all elements passed to the second argumet.

In [4]:
def tensor_map(x_train,y_train,x_val,y_val): return map(torch.tensor,(x_train,y_train,x_val,y_val))

As mentioned at the beginning of the tutorial, `ResNets` are an advancement of `CNNs` and thus just a slightly more complicated version of Convolutional Neural Networks. 

Convolutional `2D` layers expect the data to be passed in `4D`: `Batch size, Channels (1 for grey-scale, 3 for RGB), Height, Width`. Therefore, we will write a little `preprocessing` function that can be applied to our mini batch input data before performing the convolution. In our case, `channel = 1` as MNIST images are grey-scale.

In [5]:
def preprocess(x):
    return x.view(-1, 1, 28, 28)

**Refactor as much as possible!**

In deep ResNets we need many convolutional layers. That's why we will write a `conv` function for an `nn.Conv2d` layer to not call the latter multiple times.

In [6]:
def conv(in_size, out_size, pad=1): 
    return nn.Conv2d(in_size, out_size, kernel_size=3, stride=2, padding=pad)

**Deep deep in neural (or rather vector) space! Anyways, it's getting exciting!**

The code cell below is maybe the most important part in this tutorial on `ResNets`. 

**Inheriting classes from PyTorch**

To develop your own neural networks in `PyTorch`, you have to write your own Python `classes`. In Python, `classes` are Python's approach toward Object Oriented Programming (aka OOP). We have to inherit from PyTorch's `nn.Module` class which is used for all neural network architectures in PyTorch and includes automatic `backpropagation` through time among other neat features. With Python's `super()` we can use the `init` method from the caring `parent` nn.Module that provides us with all its features without having to implement them ourselves. One can think of this as a newborn child called `ResBlock` inheriting and using all the `methods` from the already developed parent called `nn.Module`. 


**Residual Block**

A `ResBlock` consists of a `ConvBlock` plus a `skip connection`. 

**ConvBlock**: Two times a convolution `conv` (look at the conv function in the cell above), followed by a `batch norm` layer. If you don't know what a batch normalization does and why it is quite useful in neural networks have a look [here](https://en.wikipedia.org/wiki/Batch_normalization). Spoiler: It makes neural networks more robust and stable and has proven to lead to better results. The latter is the reason why we apply a `batch norm` layer on top of each `conv` layer. Last but not least, we apply a `non-linearity`, here a rectified linear unit (aka `relu`), which is already implemented for us in PyTorch's `functional` module.


**Skip connection**: After each conv block, we will add the input that was not passed through the convolutional block to the output of the conv block: `x + self.convblock(x)`. Like that, we keep information from the input that might get lost through the convolutions and applied non-linearities, and add them to the convoluted output. It's called `skip connection` because we skip the conv block and add it to the output of the conv block (basically all convolutions and non-linearities are skipped and we keep information from previous points in space and time). Experiments have shown that this leads to more robust and accurate networks.  

In [7]:
class ResBlock(nn.Module):
    
    def __init__(self, in_size:int, hidden_size:int, out_size:int, pad:int):
        super().__init__()
        self.conv1 = conv(in_size, hidden_size, pad)
        self.conv2 = conv(hidden_size, out_size, pad)
        self.batchnorm1 = nn.BatchNorm2d(hidden_size)
        self.batchnorm2 = nn.BatchNorm2d(out_size)
    
    def convblock(self, x):
        x = F.relu(self.batchnorm1(self.conv1(x)))
        x = F.relu(self.batchnorm2(self.conv2(x)))
        return x
    
    def forward(self, x): return x + self.convblock(x) # skip connection

**ResNet** 

Here we are creating our ResNet that applies the ConvBlock and the ResBlock which we have already implemented in the cells above. At the end we perform `max-pooling` but make it adaptive instead of fixed. Adpative max-poolig is a neat trick that allows us to pass inputs of any size (we only need to specify the number of outputs).

In [8]:
class ResNet(nn.Module):
    
    def __init__(self, n_classes=10):
        super().__init__()
        self.res1 = ResBlock(1, 8, 16, 15)
        self.res2 = ResBlock(16, 32, 16, 15)
        self.conv = conv(16, n_classes)
        self.batchnorm = nn.BatchNorm2d(n_classes)
        self.maxpool = nn.AdaptiveMaxPool2d(1)
        
    def forward(self, x):
        x = preprocess(x)
        x = self.res1(x)
        x = self.res2(x) 
        x = self.maxpool(self.batchnorm(self.conv(x)))
        return x.view(x.size(0), -1)

**Loss computation and batches**

We will write a handy function that takes in our `ResNet`, a specified `loss function`, `mini batches` (input and target data), an optional `optimizer` (only needed for training but not for evaluation mode) and an optional `learning rate scheduler`. 

Thanks to PyTorch´s automatic gradient computation we don't have to write backprop manually (that would indeed be a hustle) and can just tell our optimizer to perform a step according to our `learning rate` and zero the gradients after each iteration.

In [9]:
def loss_batch(model, loss_func, xb, yb, opt=None, scheduler=None):
    loss = loss_func(model(xb), yb)
    acc = accuracy(model(xb), yb)
    if opt is not None:
        loss.backward()
        if scheduler is not None:
            scheduler.step()
        opt.step()
        opt.zero_grad()
    return acc, loss.item(), len(xb)

In [11]:
def accuracy(out, yb):
    # in PyTorch one cannot take the mean of ints 
    # thus, values have to be converted into floats first
    preds = torch.argmax(out, dim=1)
    return (preds == yb).float().mean()

**Let's get the model!**

In [10]:
def get_model():
    model = ResNet()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    return model, optimizer

**Let's get the data!**

`DataLoader` is an amazing PyTorch feature that will automatically split our train and dev data sets into <br> `math.ceil(x_train.shape[0]/batch size)` mini batches according to a given size (e.g., 64, 128) and optionally shuffle the data (only necessary for trainig). That is important for training as PyTorch's `nn.Module` requires inputs to be passed to the neural network in the form of tensor mini batches. However, we first have to create train and val data sets of tensors with `TensorDataset`. We will do all of the above in a single function.

In [12]:
def get_data_batches(x_train, y_train, x_val, y_val, bs):
    train_ds = TensorDataset(x_train, y_train)
    val_ds = TensorDataset(x_val, y_val)
    # DataLoader = generator
    return (
        DataLoader(train_ds, batch_size=bs, shuffle=True),
        DataLoader(val_ds, batch_size=bs * 2),
    )

**Let's fit our model!**

If the model is in training mode `model.train()` gradient computation is performed automatically. In evaluation mode `model.eval()`, however, we don't want gradient computation. Thus, we have to tell PyTorch with `with torch.no_grad()` that backprop should not be performed when testing our model on the dev set.

In [13]:
def fit(epochs, model, loss_func, opt, train_dl, valid_dl, scheduler=None):
    for epoch in range(epochs):
        model.train()
        # iterate over data loader object (generator)
        for xb, yb in train_dl:
            loss_batch(model, loss_func, xb, yb, opt, scheduler)

        model.eval()
        # no gradient computation for evaluation mode
        with torch.no_grad():
            accs, losses, nums = zip(
                *[loss_batch(model, loss_func, xb, yb) for xb, yb in valid_dl]
            )
        
        #NOTE: important to multiply with batch size and sum over values 
        #      to account for varying batch sizes
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)
        val_acc = np.sum(np.multiply(accs, nums)) / np.sum(nums)

        print("Epoch:", epoch+1)
        print("Loss: ", val_loss)
        print("Accuracy: ", val_acc)
        print()

**Hyperparameters**

Define batch size `br`, learning rate `lr`, number of epochs `n_epochs` and loss function `loss_func`.

In [14]:
bs=64 #128
lr=0.01
n_epochs = 5
loss_func = F.cross_entropy

In [15]:
# get data set
x_train, y_train, x_val, y_val = get_files(PATH, FILENAME)

In [16]:
# map tensor function to all inputs (X) and targets (Y) to create tensor data sets
x_train, y_train, x_val, y_val = tensor_map(x_train, y_train, x_val, y_val)

In [17]:
# get math.ceil(x_train.shape[0]/batch size) train and val mini batches of size bs
train_dl, val_dl = get_data_batches(x_train, y_train, x_val, y_val, bs)

In [18]:
# get model and optimizer
model, opt = get_model()

In [19]:
# train
fit(n_epochs, model, loss_func, opt, train_dl, val_dl)

Epoch: 0
Loss:  0.18226656237840652
Accuracy:  0.944
Epoch: 1
Loss:  0.1558196894288063
Accuracy:  0.953
Epoch: 2
Loss:  0.11653886579871178
Accuracy:  0.9653
Epoch: 3
Loss:  0.10886165248155594
Accuracy:  0.9664
Epoch: 4
Loss:  0.10072250712513924
Accuracy:  0.9693
