In this notebook we will study a not so new technique, or arquitecture if you want, to train neuronal networks, specially the deep ones. It's called residual connections, and was introduced in
this [paper](https://arxiv.org/abs/1512.03385). The main motivation behind was the question why deeper networks get worse results than the shorter ones. And if you're thinking it's due to overfitting, I'm talking about worse train loss, so the reason is other. I've not seen yet a formal argument of why this could happen, but it's probably a vanishing-gradient problem. 

In a few words, a residual connection gives the posibility to the net to *skip* one or more layers. The implementation is very simple. Just add the output of one or more of blocks to the input of those blocks, so theorically, the network can learn whether if a block is useful or not and open the posibility to potential add a lots and lots of layers.

Additionally, we'll see how a network that take an alternative path with differents convs compare to normal nets and resnets.

In [1]:
import numpy as np 
import pandas as pd 
import torch
from torch import nn
from torch.optim import Adam
import torchvision
import torchvision.transforms as transforms

We'll work with the CIFAR10 dataset. I stole this import from the [pytorch tutorials](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html)

In [2]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

bs = 128

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=bs,
                                          shuffle=True, num_workers=4)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=bs,
                                         shuffle=False, num_workers=4)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified


Start defining the clasical conv/batchnorm/relu block

In [3]:
class Block(nn.Module):
    def __init__(self,c_in,c_out,fs,p=0,rl = True):
        super().__init__()
        self.c_in, self.c_out, self.fs, self.rl = c_in, c_out, fs, rl
        self.conv = nn.Conv2d(self.c_in,self.c_out,self.fs,padding=p)
        self.norm, self.relu = nn.BatchNorm2d(self.c_out), nn.ReLU()
        
    def forward(self,x): 
        return self.relu(self.norm(self.conv(x))) if self.rl else self.norm(self.conv(x))

In this cell we implement a basic ResBlock. In order to keep the channels of the same size we need to pad it.

In [4]:
class ResBlock(nn.Module):
    def __init__(self,nc,fs):
        super().__init__()
        self.nc, self.fs = nc, fs
        self.a = Block(self.nc,self.nc,self.fs,p=1)
        self.b = Block(self.nc,self.nc,self.fs,rl=None,p=1)
        self.relu = nn.ReLU()
        
    def forward(self,x):
        y = self.a(self.b(x))
        return self.relu(x+y)

And in this one two differents outputs are added. One it's the result of two equals consecutive blocks and the other it's the output of just one with a bigger kernel size. You have to choose the filter sizes carefully to have outputs with the same dimensions. 

In [5]:
class AltBlock(nn.Module):
    def __init__(self,nc,fs1,fs2):
        super().__init__()
        self.nc, self.fs1, self.fs2 = nc, fs1, fs2
        self.a = Block(self.nc,self.nc,self.fs1)
        self.b = Block(self.nc,self.nc,self.fs1, rl = None)
        self.c = Block(self.nc,self.nc,self.fs2, rl = None)
        self.relu = nn.ReLU()
        
    def forward(self,x):
        y = self.a(self.b(x))
        z = self.c(x)
        return self.relu(y+z)

Next we build our nets. There will be four of them:
* The first one is just a regular net. Has several blocks with the same filter size of three and the number of channels is duplicated every two blocks.
* The second it's a shorter net with bigger filter sizes.
* The third one altern regular blocks with Resblocks
* The forth one altern regular blocks with Altblocks

In [6]:
net0 = nn.Sequential(Block(3,6,5),
                     Block(6,16,3),
                     Block(16,16,3),
                     Block(16,16,3),
                     Block(16,32,3),
                     Block(32,32,3),
                     Block(32,32,3),
                     Block(32,64,3),
                     Block(64,64,3),
                     Block(64,64,3),
                     nn.AdaptiveAvgPool2d(1),
                     nn.Flatten(),
                     nn.Linear(64,10)
                     )

In [7]:
net1 = nn.Sequential(Block(3,6,5),
                     Block(6,16,3),
                     Block(16,16,5),
                     Block(16,32,3),
                     Block(32,32,5),
                     Block(32,64,3),
                     Block(64,64,5),
                     nn.AdaptiveAvgPool2d(1),
                     nn.Flatten(),
                     nn.Linear(64,10)
                     )

In [8]:
net2 = nn.Sequential(Block(3,6,5),
                     Block(6,16,3),
                     ResBlock(16,3),
                     Block(16,32,3),
                     ResBlock(32,3),
                     Block(32,64,3),
                     ResBlock(64,3),
                     nn.AdaptiveAvgPool2d(1),
                     nn.Flatten(),
                     nn.Linear(64,10)
                     )

In [9]:
net3 = nn.Sequential(Block(3,6,5),
                     Block(6,16,3),
                     AltBlock(16,3,5),
                     Block(16,32,3),
                     AltBlock(32,3,5),
                     Block(32,64,3),
                     AltBlock(64,3,5),
                     nn.AdaptiveAvgPool2d(1),
                     nn.Flatten(),
                     nn.Linear(64,10))

In [10]:
models = [net0, net1, net2, net3]

Let's write our training loop

In [11]:
def train_loop(model, optimizer, criterion, epochs):
    
    metrics = []
    
    for _ in range(epochs):
        current = 0
        model.train()
        for img, lab in trainloader:
            
            optimizer.zero_grad()
            out = model(img.float().cuda())
            loss = criterion(out, lab.cuda())
            loss.backward()
            optimizer.step()
            current += loss.item()
            
        train_loss = current / len(trainloader)
            
        with torch.no_grad():
            current, acc = 0, 0
            model.eval()
            for img, lab in testloader:
                out = model(img.float().cuda())
                loss = criterion(out, lab.cuda())
                current += loss.item() 
                _, pred = nn.Softmax(-1)(out).max(-1)
                acc += (pred == lab.cuda()).sum().item()
            
            valid_loss = current / len(testloader)
            accuracy = 100 * acc / len(testset)
            
        metrics.append([train_loss,valid_loss,accuracy])
        
    return np.array(metrics)

And finally get our results in a nice data frame.

In [15]:
def get_results(models,epochs, lr):
    
    tuples = list(zip(*[3*['net0'] + 3*['net1'] + 3*['net2'] + 3*['net3'],4*['train_loss','valid_loss','accuracy']]))
    index = pd.MultiIndex.from_tuples(tuples, names=['model', 'metric'])
    results = pd.DataFrame(index = range(epochs),columns = index)
    
    for i,model in enumerate(models): 
        results[f'net{i}'] = train_loop(
                                        model = model.cuda(),
                                        optimizer = Adam(model.parameters(),lr=lr),
                                        criterion = nn.CrossEntropyLoss(),
                                        epochs = epochs
                                       )

    return results

Just for simplicity, I'll use the same learning rate for all the networks but for sure you can find a better number for each one

In [16]:
get_results(models = models, epochs = 7, lr = 3e-3)

model,net0,net0,net0,net1,net1,net1,net2,net2,net2,net3,net3,net3
metric,train_loss,valid_loss,accuracy,train_loss,valid_loss,accuracy,train_loss,valid_loss,accuracy,train_loss,valid_loss,accuracy
0,0.815294,0.877153,68.95,0.706641,0.803599,72.0,0.701039,0.810354,72.0,0.706289,0.81758,71.92
1,0.778261,0.870926,69.27,0.66952,0.811954,72.29,0.668315,0.751802,73.47,0.667333,0.792607,72.33
2,0.754965,0.860525,69.85,0.649788,0.796883,72.5,0.646536,0.739079,74.42,0.639201,0.803139,72.36
3,0.731481,0.887357,69.49,0.629868,0.782971,72.99,0.627556,0.724185,74.59,0.617718,0.791666,72.85
4,0.709845,0.830125,71.51,0.610913,0.815773,72.32,0.609816,0.721938,74.89,0.597628,0.801043,73.02
5,0.688633,0.831742,70.94,0.594478,0.772093,73.22,0.591058,0.747659,74.64,0.578132,0.757841,73.64
6,0.671347,0.85727,70.58,0.57778,0.767924,73.69,0.576838,0.781387,73.05,0.558304,0.772843,74.09


As you can see, net2 and net3 got slightly better results than the first two, and this is just for little networks. The og paper shows the huge improvement you can get using this technique on deep networks. 