# I) Summary

- We are going to have an in-depth review of [Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf) and [Study of Residual Networks for Image Recognition](https://arxiv.org/pdf/1805.00325.pdf) paper which introduces the ResNet architecture.
- It's important to understand that the main problem here is the difficulty to optimize a deep network rather than its lack of ability to learn features.
    - Feature learning (or representation learning) is the ability to find a transformation that maps raw data into a representation that is more suitable for a machine learning task (e.g classification).

##  Problem

- Intuitively, the more layers we have, the better the accuracy will be.
- So if we take a shallow network that performs well and copy its layers and stack them to make the model deeper, we can expect the deep network to perform comparably good or better than its counterpart.
- Surprisingly, as we go deeper, accuracy increases up to a saturation point and then begins to degrade.
- Unexpectedly, such degradation is not caused by overfitting and making the network even deeper leads to a high training error.
- Here is an example on CIFAR-10.

<div style="text-align:center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/705017447169654814/unknown.png">
    <figcaption> Figure: Trained on CIFAR-10</figcaption>
</div>

<br>

- Thus, the deep network performs worse than the shallow network.
- One possible explanation could be that the deep network suffered from the vanishing gradient problem.
- However, it can mostly be fixed with batch normalization and normalized initializations.
- A second explanation could be that the deep network wasn't able to learn the identity function.
    - Indeed, it could at least perform exactly like the shallow network by just "learning nothing" (remember the deep network was built by copying and stacking layers of the shallow network).
    - But the fact that he wasn't able to perform exactly like the shallow network means he has trouble to learn nothing! (learn the identity function).
- This suggest a new problem: **Is learning better networks as easy as stacking more layers ?**

## Solution

The solution to this problem is to use a **Residual module** so that adding more layer will not cause any performance degradation.

A residual module is composed of:
- a sequence of convolutions, batch normalization and ReLU activations.
- a residual connection $x$.

<div style="text-align:center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/699941791360745512/unknown.png">
    <figcaption > Figure: Residual module</figcaption>
</div>
<br>

- We then combine through addition the residual connection with the sequence.
- Suppose $H(x) = F(x) + x$. If the deep network wants to learn the idendity function, it just has to use the residual connection and thus, set $F(x)$ to 0 !
- It is always easier for a sequence of layer to fit to a zero than an identity function, so the proposed structure is easier to train and ensure that a deeper network will be at least comparably good or better than its counterpart (**neutral-or-better characteristic**).
- The residual connection is also called **skip connection** because they give a chance for the information to skip the function located within the residual module.
- **Skip connection** provides a clear path for gradients to back propagate to early layers of the network. This makes the learning process faster by avoiding vanishing gradient problem.
- However, the trade of is that residual networks are more prone to overfitting.
- It seems that residual modules are more powerful for very deep networks and could even hurt the performance for very shallow networks if employed improperly.
- When several residual modules are stacked, residual networks can be thought of as a complex combinations or ensemble of many shallower networks.

<div style="text-align:center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/702786810186825778/unknown.png">
    <figcaption > Figure: Residual module</figcaption>
</div>
<br>

## Architecture

There are several types of ResNet-X (with X, the number of layers).

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/703518172124545084/unknown.png">
</div>
<br>

- For ResNet-50/101/152, they used a bottleneck architecture because they are cheaper in term of operations.

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/703528990316691466/unknown.png">
</div>
<br>

- We are going to implement ResNet on CIFAR-10 which architecture is slighty different from the ImageNet one (probably due to its input image size).
- Here is the ResNet-50 architecture on Imagenet:

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/704607455979634688/unknown.png">
</div>
<br>

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/704624182289367040/unknown.png">
</div>
<br>

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/704624228334305382/unknown.png" width="50%">
</div>
<br>

- We use identity shortcuts when input and output channel dimensions are the same.
- Otherwise, we have 2 options:
    - A) Use identity shortcuts with zero padding to increase channel dimension.
    - B) Use 1x1 convolution to increase channel dimension (projection shortcut).
- When input and output spatial dimensions don't match, we use one of the 2 above options with stride 2.
- Since we are going to implemenet ResNet-50 on CIFAR-10, the architecture will be slightly different (ResNet-56):
    - No maxpooling (probably due to small input size).
    - We will use option A)

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/704999521608007680/unknown.png">
</div>
<br>

<div style="text-align: center">
    <img src="https://cdn.discordapp.com/attachments/676833120053493770/704626011412627536/unknown.png" width="50%">
</div>
<br>

# II) Implementation

In [0]:
import os
import shutil
from collections import OrderedDict
from IPython.display import clear_output

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import transforms, datasets
from torchsummary import summary
from torch.utils.data import Dataset, DataLoader, random_split

## a) Loading dataset / Preprocessing

In [0]:
def load_cifar():
    transform = transforms.Compose([transforms.ToTensor(),
                                    transforms.Normalize(mean=[0.5], std=[0.5])])
            
    train_dataset = datasets.CIFAR10('./data', train=True, download=True, transform=transform)
    test_dataset = datasets.CIFAR10('./data', train=False, download=True, transform=transform)

    #Clear downloading message.
    clear_output()
    
    # Split dataset into training set and validation set.
    train_dataset, val_dataset = random_split(train_dataset, (45000, 5000))
    
    print("Image Shape: {}".format(train_dataset[0][0].numpy().shape), end = '\n\n')
    print("Training Set:   {} samples".format(len(train_dataset)))
    print("Validation Set:   {} samples".format(len(val_dataset)))
    print("Test Set:       {} samples".format(len(test_dataset)))
    
    if torch.cuda.is_available():
        BATCH_SIZE = 2048
    else:
        BATCH_SIZE = 32

    # Create iterator.
    train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
    val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=10000, shuffle=True)
    
    # Delete the data/ folder.
    shutil.rmtree('./data')
    
    return train_loader, val_loader, test_loader

In [29]:
train_loader, val_loader, test_loader = load_cifar()

Image Shape: (3, 32, 32)

Training Set:   45000 samples
Validation Set:   5000 samples
Test Set:       10000 samples


## b) Architecture build

In [0]:
class LambdaLayer(nn.Module):
    
    def __init__(self, lambd):
        super(LambdaLayer, self).__init__()
        self.lambd = lambd
    
    def forward(self, x):
        return self.lambd(x)

class ConvBlock(nn.Module):
    
    def __init__(self, in_channels, out_channels, stride=1, option='A'):
        super(ConvBlock, self).__init__()
        
        self.features = nn.Sequential(OrderedDict([
            ('conv1', nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)),
            ('bn1', nn.BatchNorm2d(out_channels)),
            ('act1', nn.ReLU()),
            ('conv2', nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)),
            ('bn2', nn.BatchNorm2d(out_channels))
        ]))

        self.shortcut = nn.Sequential()
        
        if stride != 1 or in_channels != out_channels:
            if option == 'A':
                pad = out_channels//4
                self.shortcut = LambdaLayer(lambda x:
                            F.pad(x[:, :, ::2, ::2], (0,0, 0,0, pad,pad, 0,0)))
            if option == 'B':
                self.shortcut = nn.Sequential(OrderedDict([
                    ('s_conv1', nn.Conv2d(in_channels, 2*out_channels, kernel_size=1, stride=stride, padding=0, bias=False)),
                    ('s_bn1', nn.BatchNorm2d(2*out_channels))
                ]))
        
    def forward(self, x):
        out = self.features(x)
        out += self.shortcut(x)
        out = F.relu(out)
        return out

In [0]:
class ResNet(nn.Module):
    """
        ResNet architecture for CIFAR-10.
    """
    def __init__(self, block_type, num_blocks):
        super(ResNet, self).__init__()
        
        self.in_channels = 16
        
        self.conv0 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn0 = nn.BatchNorm2d(16)
        self.block1 = self.__build_layer(block_type, 16, num_blocks[0], mismatch_stride=1)
        self.block2 = self.__build_layer(block_type, 32, num_blocks[1], mismatch_stride=2)
        self.block3 = self.__build_layer(block_type, 64, num_blocks[2], mismatch_stride=2)
        self.avgpool = nn.AdaptiveAvgPool2d((1,1))
        self.linear = nn.Linear(64, 10)
    
    def __build_layer(self, block_type, out_channels, num_blocks, mismatch_stride):
        strides = [mismatch_stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(block_type(self.in_channels, out_channels, stride))
            self.in_channels = out_channels
        return nn.Sequential(*layers)
    
    def forward(self, x):
        out = F.relu(self.bn0(self.conv0(x)))
        out = self.block1(out)
        out = self.block2(out)        
        out = self.block3(out)
        out = self.avgpool(out)
        out = torch.flatten(out, 1)
        out = self.linear(out)
        return out

In [0]:
def ResNet56():
    return ResNet(block_type=ConvBlock, num_blocks=[9,9,9])

In [33]:
model = ResNet56()
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
summary(model, (3, 32, 32))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 16, 32, 32]             432
       BatchNorm2d-2           [-1, 16, 32, 32]              32
            Conv2d-3           [-1, 16, 32, 32]           2,304
       BatchNorm2d-4           [-1, 16, 32, 32]              32
              ReLU-5           [-1, 16, 32, 32]               0
            Conv2d-6           [-1, 16, 32, 32]           2,304
       BatchNorm2d-7           [-1, 16, 32, 32]              32
         ConvBlock-8           [-1, 16, 32, 32]               0
            Conv2d-9           [-1, 16, 32, 32]           2,304
      BatchNorm2d-10           [-1, 16, 32, 32]              32
             ReLU-11           [-1, 16, 32, 32]               0
           Conv2d-12           [-1, 16, 32, 32]           2,304
      BatchNorm2d-13           [-1, 16, 32, 32]              32
        ConvBlock-14           [-1, 16,

In [0]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

## c) Training the model

In [0]:
def train_model():
    EPOCHS = 15
    nb_examples = 45000
    nb_val_examples = 5000
    train_costs, val_costs = [], []
    
    #Training phase.
    
    for epoch in range(EPOCHS):

        train_loss = 0
        correct_train = 0
        
        for inputs, labels in train_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            # Zero the parameter gradients.
            optimizer.zero_grad()
            
            # Forward pass.
            prediction = model(inputs)
            
            # Compute the loss.
            loss = criterion(prediction, labels)
          
            # Backward pass.
            loss.backward()
            
            # Optimize.
            optimizer.step()
            
            # Compute training accuracy.
            _, predicted = torch.max(prediction.data, 1)
            correct_train += (predicted == labels).float().sum().item()
            
            # Compute batch loss.
            train_loss += (loss.data.item() * inputs.shape[0])


        train_loss /= nb_examples
        train_costs.append(train_loss)
        train_acc =  correct_train / nb_examples

        val_loss = 0
        correct_val = 0
  
        with torch.no_grad():
            for inputs, labels in val_loader:
                inputs, labels = inputs.to(device), labels.to(device)

                # Forward pass.
                prediction = model(inputs)

                # Compute the loss.
                loss = criterion(prediction, labels)

                # Compute training accuracy.
                _, predicted = torch.max(prediction.data, 1)
                correct_val += (predicted == labels).float().sum().item()

            # Compute batch loss.
            val_loss += (loss.data.item() * inputs.shape[0])

            val_loss /= nb_val_examples
            val_costs.append(val_loss)
            val_acc =  correct_val / nb_val_examples
        
        info = "[Epoch {}/{}]: train-loss = {:0.6f} | train-acc = {:0.3f} | val-loss = {:0.6f} | val-acc = {:0.3f}"
        print(info.format(epoch+1, EPOCHS, train_loss, train_acc, val_loss, val_acc))
        torch.save(model.state_dict(), 'save_weights/checkpoint_gpu_{}'.format(epoch + 1)) 
                                                                
    torch.save(model.state_dict(), 'save_weights/resnet-56_weights_gpu')  
        
    return train_costs, val_costs

In [36]:
!nvidia-smi

Wed Apr 29 10:56:13 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0    37W / 250W |  16275MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

In [37]:
train_costs, val_costs = train_model()

[Epoch 1/15]: train-loss = 2.418560 | train-acc = 0.186 | val-loss = 0.345510 | val-acc = 0.260
[Epoch 2/15]: train-loss = 1.814710 | train-acc = 0.318 | val-loss = 0.301081 | val-acc = 0.370
[Epoch 3/15]: train-loss = 1.610937 | train-acc = 0.394 | val-loss = 0.274469 | val-acc = 0.431
[Epoch 4/15]: train-loss = 1.430526 | train-acc = 0.465 | val-loss = 0.237520 | val-acc = 0.504
[Epoch 5/15]: train-loss = 1.315914 | train-acc = 0.514 | val-loss = 0.223509 | val-acc = 0.553
[Epoch 6/15]: train-loss = 1.204232 | train-acc = 0.560 | val-loss = 0.200776 | val-acc = 0.586
[Epoch 7/15]: train-loss = 1.133270 | train-acc = 0.589 | val-loss = 0.196912 | val-acc = 0.612
[Epoch 8/15]: train-loss = 1.051415 | train-acc = 0.621 | val-loss = 0.191916 | val-acc = 0.633
[Epoch 9/15]: train-loss = 0.966809 | train-acc = 0.653 | val-loss = 0.166661 | val-acc = 0.670
[Epoch 10/15]: train-loss = 0.902538 | train-acc = 0.678 | val-loss = 0.162810 | val-acc = 0.690
[Epoch 11/15]: train-loss = 0.828471 | 

In [38]:
#Restore the model.
model = ResNet56()
model.load_state_dict(torch.load('save_weights/resnet-56_weights_gpu'))

<All keys matched successfully>

In [40]:
nb_test_examples = 10000
correct = 0 

model.eval().cuda()

with  torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        # Make predictions.
        prediction = model(inputs)

        # Retrieve predictions indexes.
        _, predicted_class = torch.max(prediction.data, 1)

        # Compute number of correct predictions.
        correct += (predicted_class == labels).float().sum().item()

test_accuracy = correct / nb_test_examples
print('Test accuracy: {}'.format(test_accuracy))

Test accuracy: 0.7394
