# Instructions

For this assignment you will use PyTorch instead of EDF to implement and train neural networks. The experiments in this assignment will take a long time to run without a GPU, but you can run the notebook remotely on Google Colab and have access to GPUs for free -- in this case you don't have to worry about installing PyTorch as it is available by default in Google Colab's environment.

In case you will be running the experiments in your own machine, you should install PyTorch -- there are multiple tutorials online and it is especially easy if you're using Anaconda. Check https://pytorch.org/tutorials/ for some PyTorch tutorials -- this assignment assumes that you know the basics like defining models with multiple modules and coding up functions to train models with PyTorch optimizers. To

To use Google Colab, you should access https://colab.research.google.com/ and upload this notebook to your workspace. To use a GPU, go to Edit -> Notebook settings and select GPU as the accelerator.

Unlike previous assignments, in this one you will have to do some writing instead of just coding. Try to keep your answers short and precise, and you are encouraged to write equations if needed (you can do that using markdown cells). You can also use code as part of your answers (like plotting and printing, etc). Blue text indicates questions or things that you should discuss/comment, and there will red "ANSWER (BEGIN)" and "ANSWER (END)" markdown cells to indicate that you should add cells with your writeup between these two. **Make sure not to redefine variables or functions in your writeup, which can change the behavior of the next cells.**

Finally, you might have to do minor changes to the provided code due to differences in python/pytorch versions. You can post on piazza if there's a major, non-trivial change that you had to do (so other students can be aware of it and how to proceed), but for minor changes you should just apply them and keep working on the assignment.

In [None]:
import torch, math, copy
import numpy as np
from torchvision import datasets, transforms
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F
from functools import cache

# From Shallow to Deep Neural Networks

The main goal of this assignment is to develop a better understanding of how the depth of a network interacts with its trainability and performance.

In the previous assignment you likely observed difficulties in training sigmoid and ReLU networks with over ~8 layers, which is typically associated with 'vanishing' or 'exploding' gradients. As you will see, some of the biggest achievements in deep learning have been the development of techniques that enable deeper networks to be successfully trained, and without them deep networks are notoriously difficult to train successfully.

You will be working with the MNIST dataset, which will be downloaded and loaded in the cell below.

In [None]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])

train_dataset = datasets.MNIST("data", train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=True)

test_dataset = datasets.MNIST("data", train=False, download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=256, shuffle=False)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 15244750.14it/s]


Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 4368083.58it/s]


Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 5996122.22it/s]


Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 3933621.47it/s]

Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw






Fill the missing code below. In both train_epoch and test, total_correct should be the total number of correctly classified samples, while total_samples should be the total number of samples that have been iterated over.

In [None]:
def train(epochs, model, criterion, optimizer, train_loader, test_loader):
    for epoch in range(epochs):
        train_err = train_epoch(model, criterion, optimizer, train_loader)
        test_err = test(model, test_loader)
        print('Epoch {:03d}/{:03d}, Train Error {:.2f}% || Test Error {:.2f}%'.format(epoch, epochs, train_err*100, test_err*100))
    return train_err, test_err

def train_epoch(model, criterion, optimizer, loader):
    total_correct = 0.
    total_samples = 0.

    for batch_idx, (data, target) in enumerate(loader):
        if torch.cuda.is_available():
            data, target = data.cuda(), target.cuda()

        # insert code to feed the data to the model and collect its output
        output = model(data)

        # insert code to compute the loss from output and the true target
        loss = criterion(output, target)

        # insert code to update total_correct and total_samples
        # total_correct: total number of correctly classified samples
        # total_samples: total number of samples seen so far
        _, pred = torch.max(output.data, 1)
        total_correct += (pred == target).sum().item()
        total_samples += target.size(0)

        # insert code to update the parameters using optimizer
        # be careful in this part as an incorrect implementation will affect
        # all your experiments and have a significant impact on your grade!
        # in particular, note that pytorch does --not-- automatically
        # clear the parameter's gradients: check tutorials to see
        # how this can be done with a single method call.
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    return 1 - total_correct/total_samples

def test(model, loader):
    total_correct = 0.
    total_samples = 0.
    model.eval()

    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(loader):
            if torch.cuda.is_available():
                data, target = data.cuda(), target.cuda()

            # insert code to feed the data to the model and collect its output
            output = model(data)

            # insert code to update total_correct and total_samples
            # total_correct: total number of correctly classified samples
            # total_samples: total number of samples seen so far
            _, pred = torch.max(output.data, 1)
            total_correct += (pred == target).sum().item()
            total_samples += target.size(0)

    return 1 - total_correct/total_samples

### CNN with Tanh activations

Next, you should implement a baseline model so you can check how increasing the number of layers can make a network considerably harder to train, given that no additional methods such as residual connections and normalization layers are adopted.

Finish the implementation of CNNtanh below, carefully following the specifications:

The model should have exactly 'k' many convolutional layers, followed by a linear (fully-connected) layer that actually outputs the logits for each of the 10 MNIST classes.

The network should consist of 3 stages, each with k/3 many convolutional layers (you can assume k is divisible by 3). Each conv layer should have a 3x3 kernel, a stride of 1 and a padding of 1 pixel (such that the output of the convolution has the same height and width as its input).

It should also have an average pooling layer at the end of each stage, with a 2x2 window (hence halving the spatial dimensions), and the number of channels should double from one stage to the other (starting with 4 in the first stage). Moreover, a Tanh activation should follow each convolution layer.

When k=3, for example, the network should be:

1. Stage 1 (1x28x28 input, 4x14x14 output):
    1. Conv layer with 1 input channel and 4 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
2. Stage 2 (4x14x14 input, 8x7x7 output):
    1. Conv layer with 4 input channels and 8 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
3. Stage 3 (8x7x7 input, 16x3x3 output):
    1. Conv layer with 8 input channels and 16 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
4. Fully-connected layer with 16 * 3 * 3=144 input dimension and 10 output dimension

Note that the model should not have any activation after the fully-connected layer: the PyTorch loss module that will be adopted takes logits as input and not class probabilities.

In contrast to the network exemplified above with k=3, when k=6 it should have two conv layers per stage instead of one (each one with a tanh activation following it).

Lastly, do not change the code block with a for loop in the end of init: its purpose to randomly initialize the parameters of the conv layers by sampling from a Gaussian with zero mean and 0.05 deviation.

In [None]:
class CNNtanh(nn.Module):
    def __init__(self, k):
        super(CNNtanh, self).__init__()

        layers_per_stage = k // 3
        self.stages = nn.ModuleList()

        out_channels = [4, 8, 16]
        in_channels = 1
        for i in range(3):
            stage = nn.ModuleList()
            for _ in range(layers_per_stage):
                stage.append(nn.Conv2d(in_channels, out_channels[i], kernel_size=3, stride=1, padding=1))
                stage.append(nn.Tanh())
                in_channels = out_channels[i]
            stage.append(nn.AvgPool2d(kernel_size=2, stride=2))
            self.stages.append(stage)

        self.fully_connected = nn.Linear(out_channels[-1] * 3 * 3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                m.weight.data.normal_(0, 0.05)
                m.bias.data.zero_()

    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'

        L = input
        for stage in self.stages:
            for layer in stage:
                L = layer(L)

        L = L.view(L.size(0), -1)
        return self.fully_connected(L)


The line below just instantiates the PyTorch Cross Entropy loss, whose inputs should be logits: hence the reason that the CNN should not have an activation after last (feedforward) layer.

In [None]:
criterion = torch.nn.CrossEntropyLoss()

Now, you should train CNNtanh with different values for k: your goal is to find the largest value for k such that the network achieves less than 20% error (either train or test) in 3 epochs. You should also choose an appropriate learning rate (but do not change the optimizer or the momentum settings!).

Note that CNNs can easily achieve under 2% test error on MNIST, but we're choosing 20% as a threshold since you will be training each network for only 3 epochs.

Remember to use values for k that are divisible by 3. When submitted, your notebook should have the training log of a network with two consecutive values for k (for example, 6 and 9) such that the network is 'trainable' with the smaller one but not 'trainable' with the larger one. It is fine for the training log to include runs with more than two values of k.

In [None]:
# Binary search between lower=3 and upper=upper to find the lowest upper
# bound for the training rate

def search(lr, upper, model_factory):
    max_err = 0.2
    l, r = 1, (upper // 3)
    cache = set()
    print("Searching for lowest upper bound of trainability")
    while l <= r:
        mid = (l + r) // 2
        layers = 3 * mid
        model = model_factory(layers).cuda()
        optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
        print("Training CNN with {} layers".format(layers))
        _, test_err = train(3, model, criterion, optimizer, train_loader, test_loader)
        cache.add(3 * mid)
        if test_err >= max_err:
            print("  Error too high, looking lower")
            r = mid - 1
        else:
            print("  Error under threshold, looking higher")
            l = mid + 1
    # Trains the two consecutive k's if they haven't been trained yet
    print(f"With lr={lr}, values of k are {3*r}, {3*r+3}")
    for l in [3*r, 3*r + 3]:
        if l not in cache:
            model = model_factory(layers).cuda()
            optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
            print("Training CNN with {} layers".format(layers))
            _, test_err = train(3, model, criterion, optimizer, train_loader, test_loader)
    return 3*r

In [None]:
search(lr=0.01, upper=24, model_factory=CNNtanh);

Searching for lowest upper bound of trainability
Training CNN with 12 layers
Epoch 000/003, Train Error 89.04% || Test Error 88.65%
Epoch 001/003, Train Error 88.76% || Test Error 88.65%
Epoch 002/003, Train Error 88.76% || Test Error 88.65%
  Error too high, looking lower
Training CNN with 6 layers
Epoch 000/003, Train Error 88.91% || Test Error 88.65%
Epoch 001/003, Train Error 88.76% || Test Error 88.65%
Epoch 002/003, Train Error 47.04% || Test Error 21.92%
  Error too high, looking lower
Training CNN with 3 layers
Epoch 000/003, Train Error 70.22% || Test Error 32.00%
Epoch 001/003, Train Error 22.74% || Test Error 16.76%
Epoch 002/003, Train Error 15.67% || Test Error 13.24%
  Error under threshold, looking higher
With lr=0.01, values of k are 3, 6


### Better Initialization

Next, we will change the initialization of the conv layers and see how it affects the trainability of deep networks. Instead of sampling from a Gaussian with a deviation of 0.05, you should sample from a Gaussian with a deviation $\sigma = \sqrt{\frac{1}{k^2 \cdot C_{out}}}$ or $\sigma = \sqrt{\frac{1}{k^2 \cdot C_{in}}}$, where $k$ is the kernel size ($k=3$ for 3x3 convolutions), $C_{in}$ is the number of input channels, and $C_{out}$ the number of output channels.

The model below should be exactly like CNNtanh except for the standard deviation of the normal distribution used to initialize the conv layers.

The paper 'Understanding the difficulty of training deep feedforward neural networks' by Glorot and Bengio provides some intuition behind such a choice for $\sigma$.

In [None]:
class CNNtanh_newinit(nn.Module):
    def __init__(self, k):
        super(CNNtanh_newinit, self).__init__()

        layers_per_stage = k // 3
        self.stages = nn.ModuleList()

        out_channels = [4, 8, 16]
        in_channels = 1
        for i in range(3):
            stage = nn.ModuleList()
            for _ in range(layers_per_stage):
                stage.append(nn.Conv2d(in_channels, out_channels[i], kernel_size=3, stride=1, padding=1))
                stage.append(nn.Tanh())
                in_channels = out_channels[i]
            stage.append(nn.AvgPool2d(kernel_size=2, stride=2))
            self.stages.append(stage)

        self.fully_connected = nn.Linear(out_channels[-1] * 3 * 3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                sigma = np.sqrt(1 / (9 * m.in_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):
        L = input
        for stage in self.stages:
            for layer in stage:
                L = layer(L)

        L = L.view(L.size(0), -1)
        return self.fully_connected(L)

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNtanhinit.

In [None]:
search(lr=0.01, upper=120, model_factory=CNNtanh_newinit);

Searching for lowest upper bound of trainability
Training CNN with 60 layers
Epoch 000/003, Train Error 60.63% || Test Error 12.72%
Epoch 001/003, Train Error 7.25% || Test Error 3.81%
Epoch 002/003, Train Error 3.95% || Test Error 2.73%
  Error under threshold, looking higher
Training CNN with 90 layers
Epoch 000/003, Train Error 88.98% || Test Error 88.65%
Epoch 001/003, Train Error 88.76% || Test Error 88.65%
Epoch 002/003, Train Error 66.84% || Test Error 30.07%
  Error too high, looking lower
Training CNN with 75 layers
Epoch 000/003, Train Error 59.16% || Test Error 18.90%
Epoch 001/003, Train Error 10.69% || Test Error 6.18%
Epoch 002/003, Train Error 6.60% || Test Error 4.04%
  Error under threshold, looking higher
Training CNN with 81 layers
Epoch 000/003, Train Error 73.46% || Test Error 31.82%
Epoch 001/003, Train Error 24.31% || Test Error 10.68%
Epoch 002/003, Train Error 7.78% || Test Error 5.97%
  Error under threshold, looking higher
Training CNN with 84 layers
Epoch 00

84

### CNN with ELU activations

In this section you should replace the Tanh activations of the previous network for Exponential Linear Units (ELUs). Complete CNNelu below, which should be exactly like CNNtanhinit except for ELU activations instead of Tanh (ELUs are readily available in PyTorch, check its documentation for more details).

In [None]:
class CNNelu(nn.Module):
    def __init__(self, k):
        super(CNNelu, self).__init__()

        layers_per_stage = k // 3
        self.stages = nn.ModuleList()

        out_channels = [4, 8, 16]
        in_channels = 1
        for i in range(3):
            stage = nn.ModuleList()
            for _ in range(layers_per_stage):
                stage.append(nn.Conv2d(in_channels, out_channels[i], kernel_size=3, stride=1, padding=1))
                stage.append(nn.ELU())
                in_channels = out_channels[i]
            stage.append(nn.AvgPool2d(kernel_size=2, stride=2))
            self.stages.append(stage)

        self.fully_connected = nn.Linear(out_channels[-1] * 3 * 3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                sigma = np.sqrt(1 / (9 * m.in_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):
        L = input
        for stage in self.stages:
            for layer in stage:
                L = layer(L)

        L = L.view(L.size(0), -1)
        return self.fully_connected(L)

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNelu.

In [None]:
search(lr=0.01, upper=150, model_factory=CNNelu);

Searching for lowest upper bound of trainability
Training CNN with 75 layers
Epoch 000/003, Train Error 62.27% || Test Error 14.74%
Epoch 001/003, Train Error 9.80% || Test Error 7.90%
Epoch 002/003, Train Error 5.42% || Test Error 3.75%
  Error under threshold, looking higher
Training CNN with 114 layers
Epoch 000/003, Train Error 88.87% || Test Error 88.65%
Epoch 001/003, Train Error 88.76% || Test Error 88.65%
Epoch 002/003, Train Error 88.76% || Test Error 88.65%
  Error too high, looking lower
Training CNN with 93 layers
Epoch 000/003, Train Error 88.98% || Test Error 88.65%
Epoch 001/003, Train Error 60.32% || Test Error 17.00%
Epoch 002/003, Train Error 11.69% || Test Error 5.16%
  Error under threshold, looking higher
Training CNN with 102 layers
Epoch 000/003, Train Error 88.85% || Test Error 88.65%
Epoch 001/003, Train Error 88.76% || Test Error 88.65%
Epoch 002/003, Train Error 79.12% || Test Error 52.32%
  Error too high, looking lower
Training CNN with 96 layers
Epoch 000/

93

### CNN with Batch Normalization

Next, you will check how batch normalization can make deep networks easier to train. Implement the network below, which should be exactly like CNNelu except for additional BatchNorm2d layers after each convolution (before the ELU activation).

Note that BatchNorm2d modules require the number of channels as argument -- see the PyTorch documentation for more details.

In [None]:
class CNNeluBN(nn.Module):
    def __init__(self, k):
        super(CNNeluBN, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1

        layers_per_stage = k // 3
        self.stages = nn.ModuleList()

        out_channels = [4, 8, 16]
        in_channels = 1
        for i in range(3):
            stage = nn.ModuleList()
            for _ in range(layers_per_stage):
                stage.append(nn.Conv2d(in_channels, out_channels[i], kernel_size=3, stride=1, padding=1))
                stage.append(nn.BatchNorm2d(out_channels[i]))
                stage.append(nn.ELU())
                in_channels = out_channels[i]
            stage.append(nn.AvgPool2d(kernel_size=2, stride=2))
            self.stages.append(stage)

        self.fully_connected = nn.Linear(out_channels[-1] * 3 * 3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                sigma = np.sqrt(1 / (9 * m.in_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):
        L = input
        for stage in self.stages:
            for layer in stage:
                L = layer(L)

        L = L.view(L.size(0), -1)
        return self.fully_connected(L)

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNeluBN.

In [None]:
search(lr=0.01, upper=300, model_factory=CNNeluBN);

Searching for lowest upper bound of trainability
Training CNN with 150 layers
Epoch 000/003, Train Error 44.15% || Test Error 24.05%
Epoch 001/003, Train Error 37.10% || Test Error 22.35%
Epoch 002/003, Train Error 17.86% || Test Error 11.17%
  Error under threshold, looking higher
Training CNN with 225 layers
Epoch 000/003, Train Error 89.27% || Test Error 89.06%
Epoch 001/003, Train Error 89.22% || Test Error 89.57%
Epoch 002/003, Train Error 86.63% || Test Error 83.53%
  Error too high, looking lower
Training CNN with 186 layers
Epoch 000/003, Train Error 80.13% || Test Error 56.50%
Epoch 001/003, Train Error 51.95% || Test Error 39.56%
Epoch 002/003, Train Error 34.12% || Test Error 35.61%
  Error too high, looking lower
Training CNN with 168 layers
Epoch 000/003, Train Error 57.80% || Test Error 37.98%
Epoch 001/003, Train Error 87.25% || Test Error 89.90%
Epoch 002/003, Train Error 89.40% || Test Error 90.26%
  Error too high, looking lower
Training CNN with 159 layers
Epoch 000/

156

### Residual Networks

Finally, you experiment adding residual connections to a CNN.

To implement the model below, you should add a 'skip connection' to 'Conv->BatchNorm->ELU' blocks whenever the shape of the block's input and output are the same: this will be the case for every such block except for the first ones in each stage, as they double the number of channels.

More specifically, you should change $u = ELU(BatchNorm(Conv(x)))$ to $u = ELU(BatchNorm(Conv(x))) + x$, where $x$ and $u$ denote the block's input and output, respectively.

You should take your CNNeluBN implementation and add skip-connections as described above.

Note that there are key differences between the resulting model and the actual ResNet proposed by He et al. in 'Deep Residual Learning for Image Recognition', for example the use of ELU activations instead of ReLU and the exact position of skip-connections.

In [None]:
class ResBlock(nn.Module):
    def __init__(self, ins, outs, kernel_size=3, stride=1, padding=1):
        super(ResBlock, self).__init__()
        self.conv = nn.Conv2d(ins, outs, kernel_size, stride, padding)
        self.bn = nn.BatchNorm2d(outs)
        self.elu = nn.ELU()

    def forward(self, x):
        out = self.conv(x)
        out = self.bn(out)
        out = self.elu(out)
        out = out + x
        return out

class ResNet(nn.Module):
    def __init__(self, k):
        super(ResNet, self).__init__()

        layers_per_stage = k // 3
        self.stages = nn.ModuleList()

        out_channels = [4, 8, 16]
        in_channels = 1
        for i in range(3):
            stage = nn.ModuleList()
            for l in range(layers_per_stage):
                if l == 0:
                    stage.append(nn.Conv2d(in_channels, out_channels[i], kernel_size=3, stride=1, padding=1))
                    stage.append(nn.BatchNorm2d(out_channels[i]))
                    stage.append(nn.ELU())
                    in_channels = out_channels[i]
                else:
                    stage.append(ResBlock(in_channels, out_channels[i]))
            stage.append(nn.AvgPool2d(kernel_size=2, stride=2))
            self.stages.append(stage)

        self.fully_connected = nn.Linear(out_channels[-1] * 3 * 3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                sigma = np.sqrt(1 / (9 * m.in_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):
        L = input
        for stage in self.stages:
            for layer in stage:
                L = layer(L)

        L = L.view(L.size(0), -1)  # Flatten the tensor
        return self.fully_connected(L)


Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with the 'ResNet' model.

In [None]:
search(lr=0.01, upper=450, model_factory=ResNet);

Searching for lowest upper bound of trainability
Training CNN with 225 layers
Epoch 000/003, Train Error 46.69% || Test Error 18.80%
Epoch 001/003, Train Error 11.45% || Test Error 7.58%
Epoch 002/003, Train Error 7.31% || Test Error 6.55%
  Error under threshold, looking higher
Training CNN with 339 layers
Epoch 000/003, Train Error 39.00% || Test Error 12.73%
Epoch 001/003, Train Error 9.99% || Test Error 7.58%
Epoch 002/003, Train Error 6.75% || Test Error 6.15%
  Error under threshold, looking higher
Training CNN with 396 layers
Epoch 000/003, Train Error 76.47% || Test Error 26.85%
Epoch 001/003, Train Error 15.24% || Test Error 10.50%
Epoch 002/003, Train Error 9.48% || Test Error 7.41%
  Error under threshold, looking higher
Training CNN with 423 layers
Epoch 000/003, Train Error 58.84% || Test Error 19.90%
Epoch 001/003, Train Error 15.52% || Test Error 11.28%
Epoch 002/003, Train Error 10.95% || Test Error 7.70%
  Error under threshold, looking higher
Training CNN with 438 lay

447

**<font color='blue'>
    Summarize your results and observations regarding the experiments above. What was the maximum number of layers for each of the five models such that training remained successful? Briefly discuss why you think each modification helped/harmed the trainability of deep models.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

<ol>
<li> Tanh - Max trainable with k=3
<li> Tanh with better init - Max trainable with k=84
<li> Elu - Max trainable with k=93
<li> Elu with batch normalization - Max trainable with k=156
<li> Elu Resnet - Max trainable with k=447
</ol>
We get substantial increases with better initialization, batch normalization, and using residuals. Better initialization probably helps to reduce the risk of disappearing/exploding gradients and to prevent the CNN from getting stuck in a local minimum. Batch normalization also prevents disappearing and exploding gradients. Adding residuals helps to reduce separation between earlier and later layers, which also helps to solve the vanishing gradient problem.

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**

### Interactions: Batch Norm and Initialization

Intuitively, batch norm should make the model more robust to changes in the magnitude of the network's weights: informally, scaling up all the elements of a conv layer's filters by a factor of 10 would not affect the network's output as long as there is a batch norm layer following such convolution, as the normalization would undo the scaling.

To check how this intuition translates to practical settings, you should change the original 'CNNtanh' model so that it incorporates batch norm layers (like you have done when modifying 'CNNelu' into 'CNNeluBN').

The model below should adopt the naive initialization procedure of sampling from a Gaussian with a deviation of 0.05, not the more sophisticated one that you implemented previously

In [None]:
class CNNtanhBN_oldinit(nn.Module):
    def __init__(self, k):
        super(CNNtanhBN_oldinit, self).__init__()

        layers_per_stage = k // 3
        self.stages = nn.ModuleList()

        out_channels = [4, 8, 16]
        in_channels = 1
        for i in range(3):
            stage = nn.ModuleList()
            for _ in range(layers_per_stage):
                stage.append(nn.Conv2d(in_channels, out_channels[i], kernel_size=3, stride=1, padding=1))
                stage.append(nn.BatchNorm2d(out_channels[i]))
                stage.append(nn.Tanh())
                in_channels = out_channels[i]
            stage.append(nn.AvgPool2d(kernel_size=2, stride=2))
            self.stages.append(stage)

        self.fully_connected = nn.Linear(out_channels[-1] * 3 * 3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                m.weight.data.normal_(0, 0.05)
                m.bias.data.zero_()

    def forward(self, input):
        L = input
        for stage in self.stages:
            for layer in stage:
                L = layer(L)

        L = L.view(L.size(0), -1)
        return self.fully_connected(L)


Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNeluBN_oldinit.

In [None]:
search(lr=0.01, upper=120, model_factory=CNNtanhBN_oldinit);

Searching for lowest upper bound of trainability
Training CNN with 60 layers
Epoch 000/003, Train Error 36.16% || Test Error 8.57%
Epoch 001/003, Train Error 19.40% || Test Error 7.89%
Epoch 002/003, Train Error 7.21% || Test Error 6.38%
  Error under threshold, looking higher
Training CNN with 90 layers
Epoch 000/003, Train Error 55.93% || Test Error 42.03%
Epoch 001/003, Train Error 33.71% || Test Error 21.04%
Epoch 002/003, Train Error 17.49% || Test Error 14.40%
  Error under threshold, looking higher
Training CNN with 105 layers
Epoch 000/003, Train Error 88.98% || Test Error 88.65%
Epoch 001/003, Train Error 88.84% || Test Error 88.65%
Epoch 002/003, Train Error 88.78% || Test Error 88.65%
  Error too high, looking lower
Training CNN with 96 layers
Epoch 000/003, Train Error 60.54% || Test Error 42.21%
Epoch 001/003, Train Error 32.31% || Test Error 24.10%
Epoch 002/003, Train Error 18.75% || Test Error 14.46%
  Error under threshold, looking higher
Training CNN with 99 layers
Ep

**<font color='blue'>
    Compare CNNtanh (model with naive initialization and no batch norm), CNNtanh_newinit (model with better initialization and no batch norm), and CNNtanhBN_oldinit (model with naive initialization and batch norm), in terms of how deep each could be while being trainable, and discuss your thoughts one how batch norm interacts with the way parameters are initialized.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

<li> CNN (naive, no batch norm) - Max trainable with k = 3
<li> CNNtanh_newinit - Max trainable with k = 84
<li> CNNtanhBN_oldinit - Max trainable with k = 96

Batch norm seems to alleviate problems with parameter initialization, where we no longer have to be super careful about picking the best possible initialization for our values.

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**

### Interactions: Batch Norm and Residual Connections

Lastly, implement and train a CNN with residual connections but without batch normalization layers -- the goal here is to check how residuals interact with normalization.

The model below should be exactly like ResNet, except that it should not have batch norm layers.

In [None]:
class ResBlock_noBN(nn.Module):
    def __init__(self, ins, outs, kernel_size=3, stride=1, padding=1):
        super(ResBlock_noBN, self).__init__()
        self.conv = nn.Conv2d(ins, outs, kernel_size, stride, padding)
        self.elu = nn.ELU()

    def forward(self, x):
        out = self.conv(x)
        out = self.elu(out)
        out = out + x
        return out

class ResNet_noBN(nn.Module):
    def __init__(self, k):
        super(ResNet_noBN, self).__init__()
        layers_per_stage = k // 3
        self.stages = nn.ModuleList()

        out_channels = [4, 8, 16]
        in_channels = 1
        for i in range(3):
            stage = nn.ModuleList()
            for l in range(layers_per_stage):
                if l == 0:
                    stage.append(nn.Conv2d(in_channels, out_channels[i], kernel_size=3, stride=1, padding=1))
                    stage.append(nn.ELU())
                    in_channels = out_channels[i]
                else:
                    stage.append(ResBlock_noBN(in_channels, out_channels[i]))
            stage.append(nn.AvgPool2d(kernel_size=2, stride=2))
            self.stages.append(stage)

        self.fully_connected = nn.Linear(out_channels[-1] * 3 * 3, 10)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                sigma = np.sqrt(1 / (9 * m.in_channels))
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

    def forward(self, input):
        L = input
        for stage in self.stages:
            for layer in stage:
                L = layer(L)

        L = L.view(L.size(0), -1)
        return self.fully_connected(L)

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with ResNet_noBN.

In [None]:
search(lr=0.01, upper=24, model_factory=ResNet_noBN);

Searching for lowest upper bound of trainability
Training CNN with 12 layers
Epoch 000/003, Train Error 10.30% || Test Error 2.65%
Epoch 001/003, Train Error 2.44% || Test Error 2.15%
Epoch 002/003, Train Error 1.80% || Test Error 1.59%
  Error under threshold, looking higher
Training CNN with 18 layers
Epoch 000/003, Train Error 90.15% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%
  Error too high, looking lower
Training CNN with 15 layers
Epoch 000/003, Train Error 90.11% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%
  Error too high, looking lower
With lr=0.01, values of k are 12, 15


**<font color='blue'>
    Compare ResNet and ResNet_noBN in terms of how deep each could be while being trainable, and discuss your thoughts on how batch norm interacts with residual connections.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

<li> ResNet_noBN - trainable to only k = 12
<li> ResNet - trainable to k = 447

There is a huge difference between the depth: no batch norm is almost as bad as the naive CNN, no residual. Batch norm seems necessary for CNNs of great depth, making sure that the gradient does not explode or disappear.

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**