<a href="https://colab.research.google.com/github/Hgherzog/TTIC-DeepLearning/blob/main/PS2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Instructions

For this assignment you will use PyTorch instead of EDF to implement and train neural networks. The experiments in this assignment will take a long time to run without a GPU, but you can run the notebook remotely on Google Colab and have access to GPUs for free -- in this case you don't have to worry about installing PyTorch as it is available by default in Google Colab's environment.

In case you will be running the experiments in your own machine, you should install PyTorch -- there are multiple tutorials online and it is especially easy if you're using Anaconda. Check https://pytorch.org/tutorials/ for some PyTorch tutorials -- this assignment assumes that you know the basics like defining models with multiple modules and coding up functions to train models with PyTorch optimizers. To 

To use Google Colab, you should access https://colab.research.google.com/ and upload this notebook to your workspace. To use a GPU, go to Edit -> Notebook settings and select GPU as the accelerator.

Unlike previous assignments, in this one you will have to do some writing instead of just coding. Try to keep your answers short and precise, and you are encouraged to write equations if needed (you can do that using markdown cells). You can also use code as part of your answers (like plotting and printing, etc). Blue text indicates questions or things that you should discuss/comment, and there will red "ANSWER (BEGIN)" and "ANSWER (END)" markdown cells to indicate that you should add cells with your writeup between these two. **Make sure not to redefine variables or functions in your writeup, which can change the behavior of the next cells.**

Finally, you might have to do minor changes to the provided code due to differences in python/pytorch versions. You can post on piazza if there's a major, non-trivial change that you had to do (so other students can be aware of it and how to proceed), but for minor changes you should just apply them and keep working on the assignment.

In [None]:
import torch, math, copy
import numpy as np
from torchvision import datasets, transforms
import torch.nn as nn
import torch.nn.init as init
import torch.nn.functional as F

# From Shallow to Deep Neural Networks

The main goal of this assignment is to develop a better understanding of how the depth of a network interacts with its trainability and performance.

In the previous assignment you likely observed difficulties in training sigmoid and ReLU networks with over ~8 layers, which is typically associated with 'vanishing' or 'exploding' gradients. As you will see, some of the biggest achievements in deep learning have been the development of techniques that enable deeper networks to be successfully trained, and without them deep networks are notoriously difficult to train successfully.

You will be working with the MNIST dataset, which will be downloaded and loaded in the cell below.

In [None]:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])

train_dataset = datasets.MNIST("data", train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=True)

test_dataset = datasets.MNIST("data", train=False, download=True, transform=transform)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=256, shuffle=False)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to data/MNIST/raw/train-images-idx3-ubyte.gz


  0%|          | 0/9912422 [00:00<?, ?it/s]

Extracting data/MNIST/raw/train-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to data/MNIST/raw/train-labels-idx1-ubyte.gz


  0%|          | 0/28881 [00:00<?, ?it/s]

Extracting data/MNIST/raw/train-labels-idx1-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to data/MNIST/raw/t10k-images-idx3-ubyte.gz


  0%|          | 0/1648877 [00:00<?, ?it/s]

Extracting data/MNIST/raw/t10k-images-idx3-ubyte.gz to data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to data/MNIST/raw/t10k-labels-idx1-ubyte.gz


  0%|          | 0/4542 [00:00<?, ?it/s]

Extracting data/MNIST/raw/t10k-labels-idx1-ubyte.gz to data/MNIST/raw



Fill the missing code below. In both train_epoch and test, total_correct should be the total number of correctly classified samples, while total_samples should be the total number of samples that have been iterated over.

In [None]:
def train(epochs, model, criterion, optimizer, train_loader, test_loader):
    for epoch in range(epochs):
        train_err = train_epoch(model, criterion, optimizer, train_loader)
        test_err = test(model, test_loader)
        print('Epoch {:03d}/{:03d}, Train Error {:.2f}% || Test Error {:.2f}%'.format(epoch, epochs, train_err*100, test_err*100))
    return train_err, test_err
    
def train_epoch(model, criterion, optimizer, loader):
    total_correct = 0.
    total_samples = 0.
    
    for batch_idx, (data, target) in enumerate(loader):
        if torch.cuda.is_available():
            data, target = data.cuda(), target.cuda()
        len_t = len(target)
        # insert code to feed the data to the model and collect its output 
        #assume output is a probability distribution over labels
        output = model(data)
        # insert code to compute the loss from output and the true target
        loss = criterion(output, target)

        # insert code to update total_correct and total_samples
        # total_correct: total number of correctly classified samples
        # total_samples: total number of samples seen so far
        for i in range(len_t):
          
          total_correct +=  1 if torch.argmax(output[i]) == target[i] else 0
          total_samples += 1

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    return 1 - total_correct/total_samples
    
def test(model, loader):
    total_correct = 0.
    total_samples = 0.
    #.eval() turns off gradient computation and batch normalization etc duing traing
    model.eval()
    
    #Makes sure the gradients aren't being kept during the testing process
    with torch.no_grad():
        for batch_idx, (data, target) in enumerate(loader):
            if torch.cuda.is_available():
              #.cuda() allows the gpu to be used for faster paralellized computation
                data, target = data.cuda(), target.cuda()

            # insert code to feed the data to the model and collect its output 
            output = model(data)

            # insert code to update total_correct and total_samples
            # total_correct: total number of correctly classified samples
            # total_samples: total number of samples seen so far
            for i, val in enumerate(target):
              total_correct += 1 if torch.argmax(output[i]) == val else 0
              total_samples += 1 

    return 1 - total_correct/total_samples

### CNN with Tanh activations

Next, you should implement a baseline model so you can check how increasing the number of layers can make a network considerably harder to train, given that no additional methods such as residual connections and normalization layers are adopted.

Finish the implementation of CNNtanh below, carefully following the specifications:

The model should have exactly 'k' many convolutional layers, followed by a linear (fully-connected) layer that actually outputs the logits for each of the 10 MNIST classes.

The network should consist of 3 stages, each with k/3 many convolutional layers (you can assume k is divisible by 3). Each conv layer should have a 3x3 kernel, a stride of 1 and a padding of 1 pixel (such that the output of the convolution has the same height and width as its input).

It should also have an average pooling layer at the end of each stage, with a 2x2 window (hence halving the spatial dimensions), and the number of channels should double from one stage to the other (starting with 4 in the first stage). Moreover, a Tanh activation should follow each convolution layer.

When k=3, for example, the network should be:

1. Stage 1 (1x28x28 input, 4x14x14 output):
    1. Conv layer with 1 input channel and 4 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
2. Stage 2 (4x14x14 input, 8x7x7 output):
    1. Conv layer with 4 input channels and 8 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
3. Stage 3 (8x7x7 input, 16x3x3 output):
    1. Conv layer with 8 input channels and 16 output channels, 3x3 kernel, stride=padding=1
    2. Tanh activation
    3. Average Pool with 2x2 kernel and stride 2
4. Fully-connected layer with 16 * 3 * 3=144 input dimension and 10 output dimension

Note that the model should not have any activation after the fully-connected layer: the PyTorch loss module that will be adopted takes logits as input and not class probabilities.

In contrast to the network exemplified above with k=3, when k=6 it should have two conv layers per stage instead of one (each one with a tanh activation following it).

Lastly, do not change the code block with a for loop in the end of init: its purpose to randomly initialize the parameters of the conv layers by sampling from a Gaussian with zero mean and 0.05 deviation.

In [None]:
class CNNtanh(nn.Module):
    def __init__(self, k):
        #super allows objects of class
        super(CNNtanh, self).__init__()
        if k % 3 != 0:
          return ValueError("k not divisible by 3")
        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        self.k = k
        self.layperstage= k // 3
        # Initialize module list
        self.translayer = nn.ModuleList()
        self.layer = nn.ModuleList()
        input_c = 1
        for i in range(3):
          output_c = 4 * pow(2, i)
          self.translayer.append(nn.Conv2d(input_c, output_c, 3 , 1, 1))
          self.layer.append(nn.Conv2d(output_c, output_c, 3, 1, 1))
          input_c=output_c
        
        self.act=nn.Tanh()
        self.avgpool = nn.AvgPool2d(2,2)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(144, 10)
                                     

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                m.weight.data.normal_(0, 0.05)
                m.bias.data.zero_()
        
    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        u = input
        for i in range(3):
          for j in range(self.layperstage):
            if j ==0:
              u = self.translayer[i](u)
            else:
              u = self.layer[i](u)
            u = self.act(u)
          u = self.avgpool(u)
        u = self.flatten(u)
        u = self.linear(u)
        return u

The line below just instantiates the PyTorch Cross Entropy loss, whose inputs should be logits: hence the reason that the CNN should not have an activation after last (feedforward) layer.

In [None]:
criterion = torch.nn.CrossEntropyLoss()

Now, you should train CNNtanh with different values for k: your goal is to find the largest value for k such that the network achieves less than 20% error (either train or test) in 3 epochs. You should also choose an appropriate learning rate (but do not change the optimizer or the momentum settings!).

Note that CNNs can easily achieve under 2% test error on MNIST, but we're choosing 20% as a threshold since you will be training each network for only 3 epochs.

Remember to use values for k that are divisible by 3. When submitted, your notebook should have the training log of a network with two consecutive values for k (for example, 6 and 9) such that the network is 'trainable' with the smaller one but not 'trainable' with the larger one. It is fine for the training log to include runs with more than two values of k.

In [None]:
k_vals = [9, 12, 15]
lr = 0.15
for k in k_vals:
  print("Training Tanh CNN with {} layers".format(k))
  model = CNNtanh(k).cuda()
  optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
  train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

Training Tanh CNN with 9 layers
Epoch 000/003, Train Error 89.14% || Test Error 88.65%
Epoch 001/003, Train Error 45.37% || Test Error 4.73%
Epoch 002/003, Train Error 3.73% || Test Error 2.84%
Training Tanh CNN with 12 layers
Epoch 000/003, Train Error 89.12% || Test Error 88.65%
Epoch 001/003, Train Error 89.20% || Test Error 88.65%
Epoch 002/003, Train Error 89.20% || Test Error 88.65%
Training Tanh CNN with 15 layers
Epoch 000/003, Train Error 89.28% || Test Error 89.72%
Epoch 001/003, Train Error 89.18% || Test Error 88.65%
Epoch 002/003, Train Error 89.00% || Test Error 88.65%


### Better Initialization

Next, we will change the initialization of the conv layers and see how it affects the trainability of deep networks. Instead of sampling from a Gaussian with a deviation of 0.05, you should sample from a Gaussian with a deviation $\sigma = \sqrt{\frac{1}{k^2 \cdot C_{out}}}$ or $\sigma = \sqrt{\frac{1}{k^2 \cdot C_{in}}}$, where $k$ is the kernel size ($k=3$ for 3x3 convolutions), $C_{in}$ is the number of input channels, and $C_{out}$ the number of output channels.

The model below should be exactly like CNNtanh except for the standard deviation of the normal distribution used to initialize the conv layers.

The paper 'Understanding the difficulty of training deep feedforward neural networks' by Glorot and Bengio provides some intuition behind such a choice for $\sigma$.

In [None]:
class CNNtanh_newinit(nn.Module):
    def __init__(self, k):
        super(CNNtanh_newinit, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        self.k = k
        self.layperstage= k // 3
        # Initialize module list
        self.translayer = nn.ModuleList()
        self.layer = nn.ModuleList()
        input_c = 1
        for i in range(3):
          output_c = 4 * pow(2, i)
          self.translayer.append(nn.Conv2d(input_c, output_c, 3 , 1, 1))
          self.layer.append(nn.Conv2d(output_c, output_c, 3, 1, 1))
          input_c=output_c
        
        self.act=nn.Tanh()
        self.avgpool = nn.AvgPool2d(2,2)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(144, 10)
                                     
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = math.sqrt(1 / m.in_channels * 9)
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        u = input
        for i in range(3):
          for j in range(self.layperstage):
            if j ==0:
              u = self.translayer[i](u)
            else:
              u = self.layer[i](u)
            u = self.act(u)
          u = self.avgpool(u)
        u = self.flatten(u)
        u = self.linear(u)
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNtanhinit.

In [None]:
k_vals = [9,12,15]
lr = 1

for k in k_vals:
  print("\nTraining Tanh CNN + new init with {} layers".format(k))
  model = CNNtanh_newinit(k).cuda()
  optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
  train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training Tanh CNN + new init with 9 layers
Epoch 000/003, Train Error 63.15% || Test Error 48.65%
Epoch 001/003, Train Error 46.58% || Test Error 40.10%
Epoch 002/003, Train Error 43.42% || Test Error 40.50%

Training Tanh CNN + new init with 12 layers
Epoch 000/003, Train Error 83.09% || Test Error 79.64%
Epoch 001/003, Train Error 76.16% || Test Error 72.25%
Epoch 002/003, Train Error 72.50% || Test Error 67.58%

Training Tanh CNN + new init with 15 layers
Epoch 000/003, Train Error 89.63% || Test Error 90.18%
Epoch 001/003, Train Error 89.81% || Test Error 90.42%
Epoch 002/003, Train Error 90.07% || Test Error 90.18%


### CNN with ELU activations

In this section you should replace the Tanh activations of the previous network for Exponential Linear Units (ELUs). Complete CNNelu below, which should be exactly like CNNtanhinit except for ELU activations instead of Tanh (ELUs are readily available in PyTorch, check its documentation for more details).

In [None]:
class CNNelu(nn.Module):
    def __init__(self, k):
        super(CNNelu, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        self.k = k
        self.layperstage= k // 3
        # Initialize module list
        self.translayer = nn.ModuleList()
        self.layer = nn.ModuleList()
        input_c = 1
        for i in range(3):
          output_c = 4 * pow(2, i)
          self.translayer.append(nn.Conv2d(input_c, output_c, 3 , 1, 1))
          self.layer.append(nn.Conv2d(output_c, output_c, 3, 1, 1))
          input_c=output_c
        
        self.act=nn.ELU()
        self.avgpool = nn.AvgPool2d(2,2)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(144, 10)
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = math.sqrt(1 / m.in_channels * 9)
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):
        u = input
        for i in range(3):
          for j in range(self.layperstage):
            if j ==0:
              u = self.translayer[i](u)
            else:
              u = self.layer[i](u)
            u = self.act(u)
          u = self.avgpool(u)
        u = self.flatten(u)
        u = self.linear(u)
        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'

        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNelu.

In [None]:
k_vals = [3,6,9]
lr = .0001

for k in k_vals:
  print("\nTraining ELU CNN, with {} layers".format(k))
  model = CNNelu(k).cuda()
  optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
  train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training ELU CNN, with 3 layers
Epoch 000/003, Train Error 29.61% || Test Error 16.40%
Epoch 001/003, Train Error 14.94% || Test Error 13.35%
Epoch 002/003, Train Error 12.91% || Test Error 12.00%

Training ELU CNN, with 6 layers
Epoch 000/003, Train Error 90.25% || Test Error 90.18%
Epoch 001/003, Train Error 90.26% || Test Error 90.18%
Epoch 002/003, Train Error 90.26% || Test Error 90.18%

Training ELU CNN, with 9 layers
Epoch 000/003, Train Error 90.15% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%


### CNN with Batch Normalization

Next, you will check how batch normalization can make deep networks easier to train. Implement the network below, which should be exactly like CNNelu except for additional BatchNorm2d layers after each convolution (before the ELU activation).

Note that BatchNorm2d modules require the number of channels as argument -- see the PyTorch documentation for more details.

In [None]:
class CNNeluBN(nn.Module):
    def __init__(self, k):
        super(CNNeluBN, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        self.k = k
        self.layperstage= k // 3
        # Initialize module list
        self.translayer = nn.ModuleList()
        self.layer = nn.ModuleList()
        self.batchnorm = nn.ModuleList()
        input_c = 1
        for i in range(3):
          output_c = 4 * pow(2, i)
          self.translayer.append(nn.Conv2d(input_c, output_c, 3 , 1, 1))
          self.layer.append(nn.Conv2d(output_c, output_c, 3, 1, 1))
          self.batchnorm.append(nn.BatchNorm2d(output_c))
          input_c=output_c
        
        self.act=nn.ELU()
        self.avgpool = nn.AvgPool2d(2,2)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(144, 10)
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = math.sqrt(1 / m.in_channels * 9)
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):
        
        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        u = input
        for i in range(3):
          for j in range(self.layperstage):
            if j ==0:
              u = self.translayer[i](u)
            else:
              u = self.layer[i](u)
            u = self.batchnorm[i](u)
            u = self.act(u)
          u = self.avgpool(u)
        u = self.flatten(u)
        u = self.linear(u)
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNeluBN.

In [None]:
#I first tried all values div by 3 up to 60
k_vals = [66, 63]
lr = .01

for k in k_vals:
  print("\nTraining ELU CNN + BN with {} layers".format(k))
  model = CNNeluBN(k).cuda()
  optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
  train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training ELU CNN + BN with 66 layers
Epoch 000/003, Train Error 76.88% || Test Error 87.01%
Epoch 001/003, Train Error 58.42% || Test Error 41.43%
Epoch 002/003, Train Error 38.23% || Test Error 31.14%

Training ELU CNN + BN with 63 layers
Epoch 000/003, Train Error 73.54% || Test Error 85.37%
Epoch 001/003, Train Error 45.25% || Test Error 32.91%
Epoch 002/003, Train Error 25.95% || Test Error 23.06%


### Residual Networks

Finally, you experiment adding residual connections to a CNN.

To implement the model below, you should add a 'skip connection' to 'Conv->BatchNorm->ELU' blocks whenever the shape of the block's input and output are the same: this will be the case for every such block except for the first ones in each stage, as they double the number of channels.

More specifically, you should change $u = ELU(BatchNorm(Conv(x)))$ to $u = ELU(BatchNorm(Conv(x))) + x$, where $x$ and $u$ denote the block's input and output, respectively.

You should take your CNNeluBN implementation and add skip-connections as described above.

Note that there are key differences between the resulting model and the actual ResNet proposed by He et al. in 'Deep Residual Learning for Image Recognition', for example the use of ELU activations instead of ReLU and the exact position of skip-connections.

In [None]:
class ResNet(nn.Module):
    def __init__(self, k):
        super(ResNet, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        self.k = k
        self.layperstage= k // 3
        # Initialize module list
        self.translayer = nn.ModuleList()
        self.layer = nn.ModuleList()
        self.batchnorm = nn.ModuleList()
        input_c = 1
        for i in range(3):
          output_c = 4 * pow(2, i)
          self.translayer.append(nn.Conv2d(input_c, output_c, 3 , 1, 1))
          self.layer.append(nn.Conv2d(output_c, output_c, 3, 1, 1))
          self.batchnorm.append(nn.BatchNorm2d(output_c))
          input_c=output_c
        
        self.act=nn.ELU()
        self.avgpool = nn.AvgPool2d(2,2)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(144, 10)
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = math.sqrt(1 / m.in_channels * 9)
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()

        
    def forward(self, input):
        
        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        u = input
        for i in range(3):
          for j in range(self.layperstage):
            x = u
            if j ==0:
              u = self.translayer[i](u)
              u = self.batchnorm[i](u)
              u = self.act(u)
            else:
              u = self.layer[i](u)
              u = self.batchnorm[i](u)
              u = self.act(u) + x
          u = self.avgpool(u)
        u = self.flatten(u)
        u = self.linear(u)
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with the 'ResNet' model.

In [None]:
k_vals = [30, 27, 24]
lr = .01

for k in k_vals:
  print("\nTraining ResNet with {} layers".format(k))
  model = ResNet(k).cuda()
  optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
  train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training ResNet with 30 layers
Epoch 000/003, Train Error 13.42% || Test Error 89.90%
Epoch 001/003, Train Error 90.01% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%

Training ResNet with 27 layers
Epoch 000/003, Train Error 11.86% || Test Error 88.65%
Epoch 001/003, Train Error 89.38% || Test Error 89.68%
Epoch 002/003, Train Error 70.90% || Test Error 13.94%

Training ResNet with 24 layers
Epoch 000/003, Train Error 12.40% || Test Error 88.26%
Epoch 001/003, Train Error 9.97% || Test Error 4.55%
Epoch 002/003, Train Error 4.62% || Test Error 3.98%


**<font color='blue'>
    Summarize your results and observations regarding the experiments above. What was the maximum number of layers for each of the five models such that training remained successful? Briefly discuss why you think each modification helped/harmed the trainability of deep models.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

1. CNNTanh: 
The maximum number of layers that remained successful was 9 layers. This was the control for the experiment. 

2. CNN-gaussian init: I also could train it for 9 layers, and the initialization doesn't seem to have made much of a difference.
3. CNNelu: I could only train it for 3 layers, I think that the presence of large negative values may have caused saturation.
4. CNN-batch normalization: I was able to train the network for 63 layers, the increase in number of layers is because the batch normalization at every layer ensures that different features are at the same scale for every layer, which gives more meaningful gradients. 
5. ResNet: The max number of layers I trained it for was 27. The residual connection reduced the number of layers possible compared to 4. I think the reduced sucess is due to the fact that the residual connection passes an input that is not batch normalized with the rest of the features, which may cause that connection to have undue influence on that layer.

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**

### Interactions: Batch Norm and Initialization

Intuitively, batch norm should make the model more robust to changes in the magnitude of the network's weights: informally, scaling up all the elements of a conv layer's filters by a factor of 10 would not affect the network's output as long as there is a batch norm layer following such convolution, as the normalization would undo the scaling.

To check how this intuition translates to practical settings, you should change the original 'CNNtanh' model so that it incorporates batch norm layers (like you have done when modifying 'CNNelu' into 'CNNeluBN').

The model below should adopt the naive initialization procedure of sampling from a Gaussian with a deviation of 0.05, not the more sophisticated one that you implemented previously

In [None]:
class CNNtanhBN_oldinit(nn.Module):
    def __init__(self, k):
        super(CNNtanhBN_oldinit, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        self.k = k
        self.layperstage= k // 3
        # Initialize module list
        self.translayer = nn.ModuleList()
        self.layer = nn.ModuleList()
        self.batchnorm = nn.ModuleList()
        input_c = 1
        for i in range(3):
          output_c = 4 * pow(2, i)
          self.translayer.append(nn.Conv2d(input_c, output_c, 3 , 1, 1))
          self.layer.append(nn.Conv2d(output_c, output_c, 3, 1, 1))
          self.batchnorm.append(nn.BatchNorm2d(output_c))
          input_c=output_c
        
        self.act=nn.Tanh()
        self.avgpool = nn.AvgPool2d(2,2)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(144, 10)
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                m.weight.data.normal_(0, 0.05)
                m.bias.data.zero_()
        
    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        u = input
        for i in range(3):
          for j in range(self.layperstage):
            if j ==0:
              u = self.translayer[i](u)
            else:
              u = self.layer[i](u)
            u = self.batchnorm[i](u)
            u = self.act(u)
          u = self.avgpool(u)
        u = self.flatten(u)
        u = self.linear(u)
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNtanhBN_oldinit.

In [None]:
k_vals = [54, 51, 48]
lr = 0.01

for k in k_vals:
  print("\nTraining Tanh CNN + BN + naive init with {} layers".format(k))
  model = CNNtanhBN_oldinit(k).cuda()
  optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
  train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training Tanh CNN + BN + naive init with 54 layers
Epoch 000/003, Train Error 65.35% || Test Error 72.55%
Epoch 001/003, Train Error 38.38% || Test Error 23.91%
Epoch 002/003, Train Error 21.75% || Test Error 20.52%

Training Tanh CNN + BN + naive init with 51 layers
Epoch 000/003, Train Error 47.47% || Test Error 87.09%
Epoch 001/003, Train Error 89.01% || Test Error 88.65%
Epoch 002/003, Train Error 89.37% || Test Error 88.65%

Training Tanh CNN + BN + naive init with 48 layers
Epoch 000/003, Train Error 49.13% || Test Error 41.64%
Epoch 001/003, Train Error 23.17% || Test Error 14.26%
Epoch 002/003, Train Error 15.33% || Test Error 14.14%


**<font color='blue'>
    Compare CNNtanh (model with naive initialization and no batch norm), CNNtanh_newinit (model with better initialization and no batch norm), and CNNtanhBN_oldinit (model with naive initialization and batch norm), in terms of how deep each could be while being trainable, and discuss your thoughts one how batch norm interacts with the way parameters are initialized.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

The first two are able to train about 9 layers deep, while the batch normalized version is trainable up to around 48 layer deep. Batch normalization normalizes the parameters at each input so the training becomes less dependent on the intialization.

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**

### Interactions: Batch Norm and Residual Connections

Lastly, implement and train a CNN with residual connections but without batch normalization layers -- the goal here is to check how residuals interact with normalization.

The model below should be exactly like ResNet, except that it should not have batch norm layers.

In [None]:
class ResNet_noBN(nn.Module):
    def __init__(self, k):
        super(ResNet_noBN, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1
        self.k = k
        self.layperstage= k // 3
        # Initialize module list
        self.translayer = nn.ModuleList()
        self.layer = nn.ModuleList()
        input_c = 1
        for i in range(3):
          output_c = 4 * pow(2, i)
          self.translayer.append(nn.Conv2d(input_c, output_c, 3 , 1, 1))
          self.layer.append(nn.Conv2d(output_c, output_c, 3, 1, 1))
          input_c=output_c
        
        self.act=nn.ELU()
        self.avgpool = nn.AvgPool2d(2,2)
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(144, 10)
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = math.sqrt(1 / m.in_channels * 9)
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = math.sqrt(1 / m.in_channels * 9)
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):

        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'
        u = input
        for i in range(3):
          for j in range(self.layperstage):
            x = u
            if j ==0:
              u = self.translayer[i](u)
              u = self.act(u)
            else:
              u = self.layer[i](u)
              u = self.act(u) + x
          u = self.avgpool(u)
        u = self.flatten(u)
        u = self.linear(u)
        return u

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with ResNet_noBN.

In [None]:
k_vals= [6, 3]
lr = 0.01

for k in k_vals:
  print("\nTraining ResNet w/o BN with {} layers".format(k))
  model = ResNet_noBN(k).cuda()
  optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
  train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)


Training ResNet w/o BN with 6 layers
Epoch 000/003, Train Error 90.15% || Test Error 90.20%
Epoch 001/003, Train Error 90.13% || Test Error 90.20%
Epoch 002/003, Train Error 90.13% || Test Error 90.20%

Training ResNet w/o BN with 3 layers
Epoch 000/003, Train Error 40.14% || Test Error 13.52%
Epoch 001/003, Train Error 11.90% || Test Error 10.57%
Epoch 002/003, Train Error 8.91% || Test Error 6.99%


**<font color='blue'>
    Compare ResNet and ResNet_noBN in terms of how deep each could be while being trainable, and discuss your thoughts one how batch norm interacts with residual connections.
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

Resnet with no Batch normalization made the network untrainable for more than 3 layers, where as adding the batch normalization enables training for up to ~27 layers. Batch normalization allows resnets to be trainable for deeper networks because it prevents internal covariate shift which can cause it to be hard to train


**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**

### (Optional) Multiple Loss Heads

In this optional section, your goal is to incorporate the idea of having multiple loss heads throughout the network, distributed across its depth.

For the CNNelu_multihead model below, you should take the CNNelu model that you implemented previously and add two additional classification heads, connected to the outputs of stages 1 and 2.

More specifically, the outputs of stages 1 and 2, with shapes 4x14x14 and 8x7x7, should be connected to new fully-connected layers that map them to a 10-dimensional vector (logits for the 10 MNIST classes). The network should output three logit vectors (the original one at the end of the network plus the two new ones) instead of just one, and the loss should be computed as the average of the cross entropies between the true target and each of the three predictions.

Note that you will likely have to change the implementation of train_epoch() and test() to accomodate the fact that this model will output three logit vectors instead of one.

In [None]:
class CNNelu_multihead(nn.Module):
    def __init__(self, k):
        super(CNNelu_multihead, self).__init__()

        # write code here to instantiate layers
        # for example, self.conv = nn.Conv2d(1, 4, 3, 1, 1)
        # creates a conv layer with 1 input channel, 4 output
        # channels, a 3x3 kernel, and stride=padding=1

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                # insert code to compute sigma
                sigma = 
                m.weight.data.normal_(0, sigma)
                m.bias.data.zero_()
        
    def forward(self, input):
        # write code here to define how the output u is computed
        # from the input and the model's layers
        # for example, u = self.conv(input) defines u
        # to be simply the output of self.conv given 'input'

        return u1, u2, u3

Repeat the procedure of finding the maximum number of layers such that the network is still trainable, this time with CNNelu_multihead.

In [None]:
k = 
lr = 

print("\nTraining ELU CNN + multiloss with {} layers".format(k))
model = CNNelu_multihead(k).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
train_errs, test_errs = train(3, model, criterion, optimizer, train_loader, test_loader)

**<font color='blue'>
    Did the adoption of multiple loss heads help train deeper models? How did it compare to the adoption of batch normalization, in terms of how deeper each of the two approaches enabled the network to be while staying trainable?
</font>**

**<font color='red'> --------------------------------------------------------------------- ANSWER (BEGIN) ---------------------------------------------------------------------
</font>**

**<font color='red'> ---------------------------------------------------------------------- ANSWER (END) ----------------------------------------------------------------------
</font>**