# CIFAR-10 CNN Experiments

**|| Jonty Sinai ||** 11-04-2019

So far we've managed to train a LeNet-5 CNN to get 98.68% test accuracy on MNIST. This is good because LeNet-5's architecture was designed specifically do well at the MNIST task. It's difficult to say why exactly this architecture works well on the MNIST dataset, only that experimental design most likely converged in that direction, particularly under the computing power constraints of 1998.

However when we applied LeNet-5 to CIFAR-10 the results were still not great with a mediocre 51.64% test accuracy - only slighter better than the [feedforward neural network](neural_network.ipynb) trained earlier.

In this notebook we will explore other CNN architectures to push the limits of how well we can do on CIFAR-10 under the following constrains:

* only three epochs of training allowed,
* on a CPU,
* in less than ~10 minutes.

This notebook will be purely experimental and objects will may be overwritten but as long as the notebook is followed sequentially, things should compute. This is a learning experience for me and I'll try my best to reason through my intuition for each progressive design choice taken in each experiment. 

Later I'll implement an experiment manager to take advantage of the configurable CNN architecture. That will allow for an automatic hyperparameter search under the constraints.

<a id="toc"></a>

### List of Experiments:

1. [Add ReLU and MaxPool](#exp1)
2. [Increase Hidden Channel Sizes](#exp2)
3. [Reduce Convolution Parameters](#exp3)
4. [Increase Fully Connected Parameters](#exp4)
5. [Increase Number of Convolutional Layers](#exp5)

In [59]:
%matplotlib inline

import os
import re
import random

from collections import OrderedDict

import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
from torchvision import transforms

import matplotlib.pyplot as plt


HOME = os.environ['AI_HOME']
ROOT = os.path.join(HOME, 'artificial_neural_networks')
DATA = os.path.join(ROOT, 'data')
MNIST = os.path.join(DATA, 'mnist')
CIFAR10 = os.path.join(DATA, 'cifar10')

random.seed(1901)
np.random.seed(1901)
torch.manual_seed(1901)

<torch._C.Generator at 0x11dd348f0>

## CNN Module

Once again we'll initallise the CNN module which takes an OrderedDict describing the architecture.

In [60]:
class Flatten(nn.Module):
    # ref: https://discuss.pytorch.org/t/flatten-layer-of-pytorch-build-by-sequential-container/5983/3
    def forward(self, x):
        x = x.view(x.size()[0], -1)
        return x
    

In [61]:
class ConvNet(nn.Module):
    
    def __init__(self, arch_dict: OrderedDict):
        """
        Args:
            arch_dict (OrderedDict) : Specifies the CNN archicture where
                key, value pairs correspond to layer_type, layer_params.
                Layer parameters are specified as a tuple of integers or
                they can be None.
                
                The supported layer types with their parameters are:
                
                    Conv2d : (in_channels, out_channels, kernel_size, stride, padding)
                    AvgPool2d : (kernel_size, stride, padding)
                    MaxPool2d : (kernel_size, stride, padding)
                    Flatten : None
                    Linear : input_size, output_size
                    ReLU : None
                    Sigmoid : None
                    Tanh : None
                    
                If more layer_types are used repeatedly, then they should be
                post-fixed with an underscore followed by an alphanumeric
                index. 
                
                Eg: "Conv2d_1", "Conv2d_2", "Tanh_1a", "Tanh_1b"
                    
        """
        super().__init__()
        
        self.layers = nn.ModuleList()
        
        # make sure arch_dict is an OrderedDict
        # for activation layers, use None for layer_params
        for layer_type, layer_params in arch_dict.items():
            
            layer_type = re.sub(r"_[\d\w]+", "", layer_type) # remove number/letter post-fixing of layer types
            
            if layer_type == "Conv2d":
                in_channels, out_channels, kernel_size, stride, padding = layer_params
                self.layers.append(
                    nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding))
            elif layer_type == "AvgPool2d":
                kernel_size, stride, padding = layer_params
                self.layers.append(
                    nn.AvgPool2d(kernel_size, stride, padding))
            elif layer_type == "MaxPool2d":
                kernel_size, stride, padding = layer_params
                self.layers.append(
                    nn.MaxPool2d(kernel_size, stride, padding))
            elif layer_type == "Flatten":
                self.layers.append(
                    Flatten())
            elif layer_type == "Linear":
                input_size, output_size = layer_params
                self.layers.append(
                    nn.Linear(input_size, output_size))
            elif layer_type == "ReLU":
                self.layers.append(
                    nn.ReLU())
            elif layer_type == "Sigmod":
                self.layers.append(
                    nn.Sigmoid())
            elif layer_type == "Tanh":
                self.layers.append(
                    nn.Tanh())
            else:
                raise ValueError(f"Unsupported layer type: {layer_type}")
                
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x


## Train and Evaluate Functions

In [62]:
def train(model, training_data, optimiser, loss_function, num_epochs):
    
    for epoch in range(num_epochs):
        print(f"Epoch: {epoch + 1} " + "="*80 + ">")
        
        total_loss = 0.0
        for batch_idx, batch in enumerate(training_data):
            images, labels = batch
            
            # zero accumulated gradients
            optimiser.zero_grad()
            
            # forward pass
            output = model(images)
            # backward pass
            loss = loss_function(output, labels)
            loss.backward()
            optimiser.step()
            
            total_loss += loss.item()
            # print progress
            
            if (batch_idx + 1) % 1000 == 0:    # print every 1000 mini-batches
                print("[%4d/6000] loss: %.3f" %
                      (batch_idx + 1, total_loss / 1000))
                total_loss = 0.0
                
    print("Finished Training " + "="*71 + ">")

In [63]:
def evaluate(model, test_data):
    correct = 0
    total = 0
    with torch.no_grad():
        for data in test_data:
            images, truth = data
            output = model(images)
            _, predicted = torch.max(output.data, 1)
            total += truth.size(0)
            correct += (predicted == truth).sum().item()

    print('Test accuracy on %d test images: %.4f %%' % (total, 100 * correct / total))

## CIFAR-10 Training and Test Data

In [64]:
cifar10_transforms = transforms.Compose([
                    transforms.ToTensor(),
                    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]  # note that we normalise by rank-1 tensors
                )

In [65]:
# training set
cifar10_trainset = torchvision.datasets.CIFAR10(root=CIFAR10, train=True, download=True, transform=cifar10_transforms)
cifar10_trainloader = torch.utils.data.DataLoader(cifar10_trainset, batch_size=10, shuffle=True, num_workers=0)

Files already downloaded and verified


In [66]:
# test set
cifar10_testset = torchvision.datasets.CIFAR10(root=CIFAR10, train=False, download=True, transform=cifar10_transforms)
cifar10_testloader = torch.utils.data.DataLoader(cifar10_testset, batch_size=10, shuffle=False, num_workers=0)

Files already downloaded and verified


In [67]:
cifar10_classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

In [68]:
(8 - 2) / 2 + 1

4.0

<a id="exp1"></a>
## Experiment 1: ReLU and MaxPool

- **Last Experiment:** LeNet-5
- **Last Test Accuracy:** 51.64
- **This Test Accuracy:** 63.39

The first change we'll make to LeNet-5 is to replace AvgPool with MaxPool and Tanh with ReLU which are known to respectively perform better. This massively increases test accuracy by 20%.

In [69]:
cnn_arch = OrderedDict()

# 1: Convolutional Layer
cnn_arch["Conv2d_1"] = (3, 10, 3, 1, 1)  # Conv layer: 32x32x3 -> 32x32x16, kernel=3x3, stride=1, padding=1
cnn_arch["ReLU_1a"] = None  # followed by relu nonlinear activation
cnn_arch["MaxPool2d_1"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_1b"] = None  # followed by relu nonlinear activation
# 2: Convolutional Layer
cnn_arch["Conv2d_2"] = (10, 16, 5, 1, 0)  # Conv layer: 16x16x16 -> 12x12x32, kernel=5x5, stride=1, padding=0
cnn_arch["ReLU_2a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_2"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_2b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
cnn_arch["Flatten_3"] = None # flatten 6x6x16 -> 576
# 4: Fully Connected Layer
cnn_arch["Linear_4"] = (576, 120) # FC layer: 400 input units -> 120 output units
cnn_arch["ReLU_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
cnn_arch["Linear_5"] = (120, 84) # FC layer: 120 input units -> 84 output units
cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# 6: Fully Connected Output Layer
cnn_arch["Linear_6"] = (84, 10) # FC layer: 84 input units -> 10 output units

cnn_arch

OrderedDict([('Conv2d_1', (3, 10, 3, 1, 1)),
             ('ReLU_1a', None),
             ('MaxPool2d_1', (2, 2, 0)),
             ('ReLU_1b', None),
             ('Conv2d_2', (10, 16, 5, 1, 0)),
             ('ReLU_2a', None),
             ('MaxPool2d_2', (2, 2, 0)),
             ('ReLU_2b', None),
             ('Flatten_3', None),
             ('Linear_4', (576, 120)),
             ('ReLU_4', None),
             ('Linear_5', (120, 84)),
             ('ReLU_5', None),
             ('Linear_6', (84, 10))])

In [70]:
cnn = ConvNet(cnn_arch)

print(cnn)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(3, 10, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): ReLU()
    (4): Conv2d(10, 16, kernel_size=(5, 5), stride=(1, 1))
    (5): ReLU()
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): ReLU()
    (8): Flatten()
    (9): Linear(in_features=576, out_features=120, bias=True)
    (10): ReLU()
    (11): Linear(in_features=120, out_features=84, bias=True)
    (12): ReLU()
    (13): Linear(in_features=84, out_features=10, bias=True)
  )
)


In [71]:
x = torch.randn(10, 3, 32, 32)

y = cnn(x)
print(y.size())

torch.Size([10, 10])


In [72]:
cross_entropy = nn.CrossEntropyLoss()
adam = optim.Adam(cnn.parameters(), lr=0.001)

In [73]:
train(cnn, cifar10_trainloader, adam, cross_entropy, num_epochs=3)

[1000/6000] loss: 1.824
[2000/6000] loss: 1.545
[3000/6000] loss: 1.443
[4000/6000] loss: 1.378
[5000/6000] loss: 1.338
[1000/6000] loss: 1.249
[2000/6000] loss: 1.209
[3000/6000] loss: 1.193
[4000/6000] loss: 1.160
[5000/6000] loss: 1.139
[1000/6000] loss: 1.083
[2000/6000] loss: 1.080
[3000/6000] loss: 1.071
[4000/6000] loss: 1.046
[5000/6000] loss: 1.036


In [74]:
evaluate(cnn, cifar10_testloader)

Test accuracy on 10000 test images: 63.3900 %


[back to top](#toc)

<a id="exp2"></a>
## Experiment 2: Increase Hidden Channel Sizes

- **Last Experiment:** ReLU, MaxPool
- **Last Test Accuracy:** 63.39
- **This Test Accuracy:** 63.81

The next set of changes is inspired by AlexNet, which achieved state of the art on ImageNet (a significantly harder task than CIFAR-10), shown below:

<img src="./assets/alex_net.png" width="800">

source: [Andrew Ng, Coursera](https://www.coursera.org/learn/convolutional-neural-networks/home/welcome)

We won't be able to train a neural network as deep or with as many parameters as AlexNet, but we can try and come with an architecture similar to it on a smaller scale. Some design patterns to note:

- AlexNet quickly reduces the resolution of the input images but widens the number of channels.
- The number of channels is increased while the resolution is decreasesd progressively through the network.
- Convolution kernel sizes start of larger, than become smaller.
- Same padding is used between successive convolutional layers.

We'll implement some of these ideas progressively through this notebook, starting with larger hidden channel sizes.

In [75]:
cnn_arch = OrderedDict()

# 1: Convolutional Layer
cnn_arch["Conv2d_1"] = (3, 18, 3, 1, 0)  # Conv layer: 32x32x3 -> 30x30x18, kernel=3x3, stride=1, padding=1
cnn_arch["ReLU_1a"] = None  # followed by relu nonlinear activation
cnn_arch["MaxPool2d_1"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_1b"] = None  # followed by relu nonlinear activation
# 2: Convolutional Layer
cnn_arch["Conv2d_2"] = (18, 32, 5, 1, 0)  # Conv layer: 15x15x10 -> 11x11x32, kernel=5x5, stride=1, padding=0
cnn_arch["ReLU_2a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_2"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_2b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
cnn_arch["Flatten_3"] = None # flatten 5x5x32 -> 800
# 4: Fully Connected Layer
cnn_arch["Linear_4"] = (800, 256) # FC layer: 800 input units -> 256 output units
cnn_arch["ReLU_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
cnn_arch["Linear_5"] = (256, 128) # FC layer: 256 input units -> 128 output units
cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# 6: Fully Connected Output Layer
cnn_arch["Linear_6"] = (128, 10) # FC layer: 128 input units -> 10 output units

cnn_arch

OrderedDict([('Conv2d_1', (3, 18, 3, 1, 0)),
             ('ReLU_1a', None),
             ('MaxPool2d_1', (2, 2, 0)),
             ('ReLU_1b', None),
             ('Conv2d_2', (18, 32, 5, 1, 0)),
             ('ReLU_2a', None),
             ('MaxPool2d_2', (2, 2, 0)),
             ('ReLU_2b', None),
             ('Flatten_3', None),
             ('Linear_4', (800, 256)),
             ('ReLU_4', None),
             ('Linear_5', (256, 128)),
             ('ReLU_5', None),
             ('Linear_6', (128, 10))])

In [76]:
cnn = ConvNet(cnn_arch)

print(cnn)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(3, 18, kernel_size=(3, 3), stride=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): ReLU()
    (4): Conv2d(18, 32, kernel_size=(5, 5), stride=(1, 1))
    (5): ReLU()
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): ReLU()
    (8): Flatten()
    (9): Linear(in_features=800, out_features=256, bias=True)
    (10): ReLU()
    (11): Linear(in_features=256, out_features=128, bias=True)
    (12): ReLU()
    (13): Linear(in_features=128, out_features=10, bias=True)
  )
)


In [77]:
x = torch.randn(10, 3, 32, 32)

y = cnn(x)
print(y.size())

torch.Size([10, 10])


In [78]:
cross_entropy = nn.CrossEntropyLoss()
adam = optim.Adam(cnn.parameters(), lr=0.001)

In [79]:
train(cnn, cifar10_trainloader, adam, cross_entropy, num_epochs=3)

[1000/6000] loss: 1.800
[2000/6000] loss: 1.477
[3000/6000] loss: 1.386
[4000/6000] loss: 1.321
[5000/6000] loss: 1.265
[1000/6000] loss: 1.179
[2000/6000] loss: 1.151
[3000/6000] loss: 1.110
[4000/6000] loss: 1.105
[5000/6000] loss: 1.086
[1000/6000] loss: 1.002
[2000/6000] loss: 0.975
[3000/6000] loss: 0.987
[4000/6000] loss: 0.989
[5000/6000] loss: 0.957


In [80]:
evaluate(cnn, cifar10_testloader)

Test accuracy on 10000 test images: 63.8100 %


[back to top](#toc)

<a id="exp3"></a>
## Experiment 3: Reduce Number of Convolutional Parameters

- **Last Experiment:** Increase Hidden Channel Sizes
- **Last Test Accuracy:** 63.81
- **This Test Accuracy:** 65.19

Here we take two steps to reduce the number of convolutional parameters, with only a slight increase in the number of fully connected parameters:

- Reduce the number of hidden channels
- Reduce the size of the second hidden kernel
- To keep the size of the flat layer roughly the same we increase the stride by 1 for the first convolutional layer.

In [81]:
cnn_arch = OrderedDict()

# 1: Convolutional Layer
cnn_arch["Conv2d_1"] = (3, 18, 3, 2, 0)  # Conv layer: 32x32x3 -> 15x15x24, kernel=3x3, stride=2, padding=1
cnn_arch["ReLU_1a"] = None  # followed by relu nonlinear activation
cnn_arch["MaxPool2d_1"] = (2, 1, 0)  # followed by 2x2 MaxPool, stride = 1, padding=0
cnn_arch["ReLU_1b"] = None  # followed by relu nonlinear activation
# 2: Convolutional Layer
cnn_arch["Conv2d_2"] = (18, 24, 3, 1, 0)  # Conv layer: 14x14x32 -> 12x12x32, kernel=3x3, stride=1, padding=0
cnn_arch["ReLU_2a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_2"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_2b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
cnn_arch["Flatten_3"] = None # flatten 6x6x24 -> 864
# 4: Fully Connected Layer
cnn_arch["Linear_4"] = (864, 256) # FC layer: 864 input units -> 256 output units
cnn_arch["ReLU_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
cnn_arch["Linear_5"] = (256, 128) # FC layer: 256 input units -> 128 output units
cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# 6: Fully Connected Output Layer
cnn_arch["Linear_6"] = (128, 10) # FC layer: 128 input units -> 10 output units

cnn_arch

OrderedDict([('Conv2d_1', (3, 18, 3, 2, 0)),
             ('ReLU_1a', None),
             ('MaxPool2d_1', (2, 1, 0)),
             ('ReLU_1b', None),
             ('Conv2d_2', (18, 24, 3, 1, 0)),
             ('ReLU_2a', None),
             ('MaxPool2d_2', (2, 2, 0)),
             ('ReLU_2b', None),
             ('Flatten_3', None),
             ('Linear_4', (864, 256)),
             ('ReLU_4', None),
             ('Linear_5', (256, 128)),
             ('ReLU_5', None),
             ('Linear_6', (128, 10))])

In [82]:
cnn = ConvNet(cnn_arch)

print(cnn)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(3, 18, kernel_size=(3, 3), stride=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=1, padding=0, dilation=1, ceil_mode=False)
    (3): ReLU()
    (4): Conv2d(18, 24, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): ReLU()
    (8): Flatten()
    (9): Linear(in_features=864, out_features=256, bias=True)
    (10): ReLU()
    (11): Linear(in_features=256, out_features=128, bias=True)
    (12): ReLU()
    (13): Linear(in_features=128, out_features=10, bias=True)
  )
)


In [83]:
x = torch.randn(10, 3, 32, 32)

y = cnn(x)
print(y.size())

torch.Size([10, 10])


In [84]:
cross_entropy = nn.CrossEntropyLoss()
adam = optim.Adam(cnn.parameters(), lr=0.001)

In [85]:
train(cnn, cifar10_trainloader, adam, cross_entropy, num_epochs=3)

[1000/6000] loss: 1.794
[2000/6000] loss: 1.476
[3000/6000] loss: 1.367
[4000/6000] loss: 1.295
[5000/6000] loss: 1.236
[1000/6000] loss: 1.146
[2000/6000] loss: 1.133
[3000/6000] loss: 1.101
[4000/6000] loss: 1.076
[5000/6000] loss: 1.065
[1000/6000] loss: 0.974
[2000/6000] loss: 0.985
[3000/6000] loss: 0.969
[4000/6000] loss: 0.966
[5000/6000] loss: 0.920


In [86]:
evaluate(cnn, cifar10_testloader)

Test accuracy on 10000 test images: 65.1900 %


[back to top](#toc)

<a id="exp4"></a>
## Experiment 4: Increase Number of Fully Connected Parameters

- **Last Experiment:** Reduce Number of Convolutional Parameters
- **Last Test Accuracy:** 65.19
- **This Test Accuracy:** 67.18

In this experiment I wanted to see what would happen if I increased the number of fully connected parameters (inspired by the conventional thinking with fully connected networks). To increase the size of the flat layer I reduced the stride in the first convolutional layer while adding padding of 1 so that the resolution through the first layer is sustained.

In [87]:
cnn_arch = OrderedDict()

# 1: Convolutional Layer
cnn_arch["Conv2d_1"] = (3, 18, 3, 1, 1)  # Conv layer: 32x32x3 -> 32x32x18, kernel=3x3, stride=1, padding=1
cnn_arch["ReLU_1a"] = None  # followed by relu nonlinear activation
cnn_arch["MaxPool2d_1"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_1b"] = None  # followed by relu nonlinear activation
# 2: Convolutional Layer
cnn_arch["Conv2d_2"] = (18, 24, 3, 1, 0)  # Conv layer: 16x16x18 -> 14x14x24, kernel=3x3, stride=1, padding=0
cnn_arch["ReLU_2a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_2"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_2b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
cnn_arch["Flatten_3"] = None # flatten 7x7x24 -> 1176
# 4: Fully Connected Layer
cnn_arch["Linear_4"] = (1176, 512) # FC layer: 1176 input units -> 512 output units
cnn_arch["ReLU_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
cnn_arch["Linear_5"] = (512, 256) # FC layer: 512 input units -> 256 output units
cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# 6: Fully Connected Output Layer
cnn_arch["Linear_6"] = (256, 10) # FC layer: 256 input units -> 10 output units

cnn_arch

OrderedDict([('Conv2d_1', (3, 18, 3, 1, 1)),
             ('ReLU_1a', None),
             ('MaxPool2d_1', (2, 2, 0)),
             ('ReLU_1b', None),
             ('Conv2d_2', (18, 24, 3, 1, 0)),
             ('ReLU_2a', None),
             ('MaxPool2d_2', (2, 2, 0)),
             ('ReLU_2b', None),
             ('Flatten_3', None),
             ('Linear_4', (1176, 512)),
             ('ReLU_4', None),
             ('Linear_5', (512, 256)),
             ('ReLU_5', None),
             ('Linear_6', (256, 10))])

In [88]:
cnn = ConvNet(cnn_arch)

print(cnn)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(3, 18, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): ReLU()
    (4): Conv2d(18, 24, kernel_size=(3, 3), stride=(1, 1))
    (5): ReLU()
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): ReLU()
    (8): Flatten()
    (9): Linear(in_features=1176, out_features=512, bias=True)
    (10): ReLU()
    (11): Linear(in_features=512, out_features=256, bias=True)
    (12): ReLU()
    (13): Linear(in_features=256, out_features=10, bias=True)
  )
)


In [89]:
x = torch.randn(10, 3, 32, 32)

y = cnn(x)
print(y.size())

torch.Size([10, 10])


In [90]:
cross_entropy = nn.CrossEntropyLoss()
adam = optim.Adam(cnn.parameters(), lr=0.001)

In [91]:
train(cnn, cifar10_trainloader, adam, cross_entropy, num_epochs=3)

[1000/6000] loss: 1.727
[2000/6000] loss: 1.383
[3000/6000] loss: 1.271
[4000/6000] loss: 1.180
[5000/6000] loss: 1.117
[1000/6000] loss: 1.000
[2000/6000] loss: 0.974
[3000/6000] loss: 0.961
[4000/6000] loss: 0.931
[5000/6000] loss: 0.920
[1000/6000] loss: 0.777
[2000/6000] loss: 0.787
[3000/6000] loss: 0.806
[4000/6000] loss: 0.796
[5000/6000] loss: 0.795


In [92]:
evaluate(cnn, cifar10_testloader)

Test accuracy on 10000 test images: 67.1800 %


[back to top](#toc)

<a id="exp5"></a>
## Experiment 5: Increase Number of Convolutional Layers

- **Last Experiment:** Increase Number of Fully Connected Parameters
- **Last Test Accuracy:** 67.18
- **This Test Accuracy:** 68.51

In this experiment I wanted to see what would happen if I increased the number of convolutional layers. 

The results appeared to be slightly worse, suggesting that a lot of the predictive power of the neural network lies in the dense layers. However, the conventional thinking is that the convolutional layers should increase the fidelity of our feature extraction so perhaps we can play around with those parameters a bit more in the next experiment. Nonethless Exp 5 is still an improvement over Exp 3.

In [93]:
cnn_arch = OrderedDict()

# 1: Convolutional Layer
cnn_arch["Conv2d_1"] = (3, 18, 3, 1, 1)  # Conv layer: 32x32x3 -> 32x32x18, kernel=3x3, stride=1, padding=1
cnn_arch["ReLU_1a"] = None  # followed by relu nonlinear activation
cnn_arch["MaxPool2d_1"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_1b"] = None  # followed by relu nonlinear activation
# 2: Convolutional Layer
cnn_arch["Conv2d_2"] = (18, 24, 3, 1, 1)  # Conv layer: 16x16x18 -> 14x14x24, kernel=3x3, stride=1, padding=0
cnn_arch["ReLU_2a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_2"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_2b"] = None  # followed by tanh nonlinear activation
# 3: Convolutional Layer
cnn_arch["Conv2d_3"] = (24, 32, 3, 1, 1)  # Conv layer: 16x16x18 -> 14x14x24, kernel=3x3, stride=1, padding=0
cnn_arch["ReLU_3a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_3"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_3b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
cnn_arch["Flatten_3"] = None # flatten 4x4x32 -> 512
# 4: Fully Connected Layer
cnn_arch["Linear_4"] = (512, 256) # FC layer: 1176 input units -> 512 output units
cnn_arch["ReLU_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
cnn_arch["Linear_5"] = (256, 128) # FC layer: 512 input units -> 256 output units
cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# # 6: Fully Connected Output Layer
# cnn_arch["Linear_6"] = (128, 128) # FC layer: 256 input units -> 128 output units
# cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# 7: Fully Connected Output Layer
cnn_arch["Linear_7"] = (128, 10) # FC layer: 128 input units -> 10 output units

cnn_arch

OrderedDict([('Conv2d_1', (3, 18, 3, 1, 1)),
             ('ReLU_1a', None),
             ('MaxPool2d_1', (2, 2, 0)),
             ('ReLU_1b', None),
             ('Conv2d_2', (18, 24, 3, 1, 1)),
             ('ReLU_2a', None),
             ('MaxPool2d_2', (2, 2, 0)),
             ('ReLU_2b', None),
             ('Conv2d_3', (24, 32, 3, 1, 1)),
             ('ReLU_3a', None),
             ('MaxPool2d_3', (2, 2, 0)),
             ('ReLU_3b', None),
             ('Flatten_3', None),
             ('Linear_4', (512, 256)),
             ('ReLU_4', None),
             ('Linear_5', (256, 128)),
             ('ReLU_5', None),
             ('Linear_7', (128, 10))])

In [94]:
cnn = ConvNet(cnn_arch)

print(cnn)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(3, 18, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): ReLU()
    (4): Conv2d(18, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU()
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): ReLU()
    (8): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU()
    (10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (11): ReLU()
    (12): Flatten()
    (13): Linear(in_features=512, out_features=256, bias=True)
    (14): ReLU()
    (15): Linear(in_features=256, out_features=128, bias=True)
    (16): ReLU()
    (17): Linear(in_features=128, out_features=10, bias=True)
  )
)


In [95]:
x = torch.randn(10, 3, 32, 32)

y = cnn(x)
print(y.size())

torch.Size([10, 10])


In [96]:
cross_entropy = nn.CrossEntropyLoss()
adam = optim.Adam(cnn.parameters(), lr=0.001)

In [97]:
train(cnn, cifar10_trainloader, adam, cross_entropy, num_epochs=3)

[1000/6000] loss: 1.790
[2000/6000] loss: 1.503
[3000/6000] loss: 1.380
[4000/6000] loss: 1.267
[5000/6000] loss: 1.163
[1000/6000] loss: 1.089
[2000/6000] loss: 1.051
[3000/6000] loss: 1.021
[4000/6000] loss: 0.998
[5000/6000] loss: 0.965
[1000/6000] loss: 0.872
[2000/6000] loss: 0.870
[3000/6000] loss: 0.869
[4000/6000] loss: 0.868
[5000/6000] loss: 0.844


In [98]:
evaluate(cnn, cifar10_testloader)

Test accuracy on 10000 test images: 68.5100 %


[back to top](#toc)

<a id="exp6"></a>
## Experiment 6: More Convolutional Layers with More Fully Connected Parameters

- **Last Experiment:** More Convolutional Layers
- **Last Test Accuracy:** 68.51%
- **This Test Accuracy:** 69.07%

In [Exp 4](#exp4) an increase in the number of fully connected parameters increased our accuracy w.r.t [Exp 3](#exp3), while in [Exp 5](#exp5) an increase in the number of convolutional layers increased accuracy w.r.t Exp 3 but not 4. Here we combine the effects for an overal improvement over 4.

In [99]:
cnn_arch = OrderedDict()

# 1: Convolutional Layer
cnn_arch["Conv2d_1"] = (3, 24, 3, 1, 1)  # Conv layer: 32x32x3 -> 16x16x24, kernel=5x5, stride=2, padding=2
cnn_arch["ReLU_1a"] = None  # followed by relu nonlinear activation
cnn_arch["MaxPool2d_1"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 1, padding=0
cnn_arch["ReLU_1b"] = None  # followed by relu nonlinear activation
# 2: Convolutional Layer
cnn_arch["Conv2d_2"] = (24, 32, 3, 1, 1)  # Conv layer: 16x16x24 -> 16x16x32, kernel=3x3, stride=1, padding=0
cnn_arch["ReLU_2a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_2"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_2b"] = None  # followed by tanh nonlinear activation
# 3: Convolutional Layer
cnn_arch["Conv2d_3"] = (32, 64, 3, 1, 1)  # Conv layer: 8x8x64 -> 8x8x32, kernel=3x3, stride=1, padding=0
cnn_arch["ReLU_3a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_3"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_3b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
cnn_arch["Flatten_3"] = None # flatten 4x4x64 -> 1024
# 4: Fully Connected Layer
cnn_arch["Linear_4"] = (1024, 512) # FC layer: 1024 input units -> 512 output units
cnn_arch["ReLU_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
cnn_arch["Linear_5"] = (512, 256) # FC layer: 512 input units -> 256 output units
cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# # 6: Fully Connected Output Layer
# cnn_arch["Linear_6"] = (128, 128) # FC layer: 256 input units -> 128 output units
# cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# 7: Fully Connected Output Layer
cnn_arch["Linear_7"] = (256, 10) # FC layer: 128 input units -> 10 output units

cnn_arch

OrderedDict([('Conv2d_1', (3, 24, 3, 1, 1)),
             ('ReLU_1a', None),
             ('MaxPool2d_1', (2, 2, 0)),
             ('ReLU_1b', None),
             ('Conv2d_2', (24, 32, 3, 1, 1)),
             ('ReLU_2a', None),
             ('MaxPool2d_2', (2, 2, 0)),
             ('ReLU_2b', None),
             ('Conv2d_3', (32, 64, 3, 1, 1)),
             ('ReLU_3a', None),
             ('MaxPool2d_3', (2, 2, 0)),
             ('ReLU_3b', None),
             ('Flatten_3', None),
             ('Linear_4', (1024, 512)),
             ('ReLU_4', None),
             ('Linear_5', (512, 256)),
             ('ReLU_5', None),
             ('Linear_7', (256, 10))])

In [100]:
cnn = ConvNet(cnn_arch)

print(cnn)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(3, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): ReLU()
    (4): Conv2d(24, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU()
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): ReLU()
    (8): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU()
    (10): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (11): ReLU()
    (12): Flatten()
    (13): Linear(in_features=1024, out_features=512, bias=True)
    (14): ReLU()
    (15): Linear(in_features=512, out_features=256, bias=True)
    (16): ReLU()
    (17): Linear(in_features=256, out_features=10, bias=True)
  )
)


In [101]:
x = torch.randn(10, 3, 32, 32)

y = cnn(x)
print(y.size())

torch.Size([10, 10])


In [102]:
cross_entropy = nn.CrossEntropyLoss()
adam = optim.Adam(cnn.parameters(), lr=0.001)

In [103]:
train(cnn, cifar10_trainloader, adam, cross_entropy, num_epochs=3)

[1000/6000] loss: 1.769
[2000/6000] loss: 1.440
[3000/6000] loss: 1.313
[4000/6000] loss: 1.196
[5000/6000] loss: 1.126
[1000/6000] loss: 1.034
[2000/6000] loss: 0.993
[3000/6000] loss: 0.963
[4000/6000] loss: 0.931
[5000/6000] loss: 0.938
[1000/6000] loss: 0.802
[2000/6000] loss: 0.816
[3000/6000] loss: 0.822
[4000/6000] loss: 0.805
[5000/6000] loss: 0.807


In [104]:
evaluate(cnn, cifar10_testloader)

Test accuracy on 10000 test images: 69.0700 %


[back to top](#toc)

<a id="exp7"></a>
## Experiment 7: Even More Fully Connected Parameters

- **Last Experiment:** More Convolutional Layers with More Fully Connected Parameters
- **Last Test Accuracy:** 69.07%
- **This Test Accuracy:** 67.64


In [105]:
cnn_arch = OrderedDict()

# 1: Convolutional Layer
cnn_arch["Conv2d_1"] = (3, 12, 3, 1, 1)  # Conv layer: 32x32x3 -> 32x32x18, kernel=3x3, stride=1, padding=1
cnn_arch["ReLU_1a"] = None  # followed by relu nonlinear activation
cnn_arch["MaxPool2d_1"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_1b"] = None  # followed by relu nonlinear activation
# 2: Convolutional Layer
cnn_arch["Conv2d_2"] = (12, 24, 3, 1, 1)  # Conv layer: 16x16x18 -> 14x14x24, kernel=3x3, stride=1, padding=0
cnn_arch["ReLU_2a"] = None  # followed by tanh nonlinear activation
cnn_arch["MaxPool2d_2"] = (2, 2, 0)  # followed by 2x2 MaxPool, stride = 2, padding=0
cnn_arch["ReLU_2b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
cnn_arch["Flatten_3"] = None # flatten 8x8x24 -> 1536
# 4: Fully Connected Layer
cnn_arch["Linear_4"] = (1536, 512) # FC layer: 1176 input units -> 512 output units
cnn_arch["ReLU_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
cnn_arch["Linear_5"] = (512, 256) # FC layer: 512 input units -> 256 output units
cnn_arch["ReLU_5"] = None  # followed by tanh nonlinear activation
# 6: Fully Connected Output Layer
cnn_arch["Linear_6"] = (256, 10) # FC layer: 256 input units -> 10 output units

cnn_arch

OrderedDict([('Conv2d_1', (3, 12, 3, 1, 1)),
             ('ReLU_1a', None),
             ('MaxPool2d_1', (2, 2, 0)),
             ('ReLU_1b', None),
             ('Conv2d_2', (12, 24, 3, 1, 1)),
             ('ReLU_2a', None),
             ('MaxPool2d_2', (2, 2, 0)),
             ('ReLU_2b', None),
             ('Flatten_3', None),
             ('Linear_4', (1536, 512)),
             ('ReLU_4', None),
             ('Linear_5', (512, 256)),
             ('ReLU_5', None),
             ('Linear_6', (256, 10))])

In [106]:
cnn = ConvNet(cnn_arch)

print(cnn)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(3, 12, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): ReLU()
    (4): Conv2d(12, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU()
    (6): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (7): ReLU()
    (8): Flatten()
    (9): Linear(in_features=1536, out_features=512, bias=True)
    (10): ReLU()
    (11): Linear(in_features=512, out_features=256, bias=True)
    (12): ReLU()
    (13): Linear(in_features=256, out_features=10, bias=True)
  )
)


In [107]:
x = torch.randn(10, 3, 32, 32)

y = cnn(x)
print(y.size())

torch.Size([10, 10])


In [108]:
cross_entropy = nn.CrossEntropyLoss()
adam = optim.Adam(cnn.parameters(), lr=0.001)

In [109]:
train(cnn, cifar10_trainloader, adam, cross_entropy, num_epochs=3)

[1000/6000] loss: 1.716
[2000/6000] loss: 1.400
[3000/6000] loss: 1.266
[4000/6000] loss: 1.193
[5000/6000] loss: 1.140
[1000/6000] loss: 1.004
[2000/6000] loss: 1.015
[3000/6000] loss: 0.973
[4000/6000] loss: 0.957
[5000/6000] loss: 0.941
[1000/6000] loss: 0.800
[2000/6000] loss: 0.816
[3000/6000] loss: 0.800
[4000/6000] loss: 0.802
[5000/6000] loss: 0.822


In [110]:
evaluate(cnn, cifar10_testloader)

Test accuracy on 10000 test images: 67.6400 %


[back to top](#toc)