# Convolutional Neural Network in PyTorch

**|| Jonty Sinai ||** 10-04-2019

So far I've implemented a simple [one layer neural network](mlp.ipynb) and code for an [n-layer fully connected neural network](neural_network.ipynb) for simple image classification tasks, either MNIST or CIFAR10. With these simple architectures it was possible to get over 97% test accuracy on MNIST (with only a few minutes of training on a CPU) and roughly 51% test accuracy on CIFAR-10 (still way better than a random guess which has 10% accuracy). 

The main design flaw with these neural networks is that every pixel in the image maps through a unique weight pathway to the final output class probabilities. Put another way, every weight pathway through the network is associated with a unique pixel location in the source image. The problem with this approach is that groups of pixels tend to form similar patterns in images (eg edges, circles, texture etc.) and tend to be repeated across an image. The basic feed-forward architecture of a fully connected network takes no advantage of repeated motifs within an image. This is where convolutional neural networks come in. 

The idea is to "look" at only a subset of the image and systematically cover the image by _striding_ the selected subset. The way we do this is with _convolution_ kernels (matrices with size smaller than the image) which _convolve_ the selected pixels into a small patch of transformed pixels. The learnable weights of the neural network are the weights of the convolution kernels. The advantage of this is that a smaller number of weights can be shared across the image. For RGB colour images we can apply a separate convolution kernel across each colour channel. Convolution kernels are often coupled with _pooling_ layers which summarise the transformed pixel into one hidden unit, usually by taking the average or max value. The overall effect is similar to _compression_, only the compression tends to be lossy because of the pooling layers.

<img src="./assets/convolution_kernel.gif" width="543">

source: [Rob Robinson, Imperial College](https://mlnotebook.github.io/post/CNN1/)

Convolution kernels can be thought of as encoding essential information about the image into a compressed latent representation. With each convolution layer, each pixel in each laten representation covers a wider area of the source image. After several convolution layers, the latent representation can then be passed through a fully connected layer with softmax activation to calculate a distribution over the class labels. By this point, each class probability should reference the entire source image. 

An example of a convolutional neural network architecture with fully connected output layers is shown below:

<img src="./assets/cnn_arch.png" width="1000">

source: [Denny Britz, Wild ML](http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/)

Notice how the architecture is flexible enough to accomodate any number of channels in the hidden layers.

In [1]:
%matplotlib inline

import os
import re  # we'll use this later to process layer type keys in an OrderedDict 
import random

from collections import OrderedDict

import numpy as np

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import torchvision
from torchvision import transforms

import matplotlib.pyplot as plt


HOME = os.environ['AI_HOME']
ROOT = os.path.join(HOME, 'artificial_neural_networks')
DATA = os.path.join(ROOT, 'data')
MNIST = os.path.join(DATA, 'mnist')
CIFAR10 = os.path.join(DATA, 'cifar10')

random.seed(1901)
np.random.seed(1901)
torch.manual_seed(1901)

<torch._C.Generator at 0x10fd998f0>

## CNN Architecture

CNN architectures tend to be wide and varied and a lot of research and experimentation goes into finding the right architecture. One of the original CNN architectures to achieve high success on MNIST was Yann Le Cunn's [LeNet-5](http://yann.lecun.com/exdb/lenet/) from 1998. The architecture is shown below:

<img src="./assets/le_net_5.png" width="1000">

source: [Andrew Ng, Coursera](https://www.coursera.org/learn/convolutional-neural-networks/home/welcome)

We'll implement this architecture as part of a more generaliseable convolutional neural network module.

In [2]:
class Flatten(nn.Module):
    # ref: https://discuss.pytorch.org/t/flatten-layer-of-pytorch-build-by-sequential-container/5983/3
    def forward(self, x):
        x = x.view(x.size()[0], -1)
        return x

In [3]:
class ConvNet(nn.Module):
    
    def __init__(self, arch_dict: OrderedDict):
        """
        Args:
            arch_dict (OrderedDict) : Specifies the CNN archicture where
                key, value pairs correspond to layer_type, layer_params.
                Layer parameters are specified as a tuple of integers or
                they can be None.
                
                The supported layer types with their parameters are:
                
                    Conv2d : (in_channels, out_channels, kernel_size, stride, padding)
                    AvgPool2d : (kernel_size, stride, padding)
                    MaxPool2d : (kernel_size, stride, padding)
                    Flatten : None
                    Linear : input_size, output_size
                    ReLU : None
                    Sigmoid : None
                    Tanh : None
                    
                If more layer_types are used repeatedly, then they should be
                post-fixed with an underscore followed by an alphanumeric
                index. 
                
                Eg: "Conv2d_1", "Conv2d_2", "Tanh_1a", "Tanh_1b"
                    
        """
        super().__init__()
        
        self.layers = nn.ModuleList()
        
        # make sure arch_dict is an OrderedDict
        # for activation layers, use None for layer_params
        for layer_type, layer_params in arch_dict.items():
            
            layer_type = re.sub(r"_[\d\w]+", "", layer_type) # remove number/letter post-fixing of layer types
            
            if layer_type == "Conv2d":
                in_channels, out_channels, kernel_size, stride, padding = layer_params
                self.layers.append(
                    nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding))
            elif layer_type == "AvgPool2d":
                kernel_size, stride, padding = layer_params
                self.layers.append(
                    nn.AvgPool2d(kernel_size, stride, padding))
            elif layer_type == "MaxPool2d":
                kernel_size, stride, padding = layer_params
                self.layers.append(
                    nn.MaxPool2d(kernel_size, stride, padding))
            elif layer_type == "Flatten":
                self.layers.append(
                    Flatten())
            elif layer_type == "Linear":
                input_size, output_size = layer_params
                self.layers.append(
                    nn.Linear(input_size, output_size))
            elif layer_type == "ReLU":
                self.layers.append(
                    nn.ReLU())
            elif layer_type == "Sigmod":
                self.layers.append(
                    nn.Sigmoid())
            elif layer_type == "Tanh":
                self.layers.append(
                    nn.Tanh())
            else:
                raise ValueError(f"Unsupported layer type: {layer_type}")
                
    def forward(self, x):
        for layer in self.layers:
            x = layer(x)
        return x


The LeNet-5 architecture can be specified as follows using an `OrderedDict`.

> Note the original LeNet-5 architecture specifies zero padding in the first layer and expects 32x32 resolution input images. However MNIST images are 28x28 so we will have to use padding of 2 to keep the fully connected layers consistent with the LeNet-5 architecture.

In [4]:
lenet5_arch = OrderedDict()

# 1: Convolutional Layer
lenet5_arch["Conv2d_1"] = (1, 6, 5, 1, 2)  # Conv layer: 28x28x1 -> 28x28x6, kernel=5x5, stride=1, padding=2
lenet5_arch["Tanh_1a"] = None  # followed by tanh nonlinear activation
lenet5_arch["AvgPool2d_1"] = (2, 2, 0)  # followed by 2x2 AvgPool, stride = 2, padding=0
lenet5_arch["Tanh_1b"] = None  # followed by tanh nonlinear activation
# 2: Convolutional Layer
lenet5_arch["Conv2d_2"] = (6, 16, 5, 1, 0)  # Conv layer: 14x14x6 -> 10x10x16, kernel=5x5, stride=1, padding=0
lenet5_arch["Tanh_2a"] = None  # followed by tanh nonlinear activation
lenet5_arch["AvgPool2d_2"] = (2, 2, 0)  # followed by 2x2 AvgPool, stride = 2, padding=0
lenet5_arch["Tanh_2b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
lenet5_arch["Flatten_3"] = None # flatten 5x5x16 -> 400
# 4: Fully Connected Layer
lenet5_arch["Linear_4"] = (400, 120) # FC layer: 400 input units -> 120 output units
lenet5_arch["Tanh_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
lenet5_arch["Linear_5"] = (120, 84) # FC layer: 120 input units -> 84 output units
lenet5_arch["Tanh_5"] = None  # followed by tanh nonlinear activation
# 6: Fully Connected Output Layer
lenet5_arch["Linear_6"] = (84, 10) # FC layer: 84 input units -> 10 output units

lenet5_arch

OrderedDict([('Conv2d_1', (1, 6, 5, 1, 2)),
             ('Tanh_1a', None),
             ('AvgPool2d_1', (2, 2, 0)),
             ('Tanh_1b', None),
             ('Conv2d_2', (6, 16, 5, 1, 0)),
             ('Tanh_2a', None),
             ('AvgPool2d_2', (2, 2, 0)),
             ('Tanh_2b', None),
             ('Flatten_3', None),
             ('Linear_4', (400, 120)),
             ('Tanh_4', None),
             ('Linear_5', (120, 84)),
             ('Tanh_5', None),
             ('Linear_6', (84, 10))])

In [5]:
lenet5_mnist = ConvNet(lenet5_arch)

print(lenet5_mnist)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): Tanh()
    (2): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (3): Tanh()
    (4): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
    (5): Tanh()
    (6): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (7): Tanh()
    (8): Flatten()
    (9): Linear(in_features=400, out_features=120, bias=True)
    (10): Tanh()
    (11): Linear(in_features=120, out_features=84, bias=True)
    (12): Tanh()
    (13): Linear(in_features=84, out_features=10, bias=True)
  )
)


## Test Forward Pass

In [6]:
x = torch.randn(10, 1, 28, 28)  # 10 batches of 28x28 greyscale (1 channel) images

y = lenet5_mnist(x)
print(y.size())

torch.Size([10, 10])


## Train and Evaluate Functions

Notice that since we're using convolutions, we don't need to specify the input size parameter to unroll the images.

In [7]:
def train(model, training_data, optimiser, loss_function, num_epochs):
    
    for epoch in range(num_epochs):
        print(f"Epoch: {epoch + 1} " + "="*80 + ">")
        
        total_loss = 0.0
        for batch_idx, batch in enumerate(training_data):
            images, labels = batch
            
            # zero accumulated gradients
            optimiser.zero_grad()
            
            # forward pass
            output = model(images)
            # backward pass
            loss = loss_function(output, labels)
            loss.backward()
            optimiser.step()
            
            total_loss += loss.item()
            # print progress
            
            if (batch_idx + 1) % 1000 == 0:    # print every 1000 mini-batches
                print("[%4d/6000] loss: %.3f" %
                      (batch_idx + 1, total_loss / 1000))
                total_loss = 0.0
                
    print("Finished Training " + "="*71 + ">")

In [8]:
def evaluate(model, test_data):
    correct = 0
    total = 0
    with torch.no_grad():
        for data in test_data:
            images, truth = data
            output = model(images)
            _, predicted = torch.max(output.data, 1)
            total += truth.size(0)
            correct += (predicted == truth).sum().item()

    print('Test accuracy on %d test images: %.4f %%' % (total, 100 * correct / total))

## MNIST

Define dataloaders for [MNIST](http://yann.lecun.com/exdb/mnist/).

In [9]:
mnist_transforms = transforms.Compose([
                    transforms.ToTensor(),
                    transforms.Normalize((0.1307,), (0.3081,))]  # note that we normalise by rank-1 tensors
                )

In [10]:
# training set
mnist_trainset = torchvision.datasets.MNIST(root=MNIST, train=True, download=True, transform=mnist_transforms)
mnist_trainloader = torch.utils.data.DataLoader(mnist_trainset, batch_size=10, shuffle=True, num_workers=2)

In [11]:
# test set
mnist_testset = torchvision.datasets.MNIST(root=MNIST, train=False, download=True, transform=mnist_transforms)
mnist_testloader = torch.utils.data.DataLoader(mnist_testset, batch_size=10, shuffle=False, num_workers=2)

In [12]:
mnist_classes = tuple(f"{n}" for n in range(10))
print(mnist_classes)

('0', '1', '2', '3', '4', '5', '6', '7', '8', '9')


Now train LeNet-5 for MNIST

In [13]:
cross_entropy_mnist = nn.CrossEntropyLoss()
adam_mnist = optim.Adam(lenet5_mnist.parameters(), lr=0.001)

In [14]:
train(lenet5_mnist, mnist_trainloader, adam_mnist, cross_entropy_mnist, num_epochs=3)

[1000/6000] loss: 0.398
[2000/6000] loss: 0.156
[3000/6000] loss: 0.113
[4000/6000] loss: 0.101
[5000/6000] loss: 0.091
[6000/6000] loss: 0.083
[1000/6000] loss: 0.064
[2000/6000] loss: 0.068
[3000/6000] loss: 0.062
[4000/6000] loss: 0.069
[5000/6000] loss: 0.057
[6000/6000] loss: 0.064
[1000/6000] loss: 0.043
[2000/6000] loss: 0.050
[3000/6000] loss: 0.055
[4000/6000] loss: 0.047
[5000/6000] loss: 0.054
[6000/6000] loss: 0.055


For reference the simple MLP achieved a training loss of 0.083 and the 3-layer network achieved a training loss of 0.081 after three epochs. LeNet-5 massively improves the results, beating both shortly into the second epoch and achieving a final training loss of 0.055 after only a few minutes of training on a CPU. This is close to state of the art in 1998 - I only used a computer for the first time in 2000.

And now let's evaluate

In [15]:
evaluate(lenet5_mnist, mnist_testloader)

Test accuracy on 10000 test images: 98.6800 %


Hello world again.

## CIFAR-10

Let's see how LeNet-5 performs when trained on CIFAR-10, which has so far been a tricky dataset for our simple fully connected neural networks.

In [16]:
cifar10_transforms = transforms.Compose([
                    transforms.ToTensor(),
                    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]  # note that we normalise by rank-1 tensors
                )

In [17]:
cifar10_trainset = torchvision.datasets.CIFAR10(root=CIFAR10, train=True, download=True, transform=cifar10_transforms)
cifar10_trainloader = torch.utils.data.DataLoader(cifar10_trainset, batch_size=10, shuffle=True, num_workers=2)

Files already downloaded and verified


In [18]:
cifar10_testset = torchvision.datasets.CIFAR10(root=CIFAR10, train=False, download=True, transform=cifar10_transforms)
cifar10_testloader = torch.utils.data.DataLoader(cifar10_testset, batch_size=10, shuffle=False, num_workers=2)

Files already downloaded and verified


In [19]:
cifar10_classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Initialise model. CIFAR-10 images are 32x32 so this time we will use no padding in the first layer. Notice that we also have to increase the number of input channels to 3.

In [20]:
lenet5_arch_cifar10 = OrderedDict()

# 1: Convolutional Layer
lenet5_arch_cifar10["Conv2d_1"] = (3, 6, 5, 1, 0)  # Conv layer: 32x32x3 -> 28x28x6, kernel=5x5, stride=1, padding=0
lenet5_arch_cifar10["Tanh_1a"] = None  # followed by tanh nonlinear activation
lenet5_arch_cifar10["AvgPool2d_1"] = (2, 2, 0)  # followed by 2x2 AvgPool, stride = 2, padding=0
lenet5_arch_cifar10["Tanh_1b"] = None  # followed by tanh nonlinear activation
# 2: Convolutional Layer
lenet5_arch_cifar10["Conv2d_2"] = (6, 16, 5, 1, 0)  # Conv layer: 14x14x6 -> 10x10x16, kernel=5x5, stride=1, padding=0
lenet5_arch_cifar10["Tanh_2a"] = None  # followed by tanh nonlinear activation
lenet5_arch_cifar10["AvgPool2d_2"] = (2, 2, 0)  # followed by 2x2 AvgPool, stride = 2, padding=0
lenet5_arch_cifar10["Tanh_2b"] = None  # followed by tanh nonlinear activation
# 3: Flatten
lenet5_arch_cifar10["Flatten_3"] = None # flatten 7x7x24 -> 1176
# 4: Fully Connected Layer
lenet5_arch_cifar10["Linear_4"] = (1176, 512) # FC layer: 400 input units -> 120 output units
lenet5_arch_cifar10["Tanh_4"] = None  # followed by tanh nonlinear activation
# 5: Fully Connected Layer
lenet5_arch_cifar10["Linear_5"] = (512, ) # FC layer: 120 input units -> 84 output units
lenet5_arch_cifar10["Tanh_5"] = None  # followed by tanh nonlinear activation
# 6: Fully Connected Output Layer
lenet5_arch_cifar10["Linear_6"] = (84, 10) # FC layer: 84 input units -> 10 output units

lenet5_arch_cifar10

OrderedDict([('Conv2d_1', (3, 6, 5, 1, 0)),
             ('Tanh_1a', None),
             ('AvgPool2d_1', (2, 2, 0)),
             ('Tanh_1b', None),
             ('Conv2d_2', (6, 16, 5, 1, 0)),
             ('Tanh_2a', None),
             ('AvgPool2d_2', (2, 2, 0)),
             ('Tanh_2b', None),
             ('Flatten_3', None),
             ('Linear_4', (400, 120)),
             ('Tanh_4', None),
             ('Linear_5', (120, 84)),
             ('Tanh_5', None),
             ('Linear_6', (84, 10))])

In [50]:
lenet5_cifar10 = ConvNet(lenet5_arch_cifar10)

print(lenet5_cifar10)

ConvNet(
  (layers): ModuleList(
    (0): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1))
    (1): Tanh()
    (2): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (3): Tanh()
    (4): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
    (5): Tanh()
    (6): AvgPool2d(kernel_size=2, stride=2, padding=0)
    (7): Tanh()
    (8): Flatten()
    (9): Linear(in_features=400, out_features=120, bias=True)
    (10): Tanh()
    (11): Linear(in_features=120, out_features=84, bias=True)
    (12): Tanh()
    (13): Linear(in_features=84, out_features=10, bias=True)
  )
)


Now train

In [51]:
cross_entropy_cifar10 = nn.CrossEntropyLoss()
adam_cifar10 = optim.Adam(lenet5_cifar10.parameters(), lr=0.001)

In [52]:
train(lenet5_cifar10, cifar10_trainloader, adam_cifar10, cross_entropy_cifar10, num_epochs=3)

[1000/6000] loss: 1.912
[2000/6000] loss: 1.730
[3000/6000] loss: 1.640
[4000/6000] loss: 1.582
[5000/6000] loss: 1.503
[1000/6000] loss: 1.441
[2000/6000] loss: 1.441
[3000/6000] loss: 1.432
[4000/6000] loss: 1.405
[5000/6000] loss: 1.398
[1000/6000] loss: 1.332
[2000/6000] loss: 1.334
[3000/6000] loss: 1.325
[4000/6000] loss: 1.318
[5000/6000] loss: 1.318


And then evaluate

In [53]:
evaluate(lenet5_cifar10, cifar10_testloader)

Test accuracy on 10000 test images: 51.6400 %


LeNet-5 doesn't do as well as hoped on CIFAR-10. However using the flexible CNN architecture module defined above, we'll explore other architectures in the future to try on CIFAR-10.