# Principles of CNN

Many materials are from Deep Learning book (Ian Goodfellow, Yoshua Bengio and Aaron Courville)

Why does CNN perform better than MLP (Multilayer Perceptron) in various modalities? In this notebook, you will understand the principle of CNN through various examples and experiments. The three most distinct features that differentiate CNN from MLP are as follows:

1. **sparse interactions** : Unlike the MLP model, which had to calculate the interactions between all neurons using matrix multiplication, CNN has sparse interactions. This is achieved by using smaller kernels in comparison to the resolution of the input image. This means that CNN can greatly reduce the amount of computation and memory requirements and improve statistical efficiency. This is also called sparse connectivity or sparse weights.


2. **parameter sharing** : Parameter sharing means using the same parameters more than once within a model. In the case of MLP, all parameters are used only once when calculating the output within one layer. This reduces the memory used to store parameters.


3. **translational equivariance** : Parameter sharing in convolution operation makes the convolution layer equivariant to translation of given input. When a function is equivariant to some operation, it means that when the input changes as much as the given operation, the output of the function also changes in the same way. To explain it more formally, if a function $f(x)$ is equivariant to a transformation $g(x)$, then $f(g(x)) = g(f(x))$. In the case of convolution, $g(x)$ is the translation of the input $x$. 

    While convolution is equivariant to translation, it is not equivariant to other transformations such as rotation, scale, or warping. Therefore, various regularizations such as data augmentations are used to obtain CNN functions that are robust to such transformations during training. However, this will not be covered in this notebook.



In [1]:
# As usual, a bit of setup

import IPython
from IPython.display import display, HTML

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import matplotlib.pyplot as plt
import numpy as np
import random 

seed = 7
torch.manual_seed(seed)
random.seed(seed)
np.random.seed(seed)

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

## MLP vs CNN

In this problem, we will learn why the properties that CNN have lead to better performance on the vision modality by comparing it with MLP. 

The number of parameters is the rule of thumb to compare the expressiveness of the neural network. So we are now comparing MLP and CNN that have the similar number of parameters. Let's see how the performance deviates. 

First, we load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

PyTorch provides convenient tools to automate the process of downloading 
common datasets, processing the data, and splitting into minibatches.

In [21]:
NUM_TRAIN = 49000

# The torchvision.transforms package provides tools for preprocessing data
# and for performing data augmentation; here we set up a transform to
# preprocess the data by subtracting the mean RGB value and dividing by the
# standard deviation of each RGB value; we've hardcoded the mean and std.
# If we want to add data augmentations, torchvision also offers different 
# transformations that we can compose in here, though we would need to be sure
# to have two sets of transformations: one with data augmentation for the 
# training loaders, and one without for the test and validation sets.
transform = T.Compose([
                T.ToTensor(),
                T.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
            ])

# We set up a Dataset object for each split (train / val / test); Datasets load
# training examples one at a time, so we wrap each Dataset in a DataLoader which
# iterates through the Dataset and forms minibatches. We divide the CIFAR-10
# training set into train and val sets by passing a Sampler object to the
# DataLoader telling how it should sample from the underlying Dataset.
cifar10_train = dset.CIFAR10('./../../cifar-10/', train=True, download=False,
                             transform=transform)
loader_train = DataLoader(cifar10_train, batch_size=64, 
                          sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN)))

cifar10_val = dset.CIFAR10('./../../cifar-10/', train=True, download=False,
                           transform=transform)
loader_val = DataLoader(cifar10_val, batch_size=64, 
                        sampler=sampler.SubsetRandomSampler(range(NUM_TRAIN, 50000)))

cifar10_test = dset.CIFAR10('./../../cifar-10/', train=False, download=False, 
                            transform=transform)
loader_test = DataLoader(cifar10_test, batch_size=64)

You have an option to **use GPU by setting the flag to True below**. It is recommended, but not necessary to use GPU for this assignment. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `dtype` and `device` will control the data types throughout this assignment.

## Colab Users

If you are using Colab, you need to manually switch to a GPU device. You can do this by clicking `Runtime -> Change runtime type` and selecting `GPU` under `Hardware Accelerator`. Note that you have to rerun the cells from the top since the kernel gets restarted upon switching runtimes.

In [4]:
USE_GPU = True

dtype = torch.float32 # we will be using float throughout this tutorial

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 500

print('using device:', device)

using device: cuda


### Training Helpers

In [3]:
def check_valid_accuracy(loader, model):
    # print('Checking accuracy on validation set')
    if not loader.dataset.train:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval()  # set model to evaluation mode
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)
            scores = model(x)
            _, preds = scores.max(1)
            num_correct += (preds == y).sum()
            num_samples += preds.size(0)
        acc = float(num_correct) / num_samples
        # print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
    return acc

def train_model(model, optimizer, epochs=1):
    """
    Train a model on CIFAR-10 using the PyTorch Module API and prints model 
    accuracies during training.
    
    Inputs:
    - model: A PyTorch Module giving the model to train.
    - optimizer: An Optimizer object we will use to train the model
    - epochs: (Optional) A Python integer giving the number of epochs to train for
    
    Returns: Lists of validation accuracies at the end of each epoch.
    """
    model = model.to(device=device)  # move the model parameters to CPU/GPU
    train_accs = []
    val_accs = []
    for e in range(epochs):
        for t, (x, y) in enumerate(loader_train):
            model.train()  # put model to training mode
            x = x.to(device=device, dtype=dtype)  # move to device, e.g. GPU
            y = y.to(device=device, dtype=torch.long)

            scores = model(x)
            loss = F.cross_entropy(scores, y)

            # Zero out all of the gradients for the variables which the optimizer
            # will update.
            optimizer.zero_grad()

            # This is the backwards pass: compute the gradient of the loss with
            # respect to each trainable parameter of the model.
            loss.backward()

            # Actually update the parameters of the model using the gradients
            # computed by the backwards pass.
            optimizer.step()

            if t % print_every == 0:
                acc = check_valid_accuracy(loader_val, model)
                print(f"Epoch [{e}/{epochs}]: Iter {t}, loss = {round(loss.item(), 4)}, validation accuracy = {100*acc}%")
        val_accs.append(check_valid_accuracy(loader_val, model))
    return val_accs

### Define 3 Layer MLP and CNN

In [5]:
class ThreeLayerConvNet(nn.Module):
    def __init__(self, in_channel, channel_1, channel_2, num_classes):
        super().__init__()
        self.conv1 = nn.Conv2d(in_channel, channel_1, 5, stride=1, padding=2)
        self.conv2 = nn.Conv2d(channel_1, channel_2, 3, stride=1, padding=1)
        self.classifier = nn.Linear(channel_2 * 32 * 32, num_classes)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = x.flatten(start_dim=1)
        x = self.classifier(x)

        return x

class ThreeLayerMLP(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.classifier = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = x.flatten(start_dim=1)
        x = self.fc1(x)
        x=  F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = self.classifier(x)

        return x

### Count the number of parameters

Note that with the given architecture hyperparameters, the number of parameters of MLP model is slightly larger than that of CNN.

In [6]:
def count_parameters(model):
    num_params = sum(p.numel() for p in model.parameters())
    return num_params

input_size = 3 * 32 * 32
in_channel = 3
channel_1 = 100
channel_2 = 100
num_classes = 10
hidden_size = 350

mlp_model = ThreeLayerMLP(input_size, hidden_size, num_classes)
cnn_model = ThreeLayerConvNet(in_channel, channel_1, channel_2, num_classes)

num_params_mlp = count_parameters(mlp_model)
num_params_cnn = count_parameters(cnn_model)

print(f"#params of MLP model : {num_params_mlp}")
print(f"#params of CNN model : {num_params_cnn}")


#params of MLP model : 1201910
#params of CNN model : 1121710


In [9]:
from torchsummary import summary

## Training the model
Now we will train both MLP and CNN.

In [11]:
format_input_size = (3,32,32)

mlp_model = mlp_model.cuda()
cnn_model = cnn_model.cuda()

summary(mlp_model, input_size = format_input_size)
summary(cnn_model, input_size = format_input_size)

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Linear-1                  [-1, 350]       1,075,550
            Linear-2                  [-1, 350]         122,850
            Linear-3                   [-1, 10]           3,510
Total params: 1,201,910
Trainable params: 1,201,910
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.01
Forward/backward pass size (MB): 0.01
Params size (MB): 4.58
Estimated Total Size (MB): 4.60
----------------------------------------------------------------
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1          [-1, 100, 32, 32]           7,600
            Conv2d-2          [-1, 100, 32, 32]          90,100
            Linear-3                   [-1, 10]       1,024,010
Total params: 1,121,710
Trainable par

In [12]:
learning_rate = 3e-3

mlp_optimizer = optim.SGD(mlp_model.parameters(), lr=learning_rate)
cnn_optimizer = optim.SGD(cnn_model.parameters(), lr=learning_rate)

total_epochs = 2
# total_epochs = 5

print("Start MLP training...")
mlp_accuracy = train_model(mlp_model, mlp_optimizer, total_epochs)
print("Start CNN training...")
cnn_accuracy = train_model(cnn_model, cnn_optimizer, total_epochs)

Start MLP training...
Epoch [0/2]: Iter 0, loss = 2.3459, validation accuracy = 12.2%
Epoch [0/2]: Iter 500, loss = 2.0613, validation accuracy = 32.300000000000004%
Epoch [1/2]: Iter 0, loss = 1.7054, validation accuracy = 34.599999999999994%
Epoch [1/2]: Iter 500, loss = 1.8353, validation accuracy = 38.0%
Start CNN training...
Epoch [0/2]: Iter 0, loss = 2.3155, validation accuracy = 11.700000000000001%
Epoch [0/2]: Iter 500, loss = 1.5784, validation accuracy = 46.9%
Epoch [1/2]: Iter 0, loss = 1.4729, validation accuracy = 49.2%
Epoch [1/2]: Iter 500, loss = 1.449, validation accuracy = 53.2%


Q. What is the final accuracy of MLP and CNN models? Why are they different?

A.


## Translational Equivariance

In this problem, we will check that CNN is translationally equivraint and MLP is not.

In [13]:
## Some helpers
def torch_to_numpy(tensor):
    tensor = tensor.cpu().detach().numpy()
    return tensor

def preprocess_mnist_data(data):
    # padding tuples: (padding_left, padding_right, padding_top, padding_bottom)
    # data1 = F.pad(data, (0, 28, 0, 28), mode='constant', value=0)
    # data2 = F.pad(data, (28, 0, 0, 28), mode='constant', value=0)
    # data3 = F.pad(data, (0, 28, 28, 0), mode='constant', value=0)
    # data4 = F.pad(data, (28, 0, 28, 0), mode='constant', value=0)
    # data = torch.cat((data1, data2, data3, data4), dim=0)

    padded_data_list = []

    for i in range(0, 28, 4):
        for j in range(0, 28, 4):
            padded_data_list.append(F.pad(data, (i, 28-i, j, 28-j), mode='constant', value=0))
    
    padded_data = torch.stack(padded_data_list, dim=0)

    return padded_data

class CNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 3, 1, padding=1)
        self.conv2 = nn.Conv2d(20, 40, 3, 1, padding=1)
        self.conv3 = nn.Conv2d(40, 1, 3, 1, padding=1)
        # self.conv4 = nn.Conv2d(40, 1, 3, 1, padding=1)
    def forward(self, x):
        x = F.relu(self.conv1(x))   
        x = F.relu(self.conv2(x)) 
        x = F.relu(self.conv3(x)) 
        # x = F.relu(self.conv4(x)) 
        return x

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(56*56, 100)
        self.fc2 = nn.Linear(100, 100)
        self.fc3 = nn.Linear(100, 56*56)
        
    def forward(self, x):
        bs = x.shape[0]
        x = x.flatten(start_dim=1)
        x = F.relu(self.fc1(x))   
        x = F.relu(self.fc2(x)) 
        x = F.relu(self.fc3(x)) 
        x = x.reshape((bs, 1, 56, 56))

        return x
        

### Generate the sample data

In [14]:
mnist_train = dset.MNIST('./deeplearning/datasets', train=True, download=True)
sample_index = 12
mnist_sample = mnist_train[sample_index][0]
mnist_sample = T.functional.pil_to_tensor(mnist_sample)

mnist_sample = preprocess_mnist_data(mnist_sample)
print(mnist_sample.shape)

torch.Size([49, 1, 56, 56])


### Visualize the output of MLP and CNN with different translations

In [56]:
# !pip3 install ipywidgets

In [15]:
from ipywidgets import interactive, widgets, Layout

In [16]:
cnn_model = CNN().to(device)
mlp_model = MLP().to(device)

mnist_sample = mnist_sample.to(device)
# Convert to float32
mnist_sample = mnist_sample.float()

cnn_output = torch_to_numpy(cnn_model(mnist_sample))
mlp_output = torch_to_numpy(mlp_model(mnist_sample))
mnist_sample = torch_to_numpy(mnist_sample)

In [17]:
fig = plt.figure(figsize=(5, 5))

# Main update function for interactive plot
def update_images(i):
    fig.clear()
    f, axarr = plt.subplots(1,3, figsize=(15, 5))
    
    # Show the images
    axarr[0].imshow(mnist_sample[i, 0, :, :])
    axarr[1].imshow(cnn_output[i, 0, :, :])
    axarr[2].imshow(mlp_output[i, 0, :, :])

    # Set the titles
    axarr[0].set_title('Input Image')
    axarr[1].set_title('CNN Output')
    axarr[2].set_title('MLP Output')

    plt.axis('off')

# Create interactive plot
ip = interactive(update_images, i=widgets.IntSlider(min=0, max=48, step=1, value=0))
ip

<Figure size 500x500 with 0 Axes>

interactive(children=(IntSlider(value=0, description='i', max=48), Output()), _dom_classes=('widget-interact',…

Q. What do you observe? Why is it different from the case of MLP?

A.

Q. Note that even though CNN is not trained, the feature maps not only are still translationally equivariant but also extract a quite good features. Why is it so?

A.

Q. Then what happened if we freeze CNN bacbone and train only the final layer? And why?

A.
