# Welcome to CS 5242 **Homework 4**

ASSIGNMENT DEADLINE ⏰ : **19 Sept 2022** 

In this assignment, we have three parts:

1. Implement some functions in CNNs from scratch *(3 Points)*
2. Implement a CNN and train for CIFAR10 using PyTorch *(5 Points)*
3. Discussion (parametes and flops for AlexNet) *(2 Points)*

Colab is a hosted Jupyter notebook service that requires no setup to use, while providing access free of charge to computing resources including GPUs. In this semester, we will use Colab to run our experiments.

> In this assignment, We **need GPU** to training the CNN model. You may need to **choose GPU in Runtime -> Change runtime type -> Hardware accerator**

### **Grades Policy**

We have 10 points for this homework. 15% off per day late, 0 scores if you submit it 7 days after the deadline.

### **Cautions**

**DO NOT** use external libraries like PyTorch or TensorFlow in your implementation.

**DO NOT** copy the code from the internet, e.g. GitHub.

---

### **Contact**

Please feel free to contact us if you have any question about this homework or need any further information.

Slack (Recommend): Shenggan Cheng

TA Email: shenggan@comp.nus.edu.sg

> If you have not join the slack group, you can click [here](https://join.slack.com/t/cs5242ay20222-oiw1784/shared_invite/zt-1eiv24k1t-0J9EI7vz3uQmAHa68qU0aw)

## Setup

Start by running the cell below to set up all required software.

In [1]:
!pip install numpy matplotlib torch

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import the neccesary library and fix seed for Python, NumPy and PyTorch.

In [2]:
import math
import random

import numpy as np
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

random.seed(0)
np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x7f3cba2381d0>

Now let's setup the GPU environment. The colab provides a free GPU to use. Do as follows:

- Runtime -> Change Runtime Type -> select `GPU` in Hardware accelerator
- Click `connect` on the top-right

After connecting to one GPU, you can check its status using `nvidia-smi` command.

In [3]:
!nvidia-smi

torch.cuda.is_available()

Tue Sep 20 15:04:29 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   74C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

True

Everything is ready, you can move on and ***Good Luck !*** 😃

## Implement functions in CNNs from scratch

In this section, you need to implement some functions commonly used in CNNs, including convolution, pooling, etc. 

We will compare the computational results of your implemented version with those of pytorch, expecting that the error between the correct implementation and pytorch will be very small.

NOTE: 

1. Implement these functions from scratch, **without** using any neural network libraries. Use linear algebra libraries in python is ok.

2. The performance of the function is not included in this scoring, You just need to pay attention to the correctness of your implementation.

### Step 1
Given a 32x32 pixels, 3 channels input, get a torch tensor with torch.randn().

In [4]:
batch_size = 2
x = torch.randn(batch_size, 3, 32, 32)

### Step 2

For each following functions in the list, get the output tensor "torch_xxx_out" with input as x:

In [5]:
torch_max_pool = nn.MaxPool2d(kernel_size=2,
                              stride=1,
                              padding=0,
                              dilation=1,
                              return_indices=False,
                              ceil_mode=False)
torch_avg_pool = nn.AvgPool2d(kernel_size=2,
                              stride=1,
                              padding=0,
                              ceil_mode=False,
                              count_include_pad=True,
                              divisor_override=None)
torch_conv = nn.Conv2d(in_channels=3,
                       out_channels=6,
                       kernel_size=3,
                       stride=1,
                       padding=0,
                       dilation=1,
                       groups=1,
                       bias=True,
                       padding_mode='zeros')
torch_norm = nn.BatchNorm2d(3)

In [6]:
torch_sigmoid_out = torch.sigmoid(x, out=None)
tmp_tensor = torch.randint(3, (batch_size,))
torch_cross_entropy_out = F.cross_entropy(x[::, ::, 0, 0], tmp_tensor)

In [7]:
torch_max_pool_out = torch_max_pool(x)
torch_avg_pool_out = torch_avg_pool(x)
torch_conv_out = torch_conv(x)
torch_norm_out = torch_norm(x)

### Step 3

Implement these functions from scratch, without using any neural network libraries. Use linear algebra libraries in python is ok. Output your tensors as "my_xxx_out".

In [8]:
def my_max_pool(x, kernel_size, stride, padding):
    """
    Args:
        x: torch tensor with size (N, C_in, H_in, W_in),
        kernel_size: size of the window to take a max over, 
        stride: stride of the window,
        padding: implicit zero padding to be added on both sides,
        
    Return:
        y: torch tensor of size (N, C_out, H_out, W_out).
    """

    y = None

    # === Complete the code (0.5')
    x_numpy = x.numpy()
    b, c, h_in, w_in = x_numpy.shape
    
    h_out = math.floor((h_in + 2*padding - kernel_size)/stride + 1)
    w_out = math.floor((w_in + 2*padding - kernel_size)/stride + 1)
    output = np.zeros((b, c, h_out, w_out))

    # Need to pad
    padded_x = np.zeros((b, c, h_in + 2*padding, w_in + 2*padding))
    padded_x[:, :, padding:(x_numpy.shape[2]+padding), padding:(x_numpy.shape[3]+padding)] = x_numpy
  
    for i in range(h_out):
      for j in range(w_out):
        h_start = i * stride
        h_end = h_start + kernel_size
        w_start = j * stride
        w_end = w_start + kernel_size

        x_slice = padded_x[:, :, h_start:h_end, w_start:w_end]
        output[:, :, i, j] = np.amax(x_slice, axis = (2, 3))

    y = torch.from_numpy(output)
    # === Complete the code
    return y

In [9]:
def my_avg_pool(x, kernel_size, stride, padding):
    """
    Args:
        x: torch tensor with size (N, C_in, H_in, W_in),
        kernel_size: size of the window, 
        stride: stride of the window,
        padding: implicit zero padding to be added on both sides,
        
    Return:
        y: torch tensor of size (N, C_out, H_out, W_out).
    """

    y = None
    # === Complete the code (0.5')
    x_numpy = x.numpy()
    b, c, h_in, w_in = x_numpy.shape
    
    h_out = math.floor((h_in + 2*padding - kernel_size)/stride + 1)
    w_out = math.floor((w_in + 2*padding - kernel_size)/stride + 1)
    output = np.zeros((b, c, h_out, w_out))

    # Need to pad
    padded_x = np.zeros((b, c, h_in + 2*padding, w_in + 2*padding))
    padded_x[:, :, padding:(x_numpy.shape[2]+padding), padding:(x_numpy.shape[3]+padding)] = x_numpy
  
    for i in range(h_out):
      for j in range(w_out):
        h_start = i * stride
        h_end = h_start + kernel_size
        w_start = j * stride
        w_end = w_start + kernel_size

        x_slice = padded_x[:, :, h_start:h_end, w_start:w_end]
        output[:, :, i, j] = np.mean(x_slice, axis = (2, 3))

    y = torch.from_numpy(output)
    # === Complete the code
    return y

In [10]:
def my_conv(x, in_channels, out_channels, kernel_size, stride, padding, weight, bias):
    """
    Args:
        x: torch tensor with size (N, C_in, H_in, W_in),
        in_channels: number of channels in the input image, it is C_in;
        out_channels: number of channels produced by the convolution;
        kernel_size: size of onvolving kernel, 
        stride: stride of the convolution,
        padding: implicit zero padding to be added on both sides of each dimension,
        
    Return:
        y: torch tensor of size (N, C_out, H_out, W_out)
    """

    y = None
    # === Complete the code (0.5')
    x_numpy = x.numpy()
    n, c_in, h_in, w_in = x_numpy.shape
    
    weight_numpy = weight.detach().numpy()
    n_w, c_w, h_w, w_w = weight_numpy.shape
    bias_numpy = bias.detach().numpy()
    
    h_out = math.floor((h_in + 2*padding - kernel_size)/stride + 1)
    w_out = math.floor((w_in + 2*padding - kernel_size)/stride + 1)
    output = np.zeros((n, out_channels, h_out, w_out))

    # Need to pad
    padded_x = np.zeros((n, c_in, h_in + 2*padding, w_in + 2*padding))
    padded_x[:, :, padding:(x_numpy.shape[2]+padding), padding:(x_numpy.shape[3]+padding)] = x_numpy

    for k in range(n):
      for out_c in range(out_channels):
        for i in range(h_out):
          for j in range(w_out):
            h_start = i * stride
            h_end = h_start + kernel_size
            w_start = j * stride
            w_end = w_start + kernel_size

            result = 0
            for in_c in range(in_channels):
              x_slice = padded_x[k, in_c, h_start:h_end, w_start:w_end]
              a = np.multiply(x_slice, weight_numpy[out_c, in_c])
              result += np.sum(a)

            output[k, out_c, i, j] = result + bias_numpy[out_c]

    y = torch.from_numpy(output)
    # === Complete the code
    return y

In [11]:
def my_batchnorm(x, num_features, eps):
    """
    Args:
        x: torch tensor with size (N, C, H, W),
        num_features: number of features in the input tensor, it is C;
        eps: a value added to the denominator for numerical stability. Default: 1e-5
        
    Return:
        y: torch tensor of size (N, C, H, W)
    """

    y = torch.empty_like(x)
    # === Complete the code (0.5')
    x_numpy = x.numpy()
    N, C, H, W = x_numpy.shape
     
    output = y.numpy()

    for in_c in range(num_features):
      x_slice = x_numpy[:, in_c]

      for i in range(H):
        for j in range(W):
          output[:, in_c, i, j] = (x_numpy[:, in_c, i, j] - x_slice.mean()) / (x_slice.var() + eps)**0.5

    y = torch.from_numpy(output)
    # === Complete the code
    return y

In [12]:
def my_sigmoid(x):
    """
    Args:
        x: torch tensor with any size

    Return:
        y: the logistic sigmoid function of x
    """
    y = None
    # === Complete the code (0.5')
    y = 1 / (1 + np.exp(-x))
    # === Complete the code
    return y

In [13]:
def my_cross_entropy(p, y):
    """
    Args:
        p: torch tensor with size of (N, C),
        y (int): torch tensor with size of (N), the values range from 0 to C-1

    Return:
        loss: the cross_entropy of predicted values p and target y.
    """
    loss = None
    # === Complete the code (0.5')
    N, C = p.shape
    log_loss = torch.zeros(N, C)

    for i in range(N):
        p_sum = 0
        for j in range(C):
            p_sum += np.exp(p[i, j])
        
        for j in range(C):
            log_loss[i, j] = np.log(np.exp(p[i, j]) / p_sum)

    loss = 0
    for i in range(N):
        loss -= log_loss[i, y[i]]
        
    loss /= N
    # === Complete the code
    return loss

In [14]:
my_max_pool_out = my_max_pool(x, kernel_size=2, stride=1, padding=0)
my_avg_pool_out = my_avg_pool(x, kernel_size=2, stride=1, padding=0)
my_conv_out = my_conv(x,
                      in_channels=3,
                      out_channels=6,
                      kernel_size=3,
                      stride=1,
                      padding=0,
                      weight=torch_conv.weight,
                      bias=torch_conv.bias)
my_norm_out = my_batchnorm(x, num_features=3, eps=1e-5)

In [15]:
my_sigmoid_out = my_sigmoid(x)
my_cross_entropy_out = my_cross_entropy(x[::, ::, 0, 0], tmp_tensor)

### Step 4

Compare and show that "torch_xxx_out" and "my_xxx_out" are equal up to small numerical errors.

In [16]:
print(F.mse_loss(my_max_pool_out, torch_max_pool_out))
print(F.mse_loss(my_avg_pool_out, torch_avg_pool_out))
print(F.mse_loss(my_conv_out, torch_conv_out))
print(F.mse_loss(my_norm_out, torch_norm_out))

tensor(0., dtype=torch.float64)
tensor(4.2491e-16, dtype=torch.float64)
tensor(3.1710e-15, dtype=torch.float64, grad_fn=<MseLossBackward0>)
tensor(2.9662e-15, grad_fn=<MseLossBackward0>)


In [17]:
print(F.mse_loss(my_sigmoid_out, torch_sigmoid_out))
print(F.mse_loss(my_cross_entropy_out, torch_cross_entropy_out))

tensor(3.6253e-16)
tensor(0.)


## Train CNNs on CIFAR-10 dataset

Implement a CNN and train for CIFAR10 with these definitions:

1. cA-B = Conv2d with input A channels, output B channels - kernel size 3x3, stride (1,1), padding with zeros to keep image size constant, followed by ReLU;

2. mp = maxpool2d kernel size 2x2, stride (2,2);

3. bn = batchnorm2d with affine=False (i.e. non learning batch norm);

4. fcA-B = nn.linear with input A nodes, output B nodes;

5. aap = adaptive average pooling.

Use the definition to make the architecture c3-16 -> c16-16 -> mp -> c16-32 -> c32-32 -> mp -> c32-64 -> c64-64 -> mp -> c64-128 -> c128-128 -> aap -> flatten -> fc128-10 -> cross entropy loss. Adjust learning rate, batch size and other hyper parameters to make classification results **> 75%**.

In [18]:
# === Complete the code (1')
num_epoch = 20 # TODO: please define the number of epoch here.
batch_size = 128 # TODO: please fill the batch size here.
# === Complete the code

In [19]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data',
                                        train=True,
                                        download=True,
                                        transform=transform)
trainloader = torch.utils.data.DataLoader(trainset,
                                          batch_size=batch_size,
                                          shuffle=True,
                                          num_workers=1)

testset = torchvision.datasets.CIFAR10(root='./data',
                                       train=False,
                                       download=True,
                                       transform=transform)
testloader = torch.utils.data.DataLoader(testset,
                                         batch_size=batch_size,
                                         shuffle=False,
                                         num_workers=1)

Files already downloaded and verified
Files already downloaded and verified


In [20]:
from torch.nn.modules import activation
# Creating a CNN model
class CNN(nn.Module):
    
    def __init__(self, num_classes):
        super(CNN, self).__init__()
       
        # === Complete the code (1.5')
        self.conv1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=(3, 3), stride=(1, 1), padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=(3, 3), stride=(1, 1), padding=1)
        self.conv3 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(3, 3), stride=(1, 1), padding=1)
        self.conv4 = nn.Conv2d(in_channels=32, out_channels=32, kernel_size=(3, 3), stride=(1, 1), padding=1)
        self.conv5 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=(3, 3), stride=(1, 1), padding=1)
        self.conv6 = nn.Conv2d(in_channels=64, out_channels=64, kernel_size=(3, 3), stride=(1, 1), padding=1)
        self.conv7 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=(3, 3), stride=(1, 1), padding=1)
        self.conv8 = nn.Conv2d(in_channels=128, out_channels=128, kernel_size=(3, 3), stride=(1, 1), padding=1)

        self.relu = nn.ReLU()
        self.max_pool = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2))
        self.aap = nn.AdaptiveAvgPool2d((1, 1))
        self.flatten = nn.Flatten()
        self.linear = nn.Linear(in_features=128, out_features=num_classes)
        # === Complete the code
        
    def forward(self, x):
        # === Complete the code (1.5')
        # c3-16 -> c16-16 -> mp -> c16-32 -> c32-32 
        # -> mp -> c32-64 -> c64-64 -> mp -> 
        # c64-128 -> c128-128 -> aap -> 
        # flatten -> fc128-10 -> cross entropy loss

        out = self.relu(self.conv1(x))
        out = self.relu(self.conv2(out))
        out = self.max_pool(out)

        out = self.relu(self.conv3(out))
        out = self.relu(self.conv4(out))
        out = self.max_pool(out)

        out = self.relu(self.conv5(out))
        out = self.relu(self.conv6(out))
        out = self.max_pool(out)
    
        out = self.relu(self.conv7(out))
        out = self.relu(self.conv8(out))
        
        out = self.aap(out)
        out = self.flatten(out)
        out = self.linear(out)
        # === Complete the code
        return out

In [21]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model = CNN(10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0005)

In [22]:
for epoch in range(num_epoch):

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):

        inputs, labels = data[0].to(device), data[1].to(device)
        # === Complete the code (1')
        predicted = model.forward(inputs)
        loss = criterion(predicted, labels)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        # === Complete the code

        running_loss += loss.item()
        if (i + 1) % 128 == 0:
            print('epoch {:3d} | {:5d} batches loss: {:.4f}'.format(epoch, i + 1, running_loss/128))
            running_loss = 0.0

print('Finished Training')

epoch   0 |   128 batches loss: 2.1159
epoch   0 |   256 batches loss: 1.8479
epoch   0 |   384 batches loss: 1.6763
epoch   1 |   128 batches loss: 1.5885
epoch   1 |   256 batches loss: 1.5203
epoch   1 |   384 batches loss: 1.4534
epoch   2 |   128 batches loss: 1.4011
epoch   2 |   256 batches loss: 1.3756
epoch   2 |   384 batches loss: 1.3155
epoch   3 |   128 batches loss: 1.2670
epoch   3 |   256 batches loss: 1.2347
epoch   3 |   384 batches loss: 1.2074
epoch   4 |   128 batches loss: 1.1632
epoch   4 |   256 batches loss: 1.1489
epoch   4 |   384 batches loss: 1.1201
epoch   5 |   128 batches loss: 1.0721
epoch   5 |   256 batches loss: 1.0472
epoch   5 |   384 batches loss: 1.0434
epoch   6 |   128 batches loss: 0.9925
epoch   6 |   256 batches loss: 0.9726
epoch   6 |   384 batches loss: 0.9831
epoch   7 |   128 batches loss: 0.9392
epoch   7 |   256 batches loss: 0.9513
epoch   7 |   384 batches loss: 0.9230
epoch   8 |   128 batches loss: 0.8638
epoch   8 |   256 batches

In [23]:
dataiter = iter(testloader)
images, labels = dataiter.next()

In [24]:
correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        #images, labels = data
        images, labels = data[0].to(device), data[1].to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct // total} %')

Accuracy of the network on the 10000 test images: 76 %


## Discussion (2 points)

Calculate Parameters and FLOPs(Floating point operations) of **AlexNet** and analyse the ratio of the number of parameters and the amount of calculations for different layers in AlexNet.

Hint:

1. You can refer https://pytorch.org/vision/stable/_modules/torchvision/models/alexnet.html for architecture of AlexNet.
2. You only need to make estimates and do not need to perform rigorous calculations, (e.g. only consider the FLOPs of the convolution and FC in AlexNet model)
3. Because Multiply Accumulate (MAC) operations are performed on the hardware, it is possible to simply consider only the number of multiplications when considering the number of operations when calculating FLOPs.

## Convolutional layers
**Note: Ignoring bias for each layer for simplicity.**
`E.g.: [(3*64*11*11) + 64] ~= (3*64*11*11)`

```
nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2) 
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),                 
nn.Conv2d(64, 192, kernel_size=5, padding=2),         
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(192, 384, kernel_size=3, padding=1),         
nn.ReLU(inplace=True),
nn.Conv2d(384, 256, kernel_size=3, padding=1),         
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, padding=1),        
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
```

### No. of parameters
= Input channels * Output channels * Kernel Size
```
Layer 1: 3*64*11*11 = 23232
Layer 2: 64*192*5*5 = 307200
Layer 3: 192*384*3*3 = 663552
Layer 4: 384*256*3*3 = 884736
Layer 5: 256*256*3*3 = 589824

Total Parameters = 2,468,544
```

### No. of FLOPs
= No. of paramters * (Width+2*Padding)/Stride * (Height+2*Padding)/Stride
```
Input Size of Image: 224*224
Layer 1: 23232*(224+2*2)/4*(224+2*2)/4 = 75480768 

Input size: 57/2*57/2 (MaxPool)
Layer 2: 307200*(28+2*2)*(28+2*2) = 314572800

Input size: 32/2*32/2 (MaxPool)
Layer 3: 663552*(16+2*1)*(16+2*1) = 214990848

Input size: 18*18
Layer 4: 884736*(18+2*1)*(18+2*1) = 353894400

Input size: 20*20
Layer 5: 589824*(20+2*1)*(20+2*1) = 285474816
Output size: 22/2*22/2 (MaxPool)

Total FLOPs = 1,244,413,632
```


## Dense layers
**Note: No. of parameters. = No. on FLOPs**
```
nn.Dropout(p=dropout),
nn.Linear(256 * 6 * 6, 4096),    
nn.ReLU(inplace=True),
nn.Dropout(p=dropout),
nn.Linear(4096, 4096),          
nn.ReLU(inplace=True),
nn.Linear(4096, num_classes),  
```

### No. of parameters/FLOPs
= Input channels * Output channels * Kernel Size
```
Layer 1: 256*6*6*4096 = 37748736 
Layer 2: 4096*4096 = 16777216
Layer 3: 4096*1000 = 4096000

Total Parameters/FLOPs = 58,621,952
```

- Total no. of paramaters. = 2,468,544 + 58,621,952 = 61,090,496
- Total no. of FLOPs. = 1,244,413,632 + 58,621,952 = 1,303,035,584


## Ratio = Parameters/FLOPs

```
Layer 1 = 0.0003
Layer 2 = 0.001
Layer 3 = 0.003
Layer 4 = 0.0025
Layer 5 = 0.002
Layer 6 = 1
Layer 7 = 1
Layer 8 = 1
```


### Analysis
1. No. of parameters of dense layers >> no. of parameters of convolutional layers. 
2. No. of FLOPs per convolutional layer >> no. of FLOPs per linear layer.
3. The ratio of the convolutional layers increases and then decreases in order to learn only the most important complexities of each layer.
4. The ratio of the linear layer = 1 which means that 1 parameter is learnt per FLOP, i.e., more no. of parameters are required for learning of the model properly as compared to convolutional layers.