<h1 align=center><b>CSC14120 – PARALLEL PROGRAMMING</b></h1>

<h2 align=center>FINAL PROJECT</h2>
<p style="font-size:32px;text-align:center">Parallelize Convolutional Layer in the LeNet-5 Architecture using CUDA</p>

**Student information**

Student name         | Student ID 
---------------------|-------------
Nguyễn Quang Gia Bảo | 20120040
Huỳnh Tuấn Nam       | 20120136
Trần Hoàng Anh Phi   | 20120158

### __Table of Contents__

1. [Introduction](#introduction)
   1. [Objectives](#objectives)
   2. [Dataset](#dataset)
   3. [Motivation](#motivation)
2. [Background](#background)
   1. [Convolutional Neural Networks (CNNs)](#cnns)
   2. [LeNet-5](#lenet-5)
3. [Starter project](#starter)
4. [Implementation](#implementation)
   1. [Convolutional Layer](#convolutional-layer)
   2. [Pooling Layer](#pooling-layer)
   3. [Fully Connected Layer](#fully-connected-layer)
   4. [LeNet-5 Architecture](#lenet-5-architecture)
   5. [Parallelization](#parallelization)
      1. [Basic kernel](#basickernel)
      2. [Optimized kernel 1](#kernel1)
      3. [Optimized kernel 2](#kernel2)
      4. [Optimized kernel 3](#kernel3)
5. [Usage](#usage)
   1. [Prerequisites](#prerequisites)
   2. [Training](#training)
   3. [Testing](#testing)
6. [Results](#results)
   1. [Training stage](#training-stage)
   2. [Testing stage](#testing-stage)
      1. [CPU](#cpu) 
      2. [Basic kernel](#basic-kernel)
      3. [Optimized kernel 1](#optimized-kernel-1)
      4. [Optimized kernel 2](#optimized-kernel-2)
      5. [Optimized kernel 3](#optimized-kernel-3)
7. [Conclusion](#conclusion)
8. [Reflection](#reflection)
9. [References](#references)

__Workload distribution:__

Name | Student ID | Contribution | Percentage
-----|------------|--------------|------------
Nguyễn Quang Gia Bảo | 20120040 | Planning, read documents, implement forward-pass for CPU | 100%
Huỳnh Tuấn Nam | 20120136 | Planning, read documents, implement training phase, write report | 100%
Trần Hoàng Anh Phi | 20120158 | Implement training phase, basic kernel, optimized kernel 1, optimized kernel 2, optimized kernel 3, write report | 100%

# 1    Introduction <a id="introduction"></a>

## 1.1    Objectives <a id="objectives"></a>

__Objective__: Implement and optimize the forward-pass of a convolutional layer using CUDA. 
- Convolutional layers are the primary building blocks of convolutional neural networks (CNNs), which are used in many machine learning tasks.
- In general, CNNs work well on tasks where the data/input features have some level of partial relationship.

## 1.2    Dataset <a id="dataset"></a>

- __Fashion-MNIST__ is a dataset of Zalando's article images - consisting of a training set of 60,000 examples and a test set of 10,000 examples | Each example is a $28x28$ grayscale image, associated with 10 classes |  Zalando intends Fashion-MNIST to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms | It shares the same image size and structure of training and testing splits | 

- Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image.
    - To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix.
    - For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.


- __Labels__: Each training and test example is assigned to one of the following labels:
<center>

0 | T-shirt/top
--|------------
1 | Trouser
2 | Pullover
3 | Dress
4 | Coat
5 | Sandal
6 | Shirt
7 | Sneaker
8 | Bag
9 | Ankle boot

</center>

## 1.3    Motivation <a id="motivation"></a>

Parallelizing the convolutional layer in the LeNet-5 architecture using CUDA offers several benefits:

1. **Speedup**: The convolutional layer is the most computationally intensive layer in the LeNet-5 architecture. By parallelizing this layer, we can significantly reduce the training time of the network, leading to faster model convergence and improved overall performance.

2. **Scalability**: The parallelized convolutional layer can be seamlessly integrated into other convolutional neural networks, such as AlexNet, VGGNet, GoogLeNet, ResNet, YOLO, and many more. This means that the optimizations made for the LeNet-5 architecture can be applied to a wide range of deep learning models, enhancing their efficiency and scalability.

By leveraging the power of CUDA and parallel computing, we can unlock the full potential of convolutional neural networks, enabling faster and more efficient training of deep learning models.

# 2. Background <a id="background"></a>

## 2.1    Convolutional Neural Networks (CNNs)<a id="cnns"></a>

- Standford [cheatsheat](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks#overview)
  
- Video CNNs: ["How Convolutional Neural Networks work"](https://www.youtube.com/watch?v=FmpDIaiMIeA)

## 2.2    LeNet-5 <a id="lenet-5"></a>

<center>

![](https://www.datasciencecentral.com/wp-content/uploads/2021/10/1lvvWF48t7cyRWqct13eU0w.jpeg)

</center>

- A Complete Guide: [here](https://www.kaggle.com/code/blurredmachine/lenet-architecture-a-complete-guide/notebook)

# 3. Starter project <a id="starter-project"></a>

We will use the open source [mini-dnn-cpp](https://github.com/iamhankai/mini-dnn-cpp) as the starter project. It is well-designed for training convolutional neural networks on CPUs. The project is written in C++ and uses the [Eigen 3.3.4](http://bitbucket.org/eigen/eigen/get/3.3.4.tar.bz2) library for matrix operations. 

# 4. Implementation <a id="implementation"></a>

The original starter project was used the Eigen library version 3.3.4, which is not compatible with CUDA. Therefore, we have to upgrade the Eigen library to version 3.4.0, which is compatible with CUDA. 

```bash
.
├── build/                                  Build directory
├── checkpoint/                             Checkpoint directory
├── data/                                   Data directory
│   ├── fashion-mnist
│   └── mnist
├── Eigen/                                  Eigen library
├── src/                                    Source directory
│   ├── layer/                              Layer directory
│   │     ├── gpu/                          GPU directory
│   │     │   ├── conv_cpu.cc               Convolutional layer CPU implementation
│   │     │   ├── conv_kernel1.cu           Convolutional layer GPU optimized v1 implementation
│   │     │   ├── conv_kernel2.cu           Convolutional layer GPU optimized v2 implementation
│   │     │   ├── conv_kernel3.cu           Convolutional layer GPU optimized v3 implementation
│   │     │   ├── conv_kernel.cu            Convolutional layer GPU basic implementation
│   │     │   ├── gpu_interface.h           GPU interface
│   │     │   ├── utils.cu                  
│   │     │   └── utils.h                   GPU utils
│   │     ├── ave_pooling.cc                
│   │     ├── ave_pooling.h                 Average pooling layer CPU implementation
│   │     ├── conv.cc
│   │     ├── conv.h                        Convolutional layer CPU 
│   │     ├── conv_gpu.cc
│   │     ├── conv_gpu.h                    Convolutional layer GPU 
│   │     ├── fully_connected.cc
│   │     ├── fully_connected.h             Fully connected layer CPU implementation
│   │     ├── max_pooling.cc
│   │     ├── max_pooling.h                 Max pooling layer CPU implementation
│   │     ├── relu.cc
│   │     ├── relu.h                        ReLU activation function CPU implementation
│   │     ├── sigmoid.cc
│   │     ├── sigmoid.h                     Sigmoid activation function CPU implementation
│   │     ├── softmax.cc
│   │     └── softmax.h                     Softmax activation function CPU implementation
│   ├── loss/                               Loss directory
│   ├── optimizer/                          Optimizer directory
│   ├── CMakeLists.txt
│   ├── layer.h                             Layer interface
│   ├── loss.h                              Loss interface
│   ├── mnist.cc                            
│   ├── mnist.h                             MNIST/Fashion-MNIST dataset loader
│   ├── network.cc          
│   ├── network.h                           Network interface   
│   ├── optimizer.h                         Optimizer interface
│   └── utils.h                             Utils: accuracy, one hot encoding, ...
├── third_party/
├── CMakeLists.txt
├── LICENSE
├── Makefile
├── readme.md
├── report.ipynb
├── test.cc                                 Testing script
├── train.cc                                Training script
```

## 4.1    Convolutional layer<a id="convolutional-layer"></a>

The convolution layer (CONV) uses filters that perform convolution operations as it is scanning the input $\mathbb{I}$ with respect to its dimensions. Its hyperparameters include the filter size $\mathbb{F}$ and stride $\mathbb{S}$. The resulting output $\mathbb{O}$ is called feature map or activation map.
<center>

![](https://stanford.edu/~shervine/teaching/cs-230/illustrations/architecture-cnn-en.jpeg?3b7fccd728e29dc619e1bd8022bf71cf)

</center> 

The sequential implementation of the convolutional layer was use the im2col method to convert the input image into a matrix and then perform matrix multiplication with the filter matrix to get the output image. The im2col algorithm is a technique that converts an image into a matrix, such that each column of the matrix corresponds to a small patch of the image. This makes it easier to perform convolution operations using matrix multiplication, which can be faster and more efficient than looping over the image pixels. The im2col method is working as follows:
- Convert input image of size O(HWC) to a patches matrix of size O(HW(K^2)C) 
- Convert filter into format __kernel height__ * __kernel width__ * __kernel channel__
- Multiply the modified input and filter matrix using GEMM matrix multiplication to get the output. This is a single call.

To illustrate the im2col algorithm with a greyscale example, let's assume we have a 4x4 image with one channel, and we want to apply a 2x2 filter with a stride of 1 and no padding. The im2col algorithm would produce a 4x9 matrix, where each column represents a 2x2 patch of the image, as shown below:

$$
\begin{bmatrix}
a & b & c & d \\
e & f & g & h \\
i & j & k & l \\
m & n & o & p \\
\end{bmatrix}
\rightarrow
\begin{bmatrix}
a & b & c & e & f & g & i & j & k \\
b & c & d & f & g & h & j & k & l \\
e & f & g & i & j & k & m & n & o \\
f & g & h & j & k & l & n & o & p \\
\end{bmatrix}
$$

Now, we can multiply this matrix by a vectorized version of the filter, and reshape the result into a 3x3 output image. For example, if the filter is:

$$
\begin{bmatrix}
1 & 2 \\
3 & 4 \\
\end{bmatrix}
$$

Then the vectorized filter is:

$$
\begin{bmatrix}
1 \\
2 \\
3 \\
4 \\
\end{bmatrix}
$$

And the output image is:

$$
\begin{bmatrix}
a + 2b + 3e + 4f & b + 2c + 3f + 4g & c + 2d + 3g + 4h \\
e + 2f + 3i + 4j & f + 2g + 3j + 4k & g + 2h + 3k + 4l \\
i + 2j + 3m + 4n & j + 2k + 3n + 4o & k + 2l + 3o + 4p \\
\end{bmatrix}
$$

This is equivalent to sliding the filter over the image and computing the dot product at each position, but it can be done more efficiently using matrix operations. 

Here is the pseudo code for the im2col method:

```c++
input[C][H][W];
kernels[M][K][K][C];
output[M][H][W];
for h in 1 to H do
    for w in 1 to W do
        for o in 1 to M do
            sum = 0;
            for x in 1 to K do
                for y in 1 to K do
                    for i in 1 to C do
                    sum += input[i][h+y][w+x] * kernels[o][x][y][i];
         output[o][w][h] = sum;
```

## 4.2    Pooling layer<a id="pooling-layer"></a>

The pooling layer is a downsampling operation, typically applied after a convolution layer, which does some spatial invariance. There has been a lot of research on the pooling layer, and the most common pooling methods are max pooling and average pooling. In this project, we will implement both of them. 
<p align="center">
    <img src="https://www.mdpi.com/remotesensing/remotesensing-13-04712/article_deploy/html/images/remotesensing-13-04712-g005.png" width="600" height="400">
</p>

<center>

Type | Max pooling | Average pooling
-----|-------------|----------------
Purpose | Each pooling operation selects the maximum value of the current view | Each pooling operation averages the values of the current view
Illustration | ![](https://stanford.edu/~shervine/teaching/cs-230/illustrations/max-pooling-a.png?711b14799d07f9306864695e2713ae07) | ![](https://stanford.edu/~shervine/teaching/cs-230/illustrations/average-pooling-a.png?58f9ab6d61248c3ec8d526ef65763d2f)
Comments | - Preserves detected features <br> - Most commonly used | - Downsamples features map <br> - Used in LeNet

</center>

In our implementation, max pooling was utilized. Max pooling is a pooling operation that selects the maximum element from the region of the feature map covered by the filter. Thus, the output after max pooling will be a feature map containing the most prominent features of the previous feature map. 

Because the pooling layer dont have any trainable parameters, the backward pass is very simple. We just need to propagate the gradient from the output feature map to the input feature map. 


In particular, max and average pooling are special kinds of pooling where the maximum and average value is taken, respectively.

## 4.3    Fully connected layer<a id="fully-connected-layer"></a>

The fully connected layer (FC) operates on a flatten input where each input is connected to all neurons. It also called a dense layer, and it is the most common layer in neural networks. The fully connected layer is implemented as a matrix multiplication between the input $\mathbb{I}$ and the weights $\mathbb{W}$, followed by the addition of the bias $\mathbb{B}$.

## 4.4    LeNet-5 architecture<a id="lenet-5-architecture"></a>

Instead of using the original LeNet-5 architecture, we will use a modified version of it. The original LeNet-5 architecture is shown below:
![](https://d2l.ai/_images/lenet.svg)
And the modified version is shown below:

__First Layer: Convolutional Layer (CONV1):__

- Parameters: Input (N) = 28, Padding (P) = 2, Filter (F) = 5 x 5, Stride (S) = 1
- Conv Operation: ((N + 2P - F) / S) + 1 = ((28 + 4 - 5) / 1) + 1 = 28 x 28
- We will apply 6 filters / kernels so we will get a 28 x 28 x 6 dimensional output

__Second Layer: Max Pooling Layer (POOL1):__

- Parameters: Input (N) = 28, Filter (F) = 2 x 2, Stride (S) = 2
- AVG Pooling Operation: ((N + 2P -F) / S) + 1 = ((28 - 2) / 2) + 1 = 14 x 14
- We will have a 14 x 14 x 6 dimensional output at the end of this pooling

__Third Layer: Convolutional Layer (CONV2):__

- Parameters: Input (N) = 14, Filter (F) = 5 x 5, Stride (S) = 1
- Conv Operation: ((N + 2P - F) / S) + 1 = ((14 - 5) / 1) + 1 = 10 x 10
- We will apply 16 filters / kernels so we will get a 10 x 10 x 16 dimensional output

__Fourth Layer: Max Pooling Layer (POOL2):__

- Parameters: Input (N) = 10, Filter (F) = 2 x 2, Stride (S) = 2
- AVG Pooling Operation: ((N + 2P -F) / S) + 1 = ((10 - 2) / 2) + 1 = 5 x 5
- We will have a 5 x 5 x 16 dimensional output at the end of this pooling

__Fifth Layer: Fully Connected layer(FC1):__

- Parameters: W: 400 * 120, b: 120
- We will have an output of 120 x 1 dimension

__Sixth Layer: Fully Connected layer(FC2):__

- Parameters: W: 120 * 84, b: 84
- We will have an output of 84 x 1 dimension

__Seventh Layer: Output layer(Softmax):__

- Parameters: W: 84 * 10, b: 10
- We will get an output of 10 x 1 dimension

```python
model.add(Conv2D(filters=6, kernel_size=(5,5), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(filters=16, kernel_size=(5,5), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(120, activation='relu'))
model.add(Dense(84, activation='relu'))
model.add(Dense(10, activation='softmax'))
```

## 4.5    Parallelization <a id="parallelization"></a>

According to the teacher's instructions, we will only parallelize the __forward-pass__ of the convolutional layer using CUDA. 


### 4.5.1    Basic kernel<a id="basickernel"></a>

To parallelize the convolutional layer, we will use CUDA to implement a kernel function that performs the convolution operation in parallel. The kernel function will be called by the CPU and executed by the GPU. Fisrt, It will calculate the index of the output feature map, input feature map, and filter. Then, it loop over each channel in the input feature map and each element in the filter to calculate the convolution operation. Finally, it will store the result in the output feature map. 

### 4.5.2    Optimized kernel 1<a id="kernel1"></a>

In this kernel, we will use shared memory to store the input feature map. This will reduce the number of global memory accesses, which is the main bottleneck of the convolutional layer.

### 4.5.3    Optimized kernel 2<a id="kernel2"></a>

In this kernel, we add an extra shared memory to do convolution.

- Declare a shared memory variable that is used to store a tile of the input data. The size of the tile is determined by the number of threads in a block and the size of the shared memory.

- Copy the input data from global memory to shared memory. This is done by each thread in a block copying a portion of the input data to the shared memory variable.

- Compute the dot product of the shared memory variable and the convolutional filters, which are stored in constant memory. This is done by each thread in a block computing the dot product of the portion of the input data stored in shared memory and the corresponding portion of the convolutional filters stored in constant memory.

- Sum the dot products computed by all threads in a block to obtain the output feature maps for the block.
Write the output feature maps from shared memory to global memory.

### 4.5.4    Optimized kernel 3<a id="kernel3"></a>

In this kernel, I will use the combination of shared memory and constant memory to store the input feature map and the filter. 

## 4.6    Save and load model<a id="save-load"></a>

In order to test the model, we need to save and load the weights of the model. We implemented the `save_parameters` and `load_parameters` methods in the `Network` class to save and load the weights of the model with binary files. 

# 5. Usage <a id="usage"></a>

## 5.1 Prerequisites <a id="prerequisites"></a>

- Download and unzip [FASHION-MNIST](https://www.kaggle.com/datasets/zalando-research/fashionmnist) dataset in `mini-dnn-cpp/data/fashion-mnist/`.

- Download and unzip [Eigen 3.4.0](https://gitlab.com/libeigen/eigen/-/releases/3.4.0), then place folder __Eigen__ in `mini-dnn-cpp/`.

## 5.2 Training <a id="training"></a>

In [None]:
# !make setup
# !make train
# !make train_model

## 5.3 Testing <a id="testing"></a>

Please run the following command after each test to clean up the temporary files:

```bash
make clean
```

### 5.3.1 CPU <a id="cpu"></a>

In [50]:
!make clean
!make setup
!make cpu
!make test
!make run

To make your changes take effect please reactivate your environment
rm -f src/layer/*.o
rm test.o
rm test
rm: cannot remove 'test': No such file or directory
make: *** [Makefile:86: clean] Error 1
To make your changes take effect please reactivate your environment
make network.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/network.cc -o src/network.o -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make mnist.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/mnist.cc -o src/mnist.o  -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make layer
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --comp

### 5.3.2 Basic kernel <a id="basic-kernel"></a>

In [2]:
!make clean
!make setup
!make gpu_basic
!make test
!make run

To make your changes take effect please reactivate your environment
rm -f src/layer/*.o
rm test.o
rm test
rm: cannot remove 'test': No such file or directory
make: *** [Makefile:76: clean] Error 1
To make your changes take effect please reactivate your environment
make network.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/network.cc -o src/network.o -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make mnist.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/mnist.cc -o src/mnist.o  -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make layer
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --comp

### 5.3.3 Optimized kernel 1 <a id="optimize-kernel-1"></a>

In [7]:
!make clean
!make setup
!make gpu_v1
!make test
!make run

To make your changes take effect please reactivate your environment
rm -f src/layer/*.o
rm test.o
rm test
rm: cannot remove 'test': No such file or directory
make: *** [Makefile:76: clean] Error 1
To make your changes take effect please reactivate your environment
make network.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/network.cc -o src/network.o -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make mnist.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/mnist.cc -o src/mnist.o  -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make layer
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --comp

- We can see that the optimized kernel 1 is slower than the basic kernel. 

### 5.3.4 Optimized kernel 2 <a id="optimized-kernel-2"></a>

In [31]:
!make clean
!make setup
!make gpu_v2
!make test
!make run

To make your changes take effect please reactivate your environment
rm -f src/layer/*.o
rm test.o
rm test
To make your changes take effect please reactivate your environment
make network.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/network.cc -o src/network.o -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make mnist.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/mnist.cc -o src/mnist.o  -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make layer
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/layer/ave_pooling.cc -o src/layer/ave_pooling.o -I./ -L/usr/local/cuda/lib64 -lcuda

The optimze kernel 2 have slightly speedup compared to the basic kernel for the C3 layer. However, it is slower than the basic kernel for the C1 layer.

### 5.3.5 Optimized kernel 3 <a id="optimized-kernel-3"></a>

In [5]:
from numba import cuda
major, minor = cuda.get_current_device().compute_capability
print(f'GPU compute capability: {major}.{minor}')

GPU compute capability: 8.6


Due to the GPU compute capability of 8.6 dont support the half-precision floating-point format, so I use a compute capability of 7.5 to test the optimized kernel 3.

In [43]:
!make clean
!make setup
!make gpu_v3
!make test
!make run

To make your changes take effect please reactivate your environment
rm -f src/layer/*.o
rm test.o
rm test
To make your changes take effect please reactivate your environment
make network.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/network.cc -o src/network.o -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make mnist.o
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/mnist.cc -o src/mnist.o  -I./ -L/usr/local/cuda/lib64 -lcudart
make[1]: Leaving directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
make layer
make[1]: Entering directory '/mnt/net/i2x256-ai03/hotel/phit/personal/ParallelProgramming/mini-dnn-cpp'
nvcc --compile src/layer/ave_pooling.cc -o src/layer/ave_pooling.o -I./ -L/usr/local/cuda/lib64 -lcuda

- From the above findings, we can see that the optimized kernel 3 is faster in the Conv c3 layer. However, its total layer time is slower than the other kernels. This is because the optimized kernel 3 uses a faster type of memory called shared memory, but it can only store a small amount of data. So, we have to divide the data into smaller parts and copy them to shared memory, which takes more time and slows down the overall performance.

- On the other hand, the optimized kernel 3 has a slightly higher accuracy of 0.862 compared to the original accuracy. This could be because the way it calculates numbers may cause some rounding errors, leading to slightly different results. Normally, using a less precise type of number format like half-precision floating-point would result in lower accuracy, but in this case, it gave slightly better results.

# 6. Results <a id="results"></a>

## 6.1 Training Stage <a id="training-stage"></a>

```c++
  Network dnn;
  Layer *conv1 = new Conv(1, 28, 28, 6, 5, 5);
  Layer *pool1 = new MaxPooling(6, 24, 24, 2, 2, 2);
  Layer *conv2 = new Conv(6, 12, 12, 16, 5, 5);
  Layer *pool2 = new MaxPooling(16, 8, 8, 2, 2, 2);
  Layer* fc3 = new FullyConnected(pool2->output_dim(), 120);
  Layer* fc4 = new FullyConnected(120, 84);
  Layer* fc5 = new FullyConnected(84, 10);

  Layer* relu1 = new ReLU;
  Layer* relu2 = new ReLU;
  Layer* relu3 = new ReLU;
  Layer* relu4 = new ReLU;
  Layer* relu5 = new ReLU;
  Layer* softmax = new Softmax;
  dnn.add_layer(conv1);
  dnn.add_layer(relu1);
  dnn.add_layer(pool1);
  dnn.add_layer(conv2);
  dnn.add_layer(relu2);
  dnn.add_layer(pool2);
  dnn.add_layer(fc3);
  dnn.add_layer(relu3);
  dnn.add_layer(fc4);
  dnn.add_layer(relu4);
  dnn.add_layer(fc5);
  dnn.add_layer(softmax);
```

Default hyperparameters:

```c++
  int batch_size = 128;
  float learning_rate = 0.001;
```

1. Train on Fashion-MNIST with 10 epochs
```
weight-bad-1.bin
10-th epoch, test acc: 0.6136
```

2. Train on Fashion-MNIST with 10 epochs (dont use Relu activation at the (n - 1) layer (before softmax layer)). ([We found out the use of ReLU activation function at the penultimate layer is not good for the model, so we decided to remove it](https://stats.stackexchange.com/questions/163695/non-linearity-before-final-softmax-layer-in-a-convolutional-neural-network)) 
```
weight-1.bin
10-th epoch, test acc: 0.8619
```

<!-- 
3. Train on Fashion-MNIST with 10 epochs with kernel size = 2
```
weight-bad-2.bin
10-th epoch, test acc: 0.625
```

4. Train on Fashion-MNIST with 10 epochs with kernel size = 2 and dont use Relu at the penultimate layer
```
weight-2.bin
10-th epoch, test acc: 0.8766
```  

5. Train on Fashion-MNIST with 15 epochs with kernel size = 2 and dont use Relu at the penultimate layer
```
weight-3.bin
15-th epoch, test acc: 0.8862
``` -->

## 6.2 Testing Stage <a id="testing-stage"></a>

For the testing stage, we will only use the `weight-1.bin` checkpoint to optimize the convolutional layer using CUDA.

Kernel | Convolution layer | Kernel time | Layer time | Accuracy 
-------|-------------------|-------------|------------|---------
CPU | (C1) |            | 8500.75 ms | 0.8619
Basic kernel | (C1) | 1.37126 ms | 73.3622 ms | 0.8619
Kernel 1 | (C1) | 1.77152 ms | 74.2185 ms | 0.8619
Kernel 2 | (C1) | 1.39878 ms | 76.6955 ms | 0.8619
Kernel 3 | (C1) | 1.17472 ms | 197.065 ms | 0.862

Kernel | Convolution layer | Kernel time | Layer time | Accuracy 
-------|-------------------|-------------|------------|---------
CPU    | (C3) |            | 14941.2 ms | 0.8619
Basic kernel | (C3) | 3.58912 ms | 34.251 ms | 0.8619
Kernel 1 | (C3) | 4.92134 ms | 33.8345 ms | 0.8619
Kernel 2 | (C3) | 3.3065 ms | 33.7417 ms | 0.8619
Kernel 3 | (C3) | 1.1623 ms | 91.6598 ms | 0.862

We can see that the layer time is the same for all kernels, but the kernel time is different. This is because the kernel time is the time it takes to execute the kernel function on the GPU, while the layer time is the time it takes to execute the entire layer, including the kernel function and the data transfer between the CPU and GPU.

# 7. Conclusion <a id="conclusion"></a>

In this project, we have explored various aspects of training and testing a deep neural network using the Fashion-MNIST dataset. We started by defining the architecture of the network, which consisted of convolutional layers, pooling layers, and fully connected layers. We implemented different kernel functions to parallelize the convolutional layer using CUDA, and compared their performance in terms of kernel time and layer time.

During the training stage, we experimented with different hyperparameters and observed the impact on the accuracy of the model. We also tested different variations of the network architecture, such as removing the ReLU activation function at the penultimate layer. We saved and loaded the weights of the trained model for testing purposes.

In the testing stage, we compared the performance of the different kernel functions in terms of kernel time, layer time, and accuracy. We observed that the optimized kernel 3, which was combined of several optimization techniques, had the fastest kernel time but a slower layer time due to the data transfer overhead. We also applied mutiple techniques to improve the performance of the convolutional layer, such as using shared memory, constant memory, unrolling loops, tile-based matrix multiplication, and half-precision floating-point format.

# 8. Reflection <a id="reflection"></a>

The main challenge of this project was to understand the CUDA programming techniques, the way to interact between CPU and GPU, and the way to parallelize the convolutional layer. We had to read a lot of documentation and tutorials to get familiar with the codebase and the CUDA programming model. Throughout the project, we have expand our knowledge of CUDA programming and parallel computing, which will be very useful in our future projects.

# 9. References <a id="references"></a>

- Fashion-MNIST Dataset: https://www.kaggle.com/datasets/zalando-research/fashionmnist
- Starter project (mini-dnn-cpp): https://github.com/iamhankai/mini-dnn-cpp
- Eigen library (3.4.0): https://gitlab.com/libeigen/eigen/-/releases/3.4.0
- Course slides, demos, and project description (CSC14120 – PARALLEL PROGRAMMING)