# Set up system

And running Mini-CNN for runnability testing.

Check CUDA compiler version

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


## Compile executable file

Set up for compiling demo.exe. If the demo.exe has been put in the correct location: skip belows cell until you have met cell "RUN THIS"

In [5]:
%cd /content/drive/MyDrive/University/Parallel Computing/Personal/mini-dnn-cpp-master
%ls

/content/drive/MyDrive/University/Parallel Computing/Personal/mini-dnn-cpp-master
[0m[01;34mbuild[0m/          ConvExperiment.cc  demo_Fashion_MNIST.cc  report.ipynb  testImplement.cc
CMakeLists.txt  [01;34mdata[0m/              LICENSE                [01;34msrc[0m/          testImplement.h
config.h        demo.cc            readme.md              [01;34mtest[0m/         [01;34mthird_party[0m/


In [6]:
%rm -r build
%mkdir build
%cd build
%ls

/content/drive/MyDrive/University/Parallel Computing/Personal/mini-dnn-cpp-master/build


In [8]:
!cmake ..

  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

[0m
-- The CXX compiler identification is GNU 11.4.0
-- The CUDA compiler identification is NVIDIA 12.2.140
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Configuring done (6.8s)
-- Generating done (0.3s)
-- Build files have been written to: /content/drive/MyDrive/University/Parallel Computing/Personal/mini-dnn-cpp-master/build


In [9]:
!make

[ -5%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/mnist.cc.o[0m
[  0%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/network.cc.o[0m
[  5%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/layer/ave_pooling.cc.o[0m
[ 10%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/layer/conv.cc.o[0m
[ 15%] [32mBuilding CUDA object src/CMakeFiles/MiniDNNLib.dir/layer/cuda_utilities.cu.o[0m
[ 21%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/layer/fully_connected.cc.o[0m
[ 26%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/layer/max_pooling.cc.o[0m
[ 31%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/layer/relu.cc.o[0m
[ 36%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/layer/sigmoid.cc.o[0m
[ 42%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/layer/softmax.cc.o[0m
[ 47%] [32mBuilding CXX object src/CMakeFiles/MiniDNNLib.dir/loss/cross_entropy_loss.cc.o[0m
[ 52%] [32mBuilding CXX object src/CMakeFiles/Mini

## RUN THIS

Run the `./demo.exe` for activating LeNet-5 running on the test set. The `demo.exe` can run with multiple version of Convolutional layer (currently there are 3 versions which are indexed from 0 to 3) on the same test set, by passing follwing arguments:

- First argument: the index of the first version you want to execute.
- Second argument: the index of final version that `demo.exe` will implement.
- Third argument: if `0`, the `demo.exe` only executes the first version. If `1`, the `demo.exe` executes all versions from first one to the final one.

Following is the list of implemented versions and theirs index (The first and second arguments can gain below number):

- `0` - The sequential version: The forward of Convolutional layer is sequential in the input rolling stage.
- `1` - The first parallel version: The forward of Convolutional layer is parallelized in the input rolling stage.
- `2` - The second parallel version: The forward of Convolutional layer is parallelizedd in the matrix multiplication between input features and layer's weights.
- Any other number - The original implementation of Convolutional layer which is provided as is by authors of `mini-dnn-cpp` project.


In [10]:
!ls
!./demo 0 2 1

CMakeCache.txt	CMakeFiles  cmake_install.cmake  demo  Makefile  src
../data/fashion-mnist/
mnist train number: 60000
mnist test number: 10000
Parameters loaded

Current version: 0

Test case 0 passed
Test case 1 passed
Test case 2 passed
Test case 3 passed
Test case 4 passed
Test case 5 passed
Test case 6 passed
Test case 7 passed
Test case 8 passed
Test case 9 passed
Test cases passed
Test case 0 passed
Test case 1 passed
Test case 2 passed
Test case 3 passed
Test case 4 passed
Test case 5 passed
Test case 6 passed
Test case 7 passed
Test case 8 passed
Test case 9 passed
Test cases passed


Test time: 75486.2
Test acc: 0.8297

------------------------------------------

Parameters loaded

Current version: 1

Test case 0 passed
Test case 1 passed
Test case 2 passed
Test case 3 passed
Test case 4 passed
Test case 5 passed
Test case 6 passed
Test case 7 passed
Test case 8 passed
Test case 9 passed
Test cases passed
Test case 0 passed
Test case 1 passed
Test case 2 passed
Test case 3 pass

# Describe Solution

## Source specification

- Matrix library name: Eigen
- Starting code space: `mini-dnn-cpp` project (provided in project description).

## Convolution Layer using GEMM

## Basic idea

### Input layout

A mini-batch of multiple input images (input features) $X$ as the tensor has shape $(N, C, H, W)$ where:

* $N$: number of samples in a mini-batch
* $C$: number of input feature maps.
* $H$: height of each input feature map (or simply just the height of each input image in pixels).
* $W$: width of each input feature map (or simply just the weight of each input image in pixels).

### Output layout

The output features after applying CNN to $X$. It is an array $Y$ as the tensor has shape $(N, M, H_\text{out}, W_\text{out})$ where:

* $N$: number of samples in a mini-batch
* $M$: number of output feature maps of a CNN layer.
* $H_\text{out}$: height of each output feature map (it is often that $H_\text{out} = H - K + 1$).
* $W_\text{out}$: width of each output feature map (it is often that $W_\text{out} = W - K + 1$).

### Filter-bank layout

The matrix $W$ contains all filer maxtrix that is used for one CNN layer (or simply just the weigth matrix of a CNN layer).  It is a tensor has shape $(M, C, K, K)$ where:

* $K$: the size of a filter matrix (or kernel matrix).

With an input image has $C$ input feature maps (or $C$ color chanels) and the CNN layer produces $M$ output feature maps from that input feature maps, we need $C \times M$ kernel matrix with size $K$.

### The Unrolled-X

We unroll the matrix X. As a result, we can get all elements that are required for computing all output feature maps from a single input image, with just a single matrix multiplication step between $X$ and $W$. For now, just know that $X_\text{unroll}$ is retrieved from $X$, and it has shape $(C, K, K, H_\text{out}, W_\text{out})$, where:

* $(C, K, K)$: the "height" of $X_\text{unroll}$. It is the number of elements in $X$ that we need to compute an output feature map element (that is $C \times K \times K$),
* $(H_\text{out}, W_\text{out})$: the "width" of $X_\text{unroll}$. It is the number of elements in an output feature map, that is $H_\text{out} \times W_\text{out}$.

### Project assumptions

- The stride each time moving the kernel matrix is just $1$. So $H_\text{out} | W_\text{out}$ is always equal $H_\text{in} | W_\text{in} - K + 1$.
- Padding is always $0$. It means we treat two outer rows/columns at border as the ghost cells. As a result, the output feature maps are always smaller than the input ones.

## Version 0: Naive implementation (Sequential Version)

Note:
- transpose() is lazy operation. So use transpose().data() to retrieve 1D array as row-major order from the Matrix is pointless.
- When initialize a new