# **Fully Connected Neural Network: A `CUDA` and `C++` Implementation**

## **Prepare workspace**

In [1]:
from google.colab import drive
drive.mount("/content/drive")
%cd /content/drive/MyDrive/Project

Mounted at /content/drive
/content/drive/MyDrive/Project


## **Extract `.gz` data (if needed)**

In [None]:
# Extract data from `.gz`
# Only need to run once!
!pip install patool
import patoolib
patoolib.extract_archive("mnist/t10k-images-idx3-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/t10k-labels-idx1-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/train-images-idx3-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/train-labels-idx1-ubyte.gz", outdir="mnist")

Collecting patool
  Downloading patool-3.1.0-py2.py3-none-any.whl.metadata (4.3 kB)
Downloading patool-3.1.0-py2.py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.4/98.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: patool
Successfully installed patool-3.1.0


INFO patool: Extracting mnist/t10k-images-idx3-ubyte.gz ...
INFO:patool:Extracting mnist/t10k-images-idx3-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mnist/t10k-images-idx3-ubyte.gz
INFO:patool:running /usr/bin/7z e -omnist -- mnist/t10k-images-idx3-ubyte.gz
INFO patool: ... mnist/t10k-images-idx3-ubyte.gz extracted to `mnist'.
INFO:patool:... mnist/t10k-images-idx3-ubyte.gz extracted to `mnist'.
INFO patool: Extracting mnist/t10k-labels-idx1-ubyte.gz ...
INFO:patool:Extracting mnist/t10k-labels-idx1-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mnist/t10k-labels-idx1-ubyte.gz
INFO:patool:running /usr/bin/7z e -omnist -- mnist/t10k-labels-idx1-ubyte.gz
INFO patool: ... mnist/t10k-labels-idx1-ubyte.gz extracted to `mnist'.
INFO:patool:... mnist/t10k-labels-idx1-ubyte.gz extracted to `mnist'.
INFO patool: Extracting mnist/train-images-idx3-ubyte.gz ...
INFO:patool:Extracting mnist/train-images-idx3-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mni

'mnist'

## **Edit `Makefile`**

In [None]:
%%writefile Makefile

# Compilers
CXX := g++
CXX_FLAGS := -std=c++17 -ggdb
NVCC := nvcc

# Folders
BIN := bin
SRC := src
INCLUDE := include

EXECUTABLE := nn_main

all: $(BIN)/$(EXECUTABLE)

run: clean all
	clear
	./$(BIN)/$(EXECUTABLE)

$(BIN)/$(EXECUTABLE): $(SRC)/*.cu $(SRC)/*.cpp
	$(NVCC) -I $(INCLUDE) $^ -o $@

clean:
	-rm $(BIN)/*

Overwriting Makefile


## **Compile and run**

In [57]:
# Compile
!make

nvcc -I include src/main.cu src/nn.cu src/utils_device.cu src/data.cpp src/utils_host.cpp -o bin/nn_main


### **Run with different config**
> To run the program:
> `./main <#-neurons> <#-epochs> <learning-rate> <mode>` \
> Set `mode` to `0` to not use optimized GPU.

In [58]:
!echo "BASELINE GPU TEST..."
!./bin/nn_main 20 4 0.5 0

BASELINE GPU TEST...
-- # neurons: 20
-- # epochs: 4
-- learning rate: 0.5
-- optimize GPU (tiled matmul in FW, fp16 in BW): 0
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 4379.844238 ms
- layer 1 forward time: 88.424767 ms
- layer 2 forward time: 55.104832 ms
> FORWARD TIME CPU: 4523.625000 ms

- layer 0 forward time: 53.783936 ms
- layer 1 forward time: 5.358496 ms
- layer 2 forward time: 2.895040 ms
> FORWARD TIME GPU: 66.104355 ms

-- Mean error CPU - GPU: 1.23281e-05

Training on CPU...
- layer 0 forward time: 3048.427734 ms
- layer 1 forward time: 88.785919 ms
- layer 2 forward time: 55.749695 ms
>>> Epoch 1 CEE loss: 13.4511
- layer 0 forward time: 3676.557617 ms
- layer 1 forward time: 140.046204 ms
- layer 2 forward time: 85.972450 ms
>>> Epoch 2 CEE loss: 13.3549
- layer 0 forward time: 3045.989502 ms
- layer 1 forward time: 87.990623 ms
- layer 2 forward time: 53

In [59]:
!echo "OPTIMIZED GPU TEST..."
!./bin/nn_main 20 4 0.5 1

OPTIMIZED GPU TEST...
-- # neurons: 20
-- # epochs: 4
-- learning rate: 0.5
-- optimize GPU (tiled matmul in FW, fp16 in BW): 1
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 4251.382812 ms
- layer 1 forward time: 98.322815 ms
- layer 2 forward time: 55.277599 ms
> FORWARD TIME CPU: 4405.264648 ms

- layer 0 forward time: 51.730656 ms
- layer 1 forward time: 5.418880 ms
- layer 2 forward time: 2.607744 ms
> FORWARD TIME GPU: 63.684673 ms

-- Mean error CPU - GPU: 8.18262e-06

Training on CPU...
- layer 0 forward time: 3013.743408 ms
- layer 1 forward time: 88.179619 ms
- layer 2 forward time: 54.590782 ms
>>> Epoch 1 CEE loss: 13.4511
- layer 0 forward time: 4375.395996 ms
- layer 1 forward time: 89.246338 ms
- layer 2 forward time: 64.704514 ms
>>> Epoch 2 CEE loss: 13.3549
- layer 0 forward time: 3020.679199 ms
- layer 1 forward time: 88.342880 ms
- layer 2 forward time: 53

In [61]:
!echo "CPU-GPU Dual train test..."
!./bin/nn_main 20 20 0.5 0

CPU-GPU Dual train test...
-- # neurons: 20
-- # epochs: 20
-- learning rate: 0.5
-- optimize GPU (tiled matmul in FW, fp16 in BW): 0
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 3319.694092 ms
- layer 1 forward time: 141.525375 ms
- layer 2 forward time: 86.575104 ms
> FORWARD TIME CPU: 3548.024902 ms

- layer 0 forward time: 60.542976 ms
- layer 1 forward time: 6.197664 ms
- layer 2 forward time: 3.332960 ms
> FORWARD TIME GPU: 75.262115 ms

-- Mean error CPU - GPU: 2.104e-05

Training on CPU...
- layer 0 forward time: 3903.070068 ms
- layer 1 forward time: 88.365059 ms
- layer 2 forward time: 56.622082 ms
>>> Epoch 1 CEE loss: 13.4511
- layer 0 forward time: 3025.088135 ms
- layer 1 forward time: 88.295166 ms
- layer 2 forward time: 54.308353 ms
>>> Epoch 2 CEE loss: 13.3549
- layer 0 forward time: 3011.146484 ms
- layer 1 forward time: 97.058495 ms
- layer 2 forward tim

In [60]:
!echo "CPU-GPU Dual train test..."
!./bin/nn_main 20 20 0.5 1

CPU-GPU Dual train test...
-- # neurons: 20
-- # epochs: 20
-- learning rate: 0.5
-- optimize GPU (tiled matmul in FW, fp16 in BW): 1
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 3066.692871 ms
- layer 1 forward time: 88.674301 ms
- layer 2 forward time: 57.085377 ms
> FORWARD TIME CPU: 3212.644043 ms

- layer 0 forward time: 51.936928 ms
- layer 1 forward time: 5.931936 ms
- layer 2 forward time: 2.697984 ms
> FORWARD TIME GPU: 65.409950 ms

-- Mean error CPU - GPU: 1.17953e-05

Training on CPU...
- layer 0 forward time: 3116.512207 ms
- layer 1 forward time: 88.229507 ms
- layer 2 forward time: 57.950657 ms
>>> Epoch 1 CEE loss: 13.4511
- layer 0 forward time: 3035.551758 ms
- layer 1 forward time: 90.678146 ms
- layer 2 forward time: 54.158878 ms
>>> Epoch 2 CEE loss: 13.3549
- layer 0 forward time: 4352.062500 ms
- layer 1 forward time: 88.648735 ms
- layer 2 forward ti