# **Fully Connected Neural Network: A `CUDA` and `C++` Implementation**

## **Prepare workspace**

In [23]:
from google.colab import drive
drive.mount("/content/drive")
%cd /content/drive/MyDrive/ParaProgram/Project

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/ParaProgram/Project


## **Extract `.gz` data (if needed)**

In [None]:
# Extract data from `.gz`
# Only need to run once!
!pip install patool
import patoolib
patoolib.extract_archive("mnist/t10k-images-idx3-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/t10k-labels-idx1-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/train-images-idx3-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/train-labels-idx1-ubyte.gz", outdir="mnist")

Collecting patool
  Downloading patool-3.1.0-py2.py3-none-any.whl.metadata (4.3 kB)
Downloading patool-3.1.0-py2.py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.4/98.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: patool
Successfully installed patool-3.1.0


INFO patool: Extracting mnist/t10k-images-idx3-ubyte.gz ...
INFO:patool:Extracting mnist/t10k-images-idx3-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mnist/t10k-images-idx3-ubyte.gz
INFO:patool:running /usr/bin/7z e -omnist -- mnist/t10k-images-idx3-ubyte.gz
INFO patool: ... mnist/t10k-images-idx3-ubyte.gz extracted to `mnist'.
INFO:patool:... mnist/t10k-images-idx3-ubyte.gz extracted to `mnist'.
INFO patool: Extracting mnist/t10k-labels-idx1-ubyte.gz ...
INFO:patool:Extracting mnist/t10k-labels-idx1-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mnist/t10k-labels-idx1-ubyte.gz
INFO:patool:running /usr/bin/7z e -omnist -- mnist/t10k-labels-idx1-ubyte.gz
INFO patool: ... mnist/t10k-labels-idx1-ubyte.gz extracted to `mnist'.
INFO:patool:... mnist/t10k-labels-idx1-ubyte.gz extracted to `mnist'.
INFO patool: Extracting mnist/train-images-idx3-ubyte.gz ...
INFO:patool:Extracting mnist/train-images-idx3-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mni

'mnist'

## **Edit `Makefile`**

In [None]:
%%writefile Makefile

# Compilers
CXX := g++
CXX_FLAGS := -std=c++17 -ggdb
NVCC := nvcc

# Folders
BIN := bin
SRC := src
INCLUDE := include

EXECUTABLE := nn_main

all: $(BIN)/$(EXECUTABLE)

run: clean all
	clear
	./$(BIN)/$(EXECUTABLE)

$(BIN)/$(EXECUTABLE): $(SRC)/*.cu $(SRC)/*.cpp
	$(NVCC) -I $(INCLUDE) $^ -o $@

clean:
	-rm $(BIN)/*

Overwriting Makefile


## **Compile and run**

In [24]:
# Compile
!make

nvcc -I include src/main.cu src/nn.cu src/utils_device.cu src/data.cpp src/utils_host.cpp -o bin/nn_main


### **Run with different config**
> To run the program:
> `./main <#-neurons> <#-epochs> <learning-rate> <mode>`

In [28]:
!echo "CPU TEST..."
!./bin/nn_main 50 3 0.5 1

CPU TEST...
-- # neurons: 50
-- # epochs: 3
-- learning rate: 0.5
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 9010.190430 ms
- layer 1 forward time: 468.251221 ms
- layer 2 forward time: 109.666527 ms
FORWARD TIME CPU: 9588.397461 ms

- layer 0 forward time: 68.087517 ms
- layer 1 forward time: 10.192416 ms
- layer 2 forward time: 3.583424 ms
FORWARD TIME GPU: 86.173508 ms

- layer 0 forward time: 62.400162 ms
- layer 1 forward time: 10.114080 ms
- layer 2 forward time: 3.528352 ms
FORWARD TIME GPU (OPTIMIZED): 80.326302 ms

-- Mean error CPU - GPU: 6.3041e-08
-- Mean error CPU - GPU (optimized): 6.3041e-08

Train start...
-- number of epochs: 3
-- use GPU: 0
-- optimize GPU: 0
- layer 0 forward time: 7680.871094 ms
- layer 1 forward time: 476.981415 ms
- layer 2 forward time: 107.284767 ms
>>> Epoch 1 CEE loss: 12.4924
- layer 0 forward time: 8983.592773 ms
- layer 1 forw

In [29]:
!echo "GPU TEST..."
!./bin/nn_main 50 3 0.5 2

GPU TEST...
-- # neurons: 50
-- # epochs: 3
-- learning rate: 0.5
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 7583.410156 ms
- layer 1 forward time: 468.371429 ms
- layer 2 forward time: 113.567940 ms
FORWARD TIME CPU: 8165.673828 ms

- layer 0 forward time: 68.895805 ms
- layer 1 forward time: 12.988352 ms
- layer 2 forward time: 3.976992 ms
FORWARD TIME GPU: 91.524445 ms

- layer 0 forward time: 65.687523 ms
- layer 1 forward time: 9.670432 ms
- layer 2 forward time: 3.618688 ms
FORWARD TIME GPU (OPTIMIZED): 84.322403 ms

-- Mean error CPU - GPU: 4.13167e-08
-- Mean error CPU - GPU (optimized): 4.13167e-08

Train start...
-- number of epochs: 3
-- use GPU: 1
-- optimize GPU: 0
- layer 0 forward time: 62.750015 ms
- layer 1 forward time: 12.616256 ms
- layer 2 forward time: 4.257408 ms
bw GPU
1
0
>>> Epoch 1 CEE loss: 13.8827
- layer 0 forward time: 71.292580 ms
- layer 1

In [30]:
!echo "GPU TEST..."
!./bin/nn_main 50 3 0.5 3

GPU TEST...
-- # neurons: 50
-- # epochs: 3
-- learning rate: 0.5
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 7633.858398 ms
- layer 1 forward time: 473.356415 ms
- layer 2 forward time: 114.972672 ms
FORWARD TIME CPU: 8222.489258 ms

- layer 0 forward time: 68.588318 ms
- layer 1 forward time: 9.579968 ms
- layer 2 forward time: 3.555968 ms
FORWARD TIME GPU: 87.120094 ms

- layer 0 forward time: 62.548862 ms
- layer 1 forward time: 10.235680 ms
- layer 2 forward time: 4.146112 ms
FORWARD TIME GPU (OPTIMIZED): 81.554276 ms

-- Mean error CPU - GPU: 5.38019e-08
-- Mean error CPU - GPU (optimized): 5.40838e-08

Train start...
-- number of epochs: 3
-- use GPU: 1
-- optimize GPU: 1
- layer 0 forward time: 63.105728 ms
- layer 1 forward time: 11.451776 ms
- layer 2 forward time: 3.552768 ms
bw GPU
1
0
>>> Epoch 1 CEE loss: 12.6426
- layer 0 forward time: 58.929920 ms
- layer 1