# **Fully Connected Neural Network: A `CUDA` and `C++` Implementation**

## **Prepare workspace**

In [1]:
from google.colab import drive
drive.mount("/content/drive")
%cd /content/drive/MyDrive/Project

Mounted at /content/drive
/content/drive/MyDrive/Project


## **Extract `.gz` data (if needed)**

In [None]:
# Extract data from `.gz`
# Only need to run once!
!pip install patool
import patoolib
patoolib.extract_archive("mnist/t10k-images-idx3-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/t10k-labels-idx1-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/train-images-idx3-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/train-labels-idx1-ubyte.gz", outdir="mnist")

Collecting patool
  Downloading patool-3.1.0-py2.py3-none-any.whl.metadata (4.3 kB)
Downloading patool-3.1.0-py2.py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.4/98.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: patool
Successfully installed patool-3.1.0


INFO patool: Extracting mnist/t10k-images-idx3-ubyte.gz ...
INFO:patool:Extracting mnist/t10k-images-idx3-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mnist/t10k-images-idx3-ubyte.gz
INFO:patool:running /usr/bin/7z e -omnist -- mnist/t10k-images-idx3-ubyte.gz
INFO patool: ... mnist/t10k-images-idx3-ubyte.gz extracted to `mnist'.
INFO:patool:... mnist/t10k-images-idx3-ubyte.gz extracted to `mnist'.
INFO patool: Extracting mnist/t10k-labels-idx1-ubyte.gz ...
INFO:patool:Extracting mnist/t10k-labels-idx1-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mnist/t10k-labels-idx1-ubyte.gz
INFO:patool:running /usr/bin/7z e -omnist -- mnist/t10k-labels-idx1-ubyte.gz
INFO patool: ... mnist/t10k-labels-idx1-ubyte.gz extracted to `mnist'.
INFO:patool:... mnist/t10k-labels-idx1-ubyte.gz extracted to `mnist'.
INFO patool: Extracting mnist/train-images-idx3-ubyte.gz ...
INFO:patool:Extracting mnist/train-images-idx3-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mni

'mnist'

## **Edit `Makefile`**

In [None]:
%%writefile Makefile

# Compilers
CXX := g++
CXX_FLAGS := -std=c++17 -ggdb
NVCC := nvcc

# Folders
BIN := bin
SRC := src
INCLUDE := include

EXECUTABLE := nn_main

all: $(BIN)/$(EXECUTABLE)

run: clean all
	clear
	./$(BIN)/$(EXECUTABLE)

$(BIN)/$(EXECUTABLE): $(SRC)/*.cu $(SRC)/*.cpp
	$(NVCC) -I $(INCLUDE) $^ -o $@

clean:
	-rm $(BIN)/*

Overwriting Makefile


## **Compile and run**

In [51]:
# Compile
!make

nvcc -I include src/main.cu src/nn.cu src/utils_device.cu src/data.cpp src/utils_host.cpp -o bin/nn_main


### **Run with different config**
> To run the program:
> `./main <#-neurons> <#-epochs> <learning-rate> <mode>` \
> Set `mode` to `0` to not use optimized GPU.

In [54]:
!echo "BASELINE GPU TEST..."
!./bin/nn_main 20 4 0.5 0

BASELINE GPU TEST...
-- # neurons: 20
-- # epochs: 4
-- learning rate: 0.5
-- optimize GPU (tiled matmul in FW, fp16 in BW): 0
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 2980.574951 ms
- layer 1 forward time: 79.021629 ms
- layer 2 forward time: 51.605698 ms
> FORWARD TIME CPU: 3111.387939 ms

- layer 0 forward time: 54.005630 ms
- layer 1 forward time: 5.348384 ms
- layer 2 forward time: 2.537472 ms
> FORWARD TIME GPU: 65.901375 ms

-- Mean error CPU - GPU: 1.72985e-05

Training on CPU...
- layer 0 forward time: 2976.766846 ms
- layer 1 forward time: 79.888062 ms
- layer 2 forward time: 50.814976 ms
>>> Epoch 1 CEE loss: 13.4511
- layer 0 forward time: 3080.267578 ms
- layer 1 forward time: 78.784317 ms
- layer 2 forward time: 50.777794 ms
>>> Epoch 2 CEE loss: 13.3069
- layer 0 forward time: 3015.235840 ms
- layer 1 forward time: 131.769440 ms
- layer 2 forward time: 79

In [55]:
!echo "OPTIMIZED GPU TEST..."
!./bin/nn_main 20 4 0.5 1

OPTIMIZED GPU TEST...
-- # neurons: 20
-- # epochs: 4
-- learning rate: 0.5
-- optimize GPU (tiled matmul in FW, fp16 in BW): 1
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded

- layer 0 forward time: 2989.494385 ms
- layer 1 forward time: 79.177536 ms
- layer 2 forward time: 50.835457 ms
> FORWARD TIME CPU: 3119.703369 ms

- layer 0 forward time: 52.343487 ms
- layer 1 forward time: 5.411840 ms
- layer 2 forward time: 2.627552 ms
> FORWARD TIME GPU: 64.315872 ms

-- Mean error CPU - GPU: 1.71073e-05

Training on CPU...
- layer 0 forward time: 2990.043213 ms
- layer 1 forward time: 78.734238 ms
- layer 2 forward time: 50.972897 ms
>>> Epoch 1 CEE loss: 13.4511
- layer 0 forward time: 3535.662842 ms
- layer 1 forward time: 78.855133 ms
- layer 2 forward time: 51.082207 ms
>>> Epoch 2 CEE loss: 13.3069
- layer 0 forward time: 2982.452393 ms
- layer 1 forward time: 79.105888 ms
- layer 2 forward time: 48