# **Fully Connected Neural Network: A `CUDA` and `C++` Implementation**

## **Prepare workspace**

In [26]:
from google.colab import drive
drive.mount("/content/drive")
%cd /content/drive/MyDrive/CUDA

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/CUDA


## **Extract `.gz` data (if needed)**

In [None]:
# Extract data from `.gz`
# Only need to run once!
!pip install patool
import patoolib
patoolib.extract_archive("mnist/t10k-images-idx3-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/t10k-labels-idx1-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/train-images-idx3-ubyte.gz", outdir="mnist")
patoolib.extract_archive("mnist/train-labels-idx1-ubyte.gz", outdir="mnist")

Collecting patool
  Downloading patool-3.1.0-py2.py3-none-any.whl.metadata (4.3 kB)
Downloading patool-3.1.0-py2.py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.4/98.4 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: patool
Successfully installed patool-3.1.0


INFO patool: Extracting mnist/t10k-images-idx3-ubyte.gz ...
INFO:patool:Extracting mnist/t10k-images-idx3-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mnist/t10k-images-idx3-ubyte.gz
INFO:patool:running /usr/bin/7z e -omnist -- mnist/t10k-images-idx3-ubyte.gz
INFO patool: ... mnist/t10k-images-idx3-ubyte.gz extracted to `mnist'.
INFO:patool:... mnist/t10k-images-idx3-ubyte.gz extracted to `mnist'.
INFO patool: Extracting mnist/t10k-labels-idx1-ubyte.gz ...
INFO:patool:Extracting mnist/t10k-labels-idx1-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mnist/t10k-labels-idx1-ubyte.gz
INFO:patool:running /usr/bin/7z e -omnist -- mnist/t10k-labels-idx1-ubyte.gz
INFO patool: ... mnist/t10k-labels-idx1-ubyte.gz extracted to `mnist'.
INFO:patool:... mnist/t10k-labels-idx1-ubyte.gz extracted to `mnist'.
INFO patool: Extracting mnist/train-images-idx3-ubyte.gz ...
INFO:patool:Extracting mnist/train-images-idx3-ubyte.gz ...
INFO patool: running /usr/bin/7z e -omnist -- mni

'mnist'

## **Edit `Makefile`**

In [None]:
%%writefile Makefile

# Compilers
CXX := g++
CXX_FLAGS := -std=c++17 -ggdb
NVCC := nvcc

# Folders
BIN := bin
SRC := src
INCLUDE := include

EXECUTABLE := nn_main

all: $(BIN)/$(EXECUTABLE)

run: clean all
	clear
	./$(BIN)/$(EXECUTABLE)

$(BIN)/$(EXECUTABLE): $(SRC)/*.cu $(SRC)/*.cpp
	$(NVCC) -I $(INCLUDE) $^ -o $@

clean:
	-rm $(BIN)/*

Overwriting Makefile


## **Compile and run**

In [36]:
# Compile
!make

nvcc -I include  src/main.cu src/nn.cu src/utils_device.cu src/data.cpp src/utils_host.cpp -o bin/nn_main


### **Run with different config**

In [37]:
# Run the program
# ./main <#-neurons> <#-epochs> <learning-rate> <mode>

!echo "Train CPU..."
!./bin/nn_main 20 5 0.5 1

Train CPU...
-- # neurons: 20
-- # epochs: 5
-- learning rate: 0.5
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded


Train start...
-- number of epochs: 5
- layer 0 forward time: 3553.003418 ms
- layer 1 forward time: 86.360382 ms
- layer 2 forward time: 53.420033 ms
>>> Epoch 1 CEE loss: 13.1673
- layer 0 forward time: 3063.780273 ms
- layer 1 forward time: 87.075935 ms
- layer 2 forward time: 49.783329 ms
>>> Epoch 2 CEE loss: 13.6883
- layer 0 forward time: 3113.592285 ms
- layer 1 forward time: 86.288383 ms
- layer 2 forward time: 51.869377 ms
>>> Epoch 3 CEE loss: 7.95351
- layer 0 forward time: 4610.627441 ms
- layer 1 forward time: 92.516350 ms
- layer 2 forward time: 50.974720 ms
>>> Epoch 4 CEE loss: 2.30289
- layer 0 forward time: 3035.546631 ms
- layer 1 forward time: 93.261856 ms
- layer 2 forward time: 55.039200 ms
>>> Epoch 5 CEE loss: 2.30289
TRAIN TIME: 42665.164062 ms



In [38]:
!echo "Train GPU..."
!./bin/nn_main 20 5 0.5 2

Train GPU...
-- # neurons: 20
-- # epochs: 5
-- learning rate: 0.5
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded


Train start...
-- number of epochs: 5
- layer 0 forward time: 53.656673 ms
- layer 1 forward time: 5.918336 ms
- layer 2 forward time: 2.482048 ms
>>> Epoch 1 CEE loss: 12.6463
- layer 0 forward time: 52.788097 ms
- layer 1 forward time: 5.268384 ms
- layer 2 forward time: 2.423200 ms
>>> Epoch 2 CEE loss: 12.6444
- layer 0 forward time: 56.349407 ms
- layer 1 forward time: 5.297632 ms
- layer 2 forward time: 2.515200 ms
>>> Epoch 3 CEE loss: 12.6435
- layer 0 forward time: 53.461151 ms
- layer 1 forward time: 5.473344 ms
- layer 2 forward time: 2.476128 ms
>>> Epoch 4 CEE loss: 12.642
- layer 0 forward time: 53.454239 ms
- layer 1 forward time: 5.293856 ms
- layer 2 forward time: 2.434112 ms
>>> Epoch 5 CEE loss: 12.6394
TRAIN TIME: 2415.871826 ms



In [39]:
!echo "Train GPU (optimized)..."
!./bin/nn_main 20 5 0.5 3

Train GPU (optimized)...
-- # neurons: 20
-- # epochs: 5
-- learning rate: 0.5
Train Images: 60000 with size 784
Train Labels: 60000 labels loaded
Test Images: 10000 with size 784
Test Labels: 10000 labels loaded


Train start...
-- number of epochs: 5
- layer 0 forward time: 51.783810 ms
- layer 1 forward time: 6.392832 ms
- layer 2 forward time: 2.517792 ms
>>> Epoch 1 CEE loss: 11.0525
- layer 0 forward time: 50.025791 ms
- layer 1 forward time: 5.309888 ms
- layer 2 forward time: 2.428160 ms
>>> Epoch 2 CEE loss: 11.0519
- layer 0 forward time: 49.696129 ms
- layer 1 forward time: 5.263360 ms
- layer 2 forward time: 2.393344 ms
>>> Epoch 3 CEE loss: 11.0512
- layer 0 forward time: 49.642689 ms
- layer 1 forward time: 5.250528 ms
- layer 2 forward time: 2.391968 ms
>>> Epoch 4 CEE loss: 11.0503
- layer 0 forward time: 48.230175 ms
- layer 1 forward time: 5.297088 ms
- layer 2 forward time: 2.352736 ms
>>> Epoch 5 CEE loss: 11.0503
TRAIN TIME: 2443.056152 ms

