# QBART: A general QNN inference Accelerator

*Welcome to QBART, the Quantized, Bitserial, AcceleRaTor!*

<img src="logo.png",width=400,height=400>

In this MVP-implementation, the QBART-team have prepared the following:
- Three layers run on the FPGA: Thresholding, Fully Connected, and Convolution.
- All the other layers run on the Cortex A9s through this notebook: pooling, minimax, and ??
- We utilize little to no BRAM on the FPGA, as most IO is saved directly to DRAM, and we have no custom memory hierarchy for the FPGA, so memory performance is suboptimal.
- We use the GTSRB-benchmark as the default in testing.
- QBART can scale across several PYNQs via ethernet, yielding a linear speedup (speedup ~= Number of PYNQs/1)

Alright, let's get to it!

## Requirements:
- A trained QNN that is constructed with layers.py in the QNN folder, then pickled with python2 to a pickle file.
- This must be placed on the PYNQ, and you must edit the QNN path below so that QBART can find and work on it.
- Image(s) must also be placed in a seperate folders, and you must set the image path accordingly.
- You must also manually configure the configpart below, and setup static IPs for your additional PYNQs if you want to use distributed computing.

Alright, with the requirements done, we do the following:
1. Run all image classifications on QBART, and time it.
2. Run all image classifications on a pure, correct CPU implementation, and time it.
3. Check if both QBART and the CPU implementation agree. If both implementations agree on all image classifications, we know that the QBART implementation is correct.
4. Present the results to the user.

# Step 1: Running all image classifications on QBART

In [13]:
# Open source libraries
from time import time
import copy
import os
import sys
from imagenet_classes import *

# Custom functions for the project
from qbart_helper import *
from client import classification_client

# Provided by course instructor (github: Maltanar)
from QNN import *

###########################################################################################################
### USER INPUT SECTION, USER MUST SUBMIT VALUES OR "None" WHERE APPLICABLE
###########################################################################################################

"""
############## 
# MNIST Config
##############
qnn_path = "/home/kris/Development/QBART_user_files/mnist-w1a2.pickle"
image_dir = "/home/kris/Development/QBART_user_files/mnist_images"
image_limit = 10
image_channels = "grayscale"
image_data_layout = "Crc"

qbart_data_layout = "Crc"

qnn_trained_channels = "grayscale"
qnn_trained_imsize_col = 28
qnn_trained_imsize_row = 28
qnn_class = "MNIST"

image_classes = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
"""

"""
############## 
# GTSRB Config
##############
qnn_path = "gtsrb-w1a1.pickle"         # Image directory, relative to where the notebook resides.
image_dir = "Images"                   # Image directory, relative to where the notebook resides.
image_limit = 10                      # Max amount of images to be inferenced, set float("Inf") to inference all.
image_channels = "RGB"                 # Must be specified in order if the image is not a .jpg or .ppm
image_data_layout = "rcC"              # Must be specified, r = row, c = column, C = Channel

qbart_data_layout = "Crc"              # Qbart assumes data to be in column major form.

qnn_trained_channels = "BGR"           # The color channel ordering that the qnn is trained to.
qnn_trained_imsize_col = 32            # The expected column size of input images to the qnn.
qnn_trained_imsize_row = 32            # The expected row size of input images to the qnn.
qnn_class = "GTSRB"

# Either specify image classes to get an easily readable name, or specify None to just get a category #.
image_classes = ['20 Km/h', '30 Km/h', '50 Km/h', '60 Km/h', '70 Km/h', '80 Km/h', 'End 80 Km/h', '100 Km/h', '120 Km/h', 'No overtaking', 'No overtaking for large trucks', 'Priority crossroad', 'Priority road', 'Give way', 'Stop', 'No vehicles', 'Prohibited for vehicles with a permitted gross weight over 3.5t including their trailers, and for tractors except passenger cars and buses', 'No entry for vehicular traffic', 'Danger Ahead', 'Bend to left', 'Bend to right', 'Double bend (first to left)', 'Uneven road', 'Road slippery when wet or dirty', 'Road narrows (right)', 'Road works', 'Traffic signals', 'Pedestrians in road ahead', 'Children crossing ahead', 'Bicycles prohibited', 'Risk of snow or ice', 'Wild animals', 'End of all speed and overtaking restrictions', 'Turn right ahead', 'Turn left ahead', 'Ahead only', 'Ahead or right only', 'Ahead or left only', 'Pass by on right', 'Pass by on left', 'Roundabout', 'End of no-overtaking zone', 'End of no-overtaking zone for vehicles with a permitted gross weight over 3.5t including their trailers, and for tractors except passenger cars and buses']

"""

############## 
# ImageNet Config
##############
# TODO: I think load_images and everything is set. Now we must just set path to qnn and images, and then try to run the darn thing.
# Some work with the result array is required as well. Use qnn_class to set a custom classifying translation there.
qnn_path = "/home/kris/Development/QBART_user_files/alexnet-hwgq.pickle"
image_dir = "/home/kris/Development/QBART_user_files/imagenet_minortest_images"
image_limit = 5
image_channels = "RGB"
image_data_layout = "rcC"

qbart_data_layout = "Crc"

qnn_trained_channels = "BGR"
qnn_trained_imsize_col = 227
qnn_trained_imsize_row = 227

qnn_class = "imagenet"

image_classes = imagenet_classes


# Cluster config
aase = '192.168.1.7'
bjorg = '192.168.1.4'
gunn = '192.168.1.2'
solfrid = '192.168.1.5'
qbart_port = 64646

# At least one server (localhost or remote) must be running, or we can't run.
server_list = [('localhost', qbart_port)] 


###########################################################################################################
###########################################################################################################

###########################################################################################################
### MAIN METHOD, SHOULD BE KEPT RELATIVELY SIMPLE, DETAILS STORED AWAY IN HELPER FUNCTIONS
###########################################################################################################
print("QBART Notebook now running")
print("Loading images")
images = load_images(image_dir, image_limit, qnn_trained_imsize_col, qnn_trained_imsize_row, qbart_data_layout, qnn_trained_channels, qnn_class)
print("Loading QNN")
qnn = load_qnn(qnn_path)

print("Starting timer for classification")
qbart_starttime = time()
# We send the images to the processing server (currently localhost, can later be localhost and others (each with
# its separate thread here in main or in classification client.))
print("Sending images and qnn to server(s) for classification.")
qbart_classifications = classification_client(qnn, copy.copy(images), server_list)
print("All results received.")
qbart_classifications = [j for i in qbart_classifications for j in i] # We flatten the list we receive. A bit messy.
qbart_endtime = time()
print("Timer stopped.")


    
###########################################################################################################
###########################################################################################################

['', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/home/kris/.local/lib/python2.7/site-packages', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', '/usr/local/lib/python2.7/dist-packages/IPython/extensions', '/home/kris/.ipython']
QBART Notebook now running
Loading images
Loading QNN
Starting timer for classification
Sending images and qnn to server(s) for classification.
Successfully connected to localhost
Connected servers are now receiving and working on images.
All results received.
Timer stopped.


# Step 2: Running all image classifications on a CPU implementation 

Using the code from qnn-inference-examples (GTSRB only)
With some modifications in order to batch process images instead of one-at-a-time.

In [14]:
import cPickle as pickle
from PIL import Image
import numpy as np
#from QNN import *
from QNN.layers import *
#from qbart_helper import *

print("Starting CPU implement run on notebook.")
print("Loading qnn pickle")
# Load the qnn pickle string
qnn_unpickled = pickle.loads(qnn)

# Tutorial code galore
print("Starting timer.")
tutorial_start = time()
qnn_classifications = []

print("Classifying..")
for image in images:
    qnn_classifications.append((image[0], np.argmax(predict(qnn_unpickled, image[1]))))
    
tutorial_stop = time()
tutorial_time_total = tutorial_stop - tutorial_start
print("We finished classifying. Clock stopped.")

Starting CPU implement run on notebook.
Loading qnn pickle
Starting timer.
Classifying..
We finished classifying. Clock stopped.


# Step 3: Simple implementation correctness testing

QBART is a QNN inference accelerator. Therefore, we do not test to see if the images are actually classified correctly, only that we execute the inference correctly. The actual classification accuracy is determined by the way that the QNN is trained. Therefore, we test for correctness by seeing if the classification list of the pure CPU classifications is equal to the QBART classifications.

In [15]:
if qnn_classifications == qbart_classifications:
    print("The classification lists are identical, therefore qbart works correctly.")
else:
    print("There is a mismatch between the pure cpu classification and the qbart classification. There is an error somewhere.")

The classification lists are identical, therefore qbart works correctly.


# Step 4: Presentation of classification results

In [16]:
print("Time used by QBART classification: " + str(qbart_endtime-qbart_starttime))
print("Time used by tutorial classification: " + str(tutorial_time_total)) 

# Since everything is a-ok, we print the results and also write it to a file for easy usage elsewhere.
print("Printing classifications and writing to results.txt")


results_file = open("results.csv","wb")

# Here we print the image file name alongside its classification (a number if image_classes is not specified)
# The result is also saved as a results.csv file.
for i in range(len(qbart_classifications)):
    if image_classes is not None:
        print(qbart_classifications[i][0], image_classes[qbart_classifications[i][1]])
        results_file.write(str(qbart_classifications[i][0]) + "," + str(image_classes[qbart_classifications[i][1]])+ os.linesep)
    else:
        print(qbart_classifications[i][0], qbart_classifications[i][1])
        results_file.write(str(qbart_classifications[i][0]) + "," + str(qbart_classifications[i][1])+ os.linesep)
results_file.close()

Time used by QBART classification: 7.89345216751
Time used by tutorial classification: 7.16517496109
Printing classifications and writing to results.txt
('cat.jpg', 'tiger cat')
('grouse.jpg', 'black grouse')
('husky.jpg', 'Eskimo dog, husky')


# Step 5: Benchmarking GEMMBitserial component on the FPGA

The runtimes of a single QBART vs the bare cpu-implementation do not really differ that much.
FPGA layer execution is substantially faster than on numpy-methods on the CPU.

A major bottleneck is sadly the amount of DRAM read/writes. In our current implementation, in order to support generality, we have to write data from CPU to RAM, and then from RAM to FPGA component before execution. Then, when results are done, we have to write data from FPGA to RAM, and from RAM to CPU. This adds a substantial data transfer overhead to the total execution time.

To display how much potential there are in the FPGA components, we will now run a small benchmark, showing the difference in execution time when it comes to CPU execution and FPGA execution.

In [1]:
sys.path.append("PATH TO CFFI_RUN HERE PL0X")

from cffi_run import test_fc_benchmarks
from bitserial_performance_model import FC_time

# First we time a run of fully connected all the way from this notebook.
# TODO: Load QNN pickle, go straight to the thresholding layer, and execute it with an Activation matrix that
# has the right size. If the QNN is constant, then the activation matrix size will be too.


# Before benchmarking the component, we calculate how many cycles it will take given our performance model.
# FC_time expects FREQ WORD_SIZE COLS LHS_ROWS RHS_ROWS
performance_model_predicted_time = FC_time()

# Then we benchmark a FC Connected layer with similar parameters, the time outputted is the time on FPGA only.
