#### HARDWARE ACCELERATED MACHINE LEARNING LIBRARY
# Classifying Handwritten Digits with Gausian Naive Bayes



In this notebook a Gausian Naive Bayes application is executed locally, using our HW accelerated Machine Learning library (mllib_accel). We are given the option to choose between an accelerated execution that uses both software and hardware and a non-accelerated one, that uses only the CPU cores.

Upon choosing the accelerated option, the accelerator's library is invoked, which is using Xilinx’s built-in modules and classes, in order to drive the Programmable Logic. The Gausian Naive Bayes overlay that is used has been created with a custom accelerator (NB_training_kernel and NB_prediction_kernel), that receives data from Python, processes it, and returns the results, using AXI4-Stream Accelerator Adapter.


## Data Set Introduction

The data are taken from the famous MNIST dataset.

The original data file contains gray-scale images of hand-drawn digits, from zero through nine. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

In this example the data we use are already preprocessed/normalized using Feature Standardization method (Z-score scaling).

The data sets (train, test) have 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the (rescaled) pixel-values of the associated image.


## Environment Initialization

In [1]:
import sys, os
import cffi
spark_home = os.environ.get("SPARK_HOME", None)

# Add the spark python sub-directory to the path
sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home + "/python/lib/py4j-0.10.7-src.zip"))

# Initialize PySpark to predefine the SparkContext variable 'sc'
filename = spark_home+"/python/pyspark/shell.py"

## 1. Reading data

In [2]:
    dataset = "MNIST"
    decision = 1
    _accel_ = 1
    
    train_file = "inputs/" + dataset + "_2k.dat"
    test_file = "inputs/" + dataset + "_2k.dat"

    with open(dataset, 'r') as f:
        for line in f:
            if line[0] != '#':
                parameters = line.split(',')
                numClasses = int(parameters[0])
                numFeatures = int(parameters[1])
    f.close()       

    print("* NaiveBayes Application *")
    print(" # train file:               {:s}".format(train_file))
    print(" # test file:                {:s}".format(test_file))

* NaiveBayes Application *
 # train file:               inputs/MNIST_2k.dat
 # test file:                inputs/MNIST_2k.dat


## 2. SW-only vs HW accelerated

In [3]:
_accel_ = int(input("Select mode (0: SW-only, 1: HW accelerated): "))

Select mode (0: SW-only, 1: HW accelerated): 1


## 3. Data extraction

In [4]:
from pyspark.mllib.regression import LabeledPoint

def parsePoint(line):
    """
    Parse a line of text into an MLlib LabeledPoint object.
    """

    data = [float(s) for s in line.split(',')]

    return LabeledPoint(data[0], data[1:])

trainSet = []
with open(train_file, 'r') as f:
    for line in f:
        trainSet.append(parsePoint(line))
f.close()

testSet = []
with open(test_file, 'r') as f:
    for line in f:
        testSet.append(parsePoint(line))
f.close() 

## 4. Train Model NB

In [5]:
from pyspark.mllib_accel.classificationNB_Pynq import Naivebayes
from sys import argv
from time import time


NB = Naivebayes(numClasses, numFeatures, decision)

start = time()

# Train a Naive Bayes model given an dataset of (label, features) pairs. 
NB.train(trainSet, _accel_)

end = time()

if _accel_:
    print("! Time running Naive Bayes train in hardware: {:.3f} sec".format(end - start))
else:
    print("! Time running Naive Bayes train in software: {:.3f} sec".format(end - start))

stats = ["Means","Variances","Priors"]
for i in range (3):
    NB.save("outputs/trainPack"+ stats[i] + ".txt", i)


* NaiveBayes Training *
     # numBuffers:               1
     # numClasses:               10
     # numFeatures:              784
     # Accelerated:              True
! Time running Naive Bayes train in hardware: 0.617 sec


## 5. Prediction with NB 

In [6]:
start = time()    
NB.test(testSet,_accel_)    
end = time()
print("! Time running Naive Bayes test in harware: {:.3f} sec".format(end - start))

* NaiveBayes Testing *
     # accuracy:                 0.754 (1509/2000)
     # true:                     1509
     # false:                    491
! Time running Naive Bayes test in harware: 2.698 sec


## Performance Metrics
We present execution time measurements in different target devices.


| Target | Function | Time |
| --- | --- | --- |
| PYNQ SW-only: | Training | 65 sec |
| PYNQ SW-only: | Prediction | 843 sec |
| Intel Core i5: | Training | 4.5 sec |
| Intel Core i5: | Prediction | 66 sec |
| PYNQ HW accelerated: | Training | 0.63 sec |
| PYNQ HW accelerated: | Prediction | 2.7 sec |
