#### FPGA-ACCELERATION OF MACHINE LEARNING IN CLOUD COMPUTING, A CASE STUDY USING LOGISTIC REGRESSION

<img style="float: left;" src="data/sample.png">

# Classifying Handwritten Digits with <br /> Logistic Regression <br />

___

This notebook demonstrates how to interface to our hardware ML library from Python, while also explains how the communication between Python and the hardware accelerator is performed. The Logistic Regression (LR) overlay that is used has been created with an accelerator (LR_gradients_kernel), that receives data from Python, processes it, and returns the results, using AXI4-Stream Accelerator Adapter.

## 1. Accelerator API

First, we download the LR overlay onto the device and we map the DMAs of the hardware accelerator to DMA
objects, using Xilinx’s built-in modules and classes. We then allocate the corresponding buffers for the DMAs and after writing the input data (data1, data2, weights) to them, we can finally transfer them to the accelerator.  We also send the size of the data chunk along with the necessary commands to the Accelerator Adapter. Gradients are computed in return.

```python      
from pynq import MMIO, Overlay, PL
from pynq.mllib_accel import DMA

DMA_TO_DEV = 0    # DMA sends data to PL.
DMA_FROM_DEV = 1  # DMA receives data from PL.

class LR_Accel:
    """
    Python class for the LR Accelerator.
    """
    
    def __init__(self, chunkSize, numClasses, numFeatures):
        self.numClasses = numClasses
        self.numFeatures = numFeatures
           
        # -------------------------
        #   Download Overlay.
        # -------------------------    

        ol = Overlay("LogisticRegression.bit")
        ol.download()  
        
        # -------------------------
        #   Physical address of the Accelerator Adapter IP.
        # -------------------------

        ADDR_Accelerator_Adapter_BASE = int(PL.ip_dict["SEG_LR_gradients_kernel_accel_0_if_Reg"][0], 16)
        ADDR_Accelerator_Adapter_RANGE = int(PL.ip_dict["SEG_LR_gradients_kernel_accel_0_if_Reg"][1], 16)

        # -------------------------
        #    Initialize new MMIO object. 
        # -------------------------

        self.bus = MMIO(ADDR_Accelerator_Adapter_BASE, ADDR_Accelerator_Adapter_RANGE)

        # -------------------------
        #   Physical addresses of the DMA IPs.
        # -------------------------

        ADDR_DMA0_BASE = int(PL.ip_dict["SEG_dm_0_Reg"][0], 16)
        ADDR_DMA1_BASE = int(PL.ip_dict["SEG_dm_1_Reg"][0], 16)
        ADDR_DMA2_BASE = int(PL.ip_dict["SEG_dm_2_Reg"][0], 16)
        ADDR_DMA3_BASE = int(PL.ip_dict["SEG_dm_3_Reg"][0], 16)

        # -------------------------
        #    Initialize new DMA objects. 
        # -------------------------

        self.dma0 = DMA(ADDR_DMA0_BASE, direction = DMA_TO_DEV)    # data1 DMA.
        self.dma1 = DMA(ADDR_DMA1_BASE, direction = DMA_TO_DEV)    # data2 DMA.
        self.dma2 = DMA(ADDR_DMA2_BASE, direction = DMA_TO_DEV)    # weights DMA.
        self.dma3 = DMA(ADDR_DMA3_BASE, direction = DMA_FROM_DEV)  # gradients DMA.
        
        # -------------------------
        #    Allocate physically contiguous memory buffers.
        # -------------------------

        self.dma0.create_buf(int(chunkSize / 2) * (self.numClasses + (1 + self.numFeatures)) * 4, 1)
        self.dma1.create_buf(int(chunkSize / 2) * (self.numClasses + (1 + self.numFeatures)) * 4, 1)
        self.dma2.create_buf((self.numClasses * (1 + self.numFeatures)) * 4, 1)
        self.dma3.create_buf((self.numClasses * (1 + self.numFeatures)) * 4, 1)

        # -------------------------
        #    Get CFFI pointers to objects' internal buffers.
        # -------------------------

        self.data1_buf = self.dma0.get_buf(32, data_type = "float")
        self.data2_buf = self.dma1.get_buf(32, data_type = "float")
        self.weights_buf = self.dma2.get_buf(32, data_type = "float")
        self.gradients_buf = self.dma3.get_buf(32, data_type = "float")

    def gradients_kernel(self, data, weights):
        chunkSize = int(len(data) / (self.numClasses + (1 + self.numFeatures)))
        
        for i in range (0, int(len(data) / 2)):
            self.data1_buf[i] = float(data[i])
            self.data2_buf[i] = float(data[int(len(data) / 2) + i])
        for kj in range (0, self.numClasses * (1 + self.numFeatures)):
            self.weights_buf[kj] = float(weights[kj])

        # -------------------------
        #   Write data to MMIO.
        # -------------------------

        CMD = 0x0028            # Command.
        ISCALAR0_DATA = 0x0080  # Input Scalar-0 Write Data FIFO.

        self.bus.write(ISCALAR0_DATA, int(chunkSize))
        self.bus.write(CMD, 0x00010001)
        self.bus.write(CMD, 0x00020000)
        self.bus.write(CMD, 0x00000107)

        # -------------------------
        #   Transfer data using DMAs (Non-blocking).
        #   Block while DMAs are busy.
        # -------------------------

        self.dma0.transfer(int(len(data) / 2) * 4, direction = DMA_TO_DEV)
        self.dma1.transfer(int(len(data) / 2) * 4, direction = DMA_TO_DEV)
        self.dma2.transfer((self.numClasses * (1 + self.numFeatures)) * 4, direction = DMA_TO_DEV)

        self.dma0.wait()
        self.dma1.wait()
        self.dma2.wait()

        self.dma3.transfer((self.numClasses * (1 + self.numFeatures)) * 4, direction = DMA_FROM_DEV)

        self.dma3.wait()

        gradients = []
        for kj in range (0, self.numClasses * (1 + self.numFeatures)):
            gradients.append(float(self.gradients_buf[kj]))

        return gradients
    
    def __del__(self):

        # -------------------------
        #   Destructors for DMA objects.
        # -------------------------

        self.dma0.__del__()
        self.dma1.__del__()
        self.dma2.__del__()
        self.dma3.__del__()

```

## 2. Data Set Introduction

The data are taken from the famous <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a> dataset. 

The original data file contains gray-scale images of hand-drawn digits, from zero through nine. Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

In this example the data we use are already preprocessed/normalized using Feature Standardization method (<a href="https://en.wikipedia.org/wiki/Standard_score">Z-score scaling</a>).

### Export data sets

In [1]:
import os

print(os.popen("tar -zxvf data/datasets_MNIST_small.tar.gz -C data/").read().strip("\n"))

train.dat
test.dat


The data sets (train, test) have 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the (rescaled) pixel-values of the associated image.

## 3. Logistic Regression Application

This example shows how our Logistic Regression library is called to train a LR model on the train set and then test its accuracy. If \_accel\_ = 1, the hardware accelerator is used for the computation of the gradients in each iteration, through the driver we introduced above (LR_Accel).

### Read data & parameters

The size of the train set, as well as the number of the iterations are intentionally picked small, to avoid large execution time in SW-only cases.

In [2]:
# Data sets
train_file = "data/train.dat"
test_file = "data/test.dat"

print("* LogisticRegression Application *")
print(" # train file:               " + train_file)
print(" # test file:                " + test_file)

# Read train file
trainFile = []  
with open(train_file, "r") as lines:
    for line in lines:
        trainFile.append(line.strip("\n"))      
lines.close()
     
# Read test file
testFile = []  
with open(test_file, "r") as lines:
    for line in lines:
        testFile.append(line.strip("\n"))       
lines.close()  

* LogisticRegression Application *
 # train file:               data/train.dat
 # test file:                data/test.dat


In [3]:
chunkSize = 4000
alpha = 0.25    # learning rate
iterations = 3  # number of iterations
print("Select mode: (0: SW-only, 1: HW accelerated)")
_accel_ = int(input()) 

Select mode: (0: SW-only, 1: HW accelerated)
1


### Instantiate a Logistic Regression model

In [4]:
from pynq.mllib_accel.classification import LogisticRegression

numClasses = 10  
numFeatures = 784 
LR = LogisticRegression(numClasses, numFeatures)    

### Train the LR model    

In [5]:
weights = LR.train(trainFile, chunkSize, alpha, iterations, _accel_)
    
# Write weights file    
with open("data/weights.out", "w") as weights_file:
    for k in range(0, numClasses):
        for j in range(0, (1 + numFeatures)):
            if j == 0:
                weights_file.write(str(round(weights[k * (1 + numFeatures) + j], 5)))
            else:
                weights_file.write("," + str(round(weights[k * (1 + numFeatures) + j], 5)))
        weights_file.write("\n")
weights_file.close()

    * LogisticRegression Training *
     # numSamples:               12000
     # chunkSize:                4000
     # numClasses:               10
     # numFeatures:              784
     # alpha:                    0.25
     # iterations:               3
! Time running training in hardware: 154.647 sec


### Test the LR model 

In [6]:
LR.test(testFile)

    * LogisticRegression Testing *
     # accuracy:                 0.8155(1631/2000)
     # true:                     1631
     # false:                    369


## 4. Performance

We present execution time measurements in different target devices.

`! Time running training in`

Target | Time
:--- | ---:
`PYNQ SW-only:` | `2648.151 sec` 
`Intel Core i5-5200U @ 2.20GHz:` | `235.819 sec`
`PYNQ HW accelerated:` | `154.647 sec`