Hardware Acceleration for Machine Learnning

A image identifier implemented with FGPA to achieve fast real-time multiple object detection.

TEAM MEMBERS: Zuxiong Tan, Samyak Jain, Chenxi Li

Project Goals:

Find a state-of-art multiple object detection model
Measure its performance on GPU for inferencing
Deploy the model on FPGA DPU achieving real-time measurement
Measure the inferencing performance
Compare performances

Make roofline plot
Calculate memory bandwidths for the DL program on GPU and FPGA

What is DPU

The Xilinx® Deep Learning Processor Unit (DPU) is a programmable engine optimized for convolutional neural networks. The unit includes a high performance scheduler module, a hybrid computing array module, an instruction fetch unit module, and a global memory pool module. The DPU uses a specialized instruction set, which allows for the efficient implementation of many convolutional neural networks. Some examples of convolutional neural networks which have been deployed include VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, FPN, and many others.
The DPU IP can be implemented in the programmable logic (PL) of the selected Zynq®-7000 SoC or Zynq UltraScale+™ MPSoC devices with direct connections to the processing system (PS). The DPU requires instructions to implement a neural network and accessible memory locations for input images as well as temporary and output data. A program running on the application processing unit (APU) is also required to service interrupts and coordinate data transfers. https://www.xilinx.com/products/design-tools/ai-inference/ai-developer-hub.html#edge

DPU Development Flow (Using DNNDK)

The DPU requires a device driver which is included in the Xilinx Deep Neural Network Development Kit (DNNDK) toolchain.
The DNNDK User Guide (UG1327) describes how to use the DPU with the DNNDK tools. The basic development flow is shown in the following figure. First, use Vivado to generate the bitstream. Then, download the bitstream to the target board and install the DPU driver. For instructions on how to install the DPU driver and dependent libraries, refer to the DNNDK User Guide (UG1327).https://www.xilinx.com/support/documentation/user_guides/ug1327-dnndk-user-guide.pdf

Similar Products:

NVIDIA Deep Learning Accelerator(NVDLA):

This is a free and open architecture that promotes a standard way to design deep learning inference accelerators. NVDLA is scalable, highly configurable, and designed to simplify integration and portability. The hardware supports a wide range of IoT devices.
NVDLA overview: http://nvdla.org

Google's Tensor Processing Unit(TPU):

TPU is tailored to machine learning applications, allowing the chip to be more tolerant of reduced computational precision, which means it requires fewer transistors per operation. TPUs power many applications at Google, including RankBrain, used to improve the relevancy of search results and Street View, to improve the accuracy and quality of our maps and. navigation.
TPU overview: https://cloud.google.com/blog/products/gcp/quantifying-the-performance-of-the-tpu-our-first-machine-learning-chip

Sprint 1

Mange to run YOLO on GPU
Compare YOLO's performance on GPU to on CPU
Get FPGA

Sprint 2 (Lots of work on reverse-enginneering darknet YOLO)

Refactor YOLO we got from https://pjreddie.com/darknet/yolo/
Rewrite YOLO with DNNDK API
Looked into different methods to run the given C code on an FPGA
- Use OpenCL framework to run the code on an Intel FPGA. Can be done using the Intel FPGA SDK for OpenCL
- Convert the code into HDL to run on a Xilinx FPGA
  - Implement DPU on vivado and run some simulation tests

Results from sprint 1

Time taken to detect obejcts on a single image

Prediction on BU SCC GPU 0.925530 seconds.
Prediction on CPU(single core). Intel Core i5: 19.457083 seconds.
GPU Spec:
- Tesla P100 PCIe 16GB
- Width: 64 bits
- Clock: 33MHz

Sprint 3

Achieved object detection using Hardware Accelerator based on FPGA
Compare the performance and Power efficiency between FPGA, GPU and CPU

System Diagram

Graph above shows the system diagram of the design using YOLOv2 model with darknet-19. In this design we used CPU as the co-processor and used FPGA to accelerate the calculation. The acceleration card we used is Xilinx ML Suite-Alveo U200 and we developed it on AWS(Amazon Web Services)

Performance

According to the graph, GPU runs 15.5 times faster than CPU, FPGA runs 4.9 times faster than CPU.

Power efficiency

Power efficiency = speed/power, where GPU is 5.89 times better than CPU, FPGA is 52.6 times better than CPU.

User Stories:

Navigation for Robots
Surveillance
Self-Driving cars Use YOLOv2 algorithm

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
YOLO		YOLO
.DS_Store		.DS_Store
.gitignore		.gitignore
A2_09.jpg		A2_09.jpg
DPU.png		DPU.png
Development Process.png		Development Process.png
Performance.jpg		Performance.jpg
README.md		README.md
XDNN.jpeg		XDNN.jpeg
image.png		image.png
power.jpg		power.jpg

markcxli/FPGA_DPU

Folders and files

Latest commit

History

Repository files navigation

Hardware Acceleration for Machine Learnning

Project Goals:

What is DPU

DPU Development Flow (Using DNNDK)

Similar Products:

Sprint 1

Sprint 2 (Lots of work on reverse-enginneering darknet YOLO)

Results from sprint 1

Time taken to detect obejcts on a single image

Sprint 3

System Diagram

Performance

Power efficiency

User Stories:

Poster

About

Resources

Stars

Watchers

Forks

Languages