

# Tutorial: Neural Network Accelerator Co-Design with

Xilinx Research Labs, TU Delft, Heidelberg University

2021-03-24

## **Tutorial Agenda**



- Introduction to FINN (~30m)
- ▶ A tour of the repositories and the flow (~10m)
- ► FINN community update (~15m)

------ Break + Q&A (~10m)

- ▶ Hands-on part (~1.5h)
  - Training a quantized MLP with Brevitas
  - Exporting the MLP and verification
  - From MLP to custom hardware with the FINN compiler





Part of recording from the FINN FPGA'21 tutorial https://www.youtube.com/watch?v=zw2aG4PhzmA

## Introduction to FINN

Michaela Blott Distinguished Engineer, Xilinx Research Labs

## A Tour of the Repositories

Yaman Umuroglu Senior Research Scientist, Xilinx Research Labs





Michaela Blott Distinguished Engineer, Xilinx Research Labs



## FINN: The Beginning (FPGA'17)

## FINN: A Framework for Fast, Scalable Binarized Neural Network Inference

Yaman Umuroglu\*†, Nicholas J. Fraser\*‡, Giulio Gambardella\*, Michaela Blott\*,
Philip Leong‡, Magnus Jahre† and Kees Vissers\*
\*Xilinx Research Labs; †Norwegian University of Science and Technology; †University of Sydney





## FINN – Project Mission



#### Mission

 Tools and platforms for creation of high throughput, ultra-low latency DNN compute architectures

#### ▶ End-to-end flow

 Users can easily create specialized hardware architectures on an FPGA and benefit from custom architectures and custom precision

#### Open source

Transparency and flexibility to adapt to end-users' applications



## Two Key Techniques for Customization in FINN

## Streaming Dataflow Architectures with FPGAs & FINN



## **Custom Precision Few-bit weights & activations**





## **Customized Dataflow Processing versus More Generic Architectures**

## Matrix of Processing Engines (MPE) (Vitis AI, ASICs, GPUs):



- Customized for typical DNN operations
  - for example multiply accumulate
- Lower throughput (~10KRps)
- Flexibility for ASICs
- Applications: CV, Speech



- Customized/adapt for specific DNN topologies
- Streaming interfaces
- Specialization -> higher efficiency
- Lower latency (no intermediate buffering)
- Higher throughput (~100MRps)
- Flexibility through reconfiguration
- Applications: TBD, networking, material science, particle physics – smaller DNNs

#### **Dataflow Processing:**

#### Scaling to Meet Performance & Resource Requirements



- 1. Scale performance & resources to meet the application requirements
- 2. If resources allow, we can completely unfold to create a circuit that inferences at clock speed



# Customizing Arithmetic to Minimum Precision Required

- Reducing precision shrinks hardware cost/ scales performance
  - Instantiate n-times more compute within the same fabric, thereby scale performance n-times
  - $8b/8b \rightarrow 1b/1b$ , RTL => 70x



C= size of accumulator \* size of weight \* size of activation

Precision

1b

8b

32b

- Potential to reduce memory footprint
  - NN model can stay on-chip => no memory bottlenecks
- Inherently saves power



Modelsize [MB]

(ResNet50)

3.2

25.5

102.5

XILINX.

© Copyright 2020 Xilinx

## **Granularity of Customizing Arithmetic**









Dataflow architectures can exploit custom arithmetic at a greater degree



## Results



#### Few-bit DNNs + FPGA Dataflow: Showcases

Low-Power, Real-Time Image Classification





CIFAR-10 CNV on PYNQ-Z1
3kFPS @ 2.5 W
1ms latency

Single and multi-node ImageNet
Classification
(on XACC)

We'll tell you more about this later...



ResNet-50, MobileNet on Alveo U280 & U250

MNv1: 5.9kFPS, 2.2 msec

(2x U280)

RN50: 3.1kFPS, 1.7msec

(1x U250)



## Deep Network Intrusion Detection System (NIDS)



14 of 70 XILINX.

#### **NIDS** Results

#### Matrix of **Processing Engines**

#### **Dataflow Architecture** with 2b arithmetic





Same DNN, but trained for reduced precision, with Brevitas







| Resources              |  |  |  |  |
|------------------------|--|--|--|--|
| Compute (KLUTs, DSPs*) |  |  |  |  |
| Memory (BRAM, URAM**)  |  |  |  |  |
| Clock                  |  |  |  |  |



Low resource footprint (especially memory) Low clock rate

>1000x performance improvement over Vitis AI, less resources, 100Gbps line rate (150MRps) Through dataflow processing, reduced precision

\*DSPs: 8b or 16b Multiply Accumulates

XILINX. \*\*BRAMs: 36kb, URAM: 288kbit embedded SRAM blocks

## **The FINN Framework**



## FINN Framework: From DNN to FPGA Deployment



#### Brevitas Ining in pytorch

Training in pytorch Algorithmic optimizations

- Train or even learn reduced precision DNNs
- Library of standard layers
- Pretrained examples

FINN compiler
Specializations of
hardware architecture

- Perform optimizations
- Map to Vivado HLS
- Create DNN hardware IP

Deployment





- Embeds the DNN IP into an infrastructure design
- · Generates Python run-time
- Enables integration with your application
- Works on embedded and Alveo platforms
  - Including XACC



#### **Brevitas:**

A PyTorch Library for Quantization-Aware Training



**GitHub** 

https://github.com/Xilinx/brevitas

**EXILINX**.

## **FINN Compiler**

#### Transform DNN into Custom Dataflow Architecture



Input is ONNX description of the quantized DNN

- FINN uses the ONNX-based intermediate representation as intermediate representation (IR)
- FINN is a python library of graph transformations
- Synthesizable description of each layer is produced (in HLS)
- After synthesis each layer as IP block
  - AXI stream inputs and outputs



Output is the stitched DNN accelerator IP



#### **FINN Flows**

## Every Step is a ONNX Graph Transformations



Optimization, lowering, code generation... are all transformations



## FINN Compiler for Hardware Generation In 3 Steps



- 1. Import, streamlining transformations, conversion to HLS
- 2. Adjust folding to suit performance/resource requirements
- 3. Generate IP, and stitched IP design



#### FINN Compiler: Import, Optimization & HLS Generation



```
hls::stream<ap_int<185>> in
hls::stream<ap_int<100>> inter0, inter1, ...

...
StreamingFCLayer<BINARY, BINARY, ..>(in, inter0, ...)
StreamingFCLayer<BINARY, BINARY, ..>(inter0, inter1, ...)
...
```



- Generate calls to a pre-optimized Vivado HLS C++ library
- Support arbitrary-precision datatypes via templates
- Synthesizable to RTL



## The FINN HLS Library



- An optimized, templated Vivado HLS C++ library of 10+ common DNN layers
- Key component: MVTU (Matrix Vector Threshold Unit)



#### FINN Compiler: Adjusting Performance/Resources



- 1. Import, streamlining transformations, conversion to HLS
- 2. Adjust folding to suit performance/resource requirements
- 3. Generate IP, and stitched IP design





#### **FINN Compiler: IP Generation Flow**



- 1. Import, streamlining transformations, conversion to HLS
- 2. Adjust folding to suit performance/resource requirements
- 3. Generate IP, and stitched IP design
- Stream-in, stream-out FPGA IP block
  - » Easy "bump-in-the-wire" integration into streaming systems
  - » Simple data movement, fully deterministic







## Deployment with PYNQ for Python Productivity

```
instantiate the accelerator
accel = models.cnv w2a2 cifar10()
 generate an empty numpy array to use as input
dummy_in = np.empty(accel.ishape_normal,
dtype=np.uint8)
# perform inference and get output
dummy out = accel.execute(dummy in)
```



- Use PYNQ-provided Python abstractions and drivers
- User provides Numpy array in, calls driver, gets Numpy array out
  - Internally use PYNQ DMA driver to wr/rd NumPy arrays into I/O streams



https://github.com/Xilinx/PYNQ https://github.com/Xilinx/finn-examples



## Infrastructure for Experimentation & Collaboration

Xilinx academic compute clusters (XACC)

- 4 centres world-wide
- Free to use
- Enabling research community
- ▶ Flexibility, shared hardware, networked FPGAs
- ▶ Not only for FINN



#### **FINN Status**

- Many example designs available at github/finn
  - Increasing application & feature & platform support
- Ongoing development (3 researchers + community)

- Training material
  - Tutorials (more coming!)
  - University classes with FINN @ Stanford, Charlotte, NTNU
    - EPFL and Technion in preparation







Looking to build-up community, applications and functionality

If you're interested in collaborating, please be in touch ©



## A Tour of the Repositories

Yaman Umuroglu Senior Research Scientist, Xilinx Research Labs



#### Overview of the FINN software stack







#### Tour of the FINN software stack







## finn-examples: prebuilt dataflow accelerators

- Dataflow accelerators for MNIST, CIFAR-10, ImageNet
  - Bitfiles for PYNQ boards and Alveo U250
- Jupyter notebook example to run each accelerator
  - Based on PYNQ Python driver

```
# on your PYNQ board or Alveo U250
pip3 install finn-examples
pynq get-notebooks --from-package finn-examples -p .
```

- Scripts to rebuild the examples
- More examples on the way
  - ResNet-50 toolflow (bitfiles already on Xilinx/ResNet50-PYNQ)
  - speech recognition & keyword spotting
  - <your cool dataflow accelerator example here!>







#### Tour of the FINN software stack







## finn: dataflow compiler

- ONNX -> bitfile (or IPI design) build automation
- Docker environment with all dependencies
- Large library of graph transformations
  - Streamlining to remove floating point scaling factors
  - Lowering to finn-hlslib ops
  - Stitching generated IPs in Vivado IPI
- Custom ops corresponding to finn-hlslib
  - Including code generation and verification
- Jupyter notebook tutorials (basic, advanced, end2end)









#### Tour of the FINN software stack







## brevitas: quantization-aware training in PyTorch

- Train NNs with quantized weights and activations
- Quantized versions of many PyTorch layers
  - 0.g. brevitas.nn.QuantConv2D iNStead Of torch.nn.Conv2D
- Flexible quantization schemes
  - Mixed precision with fixed or learnable bitwidths
- Example pretrained models + training scripts
  - image classification, speech-to-text, text-to-speech
  - <your quantized DNN model contribution here!>
- Different ONNX export flows for different backends in progress
  - FINN, Xilinx DPU, standard ONNX







#### Tour of the FINN software stack







#### finn-hIslib: library of Vivado HLS components

- ▶ 10+ common DNN layer types in Vivado HLS
  - <your HLS layer contribution here!>



- Input, weight, output datatypes (precision)
- Parallelism along different axes
- Mapping to FPGA resources (LUT or DSP, LUTRAM or BRAM...)

- Easily composable components
  - AXI stream-in, AXI stream-out





#### Tour of the FINN software stack



FINN project landing page: <a href="https://xilinx.github.io/finn">https://xilinx.github.io/finn</a>



#### finn-base: ONNX compiler infrastructure

- Infrastructure for manipulating + verifying custom ONNX graphs
  - Not tied to FINN's HLS op implementations or lowering flow
  - Useful for exploring DNN compilation without full FINN
- Three key parts
  - Wrapper around ONNX protobuf with helper functions
  - Defining + executing (for verification) custom ops
  - Defining + applying graph transformations
- Various other utilities
  - e.g. execute Verilog as part of ONNX custom op (with pyverilator)



Read the Docs

readthedocs.io



### Putting it all together: a FINN end-to-end flow





Zaid Al-Ars Associate Professor, TU Delft



### Call for community building

1300+ stars on GitHub across repos, 600+ citations across papers

There are many other users + use-cases we don't know about -- we want to hear from you!

We welcome partners to work together to build and extend FINN!

We enable community efforts and provide support



#### Various types of engagement with FINN

Contribution to FINN framework

Using FINN in research

Using FINN in industry

Using FINN in education



### Various types of engagement with FINN

| Contribution to FINN framework | <ul><li>University of Heidelberg</li><li>Delft University of Technology</li></ul>                                                                                              |
|--------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Using FINN in research         | <ul> <li>Research organizations: ESA, Fraunhofer, CERN, hls4ml</li> <li>Various other FPGA NN papers use code from FINN</li> <li>LUTNet [FCCM'19], ReBNet [FCCM'18]</li> </ul> |
| Using FINN in industry         | <ul> <li>Companies: Daimler and Thales</li> <li>Ongoing evaluation by networking companies</li> </ul>                                                                          |
| Using FINN in education        | <ul> <li>University classes in Stanford, UNC Charlotte, NTNU, et al.</li> </ul>                                                                                                |



#### Two examples from collaborators

Contribution to FINN framework

LogicNets integration into FINN

Jon Ander Lezeta

Using FINN in research

**Enabling support of QuartzNet in FINN** 

Mirza Mrahorović & Jakoba Petri-König





# LogicNets integration into FINN

Jon Ander Lezeta



#### LogicNets

[Umuroglu et al. "LogicNets: Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications", FPL'20]



convert



Specialized DNN Topology

(with high sparsity + activation quantization)

Fully-Spatial Circuit Implementation

one full sample every clock low logic depth, high  $F_{clk}$  e.g. intrusion detection at 450M+ samples/second

**PyTorch** 

**FPGA** 



### LogicNets

Neuron Equivalent (NEQ)

Hardware Building Block (HBB)



convert (enumerate inputs) 6:1 LUT

Total input: 6 bits
Total output: 1 bit

Hardware cost: 1 x LUT6

**PyTorch** 

**FPGA** 

Total input: 6 bits
Total output: 1 bit



#### **Limitation of current LogicNets**

- LogicNets
  - Good approach for sparse layers
  - Compute quantized neurons with complex functions in LUTs
- Image Classification
  - First and last layers sensitive to extreme quantization and pruning
  - Can cause a noticeable accuracy drop
- Solution: FINN + LogicNets
  - Implement a LogicNets backend in FINN to mix FINN and LogicNets topologies





#### **Benefits of LogicNets in FINN**











### **Enabling support of QuartzNet in FINN**

Mirza Mrahorović



### **Enabling support of QuartzNet in FINN**







#### **Enabling support of QuartzNet in FINN**





### Call for community building

- For questions and collaboration ideas please contact!
- Gitter channel for communication
  - Inttps://gitter.im/xilinx-finn/community
- Zaid Al-Ars: z.al-ars@tudelft.nl
- Jakoba Petri-König: <u>J.Petri-Koenig@tudelft.nl</u>
- Yaman Umuroglu: <a href="mailto:yamanu@xilinx.com">yamanu@xilinx.com</a>



## **Q&A** session





# Thank You

