



# Bridge of Life U Education

## **FINN Verification**

Lecturer: Hua-Yang Weng

Date: 2022/08/28

https://github.com/Xilinx/finn/blob/main/notebooks/end2end\_example/bnn-pynq/tfc\_end2end\_verification.ipynb

https://github.com/Xilinx/finn-hlslib/tree/master/tb





## From AI to Gate Textbook



Signed in as: 翁華揚

**Textbooks** 

Dashboard



#### From AI to Gate

| Preface         | ~        |
|-----------------|----------|
| Getting Started | ~        |
| Network Define  | <b>~</b> |
| Compiler        | ~        |

Introduction Python Simulation Cpp Simulation

Verification

#### **RTL Simulation**



#### RTL simulation

The last part of the verification chapter is RTL simulation. This also corresponds to the so-called "co-simulation" part in classical HLS flow. For this reason, the RTL simulation is performed after cpp simulation is passed and after the C synthesis part.

The tool we are going to use is called Verilator. It's an open-source library and converts synthesizable Verilog code into c++. Users can then perform RTL simulation and trace waveforms for debugging Verilog code. Here, since we are using FINN, we need the python-binded pyVerilator very tool and use it for checking the HLS tool generated RTL code.

There are two ways to perform RTL simulation, one is to simulate it node-by-node as what we did previously, another is to simulate the full stitched IP. Both of the methods use a class called PrepareRTLSim() to generate simulation files and added an directory-pointing "rtlsim\_so" attribute to each node. Additionally, the use of PrepareRTLSim() then requires the input model of both to be containing only fpgadataflow nodes (Because only these nodes have RTL hardware descriptions).

Similar to cppsim, both methods will still need to use the full graph model (parent model) to run execute\_onnx() with the child graph referenced and perform the final RTL simulation

#### **Emulation of model node-by-node**

As early mentioned, PrepareRTLSim() requires input model to be containing only fpgadataflow nodes. The below code first takes the child fpgadataflow model and perform HLS synthesis via Vivado\_hls. Then the resulting model is set with "rtlsim" execution property by using again the SetExecMode() class. After that, PrepareRTLSim() generates the corresponding simulation files.

```
from finn.transformation.fpgadataflow.prepare rtlsim import PrepareRTLSim
from finn.transformation.fpgadataflow.prepare_ip import PrepareIP
from finn.transformation.fpgadataflow.hlssynth ip import HLSSynthIP
test fpga part = "xc7z020clg400-1"
target_clk_ns = 10
child_model = ModelWrapper(build_dir + "/tfc_w1_a1_set_folding_factors.onnx")
child model = child model.transform(GiveUniqueNodeNames())
child_model = child_model.transform(PrepareIP(test_fpga_part, target_clk_ns))
child model = child model.transform(HLSSynthIP())
child_model = child_model.transform(SetExecMode("rtlsim"))
child_model = child_model.transform(PrepareRTLSim())
```

#### **Table of Contents** RTL simulation

Fmulation of model nodeby-node

Emulation of stitched IP

Summary





- Introduction
- Python Simulation
- Cpp Simulation
- RTL Simulation
- HLS Testbench





- Introduction
- Python Simulation
- Cpp Simulation
- RTL Simulation
- HLS Testbench





# Introduction (1/3)

- Python Simulation: (Behavior-Level)
  - verification executed within FINN compiler.
  - Focus on python language
  - For python executed code.
- Cpp Simulation: (Behavior-Level)
  - C/C++ executed code.
  - Enables us to check the HLS codes for hardware.
- RTL Simulation: (Register-Transfer-Level)
  - Performs cycle accurate tests and verifies the final hardware HDL implementation.





# Introduction (2/3): Golden

- Calculated directly from the Brevitas
  - Running some example data from the MNIST





## Introduction (3/3)



# Child node





Reshape\_0\_out0





- Introduction
- Python Simulation
- Cpp Simulation
- RTL Simulation





# Python Simulation (1/3)

 Functionality check for operations that do not belong to the fpgadataflow backend attribute

```
from finn.custom_op.general.xnorpopcount import xnorpopcountmatmul
showSrc(xnorpopcountmatmul)
def xnorpopcountmatmul(inp0, inp1):
   """Simulates XNOR-popcount matrix multiplication as a regular bipolar
   matrix multiplication followed by some post processing."""
   # extract the operand shapes
   \# (M, K0) = inp0.shape
   # (K1, N) = inp1.shape
   K0 = inp0.shape[-1]
   K1 = inp1.shape[0]
   # make sure shapes are compatible with matmul
   assert KO == K1, "Matrix shapes are not compatible with matmul."
   K = K0
   # convert binary inputs to bipolar
   inp0 bipolar = 2.0 * inp0 - 1.0
   inp1 bipolar = 2.0 * inp1 - 1.0
   # call regular numpy matrix multiplication
   out = np.matmul(inp0 bipolar, inp1 bipolar)
   # XNOR-popcount does not produce the regular dot product result --
   # it returns the number of +1s after XNOR. let P be the number of +1s
   # and N be the number of -1s. XNOR-popcount returns P, whereas the
   # regular dot product result from numpy is P-N, so we need to apply
   # some correction.
   # out = P-N
   \# K = P+N
   # out + K = 2P, so P = (out + K)/2
   return (out + K) * 0.5
```

- Example of the execution function of a XNOR popcount node.
- Contains descriptions of the behavior in Python and calculate the result of the node.





# Python Simulation (2/3)

- Standard ONNX node simulation
  - onnxruntime is used. (open source tool)
  - Runs at host side
- execute\_onnx()
  - FINN API to simulate the model node-by-node using onnxruntime
  - The result is stored in a context dictionary





# Python Simulation (3/3)

```
import numpy as np
from finn.core.modelwrapper import ModelWrapper
input_dict = {"global_in": nph.to_array(input_tensor)}

model_for_sim = ModelWrapper(build_dir+"/tfc_w1a1_ready_for_hls_conversion.onnx")

import finn.core.onnx_exec as oxe
output_dict = oxe.execute_onnx(model_for_sim, input_dict)
output_pysim = output_dict[list(output_dict.keys())[0]]

if np.isclose(output_pysim, output_golden, atol=1e-3).all():
    print("Results are the same!")
else:
    print("The results are not the same!")
```

Results are the same!

The result is compared with the theoretical "golden" value for verification.





- Introduction
- Python Simulation
- Cpp Simulation
- RTL Simulation
- HLS Testbench





# Cpp Simulation (1/6)

 Simulate nodes with backend attribute of "HLS custom op"

- c++ executable tests the c++ HLS functions
  - 1. Input data from Brevitas is stored in an .npy file
  - Reads the input
  - 3. Streams to HLS function (finn-hlslib)
  - 4. Writes the result to a new .npy file.
- The resulting .npy file can be read into FINN





# Cpp Simulation (2/6)

- Flow:
  - PrepareCppSim()
  - 2. CompileCppSim()
  - SetExecMode()
  - 4. execute\_onnx()

- **1. PrepareCppSim:** generates the C++ code for the corresponding hls layer
- **2. CompileCppSim:** Compiles the C++ code and stores the path to the executable

```
from finn.transformation.fpgadataflow.prepare_cppsim import PrepareCppSim
from finn.transformation.fpgadataflow.compile_cppsim import CompileCppSim
from finn.transformation.general import GiveUniqueNodeNames

model_for_cppsim = model_for_cppsim.transform(GiveUniqueNodeNames())
model_for_cppsim = model_for_cppsim.transform(PrepareCppSim())
model_for_cppsim = model_for_cppsim.transform(CompileCppSim())
```





# Cpp Simulation (3/6)



The following node attributes have been added:

- code\_gen\_dir\_cppsim:
   Directory where the files for the simulation using C++ are stored
- executable\_path: specifies the path to the executable





# Cpp Simulation (4/6)

Directory "StreamingFCLayer\_Batch\_0\_ajlmloxf"

```
from finn.custom op.registry import getCustomOp
fc0 = model for cppsim.graph.node[1]
fc0w = getCustom0p(fc0)
code gen dir = fc0w.get nodeattr("code gen dir cppsim")
!ls {code gen dir}
compile.sh
                                  memblock 0.dat
                                                  thresh.h
                                                                      >>> w = np.load("./weights.npy")
execute StreamingFCLayer Batch.cpp | node model
                                                  weights.npy
                                                                      >>> w.shape
                                                                      (1.64.784)
                      Executable file
Inside compile.sh
q++ -0
/home/yuoto/multimedialC/FINN/practice/code/build_docker/code_gen_cppsim_StreamingFCLayer_Batch_0_ajl
mloxf/node model
/home/yuoto/multimedialC/FINN/practice/code/build_docker/code_gen_cppsim_StreamingFCLayer_Batch_0_ajl
mloxf/*.cpp /workspace/cnpy/cnpy.cpp
-I/workspace/finn/src/finn/gnn-data/cpp
-I/workspace/cnpy/
-l/workspace/finn-hlslib
-I/home/yuoto/YuotoSSD/Xilinx/Vivado/2020.1/include --std=c++11 -O3 -lz
```





# Cpp Simulation (5/6)

Inside "execute\_StreamingFCLayer\_Batch.cpp"

```
#define AP_INT_MAX_W 784
#include "cnpy.h"
#include "npy2apintstream.hpp"
#include <vector>
#include "bnn-library.h"

// includes for network parameters
#include "weights.hpp"
#include "activations.hpp"
#include "mvau.hpp"
#include "thresh.h"
```

Include HLS functions

```
npy2apintstream<ap_uint<49>, ap_uint<1>, 1, float>("/home/yuoto/multimediaIC/FINN/practice/code/build_d
npy2apintstream<ap_uint<784>, ap_uint<1>, 1, float>("/home/yuoto/multimediaIC/FINN/practice/code/build
```

apintstream2npy<ap\_uint<16>, ap\_uint<1>, 1, float>(out, {1, 4, 16}, "/home/yuoto/multimediaIC/FINN/prac





# Cpp Simulation (6/6)

- SetExecMode()
  - Sets the execution mode
  - E.g. cpp simulation -> "cppsim".

```
from finn.transformation.fpgadataflow.set_exec_mode import SetExecMode

model_for_cppsim = model_for_cppsim.transform(SetExecMode("cppsim"))
model_for_cppsim.save(build_dir+"/tfc_w1_a1_for_cppsim.onnx")
```

- 4. execute\_onnx()
  - Need to integrate the child model in the parent model first.

```
parent_model = ModelWrapper(build_dir+"/tfc_w1_a1_dataflow_parent.onnx")
sdp_node = parent_model.graph.node[2]
child_model = build_dir + "/tfc_w1_a1_for_cppsim.onnx"
getCustomOp(sdp_node).set_nodeattr("model", child_model)
output_dict = oxe.execute_onnx(parent_model, input_dict)
output_cppsim = output_dict[list(output_dict.keys())[0]]

if np.isclose(output_cppsim, output_golden, atol=le-3).all():
    print("Results are the same!")
else:
    print("The results are not the same!")
```





- Introduction
- Python Simulation
- Cpp Simulation
- RTL Simulation
- HLS Testbench





# RTL Simulation (1/6)

 After IP blocks are generated from the corresponding HLS layers.

- RTL cosimulation using PyVerilator
- Two ways for rtlsim:
  - 1. node-by-node
  - 2. Executed as whole
    - Required all nodes to be HLS nodes





### RTL Simulation (2/6): 1. node-by-node

- PrepareRTLSim()
  - Apply to the child model.
  - Sets the execution mode to "rtlsim".
  - New node attribute "rtlsim\_so" are created

```
from finn.transformation.fpgadataflow.prepare_rtlsim import PrepareRTLSim
from finn.transformation.fpgadataflow.prepare_ip import PrepareIP
from finn.transformation.fpgadataflow.hlssynth_ip import HLSSynthIP

test_fpga_part = "xc7z020clg400-1"
target_clk_ns = 10

child_model = ModelWrapper(build_dir + "/tfc_wl_al_set_folding_factors.onnx")
child_model = child_model.transform(GiveUniqueNodeNames())
child_model = child_model.transform(PrepareIP(test_fpga_part, target_clk_ns))
child_model = child_model.transform(HLSSynthIP())
child_model = child_model.transform(SetExecMode("rtlsim"))
child_model = child_model.transform(PrepareRTLSim())
cnild_model.save(build_dir + "/tic_wl_al_dataflow_cnild.onnx")
```





### RTL Simulation (3/6): 1. node-by-node





### RTL Simulation (4/6): 1. node-by-node

- Merge the child node to the parent node
- Then, execute with execute\_onnx()

```
# parent model
model_for_rtlsim = ModelWrapper(build_dir + "/tfc_w1_a1_dataflow_parent.onnx")
#showInNetron(build_dir + "/tfc_w1_a1_dataflow_parent.onnx")
# reference child model

sdp_node = getCustomOp(model_for_rtlsim.graph.node[1])

sdp_node.set_nodeattr("model", build_dir + "/tfc_w1_a1_dataflow_child.onnx")
model_for_rtlsim = model_for_rtlsim.transform(SetExecMode("rtlsim"))
```

```
output_dict = oxe.execute_onnx(model_for_rtlsim, input_dict)
output_rtlsim = output_dict[list(output_dict.keys())[0]]

if np.isclose(output_rtlsim, output_golden, atol=1e-3).all():
    print("Results are the same!")
else:
    print("The results are not the same!")
```





#### RTL Simulation (6/6): 2. Executed as whole

- Simulate the whole (stitched) child model at once.
  - Merged to parent model
  - Execute at parent model

from finn.transformation.fpgadataflow.insert\_dwc import InsertDWC
from finn.transformation.fpgadataflow.insert\_fifo import InsertFIFO

```
output_dict = oxe.execute_onnx(model_for_rtlsim, input_dict)
output_rtlsim = output_dict[list(output_dict.keys())[0]]

if np.isclose(output_rtlsim, output_golden, atol=1e-3).all():
    print("Results are the same!")

else:
    print("The results are not the same!")
```

```
from finn.transformation.fpgadataflow.create stitched ip import CreateStitchedIP
child model = ModelWrapper(build dir + "/tfc wl al dataflow child.onnx")
child model = child model.transform(InsertDWC())
child model = child model.transform(InsertFIFO())
child model = child model.transform(GiveUniqueNodeNames())
child model = child model.transform(PrepareIP(test fpga part, target clk ns))
child model = child model.transform(HLSSvnthIP())
child model = child model.transform(CreateStitchedIP(test fpga part, target clk ns))
child model = child model.transform(PrepareRTLSim())
child model.set metadata prop("exec mode", "rtlsim")
child model.save(build dir + "/tfc wl al dataflow child.onnx")
# parent model
model for rtlsim = ModelWrapper(build dir + "/tfc w1 a1 dataflow parent.onnx")
# reference child model
sdp node = getCustomOp(model for rtlsim.graph.node[2])
sdp node.set nodeattr("model", build dir + "/tfc w1 a1 dataflow child.onnx")
```





- Introduction
- Python Simulation
- Cpp Simulation
- RTL Simulation
- HLS Testbench





#### **HLS Testbench**

For HLS library itself or other HLS custom function

 Can use the Vivado\_hls or Vitis\_hls tools just as regular HLS development flow

- Testbench for FINN-hlslib @
  - https://github.com/Xilinx/finn-hlslib/tree/master/tb



### **HLS Testbench**



https://github.com/Xilinx/finn-hlslib/tree/master/tb





### **HLS Testbench**

```
int main()
   static ap_uint<INPUT_PRECISION> IMAGE[MAX_IMAGES][IFMDim1*IFMDim1][IFM_Channels1];
   static ap_uint<ACTIVATION_PRECISION> TEST[MAX_IMAGES][0FMDim1][0FMDim1][0FM_Channels1];
   stream<ap_uint<IFM_Channels1*INPUT_PRECISION> > input_stream("input_stream");
   stream<ap uint<OFM Channels1*ACTIVATION PRECISION> > output stream("output stream");
   unsigned int counter = 0;
   for (unsigned int n image = 0; n image < MAX IMAGES; n image++) {</pre>
       for (unsigned int oy = 0; oy < IFMDim1; oy++) {
           for (unsigned int ox = 0; ox < IFMDim1; ox++) {
               ap uint<INPUT PRECISION*IFM Channels1> input channel = 0;
               for(unsigned int channel = 0; channel < IFM_Channels1; channel++)</pre>
                   ap_uint<INPUT_PRECISION> input = (ap_uint<INPUT_PRECISION>)(counter);
                    IMAGE[n_image][oy*IFMDim1+ox][channel]= input;
                    input channel = input channel >> INPUT PRECISION;
                   input_channel(IFM_Channels1*INPUT_PRECISION-1,(IFM_Channels1-1)*INPUT_PRECISION)=input;
                    counter++;
               input_stream.write(input_channel);
```





```
static ap_uint<WIDTH> W1[OFM_Channels1][KERNEL_DIM][KERNEL_DIM][IFM_Channels1];
constexpr int TX = (IFM Channels1*KERNEL DIM*KERNEL DIM) / SIMD1;
constexpr int TY = OFM Channels1 / PE1;
unsigned int kx=0;
unsigned int ky=0;
unsigned int chan count=0;
unsigned int out_chan_count=0;
for (unsigned int oy = 0; oy < TY; oy++) {
    for(unsigned int pe=0;pe <PE1;pe++){
        for (unsigned int ox = 0; ox <TX; ox++) {
            for(unsigned int simd=0;simd<SIMD1;simd++){</pre>
                W1[out_chan_count][kx][ky][chan_count] = PARAM::weights.weights(oy*TX + ox)[pe][simd];
                chan count++;
                if (chan count==IFM Channels1){
                    chan count=0;
                    kx++;
                    if (kx==KERNEL DIM){
                        kx=0;
                        ky++;
                        if (ky==KERNEL_DIM){
                            ky=0;
                            out chan count++;
                            if (out_chan_count==OFM_Channels1){
                                 out chan count=0;
```

Bot edu

conv<MAX IMAGES, IFMDim1, OFMDim1, IFM Channels1, OFM Channels1, KERNEL DIM, 1, ap uint<INPUT PRECISION> > (IMAGE, W1, TEST);



# Debuging with Vitis\_hls

- After C synthesis (FINN hardware build flow)
  - There is a vitis\_hls project file in the directory
  - Directly open it and add testbench from github
- See textbooks for detail

