# System Check and Configuration

In [1]:
#CUDA Check
import torch
torch.cuda.is_available()

True

In [2]:
# Path Check
!pwd

/home/ba/Xilinx/finn-0v10-dev/finn/notebooks/batuhan


In [3]:
folding_flag = 'custom'

# Outline and Introduction

1. [Introduction to  `build_dataflow` Tool](#intro_build_dataflow) 
2. [Understanding the Build Configuration: `DataflowBuildConfig`](#underst_build_conf)     
    2.1.[Output Products](#output_prod)   
    2.2.[Configuring the Board and FPGA Part](#config_fpga)   
    2.3 [Configuring the Performance](#config_perf)    
4. [Launch a Build: Only Estimate Reports](#build_estimate_report)
5. [Launch a Build: Stitched IP, out-of-context synth and rtlsim Performance](#build_ip_synth_rtlsim)
6. [(Optional) Launch a Build: PYNQ Bitfile and Driver](#build_bitfile_driver)
7. [(Optional) Run on PYNQ board](#run_on_pynq)

## Understanding the Build Configuration: `DataflowBuildConfig` <a id="underst_build_conf"></a>

The build configuration is specified by an instance of `finn.builder.build_dataflow_config.DataflowBuildConfig`. The configuration is a Python [`dataclass`](https://docs.python.org/3/library/dataclasses.html) which can be serialized into or de-serialized from JSON files for persistence, although we'll just set it up in Python here.
There are many options in the configuration to customize different aspects of the build, we'll only cover a few of them in this notebook. You can read the details on all the config options on [the FINN API documentation](https://finn-dev.readthedocs.io/en/latest/source_code/finn.builder.html#finn.builder.build_dataflow_config.DataflowBuildConfig).

Let's go over some of the members of the `DataflowBuildConfig`:

### Output Products <a id="output_prod"></a>

The build can produce many different outputs, and some of them can take a long time (e.g. bitfile synthesis for a large network). When you first start working on generating a new accelerator and exploring the different performance options, you may not want to go all the way to a bitfile. Thus, in the beginning you may just select the estimate reports as the output products. Gradually, you can generate the output products from later stages until you are happy enough with the design to build the full accelerator integrated into a shell.

The output products are controlled by:

* `generate_outputs`: list of output products (of type [`finn.builder.build_dataflow_config.DataflowOutputType`](https://finn-dev.readthedocs.io/en/latest/source_code/finn.builder.html#finn.builder.build_dataflow_config.DataflowOutputType)) that will be generated by the build. Some available options are:
    - `ESTIMATE_REPORTS` : report expected resources and performance per layer and for the whole network without any synthesis
    - `STITCHED_IP` : create a stream-in stream-out IP design that can be integrated into other Vivado IPI or RTL designs
    - `RTLSIM_PERFORMANCE` : use PyVerilator to do a performance/latency test of the `STITCHED_IP` design
    - `OOC_SYNTH` : run out-of-context synthesis (just the accelerator itself, without any system surrounding it) on the `STITCHED_IP` design to get post-synthesis FPGA resources and achievable clock frequency
    - `BITFILE` : integrate the accelerator into a shell to produce a standalone bitfile
    - `PYNQ_DRIVER` : generate a PYNQ Python driver that can be used to launch the accelerator
    - `DEPLOYMENT_PACKAGE` : create a folder with the `BITFILE` and `PYNQ_DRIVER` outputs, ready to be copied to the target FPGA platform.
* `output_dir`: the directory where all the generated build outputs above will be written into.
* `steps`: list of predefined (or custom) build steps FINN will go through. Use `build_dataflow_config.estimate_only_dataflow_steps` to execute only the steps needed for estimation (without any synthesis), and the `build_dataflow_config.default_build_dataflow_steps` otherwise (which is the default value). You can find the list of default steps [here](https://finn.readthedocs.io/en/latest/source_code/finn.builder.html#finn.builder.build_dataflow_config.default_build_dataflow_steps) in the documentation.

### Configuring the Board and FPGA Part <a id="config_fpga"></a>

* `fpga_part`: Xilinx FPGA part to be used for synthesis, can be left unspecified to be inferred from `board` below, or specified explicitly for e.g. out-of-context synthesis.
* `board`: target Xilinx Zynq or Alveo board for generating accelerators integrated into a shell. See the `pynq_part_map` and `alveo_part_map` dicts in [this file](https://github.com/Xilinx/finn-base/blob/dev/src/finn/util/basic.py#L41) for a list of possible boards.
* `shell_flow_type`: the target [shell flow type](https://finn-dev.readthedocs.io/en/latest/source_code/finn.builder.html#finn.builder.build_dataflow_config.ShellFlowType), only needed for generating full bitfiles where the FINN design is integrated into a shell (so only needed if `BITFILE` is selected) 

### Configuring the Performance <a id="config_perf"></a>

You can configure the performance (and correspondingly, the FPGA resource footprint) of the generated dataflow accelerator in two ways:

1) (basic) Set a target performance and let the compiler figure out the per-node parallelization settings.

2) (advanced) Specify a separate .json as `folding_config_file` that lists the degree of parallelization (as well as other hardware options) for each layer.

This notebook only deals with the basic approach, for which you need to set up:

* `target_fps`: target inference performance in frames per second. Note that target may not be achievable due to specific layer constraints, or due to resource limitations of the FPGA. 
* `synth_clk_period_ns`: target clock frequency (in nanoseconds) for Vivado synthesis. e.g. `synth_clk_period_ns=5.0` will target a 200 MHz clock. Note that the target clock period may not be achievable depending on the FPGA part and design complexity.

# Launch a Build: Only Estimate Reports <a id="build_estimate_report"></a>

First, we'll launch a build that only generates the estimate reports, which does not require any synthesis. Note two things below: how the `generate_outputs` only contains `ESTIMATE_REPORTS`, but also how the `steps` uses a value of `estimate_only_dataflow_steps`. This skips steps like HLS synthesis to provide a quick estimate from analytical models.

In [4]:
import finn.builder.build_dataflow as build
import finn.builder.build_dataflow_config as build_cfg
import os
import shutil

In [5]:
model_dir = "/home/ba/Xilinx/finn-0v10/finn/notebooks/batuhan"
print(model_dir)

/home/ba/Xilinx/finn-0v10/finn/notebooks/batuhan


In [6]:
model_file = model_dir+"/QONNX_CNV.onnx"

In [7]:
from finn.util.visualization import showInNetron
showInNetron(model_file)

Serving '/home/ba/Xilinx/finn-0v10/finn/notebooks/batuhan/QONNX_CNV.onnx' at http://0.0.0.0:8081


In [8]:
estimates_output_dir = "output_estimates_only"

#Delete previous run results if exist
if os.path.exists(estimates_output_dir):
    shutil.rmtree(estimates_output_dir)
    print("Previous run results deleted!")

Previous run results deleted!


In [9]:
if folding_flag == 'auto':
    print("Folding Mode is Auto")

    cfg_estimates = build.DataflowBuildConfig(
        output_dir          = estimates_output_dir,
        mvau_wwidth_max     = 80,
        target_fps          = 1000,
        synth_clk_period_ns = 10.0,
        fpga_part           = "xc7z020clg400-2",
        steps               = build_cfg.estimate_only_dataflow_steps,
        generate_outputs=[
            build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
        ]
    )

elif folding_flag == 'custom':
    print("Folding Mode is Custom")

    build_dir = os.environ['FINN_ROOT'] + "/notebooks/batuhan"    
    import json    
    with open(build_dir+"/custom_folding_config.json", 'r') as json_file:
        folding_config = json.load(json_file)    
    print(json.dumps(folding_config, indent=1))

    from finn.util.pytorch import ToTensor
    from qonnx.transformation.merge_onnx_models import MergeONNXModels
    from qonnx.core.modelwrapper import ModelWrapper
    from qonnx.core.datatype import DataType
    
    def custom_step_add_pre_proc(model: ModelWrapper, cfg: build.DataflowBuildConfig):
        ishape = model.get_tensor_shape(model.graph.input[0].name)
        # preprocessing: torchvision's ToTensor divides uint8 inputs by 255
        preproc = ToTensor()
        export_qonnx(preproc, torch.randn(ishape), "preproc.onnx", opset_version=11)
        preproc_model = ModelWrapper("preproc.onnx")
        # set input finn datatype to UINT8
        preproc_model.set_tensor_datatype(preproc_model.graph.input[0].name, DataType["UINT8"])
        # merge pre-processing onnx model with cnv model (passed as input argument)
        model = model.transform(MergeONNXModels(preproc_model))
        return model

    from qonnx.transformation.insert_topk import InsertTopK
    
    def custom_step_add_post_proc(model: ModelWrapper, cfg: build.DataflowBuildConfig):
        model = model.transform(InsertTopK(k=1))
        return model

    import torch
    from finn.util.test import get_test_model_trained
    from brevitas.export import export_qonnx
    from qonnx.util.cleanup import cleanup as qonnx_cleanup

## Build flow with custom folding configuration
## folding_config_file = "folding_config_all_lutram.json"

    build_steps = [
        custom_step_add_pre_proc,
        custom_step_add_post_proc,
        "step_qonnx_to_finn",
        "step_tidy_up",
        "step_streamline",
        "step_convert_to_hw",
        "step_create_dataflow_partition",
        "step_specialize_layers",
        "step_apply_folding_config",
        "step_minimize_bit_width",
        "step_generate_estimate_reports",
    ]
    
    cfg_estimates = build.DataflowBuildConfig(
        output_dir          = estimates_output_dir,
        mvau_wwidth_max     = 80,
        synth_clk_period_ns = 10.0,
        fpga_part           = "xc7z020clg400-2",
        steps               = build_steps,
        folding_config_file = "custom_folding_config.json",
        generate_outputs=[
            build_cfg.DataflowOutputType.ESTIMATE_REPORTS,
        ]
    )

    build.build_dataflow_cfg(model_file, cfg_estimates)

Folding Mode is Custom
{
 "Defaults": {},
 "ConvolutionInputGenerator_rtl_0": {
  "SIMD": 1,
  "parallel_window": 0,
  "ram_style": "distributed"
 },
 "MVAU_hls_0": {
  "PE": 1,
  "SIMD": 27,
  "ram_style": "auto",
  "resType": "auto",
  "mem_mode": "internal_decoupled",
  "runtime_writeable_weights": 0
 },
 "ConvolutionInputGenerator_rtl_1": {
  "SIMD": 8,
  "parallel_window": 0,
  "ram_style": "distributed"
 },
 "MVAU_hls_1": {
  "PE": 8,
  "SIMD": 72,
  "ram_style": "auto",
  "resType": "auto",
  "mem_mode": "internal_decoupled",
  "runtime_writeable_weights": 0
 },
 "StreamingMaxPool_hls_0": {
  "PE": 1
 },
 "ConvolutionInputGenerator_rtl_2": {
  "SIMD": 2,
  "parallel_window": 0,
  "ram_style": "distributed"
 },
 "MVAU_hls_2": {
  "PE": 2,
  "SIMD": 72,
  "ram_style": "auto",
  "resType": "auto",
  "mem_mode": "internal_decoupled",
  "runtime_writeable_weights": 0
 },
 "ConvolutionInputGenerator_rtl_3": {
  "SIMD": 2,
  "parallel_window": 0,
  "ram_style": "distributed"
 },
 "MVAU

In [10]:
assert os.path.exists(estimates_output_dir + "/report/estimate_network_performance.json")

We'll now examine the generated outputs from this build. If we look under the outputs directory, we'll find a subfolder with the generated estimate reports.

In [11]:
! ls {estimates_output_dir}

build_dataflow.log   report				     time_per_step.json
intermediate_models  template_specialize_layers_config.json


In [12]:
! ls {estimates_output_dir}/report

estimate_layer_config_alternatives.json  estimate_network_performance.json
estimate_layer_cycles.json		 op_and_param_counts.json
estimate_layer_resources.json


We see that various reports have been generated as .json files. Let's examine the contents of the `estimate_network_performance.json` for starters. Here, we can see the analytical estimates for the performance and latency.

The output directory was created and we can extract information about our model and also how it was processed in the FINN compiler from the generated files. Let's focus on the intermediate models for now. You can find them in the output directory in the folder "intermediate_models".

Source: 4_advanced_builder_settings.ipynb

In [13]:
!ls -t -r {build_dir}/output_estimates_only/intermediate_models

custom_step_add_pre_proc.onnx	dataflow_parent.onnx
custom_step_add_post_proc.onnx	step_create_dataflow_partition.onnx
step_qonnx_to_finn.onnx		step_specialize_layers.onnx
step_tidy_up.onnx		step_apply_folding_config.onnx
step_streamline.onnx		step_minimize_bit_width.onnx
step_convert_to_hw.onnx		step_generate_estimate_reports.onnx
supported_op_partitions


In [14]:
! cat {estimates_output_dir}/report/estimate_network_performance.json

{
  "critical_path_cycles": 733224,
  "max_cycles": 147456,
  "max_cycles_node_name": "MVAU_hls_5",
  "estimated_throughput_fps": 678.1684027777778,
  "estimated_latency_ns": 7332240.0
}

Since all of these reports are .json files, we can easily load them into Python for further processing. This can be useful if you are building your own design automation tools on top of FINN. Let's define a helper function and look at the `estimate_layer_cycles.json` report.

In [15]:
import json
def read_json_dict(filename):
    with open(filename, "r") as f:
        ret = json.load(f)
    return ret

In [16]:
read_json_dict(estimates_output_dir + "/report/estimate_layer_cycles.json")

{'Thresholding_rtl_0': 3072,
 'ConvolutionInputGenerator_rtl_0': 24501,
 'MVAU_hls_0': 57600,
 'ConvolutionInputGenerator_rtl_1': 56952,
 'MVAU_hls_1': 50176,
 'StreamingMaxPool_hls_0': 980,
 'ConvolutionInputGenerator_rtl_2': 42464,
 'MVAU_hls_2': 73728,
 'ConvolutionInputGenerator_rtl_3': 59328,
 'MVAU_hls_3': 51200,
 'StreamingMaxPool_hls_1': 125,
 'ConvolutionInputGenerator_rtl_4': 12032,
 'MVAU_hls_4': 82944,
 'MVAU_hls_5': 147456,
 'MVAU_hls_6': 65536,
 'MVAU_hls_7': 5120,
 'LabelSelect_hls_0': 10}

Here, we can see the estimated number of clock cycles each layer will take. Recall that all of these layers will be running in parallel, and the slowest layer will determine the overall throughput of the entire neural network. FINN attempts to parallelize each layer such that they all take a similar number of cycles, and less than the corresponding number of cycles that would be required to meet `target_fps`. Additionally by summing up all layer cycle estimates one can obtain an estimate for the overall latency of the whole network. 

Finally, we can see the layer-by-layer resource estimates in the `estimate_layer_resources.json` report:

In [17]:
read_json_dict(estimates_output_dir + "/report/estimate_layer_resources.json")

{'Thresholding_rtl_0': {'BRAM_18K': 0,
  'BRAM_efficiency': 1,
  'LUT': 128.0,
  'URAM': 0,
  'URAM_efficiency': 1,
  'DSP': 0},
 'ConvolutionInputGenerator_rtl_0': {'BRAM_18K': 0,
  'BRAM_efficiency': 1,
  'LUT': 348,
  'URAM': 0,
  'URAM_efficiency': 1,
  'DSP': 0},
 'MVAU_hls_0': {'BRAM_18K': 2,
  'BRAM_efficiency': 0.09375,
  'LUT': 1788,
  'URAM': 0,
  'URAM_efficiency': 1,
  'DSP': 0},
 'ConvolutionInputGenerator_rtl_1': {'BRAM_18K': 0,
  'BRAM_efficiency': 1,
  'LUT': 412,
  'URAM': 0,
  'URAM_efficiency': 1,
  'DSP': 0},
 'MVAU_hls_1': {'BRAM_18K': 16,
  'BRAM_efficiency': 0.125,
  'LUT': 4172,
  'URAM': 0,
  'URAM_efficiency': 1,
  'DSP': 0},
 'StreamingMaxPool_hls_0': {'BRAM_18K': 0,
  'BRAM_efficiency': 1,
  'LUT': 0,
  'URAM': 0,
  'URAM_efficiency': 1,
  'DSP': 0},
 'ConvolutionInputGenerator_rtl_2': {'BRAM_18K': 0,
  'BRAM_efficiency': 1,
  'LUT': 354,
  'URAM': 0,
  'URAM_efficiency': 1,
  'DSP': 0},
 'MVAU_hls_2': {'BRAM_18K': 4,
  'BRAM_efficiency': 1.0,
  'LUT': 1268,

This particular report is useful to determine whether the current configuration will fit into a particular FPGA. 

**Note that the analytical models tend to over-estimate how much resources are needed, since they can't capture the effects of various synthesis optimizations.**

Pynq-Z2 HW Resources: BRAM_18K: 140 (4.9 Mb), LUT: 53.200, DSP:220

# Folding

Source 
https://github.com/Xilinx/finn/blob/e3087ad9fbabcc35f21164d415ababec4f462e9f/notebooks/advanced/3_folding.ipynb#L44

My notebook 
http://127.0.0.1:8888/notebooks/batuhan/3_folding_BA.ipynb

In [18]:
my_folding_config_file = "custom_folding_config.json"

# Launch a Build: Stitched IP, out-of-context synth and rtlsim Performance <a id="build_ip_synth_rtlsim"></a>



Once we have a configuration that gives satisfactory estimates, we can move on to generating the accelerator. We can do this in different ways depending on how we want to integrate the accelerator into a larger system. For instance, if we have a larger streaming system built in Vivado or if we'd like to re-use this generated accelerator as an IP component in other projects, the `STITCHED_IP` output product is a good choice. We can also use the `OOC_SYNTH` output product to get post-synthesis resource and clock frequency numbers for our accelerator.

<font color="red">**Live FINN tutorial:** These next builds will take about 10 minutes to complete since multiple calls to Vivado and a call to RTL simulation are involved. While this is running, you can examine the generated files with noVNC -- it is running on **(your AWS URL):6080/vnc.html**

* Once the `step_hls_codegen [8/16]` below is completed, you can view the generated HLS code under its own folder for each layer: `/tmp/finn_dev_ubuntu/code_gen_ipgen_MVAU_hls_XXXXXX`
    
* Once the `step_create_stitched_ip [11/16]` below is completed, you can view the generated stitched IP in Vivado under `/home/ubuntu/finn/notebooks/end2end_example/cybersecurity/output_ipstitch_ooc_rtlsim/stitched_ip`
</font> 

In [19]:
import finn.builder.build_dataflow as build
import finn.builder.build_dataflow_config as build_cfg
import os
import shutil

model_file = model_dir+"/QONNX_CNV.onnx"

rtlsim_output_dir = "output_ipstitch_ooc_rtlsim"

#Delete previous run results if exist
if os.path.exists(rtlsim_output_dir):
    shutil.rmtree(rtlsim_output_dir)
    print("Previous run results deleted!")

#custom_step_add_pre_proc,
#custom_step_add_post_proc,

build_steps = [
    "step_qonnx_to_finn",
    "step_tidy_up",
    "step_streamline",
    "step_convert_to_hw",
    "step_create_dataflow_partition",
    "step_specialize_layers",
    "step_target_fps_parallelization", # not necessary due to custom folding configuration
    "step_apply_folding_config",
    "step_minimize_bit_width",
    "step_generate_estimate_reports",
    "step_hw_codegen",
    "step_hw_ipgen",
    "step_set_fifo_depths",
    "step_create_stitched_ip",
    "step_measure_rtlsim_performance",
    "step_out_of_context_synthesis",
    "step_synthesize_bitfile",
    "step_make_pynq_driver",
    "step_deployment_package",
]

cfg_stitched_ip = build.DataflowBuildConfig(
    output_dir          = rtlsim_output_dir,
    mvau_wwidth_max     = 80,
    #target_fps          = 1000, # not used any more due to custom folding
    synth_clk_period_ns = 10.0,
    folding_config_file = my_folding_config_file, # new added for custom folding
    #steps               = build_steps, # for custom folding
    steps               = build_cfg.default_build_dataflow_steps, # default
    shell_flow_type     = build_cfg.ShellFlowType.VIVADO_ZYNQ, # new added
    board               = "Pynq-Z2",
    fpga_part           = "xc7z020clg400-2",
    default_swg_exception = True,
    generate_outputs=[
        build_cfg.DataflowOutputType.STITCHED_IP,
        build_cfg.DataflowOutputType.RTLSIM_PERFORMANCE,
        build_cfg.DataflowOutputType.OOC_SYNTH,
        build_cfg.DataflowOutputType.BITFILE, # shell_flow_type is required
        build_cfg.DataflowOutputType.PYNQ_DRIVER,
        build_cfg.DataflowOutputType.DEPLOYMENT_PACKAGE,
    ]
)

Previous run results deleted!


In [20]:
%%time
build.build_dataflow_cfg(model_file, cfg_stitched_ip)

Building dataflow accelerator from /home/ba/Xilinx/finn-0v10/finn/notebooks/batuhan/QONNX_CNV.onnx
Intermediate outputs will be generated in /tmp/finn_dev_ba
Final outputs will be generated in output_ipstitch_ooc_rtlsim
Build log is at output_ipstitch_ooc_rtlsim/build_dataflow.log
Running step: step_qonnx_to_finn [1/19]
Running step: step_tidy_up [2/19]
Running step: step_streamline [3/19]
Running step: step_convert_to_hw [4/19]
Running step: step_create_dataflow_partition [5/19]
Running step: step_specialize_layers [6/19]
Running step: step_target_fps_parallelization [7/19]
Running step: step_apply_folding_config [8/19]
Running step: step_minimize_bit_width [9/19]
Running step: step_generate_estimate_reports [10/19]
Running step: step_hw_codegen [11/19]
Running step: step_hw_ipgen [12/19]
Running step: step_set_fifo_depths [13/19]
Running step: step_create_stitched_ip [14/19]
Running step: step_measure_rtlsim_performance [15/19]
Running step: step_out_of_context_synthesis [16/19]
Runn

0

In [21]:
assert os.path.exists(rtlsim_output_dir + "/report/ooc_synth_and_timing.json")
assert os.path.exists(rtlsim_output_dir + "/report/rtlsim_performance.json")
assert os.path.exists(rtlsim_output_dir + "/final_hw_config.json")

Why is e.g. `step_synthesize_bitfile` listed above even though we didn't ask for a bitfile in the output products? This is because we're using the default set of build steps, which includes `step_synthesize_bitfile`. Since its output product is not selected, this step will do nothing.

Among the output products, we will find the accelerator exported as a stitched IP block design:

In [22]:
! ls {rtlsim_output_dir}/stitched_ip

all_verilog_srcs.txt		       finn_vivado_stitch_proj.xpr
data				       ip
finn_vivado_stitch_proj.cache	       make_project.sh
finn_vivado_stitch_proj.gen	       make_project.tcl
finn_vivado_stitch_proj.hw	       vivado.jou
finn_vivado_stitch_proj.ip_user_files  vivado.log
finn_vivado_stitch_proj.srcs


We also have a few reports generated by these output products, different from the ones generated by `ESTIMATE_REPORTS`.

In [23]:
! ls {rtlsim_output_dir}/report

estimate_layer_resources_hls.json  post_synth_resources.json
ooc_synth_and_timing.json	   post_synth_resources.xml
post_route_timing.rpt		   rtlsim_performance.json


In `ooc_synth_and_timing.json` we can find the post-synthesis and maximum clock frequency estimate for the accelerator. Note that the clock frequency estimate here tends to be optimistic, since out-of-context synthesis is less constrained.

In [24]:
! cat {rtlsim_output_dir}/report/ooc_synth_and_timing.json

{
  "vivado_proj_folder": "/tmp/finn_dev_ba/synth_out_of_context_b9i76j_c/results_finn_design_wrapper",
  "LUT": 13644.0,
  "LUTRAM": 1756.0,
  "FF": 16955.0,
  "DSP": 0.0,
  "BRAM": 98.0,
  "BRAM_18K": 8.0,
  "BRAM_36K": 94.0,
  "URAM": 0.0,
  "Carry": 518.0,
  "WNS": 1.442,
  "Delay": 1.442,
  "vivado_version": 2022.1,
  "vivado_build_no": 3526262.0,
  "": 0,
  "fmax_mhz": 116.84973124561813,
  "estimated_throughput_fps": 792.4379560385345
}

In `rtlsim_performance.json` we can find the steady-state throughput and latency for the accelerator, as obtained by rtlsim. If the DRAM bandwidth numbers reported here are below what the hardware platform is capable of (i.e. the accelerator is not memory-bound), you can expect the same steady-state throughput (excluding any software/driver overheads) in real hardware.

In [25]:
! cat {rtlsim_output_dir}/report/rtlsim_performance.json

{
  "N_IN_TXNS": 3072,
  "N_OUT_TXNS": 10,
  "cycles": 366924,
  "N": 1,
  "latency_cycles": 366924,
  "runtime[ms]": 3.6692400000000003,
  "throughput[images/s]": 272.53600200586493,
  "fclk[mhz]": 100.0,
  "stable_throughput[images/s]": 272.53600200586493
}

Finally, let's have a look at `final_hw_config.json`. This is the node-by-node hardware configuration determined by the FINN compiler, including FIFO depths, parallelization settings (PE/SIMD) and others. If you want to optimize your build further (the "advanced" method we mentioned under "Configuring the performance"), you can use this .json file as the `folding_config_file` for a new run to use it as a starting point for further exploration and optimizations.

In [26]:
! cat {rtlsim_output_dir}/final_hw_config.json

{
  "Defaults": {},
  "StreamingFIFO_rtl_0": {
    "ram_style": "auto",
    "depth": 3072,
    "impl_style": "rtl",
    "inFIFODepths": [
      0
    ],
    "outFIFODepths": [
      0
    ]
  },
  "ConvolutionInputGenerator_rtl_0": {
    "SIMD": 1,
    "parallel_window": 0,
    "ram_style": "distributed",
    "inFIFODepths": [
      3072
    ],
    "outFIFODepths": [
      3644
    ]
  },
  "StreamingFIFO_rtl_1": {
    "ram_style": "auto",
    "depth": 3644,
    "impl_style": "vivado",
    "inFIFODepths": [
      0
    ],
    "outFIFODepths": [
      0
    ]
  },
  "StreamingDataWidthConverter_rtl_0": {
    "inFIFODepths": [
      3644
    ],
    "outFIFODepths": [
      900
    ]
  },
  "StreamingFIFO_rtl_2": {
    "ram_style": "auto",
    "depth": 900,
    "impl_style": "vivado",
    "inFIFODepths": [
      0
    ],
    "outFIFODepths": [
      0
    ]
  },
  "MVAU_hls_0": {
    "PE": 1,
    "SIMD": 27,
    "ram_style": "auto",
    "resType": "auto",
    "mem_mode": "internal_decoupl

# PYNQ Implementation

In [27]:
#! cp unsw_nb15_binarized.npz {final_output_dir}/deploy/driver

In [28]:
#! cp validate-unsw-nb15.py {final_output_dir}/deploy/driver

In [29]:
#! ls {final_output_dir}/deploy/driver

In [30]:
#from shutil import make_archive
#make_archive('deploy-on-pynq', 'zip', final_output_dir+"/deploy")

You can now download the created zipfile (**File -> Open**, mark the checkbox next to the `deploy-on-pynq.zip` and select Download from the toolbar), then copy it to your PYNQ board (for instance via `scp` or `rsync`). Then, run the following commands **on the PYNQ board** to extract the archive and run the validation:

```shell
unzip deploy-on-pynq.zip -d finn-cybsec-mlp-demo
cd finn-cybsec-mlp-demo/driver
sudo python3.6 -m pip install bitstring
sudo python3.6 validate-unsw-nb15.py --batchsize 1000
```

You should see `Final accuracy: 91.868293` at the end. You may have noticed that the validation doesn't *quite* run at 1M inferences per second. This is because of the Python packing/unpacking and data movement overheads. To see this in more detail, the generated driver includes a benchmarking mode that shows the runtime breakdown:

```shell
sudo python3.6 driver.py --exec_mode throughput_test --bitfile ../bitfile/finn-accel.bit --batchsize 1000
cat nw_metrics.txt
```

```{'runtime[ms]': 1.0602474212646484,
 'throughput[images/s]': 943176.0737575893,
 'DRAM_in_bandwidth[Mb/s]': 70.7382055318192,
 'DRAM_out_bandwidth[Mb/s]': 0.9431760737575894,
 'fclk[mhz]': 100.0,
 'batch_size': 1000,
 'fold_input[ms]': 9.679794311523438e-05,
 'pack_input[ms]': 0.060115814208984375,
 'copy_input_data_to_device[ms]': 0.002428770065307617,
 'copy_output_data_from_device[ms]': 0.0005249977111816406,
 'unpack_output[ms]': 0.3773000240325928,
 'unfold_output[ms]': 6.818771362304688e-05}```

Here, the various `pack_input/unpack_output` calls show the overhead of packing/unpacking the inputs/outputs to convert from numpy arrays to the bit-contiguous data representation our accelerator expects. The `copy_input_data_to_device` and `copy_output_data_from_device` indicate the cost of moving the data between the CPU and accelerator memories. These overheads can dominate the execution time when running with small batch sizes.

Finally, we can see that `throughput[images/s]`, which is the pure hardware throughput without any software and data movement overheads, is close to 1M inferences per second.