# Hardware implementation
This document serves as a walkthrough for a minimal reference implementation using a single partition network. It assumes that the hardware has already been generated using the end-to-end tutorial. The single-layer model was chosen due to its acceptable compile times.

This tutorial targets Vivado 2019.1, the bundled SDK (for Vivado 2019.2 onwards Vitis has replaced Vivado SDK), and the Zedboard. However, it should be valid for any Zynq board, provided that the optimizer platform file has been selected or edited in such a way that the chip resources are not exceeded.

## Part 1: Project setup
Launch Vivado 2019.1 using the `vivado` command or the appropriate shortcut.
Create a new project using the quick start pane.

![project menu](hardware-tutorial-assets/figures/project_menu.png)

Name the project as "single_layer_tutorial", select an appropriate project location. 

![project name](hardware-tutorial-assets/figures/project_name.png)

Choose "RTL Project"

![rtl project](hardware-tutorial-assets/figures/RTL_project.png)

Skip the Add Sources and Add Constraints windows by pressing the `Next` button, our sources will be added later and constraints are selected by setting the fabric clock speed.


Select your development board by selecting the `Boards` tab, search for Zedboard. Make sure that the correct Board revision is selected. Here we are going with the ZC702 REV 1.0.

![board](hardware-tutorial-assets/figures/board.png)

After highlighting the board, complete the project creation using the `finish` button.


## Part 2: IP import

The previous end-to-end tutorial generates the hardware packaged as a Vivado IP block, which allows us to abstract the inner workings of the block and use it within the IP integrator clock environment. Before continuing further this block must be imported.

Locate `partition_0` folder at `/tutorial/1_simple_end_to_end/`

To import a Vivado IP block, left-click the `IP catalog` button within the flow navigator. An IP catalog tab should appear.

![main window](hardware-tutorial-assets/figures/main_window.png)

To add the repository, right click within the IP Catalog window, left click the `Add Repository` button within the pop-up window and select the `partition_0` directory generated by the end to end tutorial.

![add repository](hardware-tutorial-assets/figures/add_repository.png)

You will find the IP module successfully imported. 

![repo added](hardware-tutorial-assets/figures/repo_added.png)



## Part 3: Block Design

Open the IP integrator by left-clicking the `Create Block Design` button within the Flow Navigator.

![create block design](hardware-tutorial-assets/figures/create_block_design.png)

We accept the default settings and an empty diagram tab is shown.

![diagram area](hardware-tutorial-assets/figures/diagram_area.png)

We left-click the add IP 'plus' icon in the top bar of the Diagram tab, search for and add the following IP blocks:

    * 1x Fpgaconvnet_ip
    * 1x ZYNQ7 Processing System
    * 2x AXI Interconnect

![blocks](hardware-tutorial-assets/figures/blocks.png)


We run block automation by clicking `Run Block Automation` button in the top diagram banner. This matches the ZYNQ 7 ports to the physical memory interface of the development board.

![block automation dialog](hardware-tutorial-assets/figures/block_automation_dialog.png)

Next we set up the ZYNQ system.

![zynq block](hardware-tutorial-assets/figures/zynq_block.png)

The fpgaconvnet IP streams its weights, featuremaps and outputs to/from off chip memory. As such we enable a HP AXI memory port on the ZYNQ Processing System (PS), so that the IP can share the DDR controller within the PS.
This example has a relatively small number of weights, which fit within the BRAM and are initialized with the bitstream. Therefore it would be acceptable to not connect the weights reloading (wr) port. For more complex networks, weights may have to be reloaded for a given partition and between partitions at runtime.

The IP also features an interrupt line, which may be used to trigger action by the PS once the processing for a given partition is complete.

While neither the wr port or the interrupt line will be used by this design, for the sake of completeness, their set up will be shown.

To set up the ZYNQ HP memory port, we double-click the ZYNQ block visible in the previous figure. We select the `PS-PL Configuration` pane in the `Page Navigator` on the left, select the `HP Slave AXI Interface` menu and select the `S AXI HP0` Interface.

![AXI HP](hardware-tutorial-assets/figures/AXI_HP.png)


To set up the ZYNQ system interrupts, we once again double-click the ZYNQ block, select the `Interrupts` tab, enable the `Fabric Interrupts` option, open the `PL-PS Interrupt Ports` menu and enable the `IRQ_F2P port`.

![ZYNQ interrupts](hardware-tutorial-assets/figures/ZYNQ_interrupts.png)

We also adjust the PL clock frequency by double-clicking the ZYNQ block, opening the clock configuration tab, PL Fabric Clocks drop down and setting the `FCLK_CLK0` value to ca.100 (MHz). While clock speeds estimates from the HLS compiler exist prior to implementation, they can be unreliable. It is therefore recommended to run the design through the full implementation flow to get an estimate of the possible PL frequency and then re-run the implementation flow with the frequency set. A 100 MHz clock rate should be plausible.

![clock frequency](hardware-tutorial-assets/figures/clock_frequency.png)

Next we set up the `axi_interconnect_0` by double-clicking the IP, setting the number of `slave interfaces` to 3, and the number of `master interfaces` to 1.

![interconnect](hardware-tutorial-assets/figures/interconnect.png)

Next we set up the `axi_interconnect_1` by double-clicking the IP, setting the number of `slave interfaces` to 1, and the number of `master interfaces` to 1. Note that an interconnect is necessary even when routing directly between an axi master and slave port.

![interconnect 1](hardware-tutorial-assets/figures/interconnect_1.png)

We connect the three axi master ports from the fpgaconvnet IP block to `interconnect_0` and then to the `s_AXI_HP0` port of the ZYNQ 7  system.

We connect the `M_AXI_GP0` port on the ZYNQ7 system to the `S00_AXI` port of interconnect one and the corresponding `M00_AXI` port to the `s_axi_ctrl` port on the fpgaconvnet IP block.

The interrupt line of the fpgaconvnet block can be connected to the _IRQ_F2P_ port on the ZYNQ 7 system.

![axi routed](hardware-tutorial-assets/figures/axi_routed.png)


Now, the control lines can be routed automatically using Connection automation. We do this by left-clicking the `Run Connection Automation` button in the tab banner.

![connection automation](hardware-tutorial-assets/figures/connection_automation.png)

We select all available automations and press `ok`.

![all automation](hardware-tutorial-assets/figures/all_automation.png)

We clean the design up by pressing the `regenerate layout` and `optimize routing` buttons on the top toolbar.

![regenerate layout](hardware-tutorial-assets/figures/regenerate_layout.png)
![optimize routing](hardware-tutorial-assets/figures/optimize_routing.png)

The final block design can be seen below.
![final block](hardware-tutorial-assets/figures/final_block.png)

Next we switch to the `Address Editor` Tab, right click into the empty area and run `Auto Assign Addresses`.

![auto address](hardware-tutorial-assets/figures/auto_address.png)

We now go back to the `Diagram` tab and `validate the design` to ensure that all critical ports are connected.

![validate design](hardware-tutorial-assets/figures/validate_design.png)

The following message should be presented.

![validation successful](hardware-tutorial-assets/figures/validation_successful.png)

The final step is to generate a HDL wrapper for our design. We select the  `sources` tab in the top left of the area and right-click the `design_1` entry, pressing the `create hdl wrapper` button.

![hdl wrapper](hardware-tutorial-assets/figures/hdl_wrapper.png)

We let Vivado manage the wrapper and press `OK`

![vivado wrapper manage](hardware-tutorial-assets/figures/vivado_wrapper_manage.png)



## Part 4: Implementation

To run the whole implementation flow, simply click the `Generate Bitstream` button in the bottom left of the working area.

![generate bitstream](hardware-tutorial-assets/figures/generate_bitstream.png)

This should bring up the `Launch Runs` window, run with the default settings.

![launch runs](hardware-tutorial-assets/figures/launch_runs.png)

The run should appear in the `Design Runs` tab near the bottom of the working area.

![design runs](hardware-tutorial-assets/figures/design_runs.png)

If the implementation fails, make sure you have installed **Vivado Y2K22 patch** correctly. 

Once the design runs are complete, all the fields next to the design rows will be populated by checkmarks and a window wil pop up. This can be closed. To expose out project to the SDK, we export the hardware by selecting `File -> Export -> Export Hardware`.

![export hardware](hardware-tutorial-assets/figures/export_hardware.png)

Make sure to select the `include bitstream` option, so that the bitstream is also accessible from the SDK.



## Part 5: PS Software

Launch the SDK from within vivado by selecting `File-> Launch SDK`.

![launch sdk](hardware-tutorial-assets/figures/launch_sdk.png)

We accept the default settings.

![launch sdk menu](hardware-tutorial-assets/figures/launch_sdk_menu.png)

You should be faced by the main screen of the SDK.

![sdk main](hardware-tutorial-assets/figures/sdk_main.png)

We shall now create a project by selecting `File->New->Application Project`

![application project](hardware-tutorial-assets/figures/application_project.png)

We then name the project and set the language to `C++`. Project setup is completed by pressing the `Finish` button.

![new sdk project](hardware-tutorial-assets/figures/new_sdk_project.png)

Increase the Stack Size and Heap Size to `0x8000` in `/PS_tutorial_sw/src/lscript.ld`

![increase stack and heap size](hardware-tutorial-assets/figures/increase_size.png)

Next paste the contents of the included reference PS code from `/hardware-tutorial-assets/ps-code/main.cc` into the `main.cc` file. Afterwards we click `Project -> Build All` (or Ctrl + B) to check there are no issues with the build chain and code, as well as build the project.

![build full](hardware-tutorial-assets/figures/build_full.png)

We can now program the fpga with the bitstream generated in Vivado by clicking the `Program FPGA` button in the top navigation bar and then running the program operation with default settings.

![program fpga](hardware-tutorial-assets/figures/program_fpga.png)

![program menu](hardware-tutorial-assets/figures/program_menu.png)

We need three cables connected to the Zedboard, one for power, one for programming, and one for UART communication. 

![connection](hardware-tutorial-assets/figures/connection.png)

To launch the program we can press the `Run System Debugger` in the upper navigation bar, accepting the default options. 

![launch](hardware-tutorial-assets/figures/launch.png)

![run as menu](hardware-tutorial-assets/figures/run_as_menu.png)

Do inspect the commented code in `main.cc` for more details.


## Part 6: Host Software
The host program `main.py` loads the input from MNIST dataset, use hardware design to compute the output, and compare it with software reference. 

The host software relies on external libraries, hence it is recommended to use anaconda to set up another environment. 

```bash
conda create -n single_layer_tutorial python=3.8 -y
conda activate single_layer_tutorial
conda install pyserial numpy matplotlib -y
conda install -c conda-forge onnxruntime
```

Before the program is run, the MNIST test set (ca. 2MB) must be fetched by running the script in `/hardware-tutorial-assets/host-code/MNIST/get_mnist.sh`. If access using `wget` is not permitted, download `t10k-images-idx3-ubyte` and `t10k-labels-idx1-ubyte` and put them in `/hardware-tutorial-assets/host-code/MNIST`. 

Please inspect `main.py` and `tutorial_library.py` for some insights. 

## A note on the structure of inputs and outputs

fpgaConvNet uses the order of [Batch Size][Height][Width][Channels] when flattening the multi-dimensional inputs into an area of memory.
For example the 2*3 image:  

    [11][12][13]
    [21][22][23]
                            
With 3 channels (R, G, B), would be mapped into memory in the following manner.

    [R11][G11][B11][R12][G12][B12][R13][G13][B13][R21][G21][B21][R22][G22][B22][R23][G23][B23]
    
The internal number representation is a signed 16 bit fixed point format with 8 fractional bits.

The basic unit of a memory transaction for FPGA convnet is a 64 bit word, which may contain up to four 16 bit weights/features. 
Check the optimiser output file to determine the coarse factor for the first layer input stream and the coarse out factor for the final layer.
The coarse factor indicates how many 16 bit fields within the 64 bit word are populated.

For this single layer example we can see the coarse factors by opening `single_layer_opt.json`. We can see that the first (input) layer coarse factor in is 1, meaning that only the two LSB will be occupied by a feature. The remainder should be set to 0.

![coarse in](hardware-tutorial-assets/figures/coarse_in.png)

Similarly the coarse out factor on the final layer is 4, meaning that each 64 bit word on the output will contain 4 16 bit features. As such within this design they were split before transmission to the host.

![coarse out](hardware-tutorial-assets/figures/coarse_out.png)   

Unlike fpgaConvNet, the four-dimensional input tenser for an ONNX model typically uses the order of [Batch Size][Channels][Height][Width]. In this case, transpose of matrix is necessary to inspect the error rate. This is done in `main.py`. 

## Part 7: Sample session
### Install the drivers
If the drivers hasn't been installed, refer to [here](https://digilent.com/reference/programmable-logic/guides/installing-vivado-and-vitis#install_cable_drivers_linux_only) to install driver and add the user to the dialout group. 

### Locate the device
The descriptor of the UART module has to be determined before the host software is run. It can change with reboots. 

List the device file with 

```bash
ls /dev/ttyACM*
``` 
or 
```bash
ls /dev/ttyUSB*
```
Locate the exact device. Then change permission with 
```bash
sudo chmod 777 /dev/ttyACM0
``` 

![com port](hardware-tutorial-assets/figures/com_port.png)

For Windows OS, open device manager and inspect the serial ports to find the descriptor in the form COM_

### Demo
Ensure that the `single_layer_tutorial` environment is selected and run the following command from the directory containing `main.py`, substituting the appropriate UART descriptor. An optional third argument specifies the index of the image within the MNIST test set to be used as an input, otherwise the first image will be used. 

It is necessary to run the PS project first before launching the host software from the command line, then close the SDK window. 

![run](hardware-tutorial-assets/figures/run.png)

Run the python script:

```bash
python3 main.py /dev/ttyACM0 10
```
    
It may take less than a minute to run the program. The output should look similar to the following.


![sample_output](hardware-tutorial-assets/figures/sample_output.png)

Congratulations! You have just completed the first design with fpgaConvNet. Please continue with the following tutorials, to know more about how to use fpgaConvNet. 