# Tutorial project for Kria KV260
FK, 22.1.24

This tutorial project shows the typical steps needed in order to program a HLS IP Core (kernel). The project uses the `vadd_extensible_platform` Vitis project. See further documentation on this project and the design flow on the Hochschule Pforzheim Git server https://gitlab.hs-pforzheim.de

The programmable logic binary of this tutorial contains a kernel `vadd` which does a simple vector addition of two input vectors to a result vector. All vectors are stored in main memory and are accessed via AXI Master interfaces from the kernel. The size of the vectors can be programmed via a register in the AXI slave interface of the kernel.

This tutorial uses PYNQ, which is a Python library from Xilinx and offers the use of Jupyter notebooks for programming the application software for Zynqs and Ultrascale Zynqs. In this tutorial we use the Kria KV260 board. The Jupyter notebook is running on the Kria target and can be accessed via web browser (`<ip_address>:9090/lab`). Get the IP address of the board in your network. You can also access the board via ssh (`ssh ubuntu@<ip_address>`). The notebooks are stored on the Kria target under the following path: `/home/root/jupyter_notebooks`. This is the root directory when you access the Jupyter notebook via web browser. We suggest to make a project directory under this root directory. You can do this when you ssh to the board (you need some Linux command line knowledge for this). Then copy this Jupyter notebook from your Vitis system project to this directory (e.g. via sftp). Be aware that you need `root`-privileges in order to do this, so precede every command with `sudo`. And be careful what you are doing!

Information on PYNQ can be found on http://www.pynq.io. There is also a PYNQ documentation on https://pynq.readthedocs.io/en/latest/getting_started.html

We will comment the following steps with markdown cells before each step. The first cell loads some libraries, especially the PYNQ library. The numpy library is used for the vectors. The notebook uses some python modules from a local directory (see next cell). Make sure that the path to this directory fits your system. 


In [None]:
# Import libraries
import numpy as np
import time
from pynq import allocate, Overlay
# Import modules from utilities directory
import sys
sys.path.append("/home/root/jupyter_notebooks/projects/utils/")
import perftimer as pt
import hls_ip as ip

#Define project path
project_path = "/home/ubuntu/projects/vadd/"
#Define bitfile name
bitfile = project_path + "vadd_hw.xclbin"

In the next cell the size of the vectors is set and the so-called overlay is loaded into the programmable logic (PL) of the Kria KV260. The overlay is the bitfile which is programmed into the PL. In the context of the design with the extensible platform the binary file `binary_container_1.xclbin`, which was generated in Vitis, can be used here to program the PL. The PYNQ `Overlay` class is used to load the binary and the object `vadd_design` is used to work with the hardware in the PL. With the `help` function we can get information on the overlay. Most important here are the IP blocks and their names, because they are used in the following. For this demo there should be an IP block called `krnl_vadd_1`.


In [4]:
# set size of input vectors and output vectors
size = 100
# Load overlay
vadd_design = Overlay(bitfile)
# Get some information on the overlay
help(vadd_design)

Help on Overlay in module pynq.overlay:

<pynq.overlay.Overlay object>
    Default documentation for overlay /home/ubuntu/projects/hw/vadd/vadd_hw.xclbin. The following
    attributes are available on this overlay:
    
    IP Blocks
    ----------
    krnl_vadd_1          : pynq.overlay.DefaultIP
    
    Hierarchies
    -----------
    None
    
    Interrupts
    ----------
    None
    
    GPIO Outputs
    ------------
    None
    
    Memories
    ------------
    HP2                  : Memory



In the next cell we generate an object for the IP core (or "kernel") and get informations on the register map of the AXI slave interface of the kernel. The register names will be used subsequently. There should always be a CTRL-register which contains the block control interface of the kernel. Each block control signal is represented by a control bit in the register (AP_START etc.). The other registers represent the arguments of the C/C++ code of the IP core. In this example there should the registers `in1`, `in2`, `out_r` and `size`. The first three register are programmed with the start addresses of the vectors in memory and are used by the AXI master interfaces in order to access the data in main memory. The `size`-register is programmed with the transfer size (=vector size). 

In [5]:
# Create IP object ("driver").
vadd_ip = vadd_design.krnl_vadd_1
# Show register map of ip
vadd_ip.register_map

RegisterMap {
  CTRL = Register(AP_START=0, AP_DONE=0, AP_IDLE=1, AP_READY=0, AUTO_RESTART=0, AP_CONTINUE=0),
  in1 = Register(value=0),
  in2 = Register(value=0),
  out_r = Register(value=0),
  size = Register(value=0)
}

In the next cell we allocate numpy vectors for the input buffers, the output buffer and a buffer for the reference data. Since we do a vector addition, the reference data is generated by an addition here. The function `allocate` allocates buffers using a numpy syntax and returns the buffer objects, which can be used like numpy arrays. The memory is allocated somewhere in virtual memory, the physical memory addresses can be retrieved by the function `.device_address`. For a documentation see:  https://pynq.readthedocs.io/en/latest/pynq_libraries/allocate.html
If you need help on `numpy`, you can find it here: https://numpy.org

In [6]:
# allocate buffers
inbuf1 = allocate(shape=(size,), dtype=np.uint32)
inbuf2 = allocate(shape=(size,), dtype=np.uint32)
outbuf = allocate(shape=(size,), dtype=np.uint32)
refbuf = allocate(shape=(size,), dtype=np.uint32)

#Fill buffers for input, output
#Generate the reference buffer by doing the vector addition
for i in range(size):
    inbuf1[i] = i
    inbuf2[i] = i+3
    outbuf[i] = 0
    refbuf[i] = inbuf1[i] + inbuf2[i]


In the next cell the kernel registers are programmed. The registers `in1`, `in2` and `out_r` are programmed with the start addresses of the data buffers in main memory using the member function `.device_address` of the buffers. Finally the `size` register is programmed with the transfer size.

In [7]:
# Load buffer addresses into registers
vadd_ip.register_map.in1 = inbuf1.device_address
vadd_ip.register_map.in2 = inbuf2.device_address
vadd_ip.register_map.out_r = outbuf.device_address

# Set size register
vadd_ip.register_map.size = size

In the next cell the IP Core is run and execution time for the kernel is measured. The time measurement is done with a timer class from the module `perftimer` (see Python script `perftimer.py`). The IP core is run by the function `run_ip` from the module `hls_ip` (see Python script `hls_ip.py`).  The result should now be in the output buffer `output_buffer`in main memory. Note that this time measurement may differ from the execution time measurement of the the C/C++ application!  

In [8]:
# Initialize a timer
hw_timer = pt.Timer("Execution time HW:", True)
# Start timer
hw_timer.start()
# Run the IP
ip.run_ip(vadd_ip)
# Stop timer
hw_timer.stop()

Execution time HW: elapsed time: 521.966 us


521.966

Finally we compare the output vector generated by the kernel with the reference vector and print a final verdict.

In [9]:
# Check if output vector equals the reference vector 
if (outbuf == refbuf).all():
    print("Test succeeded!")
else:
    print("Test failed!")

Test succeeded!
