# Introduction to PYNQ for Alveo

In this notebook, we will introduce the basics of PYNQ for Alveo. More specifially, we will explore briefly how to:
1. program the device
2. allocate buffers
3. send to and receive data from the FPGA
4. call an accelerator

And that is literally all you need to get started!

## Example 1: Vector Addition

In this first example, we will use the vector addition kernel included in the [hello world](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d/hello_world) application of the [Vitis Accel Examples' Repository](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d).

### Overlay download

First, let's import `pynq`, download the overlay, and assign the vadd kernel IP to a variable called `vadd`. You can see below that the API is exactly the same as for Zynq devices

In [1]:
import pynq
ol = pynq.Overlay("intro.xclbin")

vadd = ol.vadd_1

### Buffers allocation

Let's first take a look at the signature of the vadd kernel. To do so, we use the `.signature` property. The accelerator takes two input vectors, the output vector, and the vectors' size as arguments

In [2]:
vadd.signature

<Signature (in1:'unsigned int const *', in2:'unsigned int const *', out_r:'unsigned int*', size:'int')>

Buffers allocation is carried by [`pynq.allocate`](https://pynq.readthedocs.io/en/v2.5/pynq_libraries/allocate.html), which provides the same interface as a [`numpy.ndarray`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html). In this case we're going to create 3 1024x1024 arrays, two input and one output. Since the kernel uses unsigned integers, we specify `u4` as data type when performing allocation.

In [3]:
size = 1024*1024
in1_vadd = pynq.allocate((1024, 1024), 'u4')
in2_vadd = pynq.allocate((1024, 1024), 'u4')
out = pynq.allocate((1024, 1024), 'u4')

We can use numpy to easily initialize one of the input arrays with random data, with numbers in the range [0, 100). We instead set all the elements of the second input array to a fixed value so we can see at a glance whether the addition was successful.

In [4]:
import numpy as np
in1_vadd[:] = np.random.randint(low=0, high=100, size=(1024, 1024), dtype='u4')
in2_vadd[:] = 200

### Run the kernel

Before we can start the kernel we need to make sure that the buffers are flushed to the FPGA card. We do this by calling `.flush()` on each of our input arrays.

To start the accelerator, we can use the `.call()` function and pass the kernel arguments. The function will take care of correctly setting the `register_map` of the IP and send the start signal. We pass the arguments to `.call()` following the `.signature` we previously inspected.

Once the kernel has completed, we can `.invalidate()` the output buffer to ensure that data from the FPGA is transferred back to the host memory.

We use the `%%timeit` magic to get the average execution time. This magic will automatically decide how many runs to perform to get a reliable average.

In [5]:
%%timeit
in1_vadd.flush()
in2_vadd.flush()

vadd.call(in1_vadd, in2_vadd, out, size)

out.invalidate()

17.5 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Finally, let's compare the FPGA results with software, using [`numpy.array_equal`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.array_equal.html)

In [6]:
np.array_equal(out, in1_vadd + in2_vadd)

True

## Example 2: Vector Addition and Vector Multiplication

In this second example, alongside the previously introduced vector addition kernel, we will use the vector multiplication kernel included in the [SLR assign](https://github.com/Xilinx/Vitis_Accel_Examples/tree/63bae10d581df40cf9402ed71ea825476751305d/sys_opt/slr_assign) application of the Vitis Accel Examples.

We assign the vector multiplication kernel IP included in the overlay to a variable called `vmult`, and print the `.signature` similarly to what we have done for `vadd`.

In [7]:
vmult = ol.vmult_1
vmult.signature

<Signature (A:'int*', B:'int*', C:'int*', n_elements:'int')>

For this example, we will take the result of `vadd` and feed it to `vmult`. Therefore, we will reuse the previously allocated buffers, so we only need to allocate two buffers that will be used as input for the `vmult` kernel.

In [8]:
in1_vmult = pynq.allocate((1024, 1024), 'u4')
in2_vmult = pynq.allocate((1024, 1024), 'u4')

The `in2_vmult` buffer will be used to store the output of `vadd`, so we need only to initialize the two input buffers for `vadd`, `in1_vadd` and `in2_vadd`, and the other input buffer for `vmult` that is `in1_vmult`. We set these buffers' elements to random integers in the range [0, 1000).

In [9]:
in1_vadd[:] = np.random.randint(1000, size=(1024, 1024), dtype='u4')
in2_vadd[:] = np.random.randint(1000, size=(1024, 1024), dtype='u4')
in1_vmult[:] = np.random.randint(1000, size=(1024, 1024), dtype='u4')

Similarly to what we did for the previous example, we have to `.flush()` the input buffers, and after executing the kernels using `.call()`, we have to `.invalidate()` the output buffer to transfer data back to the host memory. However, since `in2_vmult` is used as exchange buffer between `vadd` and `vmult`, and we need not to see its data from host, we don't need to do any flushing or invalidation for it.

Again, we use the `%%timeit` magic to get an average of the execution time.

In [10]:
%%timeit
in1_vadd.flush()
in2_vadd.flush()
in1_vmult.flush()

vadd.call(in1_vadd, in2_vadd, in2_vmult, size)
vmult.call(in1_vmult, in2_vmult, out, size)

out.invalidate()

32.2 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


And finally, we compare the result with software to check that the kernels correctly executed.

In [11]:
np.array_equal(out, (in1_vadd + in2_vadd) * in1_vmult)

True

## Cleaning up

Finally, we have to deallocate the buffers and free the FPGA context using `Overlay.free`. For the buffers, we have to use the [`%xdel`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-xdel) magic to also remove any reference to these buffers in Jupyter/IPython. IPython holds on to references of cell outputs so a standard `del` isn’t sufficient to remove all references to the array and hence trigger the memory to be freed.

In [12]:
%xdel in1_vadd
%xdel in2_vadd
%xdel in1_vmult
%xdel in2_vmult
%xdel out
ol.free()

Copyright (C) 2020 Xilinx, Inc