# Sum of squares (variable)

This example shows the usage of summing the squares of an incoming data stream. The data is streamed to and from the IP using an AXI DMA component.

A first of all, the correct bitstream (.bit) and its associated hardware description (.hwh) is loaded onto the FPGA:

In [1]:
from pynq import Overlay, MMIO
from pynq.lib import AxiGPIO

overlay = Overlay("/home/xilinx/overlays/sum_of_squares_variable_rst.bit")
overlay.download()

The components in the design and all associated metadata can be found in the `ip_dict`.

In [2]:
[key for key in overlay.ip_dict.keys()]

['axi_dma_0', 'axi_gpio_0', 'sum_of_squares_varia_0', 'processing_system7_0']

# Example with small array

Next, the input stream is allocated and populated with random data.

In [3]:
from pynq import allocate
import numpy as np

A = allocate(shape=(39,), dtype=np.uint32)

A[:] = np.random.randint(39, size=(39,))

Because the reset of the IP core is done using AXI GPIO, and the reset is active low, we need to set the value to `1` first. Else, the IP core will be permanently reset.

In [4]:
overlay.axi_gpio_0.channel1[0].on()
overlay.axi_gpio_0.channel1.read()

1

First, we signal the length of the stream (minus one) to the IP component.

In [5]:
overlay.sum_of_squares_varia_0.mmio.write(0x18, A.size - 1)

Then we start the `sum_of_squares_varia_0` IP. This can be done by writing a start bit to the memory.

In [6]:
overlay.sum_of_squares_varia_0.mmio.write(0x0, 0x01)

Stream to the IP and wait until it completes.

In [7]:
overlay.axi_dma_0.sendchannel.transfer(A)
overlay.axi_dma_0.sendchannel.wait()

Indeed, the result is the same when using NumPy's software implementation.

In [8]:
print(overlay.sum_of_squares_varia_0.mmio.read(0x10))
print(sum(A**2))

17438
17438


# Example with large array

We now allocate a much larger input stream and populate it with random data.

In [28]:
from pynq import allocate
import numpy as np

A = allocate(shape=(300,), dtype=np.uint32)

# A[:] = np.random.randint(2500, size=(2500,))
A[:] = np.arange(300)

We first reset the IP core (active low):

In [29]:
overlay.axi_gpio_0.channel1[0].off()
overlay.axi_gpio_0.channel1[0].on()
overlay.axi_gpio_0.channel1.read()

1

The IP core can handle a maximum of 256 values at once. Therefore, we need to split up the array in chunks and send them one-by-one. This will be done implicitly when using DMA. By using a static accumulator in the IP core, we only have to read back the result at the very end.

In [30]:
STREAM_LEN = 256;
iterations = A.size // STREAM_LEN;
remainder = A.size % STREAM_LEN;

print(iterations, remainder)

# if iterations:
#     overlay.sum_of_squares_varia_0.mmio.write(0x18, STREAM_LEN - 1)
#     overlay.sum_of_squares_varia_0.mmio.write(0x0, 0x81)
#     overlay.axi_dma_0.sendchannel.transfer(A[:iterations * STREAM_LEN])
#     overlay.axi_dma_0.sendchannel.wait()

1 44


In [31]:
if remainder:
    overlay.sum_of_squares_varia_0.mmio.write(0x18, remainder - 1)
    overlay.sum_of_squares_varia_0.mmio.write(0x0, 0x01)
    #     overlay.sum_of_squares_varia_0.mmio.write(0x0, 0x01)
#     print(overlay.sum_of_squares_varia_0.mmio.read(0x0))
    overlay.axi_dma_0.sendchannel.transfer(A[iterations * STREAM_LEN:])
    overlay.axi_dma_0.sendchannel.wait()

We again check if the result matches the software implementation:

In [32]:
print(overlay.sum_of_squares_varia_0.mmio.read(0x10))
print(sum(A**2))

3395370
8955050


In [13]:
print(sum(A[:256]**2))

5559680
