# Matrix multiplication (variable size)

This example shows the usage of a variable-size (up to 64x64) floating point multiplication in hardware. The data is streamed to and from the IP using an AXI DMA component.

A first of all, the correct bitstream (.bit) and its associated hardware description (.hwh) is loaded onto the FPGA:

In [1]:
from pynq import Overlay, MMIO

overlay = Overlay("/home/xilinx/overlays/mmul_v2_64.bit")

The components in the design and all associated metadata can be found in the `ip_dict`.

In [2]:
[key for key in overlay.ip_dict.keys()]

['axi_dma_0', 'mmul_v2_0', 'processing_system7_0']

Next, the input matrices `A` and `B` have to be allocated and populated with random data. 

In [3]:
from pynq import allocate
import numpy as np

L, M, N = 4, 5, 6

A = allocate(shape=(L,M), dtype="u4")
B = allocate(shape=(M,N), dtype="u4")

A[:] = np.mgrid[1:L+1, 1:M+1][0]
B[:] = np.mgrid[1:M+1, 1:N+1][0]

# 1. Python (PYNQ)

## 1.a Memory allocation

To minimize the streaming latency, the `mmul_v2_0` component actually calculates `A@B.T=C.T`. Therefore we have to re-allocate `B` and treat `C` as being transposed.

In [4]:
BT = allocate(shape=B.shape[::-1], dtype=B.dtype)
BT[:]=B.T

CT = allocate(shape=(A.shape[0], B.shape[1]), dtype="u4")

Let's compare the memory layout of `B` and `BT`:

In [5]:
import subprocess
cmd = f"hexdump -C -s {B.physical_address} /dev/mem | head"
print(B)
print(subprocess.check_output(cmd, shell=True).decode("utf-8"))

[[1 1 1 1 1 1]
 [2 2 2 2 2 2]
 [3 3 3 3 3 3]
 [4 4 4 4 4 4]
 [5 5 5 5 5 5]]
16869000  01 00 00 00 01 00 00 00  01 00 00 00 01 00 00 00  |................|
16869010  01 00 00 00 01 00 00 00  02 00 00 00 02 00 00 00  |................|
16869020  02 00 00 00 02 00 00 00  02 00 00 00 02 00 00 00  |................|
16869030  03 00 00 00 03 00 00 00  03 00 00 00 03 00 00 00  |................|
16869040  03 00 00 00 03 00 00 00  04 00 00 00 04 00 00 00  |................|
16869050  04 00 00 00 04 00 00 00  04 00 00 00 04 00 00 00  |................|
16869060  05 00 00 00 05 00 00 00  05 00 00 00 05 00 00 00  |................|
16869070  05 00 00 00 05 00 00 00  00 00 00 00 00 00 00 00  |................|
16869080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*



In [6]:
import subprocess
cmd = f"hexdump -C -s {BT.physical_address} /dev/mem | head"
print(BT)
print(subprocess.check_output(cmd, shell=True).decode("utf-8"))

[[1 2 3 4 5]
 [1 2 3 4 5]
 [1 2 3 4 5]
 [1 2 3 4 5]
 [1 2 3 4 5]
 [1 2 3 4 5]]
1686a000  01 00 00 00 02 00 00 00  03 00 00 00 04 00 00 00  |................|
1686a010  05 00 00 00 01 00 00 00  02 00 00 00 03 00 00 00  |................|
1686a020  04 00 00 00 05 00 00 00  01 00 00 00 02 00 00 00  |................|
1686a030  03 00 00 00 04 00 00 00  05 00 00 00 01 00 00 00  |................|
1686a040  02 00 00 00 03 00 00 00  04 00 00 00 05 00 00 00  |................|
1686a050  01 00 00 00 02 00 00 00  03 00 00 00 04 00 00 00  |................|
1686a060  05 00 00 00 01 00 00 00  02 00 00 00 03 00 00 00  |................|
1686a070  04 00 00 00 05 00 00 00  00 00 00 00 00 00 00 00  |................|
1686a080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*



As we can see, the memory layout of `BT` is suitable for streaming it with DMA. Else, we would have to send each element of B as a separate DMA transfer.

## 1.b Hardware execution

Now we can set the `mmul_v2` parameters and start the component using MMIO.

In [12]:
overlay.mmul_v2_0.mmio.write(0x10, L)
overlay.mmul_v2_0.mmio.write(0x18, M)
overlay.mmul_v2_0.mmio.write(0x20, N)
overlay.mmul_v2_0.mmio.write(0x0, 0x01)

Stream the A and B matrices to the IP and wait until the response has been streamed back to matrix C.

In [13]:
overlay.axi_dma_0.sendchannel.transfer(A)
overlay.axi_dma_0.sendchannel.wait()
overlay.axi_dma_0.sendchannel.transfer(BT)
overlay.axi_dma_0.sendchannel.wait()
overlay.axi_dma_0.recvchannel.transfer(CT)
overlay.axi_dma_0.recvchannel.wait()

Now we can verify if there is any difference between the regular software version (using `@`) and the hardware version:

In [14]:
print(A@B)

[[15 15 15 15 15 15]
 [30 30 30 30 30 30]
 [45 45 45 45 45 45]
 [60 60 60 60 60 60]]


In [16]:
print(CT.T)

[[25 63 29 51]
 [ 0  0  0  0]
 [51 25 63 29]
 [ 0  0  0  0]
 [29 51 25 63]
 [ 0  0  0  0]]


Luckily, they are the same.

# (b) Python (MMIO)

We can als do a stripped-down version of the PYNQ implementation. Since the IP cores are memory mapped and we can use the Unix `mmap` to read/write the devices directly.

In [None]:
def mmap(base_addr, length):
    import mmap, os
    euid = os.geteuid()
    if euid != 0:
        raise EnvironmentError('Root permissions required.')


    # Align the base address with the pages
    virt_base = base_addr & ~(mmap.PAGESIZE - 1)

    # Calculate base address offset w.r.t the base address
    virt_offset = base_addr - virt_base

    # Open file and mmap
    mmap_file = os.open('/dev/mem', os.O_RDWR | os.O_SYNC)
    mem = mmap.mmap(mmap_file, length + virt_offset, mmap.MAP_SHARED, mmap.PROT_READ | mmap.PROT_WRITE, offset=virt_base)
    os.close(mmap_file)
    array = np.frombuffer(mem, np.uint32, length >> 2, virt_offset)
    
    return array

def write(array, offset, data):
    assert not offset % 4, "Unaligned write: offset must be multiple of 4!"
    i = offset >> 2
    array[i] = np.uint32(data)  # We assume the data is int
    
def read(array, offset):
    assert not offset % 4, "Unaligned read: offset must be multiple of 4!"
    i = offset >> 2
    return int(array[i]) & 0xffffffff

In [None]:
mmul_array = mmap(overlay.mmul_v2_0.mmio.base_addr, overlay.mmul_v2_0.mmio.length)
dma_send_array = mmap(overlay.axi_dma_0.sendchannel._mmio.base_addr, overlay.axi_dma_0.sendchannel._mmio.length)
dma_recv_array = mmap(overlay.axi_dma_0.recvchannel._mmio.base_addr, overlay.axi_dma_0.recvchannel._mmio.length)

In [None]:
write(mmul_array, 0x10, L)
write(mmul_array, 0x18, M)
write(mmul_array, 0x20, N)
write(mmul_array, 0x00, 0x01)

In [None]:
C[:] = 0
print(C)

In [None]:
write(dma_send_array, 0x00, 0x0001)  # Start sendchannel
while not read(dma_send_array, 0x04) & 0x01 == 0x00: # Wait for startup
    pass


write(dma_recv_array, 0x30, 0x0001)  # Start recvchannel
while not read(dma_recv_array, 0x34) & 0x01 == 0x00: # Wait for startup
    pass

# Send A
A.flush()
write(dma_send_array, 0x18, A.physical_address & 0xffffffff)
write(dma_send_array, 0x1c, (A.physical_address >> 32) & 0xffffffff)
write(dma_send_array, 0x28, A.nbytes)
while not read(dma_send_array, 0x04) & 0x02 == 0x02: # Wait for idle
    pass

# Send B
B.flush()
write(dma_send_array, 0x18, B.physical_address & 0xffffffff)
write(dma_send_array, 0x1c, (B.physical_address >> 32) & 0xffffffff)
write(dma_send_array, 0x28, B.nbytes)
while not read(dma_send_array, 0x04) & 0x02 == 0x02: # Wait for idle
    pass

# Receive C
write(dma_recv_array, 0x48, C.physical_address & 0xffffffff)
write(dma_recv_array, 0x4c, (C.physical_address >> 32) & 0xffffffff)
write(dma_recv_array, 0x58, C.nbytes)
while not read(dma_recv_array, 0x34) & 0x02 == 0x02: # Wait for idle
    pass

In [None]:
print(C)