# MMIO example: DMA loopback

This notebook illustrates the use of MMIO to communicate with the PL part of the PYNQ-Z2 board. This can be achieved at different levels of abstraction and speedup: (i) using PYNQ, (ii) using MMIO in Python and (iii) using MMIO in C++ with pybind11. The simple, example block design consists of a DMA component in a loopback configuration, which means that the DMA can offload a ``memcpy`` from the CPU to the PL.

First, we load the bitstream, allocate the source and destination buffers ``A`` and ``B`` and fill the source buffer with random data:

In [2]:
from pynq import Overlay, allocate
import numpy as np

overlay = Overlay("/home/xilinx/overlays/dma_loopback_1.bit")
A = allocate(shape=(2**5,), dtype="u4")
B = allocate(shape=(2**5,), dtype="u4")
A[:] = np.random.randint(100000, size=A.shape)
print(A)

[21189 87074 76249 52193   913 32379 59394 58368  9113 59281  8695 78441
 52855 12167  2119 56823 33859 18304 87992 10914 66148 24318  2076 65445
 57213 96738 88649 38616 92120 38474 55263 61291]


# (i) PYNQ

First we illustrate the ``memcpy`` functionality in pure PYNQ. We clear the destination buffer ``B`` to make sure the copy takes place:

In [3]:
B[:] = 0
print(B)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


We see that the data is successfully copied into B when the DMA transfers are done:

In [4]:
overlay.axi_dma_0.sendchannel.transfer(A)
overlay.axi_dma_0.recvchannel.transfer(B)
overlay.axi_dma_0.recvchannel.wait()
print(B)

[21189 87074 76249 52193   913 32379 59394 58368  9113 59281  8695 78441
 52855 12167  2119 56823 33859 18304 87992 10914 66148 24318  2076 65445
 57213 96738 88649 38616 92120 38474 55263 61291]


# (b) MMIO in Python

We can speedup the PYNQ implementation by stripping off some of the checks and generic functionality. For more information on mmap'ing `/dev/mem`, see [this](https://unix.stackexchange.com/questions/167948/how-does-mmaping-dev-mem-work-despite-being-from-unprivileged-mode) StackExchange thread.

In [5]:
def mmap(base_addr, length):
    import mmap, os
    # Align the base address with the pages
    virt_base = base_addr & ~(mmap.PAGESIZE - 1)

    # Calculate base address offset w.r.t the base address
    virt_offset = base_addr - virt_base

    # Open file and mmap
    mmap_file = os.open('/dev/mem', os.O_RDWR | os.O_SYNC)
    mem = mmap.mmap(mmap_file, length + virt_offset, mmap.MAP_SHARED, mmap.PROT_READ | mmap.PROT_WRITE, offset=virt_base)
    os.close(mmap_file)
    array = np.frombuffer(mem, np.uint32, length >> 2, virt_offset)
    
    return array

def write(array, offset, data):
    assert not offset % 4, "Unaligned write: offset must be multiple of 4!"
    i = offset >> 2
    array[i] = np.uint32(data)  # We assume the data is uint32
    
def read(array, offset):
    assert not offset % 4, "Unaligned read: offset must be multiple of 4!"
    i = offset >> 2
    return int(array[i]) & 0xffffffff

Again, we clear the ``B`` buffer to make sure the ``memcpy`` is successful:

In [6]:
B[:] = 0
print(B)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


We ``mmap`` the send and receive channel of the DMA component separately:

In [7]:
dma_send_vaddr = mmap(overlay.axi_dma_0.sendchannel._mmio.base_addr, overlay.axi_dma_0.sendchannel._mmio.length)
dma_recv_vaddr = mmap(overlay.axi_dma_0.recvchannel._mmio.base_addr, overlay.axi_dma_0.recvchannel._mmio.length)

We see that the data is successfully copied into B when the DMA transfers are done:

In [8]:
write(dma_send_vaddr, 0x00, 0x0001)  # Start sendchannel
while not read(dma_send_vaddr, 0x04) & 0x01 == 0x00: # Wait for startup
    pass


write(dma_recv_vaddr, 0x30, 0x0001)  # Start recvchannel
while not read(dma_recv_vaddr, 0x34) & 0x01 == 0x00: # Wait for startup
    pass

# Send A
A.flush()
write(dma_send_vaddr, 0x18, A.physical_address & 0xffffffff)
write(dma_send_vaddr, 0x1c, (A.physical_address >> 32) & 0xffffffff)
write(dma_send_vaddr, 0x28, A.nbytes)
while not read(dma_send_vaddr, 0x04) & 0x02 == 0x02: # Wait for idle
    pass

# Receive B
write(dma_recv_vaddr, 0x48, B.physical_address & 0xffffffff)
write(dma_recv_vaddr, 0x4c, (B.physical_address >> 32) & 0xffffffff)
write(dma_recv_vaddr, 0x58, B.nbytes)
while not read(dma_recv_vaddr, 0x34) & 0x02 == 0x02: # Wait for idle
    pass

print(B)

[21189 87074 76249 52193   913 32379 59394 58368  9113 59281  8695 78441
 52855 12167  2119 56823 33859 18304 87992 10914 66148 24318  2076 65445
 57213 96738 88649 38616 92120 38474 55263 61291]


In [9]:
write(dma_send_vaddr, 0x18, A.physical_address & 0xffffffff)
write(dma_send_vaddr, 0x1c, (A.physical_address >> 32) & 0xffffffff)
print(A.physical_address)
print(read(dma_send_vaddr, 0x18))

377786368
377786368


# (c) C++ (`pybind11`, `mmap`)

We can now write a C++ extension using ``pybind11`` that does the ``mmap``'ing directly in C++ and through which we can pass the parameters of the DMA and the buffers. First we clear the destination buffer and we verify that the ``A`` and ``B`` buffers can be accessed using their physical address by checking out a dump of the physical memory:

In [9]:
B[:] = 0
print(B)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [10]:
import subprocess
cmd = f"hexdump -C -s {A.physical_address} /dev/mem | head"
print(subprocess.check_output(cmd, shell=True).decode("utf-8"))

1684b000  0e 1c 00 00 7f 8a 00 00  6e 12 01 00 e3 56 01 00  |........n....V..|
1684b010  10 5f 00 00 64 d0 00 00  36 70 01 00 5d 83 00 00  |._..d...6p..]...|
1684b020  08 09 01 00 98 ca 00 00  9c 2e 00 00 e9 5c 00 00  |.............\..|
1684b030  ee f5 00 00 7f 00 00 00  30 71 01 00 5f 10 00 00  |........0q.._...|
1684b040  9c 50 00 00 6f db 00 00  fc 56 00 00 13 2c 00 00  |.P..o....V...,..|
1684b050  e5 11 00 00 92 30 00 00  96 ba 00 00 fa 50 01 00  |.....0.......P..|
1684b060  b1 32 00 00 12 08 01 00  02 5f 00 00 e9 49 01 00  |.2......._...I..|
1684b070  7c 2d 01 00 b9 5b 01 00  74 23 01 00 54 60 00 00  ||-...[..t#..T`..|
1684b080  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*



In [11]:
import subprocess
cmd = f"hexdump -C -s {B.physical_address} /dev/mem | head"
print(subprocess.check_output(cmd, shell=True).decode("utf-8"))

1684c000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
1684d000  17 00 00 00 00 00 00 00  e0 f4 97 b3 00 00 00 00  |................|
1684d010  e0 81 86 b6 f8 81 86 b6  00 00 00 00 46 62 01 00  |............Fb..|
1684d020  00 00 19 00 0a 00 00 00  ff ff ff ff 00 00 00 00  |................|
1684d030  00 00 00 00 00 00 00 00  60 7f 98 b3 00 00 00 00  |........`.......|
1684d040  00 7f 98 b3 00 00 00 00  00 00 00 00 00 00 00 00  |................|
1684d050  88 b5 98 b3 00 00 00 00  e0 e1 c5 b3 90 33 96 b6  |.............3..|
1684d060  0a 00 10 00 00 01 00 00  40 28 98 b3 00 00 00 00  |........@(......|
1684d070  d8 0f 85 b6 00 80 86 b6  00 00 00 00 47 62 01 00  |............Gb..|



In [13]:
%%pybind11 mmap_cpp_dma_2_x

#include <unistd.h>
#include <fcntl.h>
#include <termios.h>
#include <sys/mman.h>

#define MM2S_DMACR 0x00
#define MM2S_DMACR_RS 0x00000001
#define MM2S_DMACR_Reset 0x00000004
#define MM2S_DMASR 0x04
#define MM2S_DMASR_Halted 0x00000001
#define MM2S_DMASR_Idle 0x00000002
#define MM2S_SA 0x18
#define MM2S_SA_MSB 0x1c
#define MM2S_LENGTH 0x28

#define S2MM_DMACR 0x30
#define S2MM_DMACR_RS 0x00000001
#define S2MM_DMACR_Reset 0x00000004
#define S2MM_DMASR 0x34
#define S2MM_DMASR_Halted 0x00000001
#define S2MM_DMASR_Idle 0x00000002
#define S2MM_DA 0x48
#define S2MM_DA_MSB 0x4c
#define S2MM_LENGTH 0x58

#define dma_get(x) DMA_VADDR[x >> 2]
#define dma_set(x, y) DMA_VADDR[x >> 2] = y

void transfer(unsigned int dma_addr, unsigned int A_addr, unsigned int B_addr, unsigned int dma_length)
{
    int fd = open("/dev/mem", O_RDWR | O_SYNC);
    volatile unsigned int *DMA_VADDR = (volatile unsigned int *) mmap(NULL, 65535, PROT_READ | PROT_WRITE, MAP_SHARED, fd, dma_addr);

    dma_set(S2MM_DMACR, S2MM_DMACR_Reset);
    dma_set(MM2S_DMACR, MM2S_DMACR_Reset);
    
    dma_set(S2MM_DMACR, S2MM_DMACR_RS);
    dma_set(MM2S_DMACR, MM2S_DMACR_RS);
    
    dma_set(MM2S_SA, A_addr);
    dma_set(MM2S_LENGTH, dma_length);
    
    dma_set(S2MM_DA, B_addr);
    dma_set(S2MM_LENGTH, dma_length);
    
    while(!(dma_get(MM2S_DMASR) & MM2S_DMASR_Idle));
    while(!(dma_get(S2MM_DMASR) & S2MM_DMASR_Idle));
    
    close(fd);
}

We now clear the ``B`` buffer and run the actual C++ extension that starts the DMA transfer. We see that the data is successfully copied into ``B`` when the DMA transfers are done:

In [14]:
B[:] = 0
print(B)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


In [15]:
import mmap_cpp_dma_2_x as dma

dma.transfer(overlay.axi_dma_0.sendchannel._mmio.base_addr, A.physical_address, B.physical_address, A.nbytes)

print(B)

[21189 87074 76249 52193   913 32379 59394 58368  9113 59281  8695 78441
 52855 12167  2119 56823 33859 18304 87992 10914 66148 24318  2076 65445
 57213 96738 88649 38616 92120 38474 55263 61291]


# (d) C (``CFFI``, ``mmap``)

In [15]:
from cffi import FFI
ffibuilder = FFI()

ffibuilder.cdef("void transfer(unsigned int, unsigned int, unsigned int, unsigned int);")

ffibuilder.set_source("_example",
r"""
#include <unistd.h>
#include <fcntl.h>
#include <termios.h>
#include <sys/mman.h>

#define MM2S_DMACR 0x00
#define MM2S_DMACR_RS 0x00000001
#define MM2S_DMACR_Reset 0x00000004
#define MM2S_DMASR 0x04
#define MM2S_DMASR_Halted 0x00000001
#define MM2S_DMASR_Idle 0x00000002
#define MM2S_SA 0x18
#define MM2S_SA_MSB 0x1c
#define MM2S_LENGTH 0x28

#define S2MM_DMACR 0x30
#define S2MM_DMACR_RS 0x00000001
#define S2MM_DMACR_Reset 0x00000004
#define S2MM_DMASR 0x34
#define S2MM_DMASR_Halted 0x00000001
#define S2MM_DMASR_Idle 0x00000002
#define S2MM_DA 0x48
#define S2MM_DA_MSB 0x4c
#define S2MM_LENGTH 0x58

#define dma_get(x) DMA_VADDR[x >> 2]
#define dma_set(x, y) DMA_VADDR[x >> 2] = y

void transfer(unsigned int dma_addr, unsigned int A_addr, unsigned int B_addr, unsigned int dma_length)
{
    int fd = open("/dev/mem", O_RDWR | O_SYNC);
    volatile unsigned int *DMA_VADDR = (volatile unsigned int *) mmap(NULL, 65535, PROT_READ | PROT_WRITE, MAP_SHARED, fd, dma_addr);

    dma_set(S2MM_DMACR, S2MM_DMACR_Reset);
    dma_set(MM2S_DMACR, MM2S_DMACR_Reset);
    
    dma_set(S2MM_DMACR, S2MM_DMACR_RS);
    dma_set(MM2S_DMACR, MM2S_DMACR_RS);
    
    dma_set(MM2S_SA, A_addr);
    dma_set(MM2S_LENGTH, dma_length);
    
    dma_set(S2MM_DA, B_addr);
    dma_set(S2MM_LENGTH, dma_length);
    
    while(!(dma_get(MM2S_DMASR) & MM2S_DMASR_Idle));
    while(!(dma_get(S2MM_DMASR) & S2MM_DMASR_Idle));
    
    close(fd);
}
""")

if __name__ == "__main__":
    ffibuilder.compile(verbose=True)

generating ./_example.c
the current directory is '/home/xilinx/jupyter_notebooks/thesis'
running build_ext
building '_example' extension
arm-linux-gnueabihf-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fdebug-prefix-map=/build/python3.7-2QTFw6/python3.7-3.7.0~b3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/xilinx/perf_env/include -I/usr/include/python3.7m -c _example.c -o ./_example.o
arm-linux-gnueabihf-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fdebug-prefix-map=/build/python3.7-2QTFw6/python3.7-3.7.0~b3=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 ./_example.o -o ./_example.cpython-37m-arm-linux-gnueabihf.so


In [16]:
B[:] = 0
print(B)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
