Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Copy framebuffer /dev/fb0 to main memory #2

Open
martin19 opened this issue Apr 22, 2021 · 9 comments
Open

Question: Copy framebuffer /dev/fb0 to main memory #2

martin19 opened this issue Apr 22, 2021 · 9 comments

Comments

@martin19
Copy link

martin19 commented Apr 22, 2021

Hi,

I'd like to copy the framebuffer contents (/dev/fb0) to main memory on raspbian using memcpy_dma_config.
Could you please give me a hint on if this library is appropriate for the task and how to set it up?
I've figured there are different methods of copying (vcsm, mailbox,...) and different addresses.

I've tried to set it up but I could not figure out how to use the api correctly. Do I need to allocate using
rpimemmgr_alloc_vcsm ? Which addresses do I provide at which places?

Kind regards,
Martin

@Terminus-IMRC
Copy link
Contributor

Thank you for the interest!

There are two types of memory address: virtual address and bus address, which must not be confused.
The virtual address is a pointer type used by an userland process on CPU.
The bus address is a 32-bit integer type used by other blocks including DMA.
So you need to specify the bus address to the DMA calls and need to use the virtual address to edit the content in userland.

Yes, because the framebuffer is essentially a memory area allocated by firmware or kernel, this library can be used to copy it.
The bcm2708_fb driver sets the bus address of the area to struct fb_fix_screeninfo, which can be obtained through the FBIOGET_FSCREENINFO ioctl.
The size of the actual image can be obtained through the FBIOGET_VSCREENINFO ioctl.

An example program that inverts the colors in the framebuffer with DMA copies is here:

Click to expand

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <linux/fb.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>

#include <rpicopy.h>
#include <rpimemmgr.h>

int main(void) {
    int err;
    int fd;
    uint8_t *image;
    uint32_t image_bus;
    struct fb_fix_screeninfo fscr;
    struct fb_var_screeninfo vscr;
    struct rpimemmgr memmgr;

    fd = open("/dev/fb0", O_RDONLY);
    if (fd == -1) {
        perror("open(/dev/fb0)");
        exit(EXIT_FAILURE);
    }

    err = ioctl(fd, FBIOGET_FSCREENINFO, &fscr);
    if (err) {
        perror("ioctl(FBIOGET_FSCREENINFO)");
        exit(EXIT_FAILURE);
    }

    printf("ID: %.16s\n", fscr.id);
    printf("smem_start: %#010lx\n", fscr.smem_start);
    printf("smem_len: %u\n", fscr.smem_len);
    printf("line_length: %u\n", fscr.line_length);

    err = ioctl(fd, FBIOGET_VSCREENINFO, &vscr);
    if (err) {
        perror("ioctl(FBIOGET_VSCREENINFO)");
        exit(EXIT_FAILURE);
    }

    printf("xres: %u\n", vscr.xres);
    printf("yres: %u\n", vscr.yres);

    err = rpimemmgr_init(&memmgr);
    if (err) {
        printf("error: rpimemmgr_init: %d\n", err);
        exit(EXIT_FAILURE);
    }

    err = rpimemmgr_alloc_vcsm(fscr.smem_len, 4096, VCSM_CACHE_TYPE_NONE,
                               (void **)&image, &image_bus, &memmgr);
    if (err) {
        printf("error: rpimemmgr_alloc_vcsm: %d\n", err);
        exit(EXIT_FAILURE);
    }

    err = rpicopy_init();
    if (err) {
        printf("error: rpicopy_init: %d\n", err);
        exit(EXIT_FAILURE);
    }

    /* Capture the framebuffer. */
    (void) memcpy_dma(image_bus, fscr.smem_start, fscr.smem_len);

    /* Invert the colors. */
    for (uint32_t i = 0; i < fscr.smem_len; ++i) image[i] = ~image[i];

    /* Copy the image back to framebuffer. */
    (void) memcpy_dma(fscr.smem_start, image_bus, fscr.smem_len);

    err = rpicopy_finalize();
    if (err) {
        printf("error: rpicopy_finalize: %d\n", err);
        exit(EXIT_FAILURE);
    }

    err = rpimemmgr_free_by_usraddr(image, &memmgr);
    if (err) {
        printf("error: rpimemmgr_free_by_usraddr: %d\n", err);
        exit(EXIT_FAILURE);
    }

    err = rpimemmgr_finalize(&memmgr);
    if (err) {
        printf("error: rpimemmgr_finalize: %d\n", err);
        exit(EXIT_FAILURE);
    }

    return 0;
}

As another option, you can use the FBIODMACOPY ioctl that the bcm2708_fb driver supports.
Because the driver restricts the memory range, we cannot copy to the framebuffer.
Here is an example of outputting the framebuffer into a PNG image with the libstb-dev package:

Click to expand

#include <errno.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <linux/fb.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#include <rpimemmgr.h>

#define STB_IMAGE_WRITE_IMPLEMENTATION
#include <stb/stb_image_write.h>

int main(void) {
    int err;

    const int fd = open("/dev/fb0", O_RDONLY);
    if (fd == -1) {
        perror("open(/dev/fb0)");
        exit(EXIT_FAILURE);
    }

    struct fb_fix_screeninfo fscr;
    err = ioctl(fd, FBIOGET_FSCREENINFO, &fscr);
    if (err) {
        perror("ioctl(FBIOGET_FSCREENINFO)");
        exit(EXIT_FAILURE);
    }

    printf("ID: %.16s\n", fscr.id);
    printf("smem_start: %#010lx\n", fscr.smem_start);
    printf("smem_len: %u\n", fscr.smem_len);
    printf("line_length: %u\n", fscr.line_length);

    struct fb_var_screeninfo vscr;
    err = ioctl(fd, FBIOGET_VSCREENINFO, &vscr);
    if (err) {
        perror("ioctl(FBIOGET_VSCREENINFO)");
        exit(EXIT_FAILURE);
    }

    printf("xres: %u\n", vscr.xres);
    printf("yres: %u\n", vscr.yres);

    struct rpimemmgr memmgr;
    err = rpimemmgr_init(&memmgr);
    if (err) {
        printf("error: rpimemmgr_init: %d\n", err);
        exit(EXIT_FAILURE);
    }

    uint8_t *image;
    uint32_t image_bus;
    err = rpimemmgr_alloc_vcsm(fscr.smem_len, 4096, VCSM_CACHE_TYPE_NONE,
                               (void **)&image, &image_bus, &memmgr);
    if (err) {
        printf("error: rpimemmgr_alloc_vcsm: %d\n", err);
        exit(EXIT_FAILURE);
    }

    /* Capture the framebuffer. */
    const struct fb_dmacopy dma = {
        .dst = image,
        .src = fscr.smem_start,
        .length = fscr.smem_len,
    };
    err = ioctl(fd, FBIODMACOPY, &dma);
    if (err) {
        perror("ioctl(FBIODMACOPY)");
        exit(EXIT_FAILURE);
    }

    stbi_write_png("out.png", vscr.xres, vscr.yres, 4, image, fscr.line_length);

    err = rpimemmgr_free_by_usraddr(image, &memmgr);
    if (err) {
        printf("error: rpimemmgr_free_by_usraddr: %d\n", err);
        exit(EXIT_FAILURE);
    }

    err = rpimemmgr_finalize(&memmgr);
    if (err) {
        printf("error: rpimemmgr_finalize: %d\n", err);
        exit(EXIT_FAILURE);
    }

    return 0;
}

Please note that the framebuffer is placed onto the DispmanX layer, so some contents (e.g. omxplayer) cannot be captured through it.
To capture that, Adafruit's tftcp uses vc_dispmanx_snapshot and vc_dispmanx_resource_read_data APIs, which is essentially done by VPU and thus it is not a heavy task for CPU.

In summary, if you want to write to the framebuffer, then take the first option.
If you want to capture the DispmanX contents, then take the third option.
Otherwise, take the second option.

@martin19
Copy link
Author

Hi,
thank you for your thorough explanations and example!
Please let me explain a bit more my application requirements: I'd like to pass the framebuffer contents (X11 output) in a MMAL chain for compression - as fast as possible.

I've tried several approaches to get the best performance:

  1. using Xvfb (which creates an in RAM framebuffer) and passing this framebuffer directly in a MMAL buffer header - the performance is very good here - however the Xvfb does not support hardware acceleration - so the X11-server itself is not really fast. Thus I'm looking to copy the accelerated X11 output which is in /dev/fb0
  2. copying from a memory mapped framebuffer (mmap for /dev/fb0) to a mmal bufferHeader using memcpy
    -> the memcopy is the bottleneck here, the mmal chain can process data much faster than it can be copied to the MMAL buffer header
  3. using vc_dispmanx_snapshot -> performance is better than for memcpy, however still the mmal chain could process data faster than it can be grabbed with vc_dispmanx_snapshot

I think the following could work, what do you think?
MMAL supports mapping its input buffers from GPU-space to user-space through setting a parameter on the input port:
mmal_port_parameter_set_boolean(port, MMAL_PARAMETER_ZERO_COPY, MMAL_TRUE);
By using librpicopy I get fast rates (like 1080p@80Hz) in my tests: Can I copy with librpicopy from /dev/fb0 to this mapped input buffer for MMAL to consume it right after the dma transfer has finished? The dma would in theory copy from one GPU buffer to another GPU buffer.
Essentially this could all work on the Video core after the setup - user-space would not need to be involved at all, however I don't know if MMAL can be configured to consume data from GPU space without mapping it to user-space first.

@Terminus-IMRC
Copy link
Contributor

Thank you for the explanation!

I surprised that librpicopy is the fastest option.
librpicopy uses as many DMA channels as possible for copying, which I think is the reason why it achieves the high performance.
(However, please note that some channels conflict with drivers depending on your device tree configuration, because the current channel selection is optimistic!)

Yes, I think your idea is feasible.
I heard four years ago that there are no mechanism to use DispmanX resource as MMAL input as-is, so we are required to manually copy the content.
There is a Mailbox call that can obtain a mem_handle of a DispmanX resource, of which bus address can be obtained through the "lock memory" Mailbox call.
Enabling ZERO_COPY on an input port makes its buffer be allocated with VCSM, of which bus address can be obtained with vcsm_vc_hdl_from_ptr (see https://www.raspberrypi.org/forums/viewtopic.php?t=167652 ).
Now that we have the two bus addresses, all the things to do are:

  1. copy the DispmanX content into the buffer with librpicopy,
  2. set MMAL_BUFFER_HEADER_FLAG_EOS to the buffer flag,
  3. notify the MMAL component that the buffer is ready by mmal_port_send_buffer (my example code may help).

I'll try the above if it really works.

@martin19
Copy link
Author

I've managed to setup the DMA transfer with librpicopy. I'm now capturing the X11 desktop by copying the framebuffer contents and dma'ing them to the mmal input buffer for h264 compression. I think I've found exactly the forum post you're referencing and implemented it - after searching alot!

You're right - it seems there are many things interfering with the DMA transfers: with default settings (channels {1, 4, 5, 6, 0}) I couldn't even log in to the desktop while capturing. After fiddling alot I've found a configuration which works mostly (channels 5&6 with burst = 1). The capturing runs at around 1080p25fps. htop tells me one ARM core is at 99% during the dma transfer - which I don't really understand because DMA is supposed to work without CPU, right? I'm still trying to improve this but haven't managed to get better than this.

I'd still like to find a zero-copy path for this setup: both framebuffer and mmal input are on the VC - so maybe it is possible to pass the framebuffer address to mmal? There is an allocator function in MMAL_PORT_PRIVATE_T which allocates shared memory for the zero-copy path. Could this perhaps be used as entry point to return the framebuffer memory directly instead of allocating shared memory?

@Terminus-IMRC
Copy link
Contributor

Good!
The reason behind the high CPU usage seems that the current code busy-waits the DMA completion.
I should improve this library by modifying to consult /sys/class/dma to list available DMA channels, and by adding another API to support asynchronous DMA copy & wait.

I realized that the snapshot achieves 359 frame/sec @ 1366x768 on my Raspberry Pi 3 (182 frame/sec even with a copy to CPU side) with this code:

Click to expand
#include <cassert>
#include <chrono>
#include <cstdint>
#include <iostream>
#include <vector>

#include <bcm_host.h>

int main(void) {
    const std::size_t count = 100;

    bcm_host_init();

    const DISPMANX_DISPLAY_HANDLE_T display = vc_dispmanx_display_open(0);
    DISPMANX_MODEINFO_T modeinfo;
    assert(vc_dispmanx_display_get_info(display, &modeinfo) == 0);

    std::cout << "width: " << modeinfo.width << std::endl;
    std::cout << "height: " << modeinfo.height << std::endl;

    const DISPMANX_RESOURCE_HANDLE_T resource = vc_dispmanx_resource_create(
        VC_IMAGE_RGBA32, modeinfo.width, modeinfo.height, (std::uint32_t[1]){});
    VC_RECT_T rect;
    vc_dispmanx_rect_set(&rect, 0, 0, modeinfo.width, modeinfo.height);
    std::vector<std::uint32_t> data;
    data.resize(modeinfo.width * modeinfo.height);

    const auto start = std::chrono::steady_clock::now();

    for (auto i(count); i != 0; --i) {
        assert(vc_dispmanx_snapshot(display, resource, DISPMANX_NO_ROTATE) ==
               0);
        assert(vc_dispmanx_resource_read_data(resource, &rect, data.data(),
                                              modeinfo.width) == 0);
    }

    const auto end = std::chrono::steady_clock::now();
    std::chrono::duration<double> t = end - start;

    std::cout << t.count() << " seconds" << std::endl;
    std::cout << count / t.count() << " frame/second" << std::endl;

    assert(vc_dispmanx_resource_delete(resource) == 0);
    assert(vc_dispmanx_display_close(display) == 0);
    bcm_host_deinit();
    return 0;
}

How about the performance on your Pi?

I did not notice the pf_payload_alloc/free internal APIs, which look good to me!
I guessed that some codes seem to assume that the memory is allocated with the vc_shm (because they call mmal_vc_shm_lock/unlock), but it turns out that a request to lock/unlock a non-vc_shm address will be just ignored.

@martin19
Copy link
Author

Hi, I wonder how you achieve these transfer rates with dispmanX:

This is the output of your benchmark:

width: 1920
height: 1080
2.63577 seconds
37.9395 frame/second

width: 1366
height: 768
14.9739 seconds
66.783 frame/second

My pi is "Pi 3 model B V1.2".
In total I'm getting around 15fps for1080p for the dispmanx->userspace->mmal(h264) path. The DMA approach is about twice as fast - however feels not as "stable" while playing with different X11 applications. - I've quickly added some usleep(500) in the busy wait which helps as a quick workaround for the 99% cpu usage.

I'll see if it is possible to inject the framebuffer's address into MMAL through the private api mentioned. I'm still trying to figure out what kind of address is returned by pf_payload_alloc and how to get such an address for dispmanX/framebuffer.

@Terminus-IMRC
Copy link
Contributor

Thank you for the report.
I once suspected force_turbo=1 in my config.txt boosted the performance, but it turned out it was not.
Essentially my config.txt contains only hdmi_group=2 and hdmi_mode=81, and no relevant options are added to cmdline.txt.
What is the output of the command vcgencmd get_throttled, which should be throttled=0x0?
Did you try to capture the frame without X? (I'm failing to launch Xorg on my pi3...)

@martin19
Copy link
Author

wow, you're right - seems my pi is throttled:

vcgencmd get_throttled The first number is if it's ever throttled. 0x50000 means throttled at least once due to low voltage, and 0x50005 means currently throttled due to low voltage.

I'll try benchmarking again :)

@martin19
Copy link
Author

martin19 commented May 3, 2021

If X11 is not running I'm getting better readout rates - but this will unfortunately not help as I want to read out X11 desktop. I've examined several other ways of capturing X11 in the meantime (https://www.raspberrypi.org/forums/viewtopic.php?f=67&t=310179) - what looks most promising to me is the EXT_image_dma_buf_import extension which can read the desktop into a GL texture really quickly. Exporting should work as well. Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants