Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cog + fdo backend on iMX8 crashes after some time due to too many open file descriptors #192

Closed
gvalcaza opened this issue Jul 8, 2020 · 7 comments
Labels

Comments

@gvalcaza
Copy link

gvalcaza commented Jul 8, 2020

Hi all, I'm currently testing the meta-webkit layer on the Yocto 3.0 (zeus) branch by compiling XWayland images for a i.MX8-based board. I'm using the cog launcher and the wpebackend-fdo backend, and I can browse pages perfectly fine out of the box without tweaking anything.

The problem is, after a minute of browsing (or even remaining idle on a relatively simple webpage), the browser freezes with the following message from wayland appearing in the serial console:

error marshalling arguments for create_buffer: dup failed: Too many open files
Error marshalling request: Too many open files
Failed to get the memory usage

I believe it's only the image that is frozen, because I can still hear audio when the freezing happens and I can navigate through my browsing history or reload the page via cogctl. Doing a Ctrl+C after this happens (or before) results in hundreds of these messages coming from the Linux kernel:

[ 4074.890551] VFS: Close: file count is 0
[ 4074.894405] VFS: Close: file count is 0
[ 4074.898253] VFS: Close: file count is 0
[ 4074.902103] VFS: Close: file count is 0
[ 4074.905951] VFS: Close: file count is 0
[ 4074.909797] VFS: Close: file count is 0
[ 4074.913647] VFS: Close: file count is 0
[ 4074.917510] VFS: Close: file count is 0

Using lsof to view the file descriptors opened by the cog process, I can see that the total number of open "dmabuf" descriptors increases until it reaches 1024, which is when the crash happens:

~# lsof | grep '^1449'
1449 /usr/bin/cog /dev/ttymxc0
1449 /usr/bin/cog /dev/ttymxc0
1449 /usr/bin/cog /dev/ttymxc0
1449 /usr/bin/cog /dev/urandom
1449 /usr/bin/cog /dev/urandom
1449 /usr/bin/cog anon_inode:[eventfd]
1449 /usr/bin/cog /proc/1449/statm
1449 /usr/bin/cog anon_inode:[eventfd]
1449 /usr/bin/cog anon_inode:[eventfd]
1449 /usr/bin/cog anon_inode:[eventfd]
1449 /usr/bin/cog socket:[19777]
1449 /usr/bin/cog anon_inode:[eventfd]
1449 /usr/bin/cog socket:[19778]
1449 /usr/bin/cog /dev/galcore
1449 /usr/bin/cog anon_inode:[eventpoll]
1449 /usr/bin/cog socket:[19780]
1449 /usr/bin/cog socket:[19781]
1449 /usr/bin/cog socket:[19788]
1449 /usr/bin/cog socket:[19783]
1449 /usr/bin/cog socket:[19786]
1449 /usr/bin/cog socket:[19788]
1449 /usr/bin/cog anon_inode:[eventfd]
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog anon_inode:[eventfd]
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /memfd:WebKitSharedMemory (deleted)
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:
1449 /usr/bin/cog /dmabuf:

This growth of file descriptors only happens when there is movement in the browser, e.g. when the text prompt blinks in the Google homepage or if there's an animation. Webpages that have a still image don't cause increments, but as soon as you force the image to change (by clicking or hovering over something), they increase again.

I assume the dmabuf descriptors are being opened every time wayland has to redraw the browser, but I don't know why all of them remain open, clogging up the file descriptors and causing the browser to crash. Is this something that has been observed before? Could it be something I'm configuring wrong in my Yocto build (although I'm following the default configuration instructions with no additional changes)? This doesn't happen with other weston applications on my system.

@gvalcaza
Copy link
Author

gvalcaza commented Jul 9, 2020

I think it's worth mentioning that I'm using the proprietary Vivante stack. I say this because I've looked into the issue further, and all of those file descriptors are created due to constant "create_buffer" requests coming from the Vivante userspace libraries. Since the source code isn't available, I have no clue as to which custom Wayland protocol the libraries are using, or the circumstances under which the request is triggered to begin with...

It's strange because out of all the applications I have on my system, "cog" is the only one that triggers create_buffer requests whenever there's movement in the app (other apps trigger some requests when launched, but stop after a dozen at most, even when there's movement). Maybe it's due to how cog/the FDO backend interacts with wayland?

But I digress... This issue is probably too specific and the obvious alternative would be to use the etnaviv stack, although I'm not sure that's an option in my case. I'll leave this issue open for now for visibility, but feel free to close it.

@philn
Copy link
Member

philn commented Jul 9, 2020

It's OK to discuss here for the time being. I think @zdobersek previously mentioned a leak in the vivante driver. Perhaps he can chime in :)

@aperezdc
Copy link
Member

aperezdc commented Jul 9, 2020

We have seen this happen on i.MX6 as well. The file descriptor leak happens when calling eglCreateWaylandBufferFromImageWL(), which is done by Cog's FDO platform module to create a wl_buffer from an EGLImage that can be directly attached as contents of the Wayland surface. It happens specifically with the proprietary Vivante driver so it is harder to investigate further, but comparing the behaviour of eglCreateWaylandBufferFromImageWL() with Mesa using strace results in more calls to close() than with the Vivante driver—so yes, this totally looks like an issue in the Vivante driver, as Mesa closes file descriptors correctly.

An alternative to avoid using eglCreateWaylandBufferFromImageWL() is using the EGLImage as a texture to paint a couple of triangles using GLES. Of course this involves making the GPU do a bit of unneeded work, but luckily GPUs are pretty good at pushing pixels around and in most cases the overhead should not be noticeable. I have implemented such a rendering mode in this commit of one of my WIP branches—feel free to try it the branch out, but be aware that some functionality available in master and stable releases is still missing.

@gvalcaza
Copy link
Author

gvalcaza commented Jul 10, 2020

Thanks for the quick feedback and explanation, @aperezdc ! I compiled cog from your branch and that seems to do the trick. As you mentioned, some features are missing, but at least I can browse pages for longer periods of times without interruptions, which is my main concern here :)

Are there any plans to incorporate this alternative rendering method into the master branch in the near future? Or is it going to be left in your WIP branch because of its GPU overhead?

@zdobersek
Copy link
Contributor

@petegriffin
Copy link

Good to see that this issue has a fix now on the nxp forum post :)

@aperezdc
Copy link
Member

aperezdc commented Nov 2, 2020

I suppose we can close this now that is clear that where the issue was and that it was outside of the meta-webkit scope. Thanks everybody for your comments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants