Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash in eglFini from libglvnd libEGL #103

Closed
kwizart opened this issue Oct 24, 2016 · 18 comments
Closed

crash in eglFini from libglvnd libEGL #103

kwizart opened this issue Oct 24, 2016 · 18 comments

Comments

@kwizart
Copy link
Contributor

kwizart commented Oct 24, 2016

This bug was initialy reported as https://bugzilla.rpmfusion.org/show_bug.cgi?id=4303

Some EGL applications are crashing with a segmentation fault with current libglvnd libEGL
Reproduced with gthumb on fedora 24:
Run gthumb - then exit.

Starting program: /usr/bin/gthumb
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffeb998700 (LWP 18735)]
[New Thread 0x7fffeab25700 (LWP 18736)]
[New Thread 0x7fffdd5c6700 (LWP 18738)]
[New Thread 0x7fffc5a3e700 (LWP 18739)]
[New Thread 0x7fffbf6a4700 (LWP 18741)]

Thread 1 "gthumb" received signal SIGSEGV, Segmentation fault.
0x00007fffe89d4d00 in ?? ()
(gdb) bt
#0 0x00007fffe89d4d00 in ()
#1 0x00007ffff366ed89 in __eglFini () at /usr/lib64/libglvnd/libEGL.so.1
#2 0x00007ffff7de94aa in _dl_fini () at /lib64/ld-linux-x86-64.so.2
#3 0x00007ffff45ae1e8 in __run_exit_handlers () at /lib64/libc.so.6
#4 0x00007ffff45ae235 in () at /lib64/libc.so.6
#5 0x00007ffff4595738 in __libc_start_main () at /lib64/libc.so.6
#6 0x0000555555588fb9 in _start ()

The libglvnd package was built with USE_ATTRIBUTE_CONSTRUCTOR enabled.

@kwizart
Copy link
Contributor Author

kwizart commented Oct 24, 2016

Our current workaround is to remove the libglvnd/libEGL.so.1*
That falls back to using mesa-libEGL (even on top of nvidia driver).
This seems to work around the problem.

@leigh123linux
Copy link

Here's the backtrace for gthumb crash with extra debug

    (gdb) bt
    #0  0x00007fffea009ed0 in  ()
    #1  0x00007ffff366ed89 in __eglFini () at libegl.c:1283
    #2  0x00007ffff7de94aa in _dl_fini () at dl-fini.c:235
    #3  0x00007ffff45ae1e8 in __run_exit_handlers (status=0, listp=0x7ffff49315d8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
    #4  0x00007ffff45ae235 in __GI_exit (status=<optimized out>) at exit.c:104
    #5  0x00007ffff4595738 in __libc_start_main (main=0x555555588ea0 <main>, argc=1, argv=0x7fffffffdfc8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdfb8) at ../csu/libc-start.c:323
    #6  0x0000555555588fb9 in _start ()
    (gdb)

@kbrenneman
Copy link
Collaborator

Hm. I'll need to reproduce this locally to be sure, but it might be some problem related to loading both EGL and GLX in the same process.

@ht990332
Copy link

kbrenneman, possibly.
Try "gtk3-demo --run=glarea".
Then close the test and finally gtk3-demo window.

@leigh123linux
Copy link

@Hussamt
"gtk3-demo --run=glarea" works here so not a good example.

@kbrenneman
running totem or gthumb apps then closing them should be enough to reproduce the issue.
It reliably reproduces here with nvidia 367.57, 370.28 and 375.10 and the latest libglvnd git.

@ht990332
Copy link

@leigh123linux ok, thank you.
I am using arch linux's libglvnd which is the 0.1.1 and doesn't contain libEGL.so and getting this crash.
Maybe there is also a packing issue on Arch Linux?

@kbrenneman
Copy link
Collaborator

Ah, I've found the problem. It's a combination of having both EGL and GLX loaded, and the way entrypoint patching works.

Both __eglFini and __glxFini start by calling into __glDispatchCheckMultithreaded. If the entrypoints are currently patched, then __glDispatchCheckMultithreaded will call the thread attached callback in the vendor.

When GLX unloads, it unloads the vendor libraries, so that callback is no longer valid. But, simply clearing the current context doesn't unpatch the entrypoints anymore, because patching and unpatching on every MakeCurrent tended to cause a major performance drop on some badly-behaved programs.

So, when libGLX unloads, it leaves dangling pointers to the entrypoint patching callbacks, which __eglFini ends up calling.

I think I can fix this particular case just by adding an extra function to force libGLdispatch to unpatch everything. But, it would still run into the same problem if a different thread still had a current context.

kbrenneman added a commit to kbrenneman/libglvnd that referenced this issue Oct 24, 2016
Added a new function to libGLdispatch, __glDispatchForceUnpatch, which forces
it to unpatch the OpenGL entrypoints before libEGL or libGLX can unload the
vendor library that patched them.

If a vendor patches the OpenGL entrypoints, libGLdispatch doesn't unpatch them
when that vendor's context is no longer current, because that adds too much
overhead to repeated MakeCurrent+LoseCurrent calls. But, that also means that
the patch callbacks end up being dangling pointers after the vendor library is
unloaded.

This mainly shows up at process termination when a process loads both libEGL
and libGLX, because __glxFini and __eglFini will both call the vendor's
threadAttach callback.

Fixes NVIDIA#103
@kbrenneman
Copy link
Collaborator

Okay, I think just unconditionally unpatching the OpenGL entrypoints before unloading the vendor libraries is enough to fix this.

The only case I can think of where it would break is if another thread was trying to call an OpenGL function while it was being rewritten. But, that would mean that the other thread is trying to call an OpenGL function while the vendor library is being unloaded, so it's going fall apart no matter what we do in libGLdispatch.

@leigh123linux
Copy link

@kbrenneman I have just tested your commit and it fixes the crash, thank you.

@leigh123linux
Copy link

leigh123linux commented Oct 25, 2016

@kbrenneman Your fix works for gnome apps but still seems to fail with kde apps

Application: ksmserver-logout-greeter (ksmserver-logout-greeter), signal: Segmentation fault
Using host libthread_db library "/lib64/libthread_db.so.1".
[KCrash Handler]
#6  0x00007fecda561dd0 in pthread_mutex_lock () from /lib64/libpthread.so.0
#7  0x00007fecbf0e0dcc in ?? () from /usr/lib64/nvidia/libGLX_nvidia.so.0
#8  0x00007fecbf0b80e8 in ?? () from /usr/lib64/nvidia/libGLX_nvidia.so.0
#9  0x00007fecc67b3d69 in __eglFini () from /usr/lib64/libglvnd/libEGL.so.1
#10 0x00007fece35a711a in _dl_fini () from /lib64/ld-linux-x86-64.so.2
#11 0x00007fecdb722420 in __run_exit_handlers () from /lib64/libc.so.6
#12 0x00007fecdb72247a in exit () from /lib64/libc.so.6
#13 0x00007fecdb708408 in __libc_start_main () from /lib64/libc.so.6
#14 0x0000561ce8fecbfa in _start ()

I have managed to reproduce the issue by running systemsettings5, navigating to the 'desktop effects' tab then closing the window.

@kbrenneman
Copy link
Collaborator

@leigh123linux - To clarify, you're seeing the same crash in KDE without #105 as well? Or did #105 introduce the crash?

@leigh123linux
Copy link

@kbrenneman kde had the issue before your commit and #105 doesn't fix it.

@kbrenneman
Copy link
Collaborator

Okay. In that case, I'll check in the change to fix Gnome and see what the problem is in KDE. It might be something unrelated.

@kbrenneman
Copy link
Collaborator

Okay, I think I've found the problem with KDE.

DSO finalizers get just run in the reverse order as their constructors, including DSO's that were loaded with dlopen().

The sequence that we're getting in this case is that it calls the driver's _fini callback, then calls __eglFini and then after that would call __glXFini. So, __eglFini tries to call the driver's thread attached callback after the driver has gone through all of its cleanup.

So, I need to rearrange things so that it doesn't try to call into the vendor at all from any of the _fini functions.

@cubanismo
Copy link

DSO finalizers get just run in the reverse order as their constructors, including DSO's that were
loaded with dlopen().

Is that true? It seems counter to comments in glibc's source:

  /* Lots of fun ahead.  We have to call the destructors for all still
     loaded objects, in all namespaces.  The problem is that the ELF
     specification now demands that dependencies between the modules
     are taken into account.  I.e., the destructor for a module is
     called before the ones for any of its dependencies.

     To make things more complicated, we cannot simply use the reverse
     order of the constructors.  Since the user might have loaded objects
     using `dlopen' there are possibly several other modules with its
     dependencies to be taken into account.  Therefore we have to start
     determining the order of the modules once again from the beginning.  */

@kbrenneman
Copy link
Collaborator

Oh, I just found what our problem is. We ran into exactly the same problem in GLX, fixed in commit b7d7542, but I never made the corresponding fix for EGL.

@leigh123linux
Copy link

@kbrenneman

The latest commit fixes the kde/plasma apps crash, thanks again.

@fafryd1125
Copy link

Related: https://bugs.archlinux.org/task/51527

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants