Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BISECTED] SIGSEV (-11) on gfxrecon::decode::FileProcessor::~FileProcessor #328

Closed
tanty opened this issue Feb 21, 2020 · 7 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@tanty
Copy link

tanty commented Feb 21, 2020

After:

commit 36f122feef73cf1be1cf65a35f064126982fdefb (refs/bisect/bad)
Author: Dustin Graves <dustin@lunarg.com>
Date:   Sat Apr 27 12:35:23 2019 -0600

    Add multi instance/device replay support
    
    Add replay support for captures including more than one instance or
    device. A dispatch table is now created and initialized for each
    created instance and device. Replaces volk with the dispatch table code
    used by the capture layer.
    
    This change addresses issues with replay and multiple instances when
    each instance enables different extensions. With a single set of
    instance function pointers, there were two options for handling function
    loading, which were both insufficient to handle this case:
    - Instance functions are initialized for the first instance created,
      where extension functions enabled by a second instance would not be
      loaded.
    - Instance functions are loaded after each instance creation, where
      extension functions loaded for the first instance could be overwritten
      as NULL by a second instance that did not enable the extension.
    
    Adding per-instance dispatch tables ensures that each instance has
    access to the extension functions it has enabled.
    
    Change-Id: I3ea5190494f5f482cd2003bef9b3bdd09fceebf1

I get flaky runs with gfxrecon-replay.

Every now and then, the execution SIGSEVs at gfxrecon::decode::FileProcessor::~FileProcessor.

From my very quick research, it seems that (at least) the destruction of the parameter_buffer_ member ends with a corrupted double-linked list. I could even confirm this by adding manually a:

FileProcessor::~FileProcessor()
{
    parameter_buffer_.clear();
    parameter_buffer_.shrink_to_fit();

...

And seeing that the SIGSEV now happens inside the shrink_to_fit() call.

Platform: Debian GNU/Linux Buster x86_64.

The trace is compressed with LZ4.

Reproduced with anv and radv Mesa Vulkan drivers from locally built tag mesa-19.3.3.

BT from gdb: gdb.txt

Trace file: vkcube.gfxr.zip

This is a run example from the source directory:

$ VK_ICD_FILENAMES="<path_to>/intel_icd.x86_64.json" ./_build/tools/replay/gfxrecon-replay <path_to>/vkcube.gfxr
35.984419 fps, 0.277898 seconds, 10 frames, 1 loop, framerange 1-10
corrupted double-linked list
Aborted (core dumped)
@tanty
Copy link
Author

tanty commented Feb 21, 2020

FWIW, using the commit previous to the one introducing the regression, I also get flaky SIGPIPEs (-13), although I cannot reproduce as consistently.

@tanty tanty changed the title SIGSEV (-11) on gfxrecon::decode::FileProcessor::~FileProcessor [BISECTED] SIGSEV (-11) on gfxrecon::decode::FileProcessor::~FileProcessor Feb 21, 2020
@dustin-lunarg dustin-lunarg self-assigned this Feb 21, 2020
@dustin-lunarg dustin-lunarg added the bug Something isn't working label Feb 21, 2020
@dustin-lunarg
Copy link
Contributor

Apologies for taking so long to respond. I did look into this when it was first submitted, but neglected to update the issue.

At the time, I was unable to reproduce the problem with builds from both the dev and master branches. I made the modifications to the FileProcessor destructor and tested with the provided vkcube.gfxr file on a mesa-19.3.3 build. Testing was done with Fedora 31 and Debian Buster.

I just tried to reproduce the issue again on Fedora 32 with the radv driver from the mesa 20.0.7 that is packaged with the system, using the latest source from master and a new capture of vkcube, and am still not able to reproduce the issue.

If you are able to test again with the latest source from the master branch, could you let me know if you are still experiencing the problem?

@lunarpapillo
Copy link
Contributor

Observed during SDK 1.2.141 .0 testing, using packages on Ubuntu 16.04 with a Radeon R9 285/380 and the mesa-vulkan-drivers driver. gfxrecon-replay of a vkcube trace (captured normally with the capture layer; the capture was terminated by ^C) seems to segfault on exit:

$ gfxrecon-replay gfxrecon_capture_20200602T180110.gfxr 
WARNING: radv is not a conformant vulkan implementation, testing use only.
[gfxrecon] WARNING - Incomplete block at end of file
58.316757 fps, 3.086591 seconds, 180 frames, 1 loop, framerange 1-180
Segmentation fault (core dumped)

The stack trace shows:

#0  0x00007fd754c763c6 in malloc_consolidate (av=av@entry=0x7fd754fbcb20 <main_arena>)
    at malloc.c:4183
#1  0x00007fd754c78678 in _int_free (av=0x7fd754fbcb20 <main_arena>, p=<optimized out>, 
    have_lock=0) at malloc.c:4075
#2  0x00007fd754c7c53c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#3  0x0000000000556590 in gfxrecon::decode::FileProcessor::~FileProcessor() ()
#4  0x0000000000550d32 in main ()

@dustin-lunarg
Copy link
Contributor

As indicated in the previous comment, we are now able to reproduce this issue. The problem happens when the capture file does not contain a call to vkDestroySurfaceKHR. Replay does not currently attempt to destroy Vulkan resources that are still active on exit. For surfaces, when the capture file does not include a vkDestroySurfaceKHR call, the window associated with the surface will be destroyed on exit while the surface is still active. The SIGSEGV then seems to happen after a call to xcb_disconnect.

For this case, Valgrind produces multiple messages like the following:

==13906== Invalid write of size 4
==13906==    at 0x6B38C6C: __pthread_mutex_cond_lock (pthread_mutex_lock.c:159)
==13906==    by 0x6B3A3EF: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:259)
==13906==    by 0x6517EB8: ??? (in /usr/lib/x86_64-linux-gnu/libxcb.so.1.1.0)
==13906==    by 0x65199A8: xcb_wait_for_special_event (in /usr/lib/x86_64-linux-gnu/libxcb.so.1.1.0)
==13906==    by 0x8DCB476: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so)
==13906==    by 0x6B346B9: start_thread (pthread_create.c:333)
==13906==    by 0x5E4B41C: clone (clone.S:109)
==13906==  Address 0x61282c0 is 32 bytes inside a block of size 21,152 free'd
==13906==    at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13906==    by 0x553560: gfxrecon::application::XcbApplication::~XcbApplication() (xcb_application.cpp:35)
==13906==    by 0x553588: gfxrecon::application::XcbApplication::~XcbApplication() (xcb_application.cpp:37)
==13906==    by 0x550F74: operator() (unique_ptr.h:76)
==13906==    by 0x550F74: ~unique_ptr (unique_ptr.h:236)
==13906==    by 0x550F74: main (desktop_main.cpp:83)

This message seems to be triggered by the call to xcb_disconnect in XcbApplication::~XcbApplication().
FileProcessor::~FileProcessor(), where the crash happens, is invoked after XcbApplication::~XcbApplication(), which is most likely corrupting memory. Temporarily removing the call to xcb_disconnect stopped the errors from being reported by Valgrind and seemed to prevent the crash.

Change #389 ensures that surfaces are destroyed before their associated windows are destroyed. This eliminates the errors reported by Valgrind without removing the call to xcb_disconnect and seems to prevent the crash. Support for cleaning up all active Vulkan resources on exit will be added in the future.

@dustin-lunarg
Copy link
Contributor

#389 was merged to the dev branch.

@tanty
Copy link
Author

tanty commented Jun 12, 2020

Thanks a lot, @lunarpapillo and @dustin-lunarg !

@dustin-lunarg
Copy link
Contributor

No problem! Thank you for the detailed bug report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants