[BISECTED] SIGSEV (-11) on `gfxrecon::decode::FileProcessor::~FileProcessor` #328

tanty · 2020-02-21T13:33:43Z

After:

commit 36f122feef73cf1be1cf65a35f064126982fdefb (refs/bisect/bad)
Author: Dustin Graves <dustin@lunarg.com>
Date:   Sat Apr 27 12:35:23 2019 -0600

    Add multi instance/device replay support
    
    Add replay support for captures including more than one instance or
    device. A dispatch table is now created and initialized for each
    created instance and device. Replaces volk with the dispatch table code
    used by the capture layer.
    
    This change addresses issues with replay and multiple instances when
    each instance enables different extensions. With a single set of
    instance function pointers, there were two options for handling function
    loading, which were both insufficient to handle this case:
    - Instance functions are initialized for the first instance created,
      where extension functions enabled by a second instance would not be
      loaded.
    - Instance functions are loaded after each instance creation, where
      extension functions loaded for the first instance could be overwritten
      as NULL by a second instance that did not enable the extension.
    
    Adding per-instance dispatch tables ensures that each instance has
    access to the extension functions it has enabled.
    
    Change-Id: I3ea5190494f5f482cd2003bef9b3bdd09fceebf1

I get flaky runs with gfxrecon-replay.

Every now and then, the execution SIGSEVs at gfxrecon::decode::FileProcessor::~FileProcessor.

From my very quick research, it seems that (at least) the destruction of the parameter_buffer_ member ends with a corrupted double-linked list. I could even confirm this by adding manually a:

FileProcessor::~FileProcessor()
{
    parameter_buffer_.clear();
    parameter_buffer_.shrink_to_fit();

...

And seeing that the SIGSEV now happens inside the shrink_to_fit() call.

Platform: Debian GNU/Linux Buster x86_64.

The trace is compressed with LZ4.

Reproduced with anv and radv Mesa Vulkan drivers from locally built tag mesa-19.3.3.

BT from gdb: gdb.txt

Trace file: vkcube.gfxr.zip

This is a run example from the source directory:

$ VK_ICD_FILENAMES="<path_to>/intel_icd.x86_64.json" ./_build/tools/replay/gfxrecon-replay <path_to>/vkcube.gfxr
35.984419 fps, 0.277898 seconds, 10 frames, 1 loop, framerange 1-10
corrupted double-linked list
Aborted (core dumped)

The text was updated successfully, but these errors were encountered:

tanty · 2020-02-21T13:38:00Z

FWIW, using the commit previous to the one introducing the regression, I also get flaky SIGPIPEs (-13), although I cannot reproduce as consistently.

dustin-lunarg · 2020-05-22T23:38:08Z

Apologies for taking so long to respond. I did look into this when it was first submitted, but neglected to update the issue.

At the time, I was unable to reproduce the problem with builds from both the dev and master branches. I made the modifications to the FileProcessor destructor and tested with the provided vkcube.gfxr file on a mesa-19.3.3 build. Testing was done with Fedora 31 and Debian Buster.

I just tried to reproduce the issue again on Fedora 32 with the radv driver from the mesa 20.0.7 that is packaged with the system, using the latest source from master and a new capture of vkcube, and am still not able to reproduce the issue.

If you are able to test again with the latest source from the master branch, could you let me know if you are still experiencing the problem?

lunarpapillo · 2020-06-05T19:58:26Z

Observed during SDK 1.2.141 .0 testing, using packages on Ubuntu 16.04 with a Radeon R9 285/380 and the mesa-vulkan-drivers driver. gfxrecon-replay of a vkcube trace (captured normally with the capture layer; the capture was terminated by ^C) seems to segfault on exit:

$ gfxrecon-replay gfxrecon_capture_20200602T180110.gfxr 
WARNING: radv is not a conformant vulkan implementation, testing use only.
[gfxrecon] WARNING - Incomplete block at end of file
58.316757 fps, 3.086591 seconds, 180 frames, 1 loop, framerange 1-180
Segmentation fault (core dumped)

The stack trace shows:

#0  0x00007fd754c763c6 in malloc_consolidate (av=av@entry=0x7fd754fbcb20 <main_arena>)
    at malloc.c:4183
#1  0x00007fd754c78678 in _int_free (av=0x7fd754fbcb20 <main_arena>, p=<optimized out>, 
    have_lock=0) at malloc.c:4075
#2  0x00007fd754c7c53c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#3  0x0000000000556590 in gfxrecon::decode::FileProcessor::~FileProcessor() ()
#4  0x0000000000550d32 in main ()

dustin-lunarg · 2020-06-09T00:03:45Z

As indicated in the previous comment, we are now able to reproduce this issue. The problem happens when the capture file does not contain a call to vkDestroySurfaceKHR. Replay does not currently attempt to destroy Vulkan resources that are still active on exit. For surfaces, when the capture file does not include a vkDestroySurfaceKHR call, the window associated with the surface will be destroyed on exit while the surface is still active. The SIGSEGV then seems to happen after a call to xcb_disconnect.

For this case, Valgrind produces multiple messages like the following:

==13906== Invalid write of size 4
==13906==    at 0x6B38C6C: __pthread_mutex_cond_lock (pthread_mutex_lock.c:159)
==13906==    by 0x6B3A3EF: pthread_cond_wait@@GLIBC_2.3.2 (pthread_cond_wait.S:259)
==13906==    by 0x6517EB8: ??? (in /usr/lib/x86_64-linux-gnu/libxcb.so.1.1.0)
==13906==    by 0x65199A8: xcb_wait_for_special_event (in /usr/lib/x86_64-linux-gnu/libxcb.so.1.1.0)
==13906==    by 0x8DCB476: ??? (in /usr/lib/x86_64-linux-gnu/libvulkan_radeon.so)
==13906==    by 0x6B346B9: start_thread (pthread_create.c:333)
==13906==    by 0x5E4B41C: clone (clone.S:109)
==13906==  Address 0x61282c0 is 32 bytes inside a block of size 21,152 free'd
==13906==    at 0x4C2EDEB: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13906==    by 0x553560: gfxrecon::application::XcbApplication::~XcbApplication() (xcb_application.cpp:35)
==13906==    by 0x553588: gfxrecon::application::XcbApplication::~XcbApplication() (xcb_application.cpp:37)
==13906==    by 0x550F74: operator() (unique_ptr.h:76)
==13906==    by 0x550F74: ~unique_ptr (unique_ptr.h:236)
==13906==    by 0x550F74: main (desktop_main.cpp:83)

This message seems to be triggered by the call to xcb_disconnect in XcbApplication::~XcbApplication().
FileProcessor::~FileProcessor(), where the crash happens, is invoked after XcbApplication::~XcbApplication(), which is most likely corrupting memory. Temporarily removing the call to xcb_disconnect stopped the errors from being reported by Valgrind and seemed to prevent the crash.

Change #389 ensures that surfaces are destroyed before their associated windows are destroyed. This eliminates the errors reported by Valgrind without removing the call to xcb_disconnect and seems to prevent the crash. Support for cleaning up all active Vulkan resources on exit will be added in the future.

dustin-lunarg · 2020-06-09T00:12:42Z

#389 was merged to the dev branch.

tanty · 2020-06-12T19:47:37Z

Thanks a lot, @lunarpapillo and @dustin-lunarg !

dustin-lunarg · 2020-06-12T19:54:59Z

No problem! Thank you for the detailed bug report!

tanty changed the title ~~SIGSEV (-11) on gfxrecon::decode::FileProcessor::~FileProcessor~~ [BISECTED] SIGSEV (-11) on gfxrecon::decode::FileProcessor::~FileProcessor Feb 21, 2020

dustin-lunarg self-assigned this Feb 21, 2020

dustin-lunarg added the bug Something isn't working label Feb 21, 2020

dustin-lunarg mentioned this issue Jun 8, 2020

Ensure surfaces are destroyed on replay exit #389

Merged

dustin-lunarg closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BISECTED] SIGSEV (-11) on `gfxrecon::decode::FileProcessor::~FileProcessor` #328

[BISECTED] SIGSEV (-11) on `gfxrecon::decode::FileProcessor::~FileProcessor` #328

tanty commented Feb 21, 2020 •

edited

Loading

tanty commented Feb 21, 2020

dustin-lunarg commented May 22, 2020

lunarpapillo commented Jun 5, 2020

dustin-lunarg commented Jun 9, 2020

dustin-lunarg commented Jun 9, 2020

tanty commented Jun 12, 2020

dustin-lunarg commented Jun 12, 2020

[BISECTED] SIGSEV (-11) on gfxrecon::decode::FileProcessor::~FileProcessor #328

[BISECTED] SIGSEV (-11) on gfxrecon::decode::FileProcessor::~FileProcessor #328

Comments

tanty commented Feb 21, 2020 • edited Loading

tanty commented Feb 21, 2020

dustin-lunarg commented May 22, 2020

lunarpapillo commented Jun 5, 2020

dustin-lunarg commented Jun 9, 2020

dustin-lunarg commented Jun 9, 2020

tanty commented Jun 12, 2020

dustin-lunarg commented Jun 12, 2020

[BISECTED] SIGSEV (-11) on `gfxrecon::decode::FileProcessor::~FileProcessor` #328

[BISECTED] SIGSEV (-11) on `gfxrecon::decode::FileProcessor::~FileProcessor` #328

tanty commented Feb 21, 2020 •

edited

Loading