Skip to content

Intermittent corrupted double-linked list SIGABRT at process exit on RDNA 4 (gfx1201) with MIGraphX 6.4.2 + ORT 1.24.2 #4792

@maherr

Description

@maherr

Summary

A long-running batch of 232 sequential ORT + MIGraphX inference jobs (one fresh subprocess per job, WeSpeaker ResNet34 embedding model on 16 kHz mono WAVs) produces a heap-corruption SIGABRT in 1 of 232 runs on an RX 9070 (gfx1201, RDNA 4) under ROCm 6.4.4 / MIGraphX rocm-6.4.2 / ORT 1.24.2 / glibc 2.42 / libgcc 15.2.1.

The abort signature is glibc's corrupted double-linked list, printed from malloc_printerr inside unlink_chunk. The backtrace shows the free happening in libgcc_s.so.1!btree_destroy called from release_registered_frames during _dl_fini — i.e., the FDE (DWARF unwind) btree teardown at process exit. All application work (main() in our Rust wrapper) completed successfully before the abort, and the inference JSON was generated correctly.

The single-threaded-at-crash state strongly suggests a deferred heap corruption originating during MIGraphX / MIGraphX EP / HIP runtime teardown, which gets detected when libgcc's shutdown free() path next coalesces against the damaged chunk.

Upfront caveat on platform

We recognize gfx1201 / RDNA 4 is not officially supported in the MIGraphX 6.4 branch. We're running MIGraphX at the rocm-6.4.2 tag with 6 small local patches to enable building for gfx1201 on Fedora 43 (TF protobuf ABI, MLIR stubs, hipcc arch-list entry, C API unlink). None of our patches touch allocator, teardown, or std::atexit/destructor code paths. Full patch set available on request. If the right answer is "wait for ROCm 7.x official gfx1201 support," that's fine — we're filing primarily because we have a fresh coredump with a clean backtrace and wanted to make sure it's on the record, in case it helps identify a latent issue that's independent of arch support.

Signature

Using migraphx mode, ahc_threshold_mode=adaptive
corrupted double-linked list
<SIGABRT, rc=-6>

Backtrace (from systemd-coredump)

Thread 1 (the only thread at crash time):
#0  __pthread_kill_implementation            (libc.so.6 + 0x743cc)
#1  raise                                    (libc.so.6 + 0x1a15e)
#2  abort                                    (libc.so.6 + 0x16d0)
#3  __libc_message_impl.cold                 (libc.so.6 + 0x273b)
#4  malloc_printerr                          (libc.so.6 + 0x7e665)   <-- "corrupted double-linked list" prints here
#5  unlink_chunk.isra.0                      (libc.so.6 + 0x7f034)   <-- linkage check fails
#6  malloc_consolidate                       (libc.so.6 + 0x7f0e5)
#7  _int_free_maybe_consolidate.part.0       (libc.so.6 + 0x80290)
#8  _int_free_chunk                          (libc.so.6 + 0x80801)
#9  btree_destroy                            (libgcc_s.so.1 + 0xb7c) <-- FDE btree teardown
#10 release_registered_frames                (libgcc_s.so.1 + 0xbb4) <-- libgcc's __attribute__((destructor))
#11 _dl_call_fini                            (ld-linux + 0x12d2)
#12 _dl_fini                                 (ld-linux + 0x536e)
#13 __run_exit_handlers                      (libc.so.6 + 0x1c9e1)
#14 exit                                     (libc.so.6 + 0x1cabe)
#15 __libc_start_call_main                   (libc.so.6 + 0x35bc)

Loaded DSOs at crash time (relevant subset)

  • libonnxruntime.so.1.24.2
  • libonnxruntime_providers_migraphx.so
  • libonnxruntime_providers_shared.so
  • libmigraphx.so.2012000
  • libmigraphx_c.so.3
  • libmigraphx_onnx.so.2012000
  • libmigraphx_tf_stub.so (our local stub replacing the TF reader)
  • libamdhip64.so.6
  • libhsa-runtime64.so.1
  • libamd_comgr.so.3

Reproduction rate

Mode Crashes / Runs Rate
232-file sequential benchmark (VoxConverse test split) 1 / 232 ~0.43%
Isolated rerun of the crashing file 0 / 10 0.0%
Stress with MALLOC_CHECK_=3 MALLOC_PERTURB_=42 0 / 65 0.0%
Combined 1 / 307 ~0.33%

The isolated reruns and stress rule out a deterministic per-file trigger. The crashing file is structurally unremarkable (171 s, 16 kHz mono, 2 speakers, 24 reference segments). Almost certainly a rare race in session-teardown code path.

Environment

Layer Version
OS Fedora 43 (kernel 6.19.11-200.fc43.x86_64)
glibc 2.42-10.fc43
libgcc 15.2.1-7.fc43
ROCm 6.4.4 (Fedora packages: rocm-core, libamdhip64, libhsa-runtime64, libamd_comgr)
MIGraphX rocm-6.4.2 + 6 local RDNA 4 enablement patches
ONNX Runtime 1.24.2 + 2 local RDNA 4 enablement patches (fp4x2 fallback, bf16 skip)
GPU AMD Radeon RX 9070 (RDNA 4, gfx1201, Navi 48), 16 GB
CPU Ryzen 7 5800X3D, 64 GB RAM
Model wespeaker-voxceleb-resnet34 (ONNX, precompiled .mxr cached)

Best-available reproducer

A 232-file serial sweep driven by a Python subprocess harness — one fresh speakrs-cli --gpu <wav> per file. Expected rate ~1 crash per ~1 hour of sweep on this hardware. Happy to share the harness and the VoxConverse inputs if useful.

Request

Is there a known teardown ordering issue at exit on RDNA 4 that could manifest as deferred heap corruption? The single-thread-at-crash state suggests it's not a compute-time race between workers but something in destructor ordering between MIGraphX's compiled-kernel cleanup, its JIT/comgr state, and HIP's async-queue teardown.

Happy to:

  • Run a debug-malloc or ASan build of MIGraphX if someone can point at the suspected scope.
  • Share the coredump, full DSO list (gdb info sharedlibrary output), and the 6 local RDNA 4 patches we're carrying.
  • Re-test against a newer MIGraphX tag if there's a candidate fix.

Cross-filing a companion issue against microsoft/onnxruntime since attribution between MIGraphX-proper and ORT's MIGraphX EP is unclear from the backtrace alone.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions