Intermittent `corrupted double-linked list` SIGABRT at process exit on RDNA 4 (gfx1201) with MIGraphX 6.4.2 + ORT 1.24.2

## Summary

A long-running batch of 232 sequential ORT + MIGraphX inference jobs (one fresh subprocess per job, WeSpeaker ResNet34 embedding model on 16 kHz mono WAVs) produces a heap-corruption `SIGABRT` in 1 of 232 runs on an **RX 9070 (gfx1201, RDNA 4)** under ROCm 6.4.4 / MIGraphX rocm-6.4.2 / ORT 1.24.2 / glibc 2.42 / libgcc 15.2.1.

The abort signature is glibc's `corrupted double-linked list`, printed from `malloc_printerr` inside `unlink_chunk`. The backtrace shows the free happening in `libgcc_s.so.1!btree_destroy` called from `release_registered_frames` during `_dl_fini` — i.e., the FDE (DWARF unwind) btree teardown at process exit. All application work (`main()` in our Rust wrapper) completed successfully before the abort, and the inference JSON was generated correctly.

The **single-threaded-at-crash state** strongly suggests a deferred heap corruption originating during **MIGraphX / MIGraphX EP / HIP runtime teardown**, which gets detected when libgcc's shutdown `free()` path next coalesces against the damaged chunk.

## Upfront caveat on platform

We recognize **gfx1201 / RDNA 4 is not officially supported** in the MIGraphX 6.4 branch. We're running MIGraphX at the `rocm-6.4.2` tag with 6 small local patches to enable building for gfx1201 on Fedora 43 (TF protobuf ABI, MLIR stubs, hipcc arch-list entry, C API unlink). None of our patches touch allocator, teardown, or `std::atexit`/destructor code paths. Full patch set available on request. If the right answer is "wait for ROCm 7.x official gfx1201 support," that's fine — we're filing primarily because we have a fresh coredump with a clean backtrace and wanted to make sure it's on the record, in case it helps identify a latent issue that's independent of arch support.

## Signature

```
Using migraphx mode, ahc_threshold_mode=adaptive
corrupted double-linked list
<SIGABRT, rc=-6>
```

## Backtrace (from systemd-coredump)

```
Thread 1 (the only thread at crash time):
#0  __pthread_kill_implementation            (libc.so.6 + 0x743cc)
#1  raise                                    (libc.so.6 + 0x1a15e)
#2  abort                                    (libc.so.6 + 0x16d0)
#3  __libc_message_impl.cold                 (libc.so.6 + 0x273b)
#4  malloc_printerr                          (libc.so.6 + 0x7e665)   <-- "corrupted double-linked list" prints here
#5  unlink_chunk.isra.0                      (libc.so.6 + 0x7f034)   <-- linkage check fails
#6  malloc_consolidate                       (libc.so.6 + 0x7f0e5)
#7  _int_free_maybe_consolidate.part.0       (libc.so.6 + 0x80290)
#8  _int_free_chunk                          (libc.so.6 + 0x80801)
#9  btree_destroy                            (libgcc_s.so.1 + 0xb7c) <-- FDE btree teardown
#10 release_registered_frames                (libgcc_s.so.1 + 0xbb4) <-- libgcc's __attribute__((destructor))
#11 _dl_call_fini                            (ld-linux + 0x12d2)
#12 _dl_fini                                 (ld-linux + 0x536e)
#13 __run_exit_handlers                      (libc.so.6 + 0x1c9e1)
#14 exit                                     (libc.so.6 + 0x1cabe)
#15 __libc_start_call_main                   (libc.so.6 + 0x35bc)
```

## Loaded DSOs at crash time (relevant subset)

- `libonnxruntime.so.1.24.2`
- `libonnxruntime_providers_migraphx.so`
- `libonnxruntime_providers_shared.so`
- `libmigraphx.so.2012000`
- `libmigraphx_c.so.3`
- `libmigraphx_onnx.so.2012000`
- `libmigraphx_tf_stub.so` (our local stub replacing the TF reader)
- `libamdhip64.so.6`
- `libhsa-runtime64.so.1`
- `libamd_comgr.so.3`

## Reproduction rate

| Mode | Crashes / Runs | Rate |
|---|---|---|
| 232-file sequential benchmark (VoxConverse test split) | **1 / 232** | ~0.43% |
| Isolated rerun of the crashing file | 0 / 10 | 0.0% |
| Stress with `MALLOC_CHECK_=3 MALLOC_PERTURB_=42` | 0 / 65 | 0.0% |
| Combined | **1 / 307** | ~0.33% |

The isolated reruns and stress rule out a deterministic per-file trigger. The crashing file is structurally unremarkable (171 s, 16 kHz mono, 2 speakers, 24 reference segments). Almost certainly a rare race in session-teardown code path.

## Environment

| Layer | Version |
|---|---|
| OS | Fedora 43 (kernel 6.19.11-200.fc43.x86_64) |
| glibc | 2.42-10.fc43 |
| libgcc | 15.2.1-7.fc43 |
| ROCm | 6.4.4 (Fedora packages: `rocm-core`, `libamdhip64`, `libhsa-runtime64`, `libamd_comgr`) |
| MIGraphX | `rocm-6.4.2` + 6 local RDNA 4 enablement patches |
| ONNX Runtime | `1.24.2` + 2 local RDNA 4 enablement patches (fp4x2 fallback, bf16 skip) |
| GPU | AMD Radeon RX 9070 (RDNA 4, gfx1201, Navi 48), 16 GB |
| CPU | Ryzen 7 5800X3D, 64 GB RAM |
| Model | wespeaker-voxceleb-resnet34 (ONNX, precompiled `.mxr` cached) |

## Best-available reproducer

A 232-file serial sweep driven by a Python subprocess harness — one fresh `speakrs-cli --gpu <wav>` per file. Expected rate ~1 crash per ~1 hour of sweep on this hardware. Happy to share the harness and the VoxConverse inputs if useful.

## Request

Is there a known teardown ordering issue at exit on RDNA 4 that could manifest as deferred heap corruption? The single-thread-at-crash state suggests it's not a compute-time race between workers but something in destructor ordering between MIGraphX's compiled-kernel cleanup, its JIT/comgr state, and HIP's async-queue teardown.

Happy to:
- Run a debug-malloc or ASan build of MIGraphX if someone can point at the suspected scope.
- Share the coredump, full DSO list (`gdb info sharedlibrary` output), and the 6 local RDNA 4 patches we're carrying.
- Re-test against a newer MIGraphX tag if there's a candidate fix.

Cross-filing a companion issue against `microsoft/onnxruntime` since attribution between MIGraphX-proper and ORT's MIGraphX EP is unclear from the backtrace alone.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent `corrupted double-linked list` SIGABRT at process exit on RDNA 4 (gfx1201) with MIGraphX 6.4.2 + ORT 1.24.2 #4792

Summary

Upfront caveat on platform

Signature

Backtrace (from systemd-coredump)

Loaded DSOs at crash time (relevant subset)

Reproduction rate

Environment

Best-available reproducer

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Mode	Crashes / Runs	Rate
232-file sequential benchmark (VoxConverse test split)	1 / 232	~0.43%
Isolated rerun of the crashing file	0 / 10	0.0%
Stress with `MALLOC_CHECK_=3 MALLOC_PERTURB_=42`	0 / 65	0.0%
Combined	1 / 307	~0.33%

Layer	Version
OS	Fedora 43 (kernel 6.19.11-200.fc43.x86_64)
glibc	2.42-10.fc43
libgcc	15.2.1-7.fc43
ROCm	6.4.4 (Fedora packages: `rocm-core`, `libamdhip64`, `libhsa-runtime64`, `libamd_comgr`)
MIGraphX	`rocm-6.4.2` + 6 local RDNA 4 enablement patches
ONNX Runtime	`1.24.2` + 2 local RDNA 4 enablement patches (fp4x2 fallback, bf16 skip)
GPU	AMD Radeon RX 9070 (RDNA 4, gfx1201, Navi 48), 16 GB
CPU	Ryzen 7 5800X3D, 64 GB RAM
Model	wespeaker-voxceleb-resnet34 (ONNX, precompiled `.mxr` cached)

Intermittent corrupted double-linked list SIGABRT at process exit on RDNA 4 (gfx1201) with MIGraphX 6.4.2 + ORT 1.24.2 #4792

Description

Summary

Upfront caveat on platform

Signature

Backtrace (from systemd-coredump)

Loaded DSOs at crash time (relevant subset)

Reproduction rate

Environment

Best-available reproducer

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Intermittent `corrupted double-linked list` SIGABRT at process exit on RDNA 4 (gfx1201) with MIGraphX 6.4.2 + ORT 1.24.2 #4792