Skip to content

#89 - Make Manager::shutdown() idempotent to prevent post-finalize log crash#90

Merged
RamyaGuru merged 2 commits into
mainfrom
fix-manager-shutdown-idempotency
May 20, 2026
Merged

#89 - Make Manager::shutdown() idempotent to prevent post-finalize log crash#90
RamyaGuru merged 2 commits into
mainfrom
fix-manager-shutdown-idempotency

Conversation

@RamyaGuru
Copy link
Copy Markdown
Collaborator

@RamyaGuru RamyaGuru commented May 20, 2026

Fixes #89. `Manager::shutdown()` was being invoked twice in the bench lifecycle, and the second call's top `DAQIRI_LOG_INFO` crashed because spdlog's default logger had already been destroyed by `__cxa_finalize`.

Adds a 1-line idempotency guard (`if (!initialized_) return;`) at the top of `RdmaMgr::shutdown()`, `DpdkMgr::shutdown()`, and `SocketMgr::shutdown()`. Same log-first body-second pattern in all three managers, same fix.
`DpdkMgr`'s existing `num_init` reference-counted body is preserved — the guard only activates after the body has cleared `initialized_` on the final shutdown.

Test plan

  • A/B verified on a clean fix-branch base (off `origin/main`):
    • With fix → `daqiri_bench_rdma --mode both` exits 0.
    • Without fix (reset to `origin/main`) → bench segfaults during `__cxa_finalize` (exit 139). Confirms the fix is necessary on plain main, not just under the perf-report branch's bench changes.
  • DPDK / Socket benches still shut down cleanly (no regression — `num_init` ref counting in DpdkMgr preserved).

Discovered while tackling PR #72.

…g crash

Every Manager backend's shutdown() begins with a DAQIRI_LOG_INFO call,
and is invoked twice during the typical bench lifecycle:
1. Explicitly from main() via daqiri::shutdown().
2. Again from the manager's destructor (or, for SocketMgr in RoCE mode,
   a destructor cascade into RdmaMgr::shutdown()) during C++
   __cxa_finalize.

By the time the destructor cascade fires, spdlog's default logger -- a
function-local static created lazily on the first DAQIRI_LOG_INFO -- has
already been destroyed. The DAQIRI_LOG_INFO at the top of the second
shutdown() call then crashes inside spdlog::sink_it_.

Repro on DGX Spark: daqiri_bench_rdma --mode both against
examples/daqiri_bench_rdma_tx_rx_spark.yaml segfaults immediately after
the legitimate shutdown completes. Backtrace shows
__cxa_finalize -> ~SocketMgr -> SocketMgr::shutdown ->
RdmaMgr::shutdown -> daqiri::log_formatted_message ->
spdlog::logger::log_ -> spdlog::logger::sink_it_ -> SIGSEGV.

Fix: short-circuit shutdown() on subsequent calls by returning early
when initialized_ is false. Applied symmetrically to RdmaMgr, DpdkMgr,
and SocketMgr -- the log-first body-second pattern is identical in all
three. DpdkMgr's existing num_init reference-counted body is preserved;
the guard only activates after the body has cleared initialized_ in
the final shutdown.

Verified by repeated daqiri_bench_rdma --mode both runs from a
bash-parent shell. Pre-fix: SIGSEGV 100% reproducible. Post-fix: clean
exit 0, all destructor markers run in order.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
@RamyaGuru RamyaGuru marked this pull request as ready for review May 20, 2026 20:54
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR adds a one-line idempotency guard to DpdkMgr::shutdown(), RdmaMgr::shutdown(), and SocketMgr::shutdown() to prevent a segfault that occurred when __cxa_finalize invoked the destructor chain a second time after spdlog's default logger had already been torn down.

  • DpdkMgr / RdmaMgr receive a simple if (!initialized_) { return; } guard, consistent with each manager's existing practice of clearing initialized_ at the end of the shutdown body.
  • SocketMgr uses a dual-flag guard (if (!initialized_ && !running_.load()) { return; }) to preserve the init-failure cleanup path: initialize() sets running_=true before setup and initialized_=true only on success, so a throw mid-setup calls shutdown() with initialized_=false, running_=true — a state that must fall through to join any partially-spawned threads, avoiding std::terminate in the destructor cascade.

Confidence Score: 5/5

Safe to merge — three small, well-isolated guards that each fire only after the corresponding manager has fully torn itself down.

The change is minimal: one guard per manager, each guarding on the flag that the shutdown body already clears. The SocketMgr variant correctly uses a dual-flag check to preserve the init-failure cleanup path, as confirmed by reading initialize(). The ref-counted DpdkMgr body is untouched. The previous outside-diff concern about a dead secondary guard has been resolved by this commit's tightened condition and removal of the old guard position.

No files require special attention.

Important Files Changed

Filename Overview
src/managers/dpdk/daqiri_dpdk_mgr.cpp Adds if (!initialized_) { return; } guard before the first log call; initialized_ is only cleared when the ref-count num_init reaches zero, so the guard does not interfere with normal ref-counted shutdown sequences.
src/managers/rdma/daqiri_rdma_mgr.cpp Adds if (!initialized_) { return; } guard at the top; initialized_ is cleared at the end of the shutdown body, so a post-finalize re-entry exits immediately without touching the destroyed spdlog logger.
src/managers/socket/daqiri_socket_mgr.cpp Guard moved from below the RoCE branch to the top of the function, and strengthened to if (!initialized_ && !running_.load()) { return; } so that the init-failure cleanup path (running_=true, initialized_=false) still falls through and joins any partially-spawned threads.

Sequence Diagram

sequenceDiagram
    participant App
    participant Manager
    participant spdlog

    App->>Manager: explicit shutdown()
    Note over Manager: guard: initialized_=true → pass
    Manager->>spdlog: DAQIRI_LOG_INFO(...)
    Manager->>Manager: cleanup body
    Manager->>Manager: "initialized_ = false"

    Note over spdlog: __cxa_finalize destroys default logger

    Manager->>Manager: ~Manager() → shutdown()
    Note over Manager: guard: initialized_=false → return immediately
    Note over spdlog: (no log call — crash prevented)
Loading

Reviews (2): Last reviewed commit: "#89 - Tighten SocketMgr::shutdown() guar..." | Re-trigger Greptile

…eanup

Greptile review of the original idempotency commit flagged the secondary
`if (!initialized_ && !running_.load()) { return; }` in SocketMgr::shutdown()
as having a dead `!initialized_` clause, since the new top-of-function guard
`if (!initialized_) { return; }` already covers that state. Investigation
surfaced the deeper concern: the top guard was too aggressive.
SocketMgr::initialize() sets initialized_=false and running_=true before
running setup, then sets initialized_=true on success. If setup_tcp_endpoint
or setup_udp_endpoint throws after spawning an accept_thread or io_thread,
the catch-block shutdown() call entered with initialized_=false and
running_=true. Under the original top guard the cleanup body was skipped,
leaving the worker threads joinable on the EndpointState — the destructor
cascade would then std::terminate on an unjoined std::thread.

Tighten the top guard to require both flags cleared. The post-shutdown
re-entry from __cxa_finalize still fires (both flags cleared at the end of
the body) while the init-failure cleanup path (running_=true) falls through
and joins its threads. The pre-existing secondary check is now fully
redundant and removed.

DpdkMgr and RdmaMgr keep the simpler `if (!initialized_) { return; }` —
neither has an init-failure shutdown() caller, so the asymmetry is
intentional and isolated to the manager whose initialize() partially
spawns threads before setting initialized_=true.

Verified manually with both the existing DPDK / socket-udp / socket-tcp
normal-shutdown smokes and a new 2-endpoint UDP init-failure repro
(malformed remote IP on endpoint 2 → parse_ipv4_addr throws after
endpoint 1's io_thread is spawned): rc=1, no SIGSEGV / SIGABRT /
"terminate called" in stderr.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
@RamyaGuru RamyaGuru merged commit 261d62f into main May 20, 2026
1 check passed
@RamyaGuru RamyaGuru deleted the fix-manager-shutdown-idempotency branch May 20, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] RDMA bench shutdown segfaults from double-call of Manager::shutdown() after logger finalized

1 participant