#89 - Make Manager::shutdown() idempotent to prevent post-finalize log crash#90
Conversation
…g crash Every Manager backend's shutdown() begins with a DAQIRI_LOG_INFO call, and is invoked twice during the typical bench lifecycle: 1. Explicitly from main() via daqiri::shutdown(). 2. Again from the manager's destructor (or, for SocketMgr in RoCE mode, a destructor cascade into RdmaMgr::shutdown()) during C++ __cxa_finalize. By the time the destructor cascade fires, spdlog's default logger -- a function-local static created lazily on the first DAQIRI_LOG_INFO -- has already been destroyed. The DAQIRI_LOG_INFO at the top of the second shutdown() call then crashes inside spdlog::sink_it_. Repro on DGX Spark: daqiri_bench_rdma --mode both against examples/daqiri_bench_rdma_tx_rx_spark.yaml segfaults immediately after the legitimate shutdown completes. Backtrace shows __cxa_finalize -> ~SocketMgr -> SocketMgr::shutdown -> RdmaMgr::shutdown -> daqiri::log_formatted_message -> spdlog::logger::log_ -> spdlog::logger::sink_it_ -> SIGSEGV. Fix: short-circuit shutdown() on subsequent calls by returning early when initialized_ is false. Applied symmetrically to RdmaMgr, DpdkMgr, and SocketMgr -- the log-first body-second pattern is identical in all three. DpdkMgr's existing num_init reference-counted body is preserved; the guard only activates after the body has cleared initialized_ in the final shutdown. Verified by repeated daqiri_bench_rdma --mode both runs from a bash-parent shell. Pre-fix: SIGSEGV 100% reproducible. Post-fix: clean exit 0, all destructor markers run in order. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
|
| Filename | Overview |
|---|---|
| src/managers/dpdk/daqiri_dpdk_mgr.cpp | Adds if (!initialized_) { return; } guard before the first log call; initialized_ is only cleared when the ref-count num_init reaches zero, so the guard does not interfere with normal ref-counted shutdown sequences. |
| src/managers/rdma/daqiri_rdma_mgr.cpp | Adds if (!initialized_) { return; } guard at the top; initialized_ is cleared at the end of the shutdown body, so a post-finalize re-entry exits immediately without touching the destroyed spdlog logger. |
| src/managers/socket/daqiri_socket_mgr.cpp | Guard moved from below the RoCE branch to the top of the function, and strengthened to if (!initialized_ && !running_.load()) { return; } so that the init-failure cleanup path (running_=true, initialized_=false) still falls through and joins any partially-spawned threads. |
Sequence Diagram
sequenceDiagram
participant App
participant Manager
participant spdlog
App->>Manager: explicit shutdown()
Note over Manager: guard: initialized_=true → pass
Manager->>spdlog: DAQIRI_LOG_INFO(...)
Manager->>Manager: cleanup body
Manager->>Manager: "initialized_ = false"
Note over spdlog: __cxa_finalize destroys default logger
Manager->>Manager: ~Manager() → shutdown()
Note over Manager: guard: initialized_=false → return immediately
Note over spdlog: (no log call — crash prevented)
Reviews (2): Last reviewed commit: "#89 - Tighten SocketMgr::shutdown() guar..." | Re-trigger Greptile
…eanup
Greptile review of the original idempotency commit flagged the secondary
`if (!initialized_ && !running_.load()) { return; }` in SocketMgr::shutdown()
as having a dead `!initialized_` clause, since the new top-of-function guard
`if (!initialized_) { return; }` already covers that state. Investigation
surfaced the deeper concern: the top guard was too aggressive.
SocketMgr::initialize() sets initialized_=false and running_=true before
running setup, then sets initialized_=true on success. If setup_tcp_endpoint
or setup_udp_endpoint throws after spawning an accept_thread or io_thread,
the catch-block shutdown() call entered with initialized_=false and
running_=true. Under the original top guard the cleanup body was skipped,
leaving the worker threads joinable on the EndpointState — the destructor
cascade would then std::terminate on an unjoined std::thread.
Tighten the top guard to require both flags cleared. The post-shutdown
re-entry from __cxa_finalize still fires (both flags cleared at the end of
the body) while the init-failure cleanup path (running_=true) falls through
and joins its threads. The pre-existing secondary check is now fully
redundant and removed.
DpdkMgr and RdmaMgr keep the simpler `if (!initialized_) { return; }` —
neither has an init-failure shutdown() caller, so the asymmetry is
intentional and isolated to the manager whose initialize() partially
spawns threads before setting initialized_=true.
Verified manually with both the existing DPDK / socket-udp / socket-tcp
normal-shutdown smokes and a new 2-endpoint UDP init-failure repro
(malformed remote IP on endpoint 2 → parse_ipv4_addr throws after
endpoint 1's io_thread is spawned): rc=1, no SIGSEGV / SIGABRT /
"terminate called" in stderr.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rgurunathan <rgurunathan@nvidia.com>
Fixes #89. `Manager::shutdown()` was being invoked twice in the bench lifecycle, and the second call's top `DAQIRI_LOG_INFO` crashed because spdlog's default logger had already been destroyed by `__cxa_finalize`.
Adds a 1-line idempotency guard (`if (!initialized_) return;`) at the top of `RdmaMgr::shutdown()`, `DpdkMgr::shutdown()`, and `SocketMgr::shutdown()`. Same log-first body-second pattern in all three managers, same fix.
`DpdkMgr`'s existing `num_init` reference-counted body is preserved — the guard only activates after the body has cleared `initialized_` on the final shutdown.
Test plan
Discovered while tackling PR #72.