Fix collective mismatch hangs in multi-rank error paths#270
Fix collective mismatch hangs in multi-rank error paths#270nileshnegi merged 2 commits intocandidatefrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR prevents multi-rank hangs caused by rank-local failures returning early before subsequent collectives, ensuring all ranks observe the failure and exit together (notably in NIC QP setup and POD fabric-handle exchange). It also includes build-system/toolchain updates that improve ROCm/HIP (+ AMD-SMI) feature detection and configuration.
Changes:
- Synchronize error handling across ranks in NIC QP transition setup and POD fabric-handle export/import to avoid collective mismatch hangs.
- Fix
BroadcastTfrResultnon-root path to receive all CU pairs (not just the first element). - Update CMake/Makefile/toolchain logic for ROCm path detection and compile-time API probes (HIP fabric + AMD-SMI fabric APIs), plus minor docs/changelog tweaks.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| toolchain-linux.cmake | Refactors ROCm/compiler auto-detection and default build-type flags in a Linux toolchain file. |
| src/header/TransferBench.hpp | Adds rank-synchronized error propagation around collectives; fixes CU broadcast loop; updates AMD-SMI BDF usage. |
| Makefile | Updates HIP compiler fallback selection and switches POD/AMD-SMI enablement to compile-time API probes. |
| CMakeLists.txt | Raises CMake minimum, improves ROCm/GPU target handling, and adds compile-time probes for POD/AMD-SMI support. |
| CHANGELOG.md | Normalizes preset naming/formatting entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Three places used ERR_CHECK inside rank-specific guards preceding
(or within loops containing) subsequent Broadcast collectives. A
failure on one rank would cause it to early-return before the
collective, leaving all other ranks blocked indefinitely.
- ExchangeMemory (POD path): capture hipSetDevice + hipMemExport-
ToShareableHandle errors into exportErr and broadcast it after
the handle broadcast, so all ranks see the failure and return
together. Apply the same treatment to the import-side sequence
(hipSetDevice, hipMemImport, hipMemAddressReserve, hipMemMap,
hipMemSetAccess), broadcasting importErr before returning.
- NIC executor QP setup: replace ERR_CHECK inside the per-QP loop
with captured ErrType values broadcast from srcMemRank and
dstMemRank respectively, so a TransitionQpToRtr/Rts failure on
any rank does not leave others blocked on the next iteration's
Broadcast when qpCount > 1.
- BroadcastTfrResult: fix non-root path broadcasting only the
first element of perIterCUs[i] regardless of setSize; replace
the if-guard with a loop matching the root's per-element sends.
Fix segfault and improve error diagnostics in multi-rank error paths
- CheckMultiNodeConfigConsistency: null out vector members in DataOptions
(fillPattern, fillCompress) and GfxOptions (cuMask, prefXccTable) before
sizeof-broadcast. Broadcasting a struct containing std::vector transmits
rank 0's heap pointers to other ranks; on scope exit the receiving rank
calls free(remote_ptr) causing a segfault. These fields are permitted to
differ across ranks and are not compared by the consistency check.
- NIC QP transition: replace bare ErrType broadcast with a POD
QpTransitionResult {errType, rtrFailed} struct so the failing stage
(RTR vs RTS) is preserved and reported accurately. Previously all
failures reported "to RTS" regardless of which transition failed.
Adds static_assert(is_trivially_copyable) to enforce broadcast safety.
Co-authored-by: Claude <claude@anthropic.com>
8cf0bc4 to
7a6defe
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Fix CUDA build: hip fabric macros (hipMemExportToShareableHandle, hipMemImportFromShareableHandle, hipMemAddressReserve, hipMemMap, hipMemSetAccess, hipMemUnmap, hipMemRelease, hipMemAddressFree) now cast CUresult to cudaError_t so callers can use a single hipError_t variable across runtime and driver API calls. - Fabric handle export/import error messages now include the specific failing call (e.g. "HIP Error in hipMemAddressReserve during fabric handle import") to make multi-rank failures diagnosable without re-running with extra logging. - Fix typo: "sharable" -> "shareable" in comment (matches HIP API name). Co-authored-by: Claude <claude@anthropic.com>
- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Motivation
This PR builds on top of #269, so this PR will be rebased after the other is merged.
Three places used ERR_CHECK inside rank-specific guards preceding (or within loops containing) subsequent Broadcast collectives. A failure on one rank would cause it to early-return before the collective, leaving all other ranks blocked indefinitely.
Also, fixes segfault and improves error diagnostics in multi-rank error paths
Technical Details
exportErrand broadcast it after the handle broadcast, so all ranks see the failure and return together. Apply the same treatment to the import-side sequence (hipSetDevice, hipMemImport, hipMemAddressReserve, hipMemMap, hipMemSetAccess), broadcastingimportErrbefore returning.QpTransitionResult {errType, rtrFailed}struct so the failing stage (RTR vs RTS) is preserved and reported accurately. Previously all failures reported "to RTS" regardless of which transition failed. Addsstatic_assert(is_trivially_copyable)to enforce broadcast safety.Test Plan
Test Result
Submission Checklist