Skip to content

Fix collective mismatch hangs in multi-rank error paths#270

Merged
nileshnegi merged 2 commits intocandidatefrom
users/nileshnegi/fix-mpi-based-errors
Apr 27, 2026
Merged

Fix collective mismatch hangs in multi-rank error paths#270
nileshnegi merged 2 commits intocandidatefrom
users/nileshnegi/fix-mpi-based-errors

Conversation

@nileshnegi
Copy link
Copy Markdown
Collaborator

@nileshnegi nileshnegi commented Apr 26, 2026

Motivation

This PR builds on top of #269, so this PR will be rebased after the other is merged.

Three places used ERR_CHECK inside rank-specific guards preceding (or within loops containing) subsequent Broadcast collectives. A failure on one rank would cause it to early-return before the collective, leaving all other ranks blocked indefinitely.

Also, fixes segfault and improves error diagnostics in multi-rank error paths

Technical Details

  • ExchangeMemory (POD path): capture hipSetDevice + hipMemExportToShareableHandle errors into exportErr and broadcast it after the handle broadcast, so all ranks see the failure and return together. Apply the same treatment to the import-side sequence (hipSetDevice, hipMemImport, hipMemAddressReserve, hipMemMap, hipMemSetAccess), broadcasting importErr before returning.
  • BroadcastTfrResult: fix non-root path broadcasting only the first element of perIterCUs[i] regardless of setSize; replace the if-guard with a loop matching the root's per-element sends.
  • NIC executor QP setup: replace ERR_CHECK inside the per-QP loop with captured ErrType values broadcast from srcMemRank and dstMemRank, respectively, so a TransitionQpToRtr/Rts failure on any rank does not leave others blocked on the next iteration's Broadcast when qpCount > 1.
  • CheckMultiNodeConfigConsistency: null out vector members in DataOptions (fillPattern, fillCompress) and GfxOptions (cuMask, prefXccTable) before sizeof-broadcast. Broadcasting a struct containing std::vector transmits rank 0's heap pointers to other ranks; on scope exit the receiving rank calls free(remote_ptr) causing a segfault. These fields are permitted to differ across ranks and are not compared by the consistency check.
  • NIC QP transition: replace bare ErrType broadcast with a POD QpTransitionResult {errType, rtrFailed} struct so the failing stage (RTR vs RTS) is preserved and reported accurately. Previously all failures reported "to RTS" regardless of which transition failed. Adds static_assert(is_trivially_copyable) to enforce broadcast safety.

Test Plan

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings April 26, 2026 16:38
@nileshnegi nileshnegi requested review from a team as code owners April 26, 2026 16:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prevents multi-rank hangs caused by rank-local failures returning early before subsequent collectives, ensuring all ranks observe the failure and exit together (notably in NIC QP setup and POD fabric-handle exchange). It also includes build-system/toolchain updates that improve ROCm/HIP (+ AMD-SMI) feature detection and configuration.

Changes:

  • Synchronize error handling across ranks in NIC QP transition setup and POD fabric-handle export/import to avoid collective mismatch hangs.
  • Fix BroadcastTfrResult non-root path to receive all CU pairs (not just the first element).
  • Update CMake/Makefile/toolchain logic for ROCm path detection and compile-time API probes (HIP fabric + AMD-SMI fabric APIs), plus minor docs/changelog tweaks.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
toolchain-linux.cmake Refactors ROCm/compiler auto-detection and default build-type flags in a Linux toolchain file.
src/header/TransferBench.hpp Adds rank-synchronized error propagation around collectives; fixes CU broadcast loop; updates AMD-SMI BDF usage.
Makefile Updates HIP compiler fallback selection and switches POD/AMD-SMI enablement to compile-time API probes.
CMakeLists.txt Raises CMake minimum, improves ROCm/GPU target handling, and adds compile-time probes for POD/AMD-SMI support.
CHANGELOG.md Normalizes preset naming/formatting entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp Outdated
Comment thread CMakeLists.txt Outdated
Three places used ERR_CHECK inside rank-specific guards preceding
(or within loops containing) subsequent Broadcast collectives. A
failure on one rank would cause it to early-return before the
collective, leaving all other ranks blocked indefinitely.

- ExchangeMemory (POD path): capture hipSetDevice + hipMemExport-
  ToShareableHandle errors into exportErr and broadcast it after
  the handle broadcast, so all ranks see the failure and return
  together. Apply the same treatment to the import-side sequence
  (hipSetDevice, hipMemImport, hipMemAddressReserve, hipMemMap,
  hipMemSetAccess), broadcasting importErr before returning.

- NIC executor QP setup: replace ERR_CHECK inside the per-QP loop
  with captured ErrType values broadcast from srcMemRank and
  dstMemRank respectively, so a TransitionQpToRtr/Rts failure on
  any rank does not leave others blocked on the next iteration's
  Broadcast when qpCount > 1.

- BroadcastTfrResult: fix non-root path broadcasting only the
  first element of perIterCUs[i] regardless of setSize; replace
  the if-guard with a loop matching the root's per-element sends.

Fix segfault and improve error diagnostics in multi-rank error paths

- CheckMultiNodeConfigConsistency: null out vector members in DataOptions
  (fillPattern, fillCompress) and GfxOptions (cuMask, prefXccTable) before
  sizeof-broadcast. Broadcasting a struct containing std::vector transmits
  rank 0's heap pointers to other ranks; on scope exit the receiving rank
  calls free(remote_ptr) causing a segfault. These fields are permitted to
  differ across ranks and are not compared by the consistency check.

- NIC QP transition: replace bare ErrType broadcast with a POD
  QpTransitionResult {errType, rtrFailed} struct so the failing stage
  (RTR vs RTS) is preserved and reported accurately. Previously all
  failures reported "to RTS" regardless of which transition failed.
  Adds static_assert(is_trivially_copyable) to enforce broadcast safety.

Co-authored-by: Claude <claude@anthropic.com>
@nileshnegi nileshnegi force-pushed the users/nileshnegi/fix-mpi-based-errors branch from 8cf0bc4 to 7a6defe Compare April 27, 2026 01:55
@nileshnegi nileshnegi requested a review from Copilot April 27, 2026 01:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
- Fix CUDA build: hip fabric macros (hipMemExportToShareableHandle,
  hipMemImportFromShareableHandle, hipMemAddressReserve, hipMemMap,
  hipMemSetAccess, hipMemUnmap, hipMemRelease, hipMemAddressFree) now
  cast CUresult to cudaError_t so callers can use a single hipError_t
  variable across runtime and driver API calls.

- Fabric handle export/import error messages now include the specific
  failing call (e.g. "HIP Error in hipMemAddressReserve during fabric
  handle import") to make multi-rank failures diagnosable without
  re-running with extra logging.

- Fix typo: "sharable" -> "shareable" in comment (matches HIP API name).

Co-authored-by: Claude <claude@anthropic.com>
@nileshnegi nileshnegi merged commit 4adc3a9 into candidate Apr 27, 2026
4 checks passed
@nileshnegi nileshnegi deleted the users/nileshnegi/fix-mpi-based-errors branch April 27, 2026 02:29
@nileshnegi nileshnegi mentioned this pull request Apr 27, 2026
1 task
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: AtlantaPepsi <timhu102@gmail.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants