Fix collective mismatch hangs in multi-rank error paths by nileshnegi · Pull Request #270 · ROCm/TransferBench

nileshnegi · 2026-04-26T16:38:11Z

Motivation

This PR builds on top of #269, so this PR will be rebased after the other is merged.

Three places used ERR_CHECK inside rank-specific guards preceding (or within loops containing) subsequent Broadcast collectives. A failure on one rank would cause it to early-return before the collective, leaving all other ranks blocked indefinitely.

Also, fixes segfault and improves error diagnostics in multi-rank error paths

Technical Details

ExchangeMemory (POD path): capture hipSetDevice + hipMemExportToShareableHandle errors into exportErr and broadcast it after the handle broadcast, so all ranks see the failure and return together. Apply the same treatment to the import-side sequence (hipSetDevice, hipMemImport, hipMemAddressReserve, hipMemMap, hipMemSetAccess), broadcasting importErr before returning.
BroadcastTfrResult: fix non-root path broadcasting only the first element of perIterCUs[i] regardless of setSize; replace the if-guard with a loop matching the root's per-element sends.
NIC executor QP setup: replace ERR_CHECK inside the per-QP loop with captured ErrType values broadcast from srcMemRank and dstMemRank, respectively, so a TransitionQpToRtr/Rts failure on any rank does not leave others blocked on the next iteration's Broadcast when qpCount > 1.
CheckMultiNodeConfigConsistency: null out vector members in DataOptions (fillPattern, fillCompress) and GfxOptions (cuMask, prefXccTable) before sizeof-broadcast. Broadcasting a struct containing std::vector transmits rank 0's heap pointers to other ranks; on scope exit the receiving rank calls free(remote_ptr) causing a segfault. These fields are permitted to differ across ranks and are not compared by the consistency check.
NIC QP transition: replace bare ErrType broadcast with a POD QpTransitionResult {errType, rtrFailed} struct so the failing stage (RTR vs RTS) is preserved and reported accurately. Previously all failures reported "to RTS" regardless of which transition failed. Adds static_assert(is_trivially_copyable) to enforce broadcast safety.

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR prevents multi-rank hangs caused by rank-local failures returning early before subsequent collectives, ensuring all ranks observe the failure and exit together (notably in NIC QP setup and POD fabric-handle exchange). It also includes build-system/toolchain updates that improve ROCm/HIP (+ AMD-SMI) feature detection and configuration.

Changes:

Synchronize error handling across ranks in NIC QP transition setup and POD fabric-handle export/import to avoid collective mismatch hangs.
Fix BroadcastTfrResult non-root path to receive all CU pairs (not just the first element).
Update CMake/Makefile/toolchain logic for ROCm path detection and compile-time API probes (HIP fabric + AMD-SMI fabric APIs), plus minor docs/changelog tweaks.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
toolchain-linux.cmake	Refactors ROCm/compiler auto-detection and default build-type flags in a Linux toolchain file.
src/header/TransferBench.hpp	Adds rank-synchronized error propagation around collectives; fixes CU broadcast loop; updates AMD-SMI BDF usage.
Makefile	Updates HIP compiler fallback selection and switches POD/AMD-SMI enablement to compile-time API probes.
CMakeLists.txt	Raises CMake minimum, improves ROCm/GPU target handling, and adds compile-time probes for POD/AMD-SMI support.
CHANGELOG.md	Normalizes preset naming/formatting entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Three places used ERR_CHECK inside rank-specific guards preceding (or within loops containing) subsequent Broadcast collectives. A failure on one rank would cause it to early-return before the collective, leaving all other ranks blocked indefinitely. - ExchangeMemory (POD path): capture hipSetDevice + hipMemExport- ToShareableHandle errors into exportErr and broadcast it after the handle broadcast, so all ranks see the failure and return together. Apply the same treatment to the import-side sequence (hipSetDevice, hipMemImport, hipMemAddressReserve, hipMemMap, hipMemSetAccess), broadcasting importErr before returning. - NIC executor QP setup: replace ERR_CHECK inside the per-QP loop with captured ErrType values broadcast from srcMemRank and dstMemRank respectively, so a TransitionQpToRtr/Rts failure on any rank does not leave others blocked on the next iteration's Broadcast when qpCount > 1. - BroadcastTfrResult: fix non-root path broadcasting only the first element of perIterCUs[i] regardless of setSize; replace the if-guard with a loop matching the root's per-element sends. Fix segfault and improve error diagnostics in multi-rank error paths - CheckMultiNodeConfigConsistency: null out vector members in DataOptions (fillPattern, fillCompress) and GfxOptions (cuMask, prefXccTable) before sizeof-broadcast. Broadcasting a struct containing std::vector transmits rank 0's heap pointers to other ranks; on scope exit the receiving rank calls free(remote_ptr) causing a segfault. These fields are permitted to differ across ranks and are not compared by the consistency check. - NIC QP transition: replace bare ErrType broadcast with a POD QpTransitionResult {errType, rtrFailed} struct so the failing stage (RTR vs RTS) is preserved and reported accurately. Previously all failures reported "to RTS" regardless of which transition failed. Adds static_assert(is_trivially_copyable) to enforce broadcast safety. Co-authored-by: Claude <claude@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Fix CUDA build: hip fabric macros (hipMemExportToShareableHandle, hipMemImportFromShareableHandle, hipMemAddressReserve, hipMemMap, hipMemSetAccess, hipMemUnmap, hipMemRelease, hipMemAddressFree) now cast CUresult to cudaError_t so callers can use a single hipError_t variable across runtime and driver API calls. - Fabric handle export/import error messages now include the specific failing call (e.g. "HIP Error in hipMemAddressReserve during fabric handle import") to make multi-rank failures diagnosable without re-running with extra logging. - Fix typo: "sharable" -> "shareable" in comment (matches HIP API name). Co-authored-by: Claude <claude@anthropic.com>

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 26, 2026 16:38

nileshnegi requested review from a team as code owners April 26, 2026 16:38

Copilot started reviewing on behalf of nileshnegi April 26, 2026 16:39 View session

Copilot AI reviewed Apr 26, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp Outdated

Comment thread src/header/TransferBench.hpp Outdated

Comment thread src/header/TransferBench.hpp Outdated

Comment thread CMakeLists.txt Outdated

gilbertlee-amd approved these changes Apr 26, 2026

View reviewed changes

nileshnegi force-pushed the users/nileshnegi/fix-mpi-based-errors branch from 8cf0bc4 to 7a6defe Compare April 27, 2026 01:55

nileshnegi requested a review from Copilot April 27, 2026 01:57

Copilot started reviewing on behalf of nileshnegi April 27, 2026 01:59 View session

Copilot AI reviewed Apr 27, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp Outdated

Comment thread src/header/TransferBench.hpp Outdated

Comment thread src/header/TransferBench.hpp

Comment thread src/header/TransferBench.hpp

Comment thread src/header/TransferBench.hpp

nileshnegi merged commit 4adc3a9 into candidate Apr 27, 2026
4 checks passed

nileshnegi deleted the users/nileshnegi/fix-mpi-based-errors branch April 27, 2026 02:29

nileshnegi mentioned this pull request Apr 27, 2026

TransferBench v1.67.0 #273

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix collective mismatch hangs in multi-rank error paths#270

Fix collective mismatch hangs in multi-rank error paths#270
nileshnegi merged 2 commits intocandidatefrom
users/nileshnegi/fix-mpi-based-errors

nileshnegi commented Apr 26, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nileshnegi commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nileshnegi commented Apr 26, 2026 •

edited

Loading