Fixes for cuMem compilation and invalid device ordinal by AtlantaPepsi · Pull Request #278 · ROCm/TransferBench

AtlantaPepsi · 2026-04-29T21:04:37Z

Motivation

cuMem symbols and definitions are currently guarded by pod communication enablement. It's not a long term solution as these two are not always coupled and usage might diverge in future. The separation of these two also fixes existing linking error with cuMemcpyAsync or CUResult in absence of POD_COMM_ENABLED.

Technical Details

Introduced a separate CUMEM_ENABLED macro for build process as well as header. POD_COMM_ENABLED will have to depend on cuMem enablement as well.
Minor adjustment for error reporting for cuda runtime calls
Eliminated unnecessary hipSetDevice: Previously in cuda + MNNVL update & pod presets #241 multiple hipSetDevice are added throughout cuMem allocation and release to make sure context is always initialized. Not all of them were needed, and unconditional invocation also caused error for non GPU executors.

Test Plan

Tested all combination of Makefile flags and made sure compilation/linking succeeded.
Previously on CI machines with more CPU NUMA nodes than GPU devices, certain sweeping presets such as p2p would fail, which is fixed now.

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

This PR decouples CUDA driver API (cuMem / libcuda) enablement from pod-communication support, and removes unnecessary hipSetDevice calls that could trigger “invalid device ordinal” on non-GPU executors.

Changes:

Introduce CUMEM_ENABLED as a separate build/header macro and make CUDA pod-comm depend on it.
Adjust error reporting strings for CUDA runtime vs HIP runtime errors.
Remove/guard some hipSetDevice calls to avoid invalid device selection in CPU-only contexts.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`src/header/TransferBench.hpp`	Guards CUDA driver API usage behind `CUMEM_ENABLED`, tweaks error messages, and reduces `hipSetDevice` usage to avoid invalid device ordinal.
`Makefile`	Adds `DISABLE_CUMEM` / `CUMEM_ENABLED` logic and makes CUDA pod-comm conditional on cuMem availability.

Comments suppressed due to low confidence (1)

src/header/TransferBench.hpp:5090

When CUMEM_ENABLED is not defined (e.g., TransferBenchCuda with DISABLE_CUMEM=1), this code falls back to hipMemcpyAsync(..., memcpyKind, ...), but memcpyKind is only declared under the HIP_PLATFORM_AMD && HIP_VERSION_MAJOR>=6 block. Under NVCC builds that block is not active, so memcpyKind is undefined and the CUDA build will fail to compile. Define an appropriate memcpy kind for the CUDA fallback path (e.g., device-to-device or default) and/or restructure the preprocessor guards so the fallback does not reference an undeclared variable.

                                  resources.numBytes, stream));
#else
          ERR_CHECK(hipMemcpyAsync(resources.dstMem[dstIdx], resources.srcMem[0], resources.numBytes,
                                   memcpyKind, stream));
#endif

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>

Copilot

Pull request overview

This PR decouples CUDA driver API (cuMem/cuMemcpyAsync/CUresult) enablement from pod-communication enablement, and removes/guards several hipSetDevice calls that could trigger “invalid device ordinal” on non-GPU executors.

Changes:

Add a standalone CUMEM_ENABLED build macro (and link -lcuda) for TransferBenchCuda, with CUDA pod-comm gated on cuMem availability.
Switch the DMA executor’s CUDA driver copy path to be controlled by CUMEM_ENABLED instead of __NVCC__.
Adjust error reporting strings for CUDA runtime vs HIP, and avoid hipSetDevice in CPU-memory initialization paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/header/TransferBench.hpp`	Uses `CUMEM_ENABLED` to gate CUDA driver APIs, tweaks error messages, and reduces unconditional `hipSetDevice` usage to avoid invalid ordinals.
`Makefile`	Introduces `DISABLE_CUMEM` / `CUMEM_ENABLED` and makes CUDA pod-comm depend on cuMem enablement; moves `-lcuda` linkage to the cuMem feature gate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+#if defined(CUMEM_ENABLED)
          ERR_CHECK(cuMemcpyAsync((CUdeviceptr)resources.dstMem[dstIdx],
                                  (CUdeviceptr)resources.srcMem[0],
                                  resources.numBytes, stream));


- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

AtlantaPepsi added 2 commits April 29, 2026 19:10

Removing unnecessary device set

429a84a

separation of cuMem and compilation fixes

480c582

Copilot AI review requested due to automatic review settings April 29, 2026 21:04

AtlantaPepsi requested a review from a team as a code owner April 29, 2026 21:04

Copilot started reviewing on behalf of AtlantaPepsi April 29, 2026 21:06 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

nileshnegi reviewed Apr 29, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp Outdated

nileshnegi approved these changes Apr 29, 2026

View reviewed changes

Update src/header/TransferBench.hpp

cdb40f7

Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>

Copilot AI review requested due to automatic review settings April 29, 2026 22:13

AtlantaPepsi merged commit 350e4e5 into ROCm:candidate Apr 29, 2026
1 of 2 checks passed

Copilot started reviewing on behalf of AtlantaPepsi April 29, 2026 22:14 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp

Comment on lines +5083 to 5086

#if defined(CUMEM_ENABLED)

ERR_CHECK(cuMemcpyAsync((CUdeviceptr)resources.dstMem[dstIdx],

(CUdeviceptr)resources.srcMem[0],

resources.numBytes, stream));

alex-breslow-amd reviewed Apr 29, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp

alex-breslow-amd reviewed Apr 29, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp

alex-breslow-amd reviewed Apr 29, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp

nileshnegi mentioned this pull request May 2, 2026

TransferBench v1.67.0 #273

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for cuMem compilation and invalid device ordinal#278

Fixes for cuMem compilation and invalid device ordinal#278
AtlantaPepsi merged 3 commits intoROCm:candidatefrom
AtlantaPepsi:devSetFix

AtlantaPepsi commented Apr 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

AtlantaPepsi commented Apr 29, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants