Skip to content

Fixes for cuMem compilation and invalid device ordinal#278

Merged
AtlantaPepsi merged 3 commits intoROCm:candidatefrom
AtlantaPepsi:devSetFix
Apr 29, 2026
Merged

Fixes for cuMem compilation and invalid device ordinal#278
AtlantaPepsi merged 3 commits intoROCm:candidatefrom
AtlantaPepsi:devSetFix

Conversation

@AtlantaPepsi
Copy link
Copy Markdown
Contributor

Motivation

cuMem symbols and definitions are currently guarded by pod communication enablement. It's not a long term solution as these two are not always coupled and usage might diverge in future. The separation of these two also fixes existing linking error with cuMemcpyAsync or CUResult in absence of POD_COMM_ENABLED.

Technical Details

  • Introduced a separate CUMEM_ENABLED macro for build process as well as header. POD_COMM_ENABLED will have to depend on cuMem enablement as well.
  • Minor adjustment for error reporting for cuda runtime calls
  • Eliminated unnecessary hipSetDevice: Previously in cuda + MNNVL update & pod presets #241 multiple hipSetDevice are added throughout cuMem allocation and release to make sure context is always initialized. Not all of them were needed, and unconditional invocation also caused error for non GPU executors.

Test Plan

Tested all combination of Makefile flags and made sure compilation/linking succeeded.
Previously on CI machines with more CPU NUMA nodes than GPU devices, certain sweeping presets such as p2p would fail, which is fixed now.

Test Result

Submission Checklist

Copilot AI review requested due to automatic review settings April 29, 2026 21:04
@AtlantaPepsi AtlantaPepsi requested a review from a team as a code owner April 29, 2026 21:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR decouples CUDA driver API (cuMem / libcuda) enablement from pod-communication support, and removes unnecessary hipSetDevice calls that could trigger “invalid device ordinal” on non-GPU executors.

Changes:

  • Introduce CUMEM_ENABLED as a separate build/header macro and make CUDA pod-comm depend on it.
  • Adjust error reporting strings for CUDA runtime vs HIP runtime errors.
  • Remove/guard some hipSetDevice calls to avoid invalid device selection in CPU-only contexts.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
src/header/TransferBench.hpp Guards CUDA driver API usage behind CUMEM_ENABLED, tweaks error messages, and reduces hipSetDevice usage to avoid invalid device ordinal.
Makefile Adds DISABLE_CUMEM / CUMEM_ENABLED logic and makes CUDA pod-comm conditional on cuMem availability.
Comments suppressed due to low confidence (1)

src/header/TransferBench.hpp:5090

  • When CUMEM_ENABLED is not defined (e.g., TransferBenchCuda with DISABLE_CUMEM=1), this code falls back to hipMemcpyAsync(..., memcpyKind, ...), but memcpyKind is only declared under the HIP_PLATFORM_AMD && HIP_VERSION_MAJOR>=6 block. Under NVCC builds that block is not active, so memcpyKind is undefined and the CUDA build will fail to compile. Define an appropriate memcpy kind for the CUDA fallback path (e.g., device-to-device or default) and/or restructure the preprocessor guards so the fallback does not reference an undeclared variable.
                                  resources.numBytes, stream));
#else
          ERR_CHECK(hipMemcpyAsync(resources.dstMem[dstIdx], resources.srcMem[0], resources.numBytes,
                                   memcpyKind, stream));
#endif

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/header/TransferBench.hpp Outdated
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Copilot AI review requested due to automatic review settings April 29, 2026 22:13
@AtlantaPepsi AtlantaPepsi merged commit 350e4e5 into ROCm:candidate Apr 29, 2026
1 of 2 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR decouples CUDA driver API (cuMem/cuMemcpyAsync/CUresult) enablement from pod-communication enablement, and removes/guards several hipSetDevice calls that could trigger “invalid device ordinal” on non-GPU executors.

Changes:

  • Add a standalone CUMEM_ENABLED build macro (and link -lcuda) for TransferBenchCuda, with CUDA pod-comm gated on cuMem availability.
  • Switch the DMA executor’s CUDA driver copy path to be controlled by CUMEM_ENABLED instead of __NVCC__.
  • Adjust error reporting strings for CUDA runtime vs HIP, and avoid hipSetDevice in CPU-memory initialization paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/header/TransferBench.hpp Uses CUMEM_ENABLED to gate CUDA driver APIs, tweaks error messages, and reduces unconditional hipSetDevice usage to avoid invalid ordinals.
Makefile Introduces DISABLE_CUMEM / CUMEM_ENABLED and makes CUDA pod-comm depend on cuMem enablement; moves -lcuda linkage to the cuMem feature gate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +5083 to 5086
#if defined(CUMEM_ENABLED)
ERR_CHECK(cuMemcpyAsync((CUdeviceptr)resources.dstMem[dstIdx],
(CUdeviceptr)resources.srcMem[0],
resources.numBytes, stream));
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: AtlantaPepsi <timhu102@gmail.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@nileshnegi nileshnegi mentioned this pull request May 2, 2026
1 task
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants