Fixes for cuMem compilation and invalid device ordinal#278
Fixes for cuMem compilation and invalid device ordinal#278AtlantaPepsi merged 3 commits intoROCm:candidatefrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR decouples CUDA driver API (cuMem / libcuda) enablement from pod-communication support, and removes unnecessary hipSetDevice calls that could trigger “invalid device ordinal” on non-GPU executors.
Changes:
- Introduce
CUMEM_ENABLEDas a separate build/header macro and make CUDA pod-comm depend on it. - Adjust error reporting strings for CUDA runtime vs HIP runtime errors.
- Remove/guard some
hipSetDevicecalls to avoid invalid device selection in CPU-only contexts.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/header/TransferBench.hpp |
Guards CUDA driver API usage behind CUMEM_ENABLED, tweaks error messages, and reduces hipSetDevice usage to avoid invalid device ordinal. |
Makefile |
Adds DISABLE_CUMEM / CUMEM_ENABLED logic and makes CUDA pod-comm conditional on cuMem availability. |
Comments suppressed due to low confidence (1)
src/header/TransferBench.hpp:5090
- When CUMEM_ENABLED is not defined (e.g., TransferBenchCuda with DISABLE_CUMEM=1), this code falls back to hipMemcpyAsync(..., memcpyKind, ...), but memcpyKind is only declared under the HIP_PLATFORM_AMD && HIP_VERSION_MAJOR>=6 block. Under NVCC builds that block is not active, so memcpyKind is undefined and the CUDA build will fail to compile. Define an appropriate memcpy kind for the CUDA fallback path (e.g., device-to-device or default) and/or restructure the preprocessor guards so the fallback does not reference an undeclared variable.
resources.numBytes, stream));
#else
ERR_CHECK(hipMemcpyAsync(resources.dstMem[dstIdx], resources.srcMem[0], resources.numBytes,
memcpyKind, stream));
#endif
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
There was a problem hiding this comment.
Pull request overview
This PR decouples CUDA driver API (cuMem/cuMemcpyAsync/CUresult) enablement from pod-communication enablement, and removes/guards several hipSetDevice calls that could trigger “invalid device ordinal” on non-GPU executors.
Changes:
- Add a standalone
CUMEM_ENABLEDbuild macro (and link-lcuda) forTransferBenchCuda, with CUDA pod-comm gated on cuMem availability. - Switch the DMA executor’s CUDA driver copy path to be controlled by
CUMEM_ENABLEDinstead of__NVCC__. - Adjust error reporting strings for CUDA runtime vs HIP, and avoid
hipSetDevicein CPU-memory initialization paths.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/header/TransferBench.hpp |
Uses CUMEM_ENABLED to gate CUDA driver APIs, tweaks error messages, and reduces unconditional hipSetDevice usage to avoid invalid ordinals. |
Makefile |
Introduces DISABLE_CUMEM / CUMEM_ENABLED and makes CUDA pod-comm depend on cuMem enablement; moves -lcuda linkage to the cuMem feature gate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #if defined(CUMEM_ENABLED) | ||
| ERR_CHECK(cuMemcpyAsync((CUdeviceptr)resources.dstMem[dstIdx], | ||
| (CUdeviceptr)resources.srcMem[0], | ||
| resources.numBytes, stream)); |
- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Motivation
cuMem symbols and definitions are currently guarded by pod communication enablement. It's not a long term solution as these two are not always coupled and usage might diverge in future. The separation of these two also fixes existing linking error with
cuMemcpyAsyncorCUResultin absence ofPOD_COMM_ENABLED.Technical Details
CUMEM_ENABLEDmacro for build process as well as header.POD_COMM_ENABLEDwill have to depend on cuMem enablement as well.hipSetDevice: Previously in cuda + MNNVL update & pod presets #241 multiplehipSetDeviceare added throughout cuMem allocation and release to make sure context is always initialized. Not all of them were needed, and unconditional invocation also caused error for non GPU executors.Test Plan
Tested all combination of Makefile flags and made sure compilation/linking succeeded.
Previously on CI machines with more CPU NUMA nodes than GPU devices, certain sweeping presets such as p2p would fail, which is fixed now.
Test Result
Submission Checklist