Skip to content

Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255

Merged
gilbertlee-amd merged 6 commits intoROCm:candidatefrom
gilbertlee-amd:BmaExecutor
Apr 14, 2026
Merged

Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255
gilbertlee-amd merged 6 commits intoROCm:candidatefrom
gilbertlee-amd:BmaExecutor

Conversation

@gilbertlee-amd
Copy link
Copy Markdown
Collaborator

Motivation

This adds a new Executor (B) based on the hipMemcpyBatchAsync call that was introduced in HIP 7.0.
This new Executor supports SubExecutors - namely how many batches a single Transfer is broken into.
This allows comparing performance against standard hipMemcpyAsync.
A new bmasweep preset is also introduced to compare the two versions

Technical Details

This new code also enables more than one destination when using DMA and Batched DMA (BMA) Executor.
When multiple destinations are provided, the copies are performed one after another with DMA, and all as different batches (respecting number of SubExecutors) for Batched DMA.

Test Result

Here are results from bmasweep from a MI355X:

[BMA Sweep Related]
EXE_INDEX            =            0 : Executing on GPU 0
LOCAL_COPY           =            0 : Excluding local copy to GPU 0
GPU_MEM_TYPE         =            0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            8 : Using 8 GPUs
NUM_SUB_EXECS        =            4 : 1,2,4,8

Performing 7 simultaneous DMA Transfers from GPU 0 other GPUs
Executing: ...................
┌------------┬--------┬---------------------------------------┐
│      Bytes │    DMA │ BMA (1)   BMA (2)   BMA (4)   BMA (8) │
├------------┼--------┼---------------------------------------┤
│       4096 │   0.68 │    0.67      0.35      0.18      0.09 │
│       8192 │   1.33 │    1.34      0.70      0.36      0.18 │
│      16384 │   2.61 │    2.68      1.41      0.72      0.35 │
│      32768 │   5.42 │    5.36      2.98      1.49      0.76 │
│      65536 │  10.52 │   10.54      5.53      3.01      1.52 │
│     131072 │  20.18 │   19.84     11.18      5.63      3.04 │
│     262144 │  33.90 │   32.19     21.52     11.21      5.67 │
│     524288 │  44.86 │   43.72     34.54     21.98     11.40 │
│    1048576 │  50.01 │   49.64     45.43     35.59     21.95 │
│    2097152 │  47.99 │   48.07     50.96     46.41     35.78 │
│    4194304 │  53.64 │   53.66     48.59     51.27     46.71 │
│    8388608 │  57.09 │   57.07     53.65     48.10     51.54 │
│   16777216 │  59.11 │   59.11     57.22     53.66     48.16 │
│   33554432 │  60.10 │   59.97     59.11     57.10     53.74 │
│   67108864 │  57.39 │   60.65     60.11     59.10     57.18 │
│  134217728 │  60.95 │   60.82     60.67     60.13     59.13 │
│  268435456 │  61.10 │   61.10     60.83     60.64     60.07 │
│  536870912 │  61.19 │   59.33     61.10     60.95     60.54 │
│ 1073741824 │  61.22 │   61.22     61.18     60.41     60.73 │
└------------┴--------┴---------------------------------------┘
Reported numbers are all GB/s, normalized for per Transfer for 7 Transfers

@gilbertlee-amd gilbertlee-amd requested review from a team as code owners April 10, 2026 22:17
@nileshnegi nileshnegi requested a review from Copilot April 10, 2026 22:19
@gilbertlee-amd gilbertlee-amd review requested due to automatic review settings April 10, 2026 22:20
@nileshnegi nileshnegi requested a review from Copilot April 10, 2026 23:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new GPU batched DMA executor (B, backed by hipMemcpyBatchAsync in HIP/ROCm 7.0+) and a new bmasweep preset to compare standard DMA vs batched DMA, while also extending the DMA path to support multiple destination buffers.

Changes:

  • Add EXE_GPU_BDMA (“B”) executor support gated by HIP/ROCm 7.0+ (hipMemcpyBatchAsync).
  • Allow DMA (and BMA) transfers to specify multiple destinations and execute the copies accordingly.
  • Add a bmasweep preset and update docs/changelog to expose the new executor.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/header/TransferBench.hpp Adds BDMA executor, multi-destination DMA support, BDMA execution path, and related validation/topology updates.
src/client/Utilities.hpp Adds string mapping for the new executor type.
src/client/Presets/Presets.hpp Registers the new bmasweep preset.
src/client/Presets/BmaSweep.hpp New preset to benchmark DMA vs batched DMA across multiple destinations.
examples/example.cfg Documents the new executor in the example config.
CHANGELOG.md Notes the new executor/preset and related DMA behavior changes.
Comments suppressed due to low confidence (1)

src/header/TransferBench.hpp:5834

  • Wildcard expansion for executor subindices treats EXE_GPU_BDMA like GFX/DMA and iterates over GetNumExecutorSubIndices(). Since BDMA reports 0 subindices, this branch currently generates no transfers when exeSubIndices is -2, instead of recursing once with subindex -1. BDMA should likely be handled like CPU here (set -1 and recurse once).
      case EXE_GPU_GFX: case EXE_GPU_DMA: case EXE_GPU_BDMA:
      {
        // Iterate over all available subindices
        ExeDevice exeDevice = {wc.exe.exeType, wc.exe.exeIndices[0], wc.exe.exeRanks[0], 0};
        int numSubIndices = GetNumExecutorSubIndices(exeDevice);
        for (int x = 0; x < numSubIndices; x++) {
          wc.exe.exeSubIndices = {x};
          result |= RecursiveWildcardTransferExpansion(wc, baseRankIndex, numBytes, numSubExecs, transfers);
        }
        wc.exe.exeSubIndices = {-1};
        return result;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/client/Presets/BmaSweep.hpp Outdated
Comment thread src/header/TransferBench.hpp
Comment thread src/header/TransferBench.hpp Outdated
Comment thread examples/example.cfg Outdated
Comment thread CHANGELOG.md Outdated
Comment thread src/client/Presets/BmaSweep.hpp
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/header/TransferBench.hpp Outdated
Comment thread src/header/TransferBench.hpp
Comment thread examples/example.cfg
Comment thread CHANGELOG.md
@alex-breslow-amd alex-breslow-amd self-requested a review April 14, 2026 23:01
Copy link
Copy Markdown

@alex-breslow-amd alex-breslow-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread src/header/TransferBench.hpp
@gilbertlee-amd gilbertlee-amd merged commit 2dba07f into ROCm:candidate Apr 14, 2026
4 checks passed
@gilbertlee-amd gilbertlee-amd deleted the BmaExecutor branch April 14, 2026 23:51
@nileshnegi nileshnegi mentioned this pull request Apr 27, 2026
1 task
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: AtlantaPepsi <timhu102@gmail.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants