Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset by gilbertlee-amd · Pull Request #255 · ROCm/TransferBench

gilbertlee-amd · 2026-04-10T22:17:40Z

Motivation

This adds a new Executor (B) based on the hipMemcpyBatchAsync call that was introduced in HIP 7.0.
This new Executor supports SubExecutors - namely how many batches a single Transfer is broken into.
This allows comparing performance against standard hipMemcpyAsync.
A new bmasweep preset is also introduced to compare the two versions

Technical Details

This new code also enables more than one destination when using DMA and Batched DMA (BMA) Executor.
When multiple destinations are provided, the copies are performed one after another with DMA, and all as different batches (respecting number of SubExecutors) for Batched DMA.

Test Result

Here are results from bmasweep from a MI355X:

[BMA Sweep Related]
EXE_INDEX            =            0 : Executing on GPU 0
LOCAL_COPY           =            0 : Excluding local copy to GPU 0
GPU_MEM_TYPE         =            0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            8 : Using 8 GPUs
NUM_SUB_EXECS        =            4 : 1,2,4,8

Performing 7 simultaneous DMA Transfers from GPU 0 other GPUs
Executing: ...................
┌------------┬--------┬---------------------------------------┐
│      Bytes │    DMA │ BMA (1)   BMA (2)   BMA (4)   BMA (8) │
├------------┼--------┼---------------------------------------┤
│       4096 │   0.68 │    0.67      0.35      0.18      0.09 │
│       8192 │   1.33 │    1.34      0.70      0.36      0.18 │
│      16384 │   2.61 │    2.68      1.41      0.72      0.35 │
│      32768 │   5.42 │    5.36      2.98      1.49      0.76 │
│      65536 │  10.52 │   10.54      5.53      3.01      1.52 │
│     131072 │  20.18 │   19.84     11.18      5.63      3.04 │
│     262144 │  33.90 │   32.19     21.52     11.21      5.67 │
│     524288 │  44.86 │   43.72     34.54     21.98     11.40 │
│    1048576 │  50.01 │   49.64     45.43     35.59     21.95 │
│    2097152 │  47.99 │   48.07     50.96     46.41     35.78 │
│    4194304 │  53.64 │   53.66     48.59     51.27     46.71 │
│    8388608 │  57.09 │   57.07     53.65     48.10     51.54 │
│   16777216 │  59.11 │   59.11     57.22     53.66     48.16 │
│   33554432 │  60.10 │   59.97     59.11     57.10     53.74 │
│   67108864 │  57.39 │   60.65     60.11     59.10     57.18 │
│  134217728 │  60.95 │   60.82     60.67     60.13     59.13 │
│  268435456 │  61.10 │   61.10     60.83     60.64     60.07 │
│  536870912 │  61.19 │   59.33     61.10     60.95     60.54 │
│ 1073741824 │  61.22 │   61.22     61.18     60.41     60.73 │
└------------┴--------┴---------------------------------------┘
Reported numbers are all GB/s, normalized for per Transfer for 7 Transfers

Copilot

Pull request overview

This PR introduces a new GPU batched DMA executor (B, backed by hipMemcpyBatchAsync in HIP/ROCm 7.0+) and a new bmasweep preset to compare standard DMA vs batched DMA, while also extending the DMA path to support multiple destination buffers.

Changes:

Add EXE_GPU_BDMA (“B”) executor support gated by HIP/ROCm 7.0+ (hipMemcpyBatchAsync).
Allow DMA (and BMA) transfers to specify multiple destinations and execute the copies accordingly.
Add a bmasweep preset and update docs/changelog to expose the new executor.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
src/header/TransferBench.hpp	Adds BDMA executor, multi-destination DMA support, BDMA execution path, and related validation/topology updates.
src/client/Utilities.hpp	Adds string mapping for the new executor type.
src/client/Presets/Presets.hpp	Registers the new `bmasweep` preset.
src/client/Presets/BmaSweep.hpp	New preset to benchmark DMA vs batched DMA across multiple destinations.
examples/example.cfg	Documents the new executor in the example config.
CHANGELOG.md	Notes the new executor/preset and related DMA behavior changes.

Comments suppressed due to low confidence (1)

src/header/TransferBench.hpp:5834

Wildcard expansion for executor subindices treats EXE_GPU_BDMA like GFX/DMA and iterates over GetNumExecutorSubIndices(). Since BDMA reports 0 subindices, this branch currently generates no transfers when exeSubIndices is -2, instead of recursing once with subindex -1. BDMA should likely be handled like CPU here (set -1 and recurse once).

      case EXE_GPU_GFX: case EXE_GPU_DMA: case EXE_GPU_BDMA:
      {
        // Iterate over all available subindices
        ExeDevice exeDevice = {wc.exe.exeType, wc.exe.exeIndices[0], wc.exe.exeRanks[0], 0};
        int numSubIndices = GetNumExecutorSubIndices(exeDevice);
        for (int x = 0; x < numSubIndices; x++) {
          wc.exe.exeSubIndices = {x};
          result |= RecursiveWildcardTransferExpansion(wc, baseRankIndex, numBytes, numSubExecs, transfers);
        }
        wc.exe.exeSubIndices = {-1};
        return result;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

alex-breslow-amd

LGTM

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gilbertlee-amd requested review from a team as code owners April 10, 2026 22:17

nileshnegi requested a review from Copilot April 10, 2026 22:19

gilbertlee-amd review requested due to automatic review settings April 10, 2026 22:20

Copilot started reviewing on behalf of nileshnegi April 10, 2026 22:21 View session

nileshnegi requested a review from Copilot April 10, 2026 23:41

Copilot started reviewing on behalf of nileshnegi April 10, 2026 23:43 View session

Copilot AI reviewed Apr 10, 2026

View reviewed changes

nileshnegi requested a review from Copilot April 11, 2026 01:32

Copilot started reviewing on behalf of nileshnegi April 11, 2026 01:34 View session

Copilot AI reviewed Apr 11, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp Outdated

Comment thread src/header/TransferBench.hpp

Comment thread examples/example.cfg

Comment thread CHANGELOG.md

gilbertlee-amd added 3 commits April 11, 2026 16:01

Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset

e100737

Minor fixes to Batched DMA support

2372321

Fixing up typos / bugs

8bab3a2

gilbertlee-amd force-pushed the BmaExecutor branch from 8bd9fe4 to 8bab3a2 Compare April 11, 2026 21:02

nileshnegi approved these changes Apr 12, 2026

View reviewed changes

gilbertlee-amd added 3 commits April 13, 2026 23:49

Adding support for cudaMemcpyBatchAsync

e55d2a5

Accounting for CUDA 12.8 vs CUDA 13.0 cudaMemcpyBatchAsync differences

bec6e90

Fixing gfx906 compile issue

0e07bd0

alex-breslow-amd self-requested a review April 14, 2026 23:01

alex-breslow-amd approved these changes Apr 14, 2026

View reviewed changes

alex-breslow-amd reviewed Apr 14, 2026

View reviewed changes

Comment thread src/header/TransferBench.hpp

gilbertlee-amd merged commit 2dba07f into ROCm:candidate Apr 14, 2026
4 checks passed

gilbertlee-amd deleted the BmaExecutor branch April 14, 2026 23:51

nileshnegi mentioned this pull request Apr 27, 2026

TransferBench v1.67.0 #273

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255

Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255
gilbertlee-amd merged 6 commits intoROCm:candidatefrom
gilbertlee-amd:BmaExecutor

gilbertlee-amd commented Apr 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alex-breslow-amd left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

gilbertlee-amd commented Apr 10, 2026

Motivation

Technical Details

Test Result

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alex-breslow-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants