Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255
Merged
gilbertlee-amd merged 6 commits intoROCm:candidatefrom Apr 14, 2026
Merged
Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset#255gilbertlee-amd merged 6 commits intoROCm:candidatefrom
gilbertlee-amd merged 6 commits intoROCm:candidatefrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a new GPU batched DMA executor (B, backed by hipMemcpyBatchAsync in HIP/ROCm 7.0+) and a new bmasweep preset to compare standard DMA vs batched DMA, while also extending the DMA path to support multiple destination buffers.
Changes:
- Add
EXE_GPU_BDMA(“B”) executor support gated by HIP/ROCm 7.0+ (hipMemcpyBatchAsync). - Allow DMA (and BMA) transfers to specify multiple destinations and execute the copies accordingly.
- Add a
bmasweeppreset and update docs/changelog to expose the new executor.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/header/TransferBench.hpp | Adds BDMA executor, multi-destination DMA support, BDMA execution path, and related validation/topology updates. |
| src/client/Utilities.hpp | Adds string mapping for the new executor type. |
| src/client/Presets/Presets.hpp | Registers the new bmasweep preset. |
| src/client/Presets/BmaSweep.hpp | New preset to benchmark DMA vs batched DMA across multiple destinations. |
| examples/example.cfg | Documents the new executor in the example config. |
| CHANGELOG.md | Notes the new executor/preset and related DMA behavior changes. |
Comments suppressed due to low confidence (1)
src/header/TransferBench.hpp:5834
- Wildcard expansion for executor subindices treats EXE_GPU_BDMA like GFX/DMA and iterates over GetNumExecutorSubIndices(). Since BDMA reports 0 subindices, this branch currently generates no transfers when exeSubIndices is -2, instead of recursing once with subindex -1. BDMA should likely be handled like CPU here (set -1 and recurse once).
case EXE_GPU_GFX: case EXE_GPU_DMA: case EXE_GPU_BDMA:
{
// Iterate over all available subindices
ExeDevice exeDevice = {wc.exe.exeType, wc.exe.exeIndices[0], wc.exe.exeRanks[0], 0};
int numSubIndices = GetNumExecutorSubIndices(exeDevice);
for (int x = 0; x < numSubIndices; x++) {
wc.exe.exeSubIndices = {x};
result |= RecursiveWildcardTransferExpansion(wc, baseRankIndex, numBytes, numSubExecs, transfers);
}
wc.exe.exeSubIndices = {-1};
return result;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8bd9fe4 to
8bab3a2
Compare
nileshnegi
approved these changes
Apr 12, 2026
nileshnegi
added a commit
that referenced
this pull request
May 2, 2026
- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
nileshnegi
added a commit
that referenced
this pull request
May 2, 2026
- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This adds a new Executor (
B) based on thehipMemcpyBatchAsynccall that was introduced in HIP 7.0.This new Executor supports SubExecutors - namely how many batches a single Transfer is broken into.
This allows comparing performance against standard
hipMemcpyAsync.A new
bmasweeppreset is also introduced to compare the two versionsTechnical Details
This new code also enables more than one destination when using DMA and Batched DMA (BMA) Executor.
When multiple destinations are provided, the copies are performed one after another with DMA, and all as different batches (respecting number of SubExecutors) for Batched DMA.
Test Result
Here are results from
bmasweepfrom a MI355X: