Pod Ring preset by AtlantaPepsi · Pull Request #251 · ROCm/TransferBench

AtlantaPepsi · 2026-03-28T02:22:59Z

Motivation

We need a intra-pod ring preset, similar to nicrings preset, to simulate potential patterns used by RCCL

Technical Details

Similar to poda2a preset, we have the option to reorder all detectable devices according to user-input stride, then divide reordered devices into subgroups of user specified size. Each subgroup will be a ring.

Test Plan

Test Result

Example: on 2 nodes each with 4 GPU

Stride = 1 and Group Size = 4 ->all 2 x 4 = 8 devices in natural order and cut into 2 subgroups

[PodRing Related]
MEM_TYPE             =            0 : Using default GPU GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_REMOTE_READ      =            0 : Using SRC as executor
STRIDE               =            1 : Reordering devices by taking 1 steps
GROUP_SIZE           =            4 : Dividing all devices into ring groups of 4

GPU-GFX IntraPod Ring benchmark:
==============================
[268435456 bytes per Transfer] [GFX:8] [MemType:default GPU] [NIC QueuePairs:0] [#Ranks:2]
2 ring(s) of 4 devices:
  Ring 0: R0:G0 -> R0:G1 -> R0:G2 -> R0:G3 -> R0:G0
  Ring 1: R1:G0 -> R1:G1 -> R1:G2 -> R1:G3 -> R1:G0


--- Pod Ring Group 0 ---
┌------------┬------------┬----------┐
│  Src   Src │  Dst   Dst │ GFX BW   │
│ Rank   GPU │ Rank   GPU │ (GB/s)   │
├------------┼------------┼----------┤
│    0     0 │    0     1 │ 106.53   │
│    0     1 │    0     2 │ 105.44   │
│    0     2 │    0     3 │ 108.15   │
│    0     3 │    0     0 │ 110.00   │
├------------┼------------┼----------┤
│        MAX │            │ 110.00   │
│        AVG │            │ 107.53   │
│        MIN │            │ 105.44   │
└------------┴------------┴----------┘
Aggregate bandwidth (CPU Timed):  197.714 GB/s

--- Pod Ring Group 1 ---
┌------------┬------------┬----------┐
│  Src   Src │  Dst   Dst │ GFX BW   │
│ Rank   GPU │ Rank   GPU │ (GB/s)   │
├------------┼------------┼----------┤
│    1     0 │    1     1 │ 105.30   │
│    1     1 │    1     2 │ 104.57   │
│    1     2 │    1     3 │ 106.82   │
│    1     3 │    1     0 │ 106.71   │
├------------┼------------┼----------┤
│        MAX │            │ 106.82   │
│        AVG │            │ 105.85   │
│        MIN │            │ 104.57   │
└------------┴------------┴----------┘
Aggregate bandwidth (CPU Timed):  197.387 GB/s

Stride = 4 and Group Size = 4 ->all 2 x 4 = 8 devices reordered and cut into 2 subgroups

[PodRing Related]
MEM_TYPE             =            0 : Using default GPU GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
NUM_GPU_DEVICES      =            4 : Using 4 GPUs
NUM_QUEUE_PAIRS      =            0 : Using 0 queue pairs for NIC transfers
NUM_SUB_EXEC         =            8 : Using 8 subexecutors/CUs per Transfer
USE_DMA_EXEC         =            0 : Using GFX executor
USE_REMOTE_READ      =            0 : Using SRC as executor
STRIDE               =            4 : Reordering devices by taking 4 steps
GROUP_SIZE           =            4 : Dividing all devices into ring groups of 4

GPU-GFX IntraPod Ring benchmark:
==============================
[268435456 bytes per Transfer] [GFX:8] [MemType:default GPU] [NIC QueuePairs:0] [#Ranks:2]
2 ring(s) of 4 devices:
  Ring 0: R0:G0 -> R0:G2 -> R1:G0 -> R1:G2 -> R0:G0
  Ring 1: R0:G1 -> R0:G3 -> R1:G1 -> R1:G3 -> R0:G1


--- Pod Ring Group 0 ---
┌------------┬------------┬----------┐
│  Src   Src │  Dst   Dst │ GFX BW   │
│ Rank   GPU │ Rank   GPU │ (GB/s)   │
├------------┼------------┼----------┤
│    0     0 │    0     2 │ 110.72   │
│    0     2 │    1     0 │ 111.23   │
│    1     0 │    1     2 │ 109.95   │
│    1     2 │    0     0 │ 110.07   │
├------------┼------------┼----------┤
│        MAX │            │ 111.23   │
│        AVG │            │ 110.49   │
│        MIN │            │ 109.95   │
└------------┴------------┴----------┘
Aggregate bandwidth (CPU Timed):  410.284 GB/s

--- Pod Ring Group 1 ---
┌------------┬------------┬----------┐
│  Src   Src │  Dst   Dst │ GFX BW   │
│ Rank   GPU │ Rank   GPU │ (GB/s)   │
├------------┼------------┼----------┤
│    0     1 │    0     3 │ 104.38   │
│    0     3 │    1     1 │ 104.27   │
│    1     1 │    1     3 │ 103.70   │
│    1     3 │    0     1 │ 103.47   │
├------------┼------------┼----------┤
│        MAX │            │ 104.38   │
│        AVG │            │ 103.96   │
│        MIN │            │ 103.47   │
└------------┴------------┴----------┘
Aggregate bandwidth (CPU Timed):  385.956 GB/s

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Adds a new “podring” preset and centralizes several scheduling helper utilities so they can be reused across presets.

Changes:

Added a new PodRingPreset to run intra-pod ring transfers (optionally with NIC queue-pair transfers) and print per-group summaries.
Moved common helper routines (StrideGenerate, RoundRobinSchedule, CombinationSchedule) into TransferBench::Utils.
Updated existing presets to call the new Utils:: helper implementations.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/client/Utilities.hpp	Adds shared scheduling / indexing helpers used by multiple presets.
src/client/Presets/Presets.hpp	Registers the new `podring` preset and includes its header.
src/client/Presets/PodRing.hpp	New preset implementing ring transfers within pod subgroups.
src/client/Presets/PodPeerToPeer.hpp	Switches round-robin scheduling call to `Utils::RoundRobinSchedule`.
src/client/Presets/PodAllToAll.hpp	Removes local stride helper and uses `Utils::StrideGenerate`.
src/client/Presets/NicPeerToPeer.hpp	Removes local scheduling helpers and uses `Utils::RoundRobinSchedule` / `Utils::CombinationSchedule`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

AtlantaPepsi requested a review from a team as a code owner March 28, 2026 02:22

nileshnegi requested a review from Copilot March 28, 2026 18:16

Copilot started reviewing on behalf of nileshnegi March 28, 2026 18:17 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings April 28, 2026 15:36

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Comment thread src/client/Presets/PodRing.hpp Outdated

Comment thread src/client/Presets/PodRing.hpp Outdated

Comment thread src/client/Presets/PodRing.hpp Outdated

Comment thread src/client/Presets/PodRing.hpp Outdated

adjusting grouping logic; lifting helper functions to Utilities

963d581

AtlantaPepsi force-pushed the podring branch from 3b1088f to f90212f Compare April 28, 2026 16:14

nileshnegi requested a review from Copilot April 28, 2026 16:18

Copilot started reviewing on behalf of nileshnegi April 28, 2026 16:19 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Comment thread src/client/Utilities.hpp

Comment thread src/client/Utilities.hpp Outdated

Comment thread src/client/Utilities.hpp Outdated

addition of pod loop and minor fixes

6886db2

AtlantaPepsi force-pushed the podring branch from f90212f to 6886db2 Compare April 28, 2026 17:34

AtlantaPepsi requested a review from Copilot April 28, 2026 17:34

Copilot started reviewing on behalf of AtlantaPepsi April 28, 2026 17:35 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

Comment thread src/client/Utilities.hpp

Comment thread src/client/Presets/PodRing.hpp

Comment thread src/client/Presets/PodRing.hpp

Comment thread src/client/Presets/PodRing.hpp Outdated

AtlantaPepsi added 2 commits April 28, 2026 18:57

adjusting sizing checks

86e6eff

rolling back to single pod

92cc265

Copilot AI review requested due to automatic review settings April 28, 2026 19:52

Copilot started reviewing on behalf of AtlantaPepsi April 28, 2026 19:54 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

nileshnegi approved these changes Apr 28, 2026

View reviewed changes

AtlantaPepsi merged commit 0621e90 into ROCm:candidate Apr 28, 2026
7 of 8 checks passed

nileshnegi mentioned this pull request May 2, 2026

TransferBench v1.67.0 #273

Open

1 task

Conversation

AtlantaPepsi commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Example: on 2 nodes each with 4 GPU

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AtlantaPepsi commented Mar 28, 2026 •

edited

Loading