Adding SHOW_PERCENTILES to show extra per-iteration statistics by gilbertlee-amd · Pull Request #281 · ROCm/TransferBench

gilbertlee-amd · 2026-05-02T04:43:54Z

Motivation

Add ability to get additional statistics about per-iteration duration through the new env var SHOW_PERCENTILES which takes in comma-separated percentages. For example SHOW_PERCENTILES=50,90,99 reports the duration 50%/90%/99% of Transfers are faster than.

Technical Details

This is purely a client-side change. When SHOW_PERCENTILES is enabled, per-iteration timing is recorded, and then the results are sorted to compute the statistics:

Example output

Test 1:

-------------------┬--------------┬------------┬-------------------┬--------------------
  Executor: CPU 00 │  12.058 GB/s │  89.050 ms │  1073741824 bytes │  12.091 GB/s (sum)
-------------------┼--------------┼------------┼-------------------┼--------------------
     Transfer 0    │  12.091 GB/s │  88.802 ms │  1073741824 bytes │ C0 -> C0:1 -> C1
               p50 │  12.090 GB/s │  88.814 ms │                   │
               p90 │  11.720 GB/s │  91.617 ms │                   │
               p99 │  11.702 GB/s │  91.757 ms │                   │
-------------------┼--------------┼------------┼-------------------┼--------------------
   Aggregate (CPU) │  12.014 GB/s │  89.374 ms │  1073741824 bytes │ Overhead 0.325 ms
-------------------┴--------------┴------------┴-------------------┴--------------------

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Adding SHOW_PERCENTILES to show extra per-iteration statistics

41167a8

gilbertlee-amd requested review from a team as code owners May 2, 2026 04:43

AtlantaPepsi approved these changes May 2, 2026

View reviewed changes

gilbertlee-amd merged commit d36cc23 into ROCm:candidate May 2, 2026
3 of 4 checks passed

gilbertlee-amd deleted the AddPercentiles branch May 2, 2026 04:55

nileshnegi mentioned this pull request May 2, 2026

TransferBench v1.67.0 #273

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SHOW_PERCENTILES to show extra per-iteration statistics#281

Adding SHOW_PERCENTILES to show extra per-iteration statistics#281
gilbertlee-amd merged 1 commit intoROCm:candidatefrom
gilbertlee-amd:AddPercentiles

gilbertlee-amd commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gilbertlee-amd commented May 2, 2026

Motivation

Technical Details

Example output

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants