Skip to content

Adding SHOW_PERCENTILES to show extra per-iteration statistics#281

Merged
gilbertlee-amd merged 1 commit intoROCm:candidatefrom
gilbertlee-amd:AddPercentiles
May 2, 2026
Merged

Adding SHOW_PERCENTILES to show extra per-iteration statistics#281
gilbertlee-amd merged 1 commit intoROCm:candidatefrom
gilbertlee-amd:AddPercentiles

Conversation

@gilbertlee-amd
Copy link
Copy Markdown
Collaborator

Motivation

Add ability to get additional statistics about per-iteration duration through the new env var SHOW_PERCENTILES which takes in comma-separated percentages. For example SHOW_PERCENTILES=50,90,99 reports the duration 50%/90%/99% of Transfers are faster than.

Technical Details

This is purely a client-side change. When SHOW_PERCENTILES is enabled, per-iteration timing is recorded, and then the results are sorted to compute the statistics:

Example output

Test 1:

-------------------┬--------------┬------------┬-------------------┬--------------------
  Executor: CPU 00 │  12.058 GB/s │  89.050 ms │  1073741824 bytes │  12.091 GB/s (sum)
-------------------┼--------------┼------------┼-------------------┼--------------------
     Transfer 0    │  12.091 GB/s │  88.802 ms │  1073741824 bytes │ C0 -> C0:1 -> C1
               p50 │  12.090 GB/s │  88.814 ms │                   │
               p90 │  11.720 GB/s │  91.617 ms │                   │
               p99 │  11.702 GB/s │  91.757 ms │                   │
-------------------┼--------------┼------------┼-------------------┼--------------------
   Aggregate (CPU) │  12.014 GB/s │  89.374 ms │  1073741824 bytes │ Overhead 0.325 ms
-------------------┴--------------┴------------┴-------------------┴--------------------

@gilbertlee-amd gilbertlee-amd requested review from a team as code owners May 2, 2026 04:43
@gilbertlee-amd gilbertlee-amd merged commit d36cc23 into ROCm:candidate May 2, 2026
3 of 4 checks passed
@gilbertlee-amd gilbertlee-amd deleted the AddPercentiles branch May 2, 2026 04:55
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: AtlantaPepsi <timhu102@gmail.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@nileshnegi nileshnegi mentioned this pull request May 2, 2026
1 task
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants