Adding a wallclock consistency detection preset#258
Merged
gilbertlee-amd merged 3 commits intoROCm:candidatefrom Apr 19, 2026
Merged
Adding a wallclock consistency detection preset#258gilbertlee-amd merged 3 commits intoROCm:candidatefrom
gilbertlee-amd merged 3 commits intoROCm:candidatefrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new preset entry intended to measure AMD GPU wallclock consistency across XCCs, and records the feature in the changelog.
Changes:
- Registers a new
"wallclock"preset in the preset dispatcher. - Adjusts macro cleanup behavior for
GetXccIdinTransferBench.hpp. - Documents the new preset in
CHANGELOG.md.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/header/TransferBench.hpp | Stops undefining GetXccId at the end of the header (macro now leaks past the header boundary). |
| src/client/Presets/Presets.hpp | Adds include and preset map entry for the new wallclock preset. |
| CHANGELOG.md | Adds a bullet noting the new wallclock preset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
alex-breslow-amd
approved these changes
Apr 18, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mustafabar
reviewed
Apr 21, 2026
nileshnegi
added a commit
that referenced
this pull request
May 2, 2026
- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: AtlantaPepsi <timhu102@gmail.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
nileshnegi
added a commit
that referenced
this pull request
May 2, 2026
- Initial pod communication support (#235) - cuda + MNNVL update & pod presets (#241) - Increase CQ size for high qps (#244) - fix hang when NVML is present but fabricmanager isnt (#246) - Adding nica2a preset (#248) - Adding HBM read bandwidth preset (#250) - Pod Ring preset (#251) - gfxsweep preset (#254) (#256) - Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255) - Adding a wallclock consistency detection preset (#258) - Adding smoketest preset for simple correctness tests (#266) - Help / envvars / presets presets (#267) - Modernize CMake build (#268) - Replace version-based pod/amd-smi detection with compile-time API probes (#269) - Fix collective mismatch hangs in multi-rank error paths (#270) - Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271) - Reformat a2asweep output to match gfxsweep style (#272) - Gfx sweep update (#274) - Increasing flush frequency in smoketest (#275) - Adding new experimental copy-only GFX kernel, gfxsweep update (#277) - Fixes for cuMem compilation and invalid device ordinal (#278) - Simplifying socket connect, allow for using host address (#279) - Updating podring to run on single node without need to force single pod (#280) - Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281) --------- Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com> Co-authored-by: Pak Nin Lui <pak.lui@amd.com> Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com> Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Claude <claude@anthropic.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
TransferBench uses GFX wall clock timestamps to measure individual Transfers within a kernel, which may require comparing timestamps across multiple threadblocks working together on the same Transfers. On AMD hardware, each XCC has its own wall clock counter, which may be slightly uncoordinated with one another.
This new wallclock preset executes a simple kernel to try to capture timestamps from various XCCs at the same moment of time, then compares the differences between them. This preset is multi-node capable, allow for convenient checking across a cluster of nodes.
Technical Details
This kernel launches 1 threadblock per XCC as well as an extra threadblock that issues a "go" command to the other threadblocks as to when they should capture a timestamp. This assumes that threadblocks are assigned in round-robin XCC order one at a time. The timestamps are collected then processed on the host and results are printed.
Test Result
Example output (MI355X)
Additional information, as well as raw timestamp values can be shown by setting SHOW_ITERATIONS
Example showing raw timestamps (SHOW_ITERATIONS=2):