Skip to content

Modifying the gfxsweep preset#256

Merged
gilbertlee-amd merged 5 commits intoROCm:candidatefrom
gilbertlee-amd:gfxSweepUpdate
Apr 17, 2026
Merged

Modifying the gfxsweep preset#256
gilbertlee-amd merged 5 commits intoROCm:candidatefrom
gilbertlee-amd:gfxSweepUpdate

Conversation

@gilbertlee-amd
Copy link
Copy Markdown
Collaborator

@gilbertlee-amd gilbertlee-amd commented Apr 15, 2026

Motivation

Tweak the gfxsweep preset to make it easier to import into spreadsheets / track which set of options yields the best results.
Also allow the use of multiple Transfers / wildcards

Technical Details

Re-wrote parts of preset

Test Result

Here's example output:

[GFX Sweep Related]
BLOCKSIZES           =            4 : 256,512,768,1024
GFX_TRANSFER         = R0G0->R0G0->R0G0 : GFX Transfer to sweep (see config file format)
NUM_SUB_EXECS        =            5 : 4,8,16,32,64
TEMPORAL_MODES       =            1 : 0
UNROLLS              =            4 : 1,2,4,8
WAVE_ORDERS          =            1 : 0
WORDSIZES            =            1 : 4
VERBOSE              =            0 : Display summary only

GFX sweep: (1048576 bytes per Transfer). All values are CPU-timed GB/s
=======================================================================================
Transfer     0: (G0->G0->G0)
=======================================================================================
 WvO   WSz   TpM   BlkS   UnR   SE 004  SE 008  SE 016  SE 032  SE 064
  0     4     0     256     1     5.38    6.00    7.79    6.76    4.22
  0     4     0     256     2    11.81    9.23    9.50    6.84    4.64
  0     4     0     256     4    12.05   11.29    8.98    6.44    4.58
  0     4     0     256     8    12.01   11.25    9.52    6.48    4.59
  0     4     0     512     1    11.75   10.65    9.33    6.80    4.63
  0     4     0     512     2    12.27   10.95    9.70    6.66    4.59
  0     4     0     512     4    11.76   11.26    8.54    6.62    4.58
  0     4     0     512     8    10.66   11.44    9.46    7.01    4.51
  0     4     0     768     1    10.09   11.37    8.69    6.91    4.65
  0     4     0     768     2    12.07   11.38    9.03    6.81    4.63
  0     4     0     768     4    12.10   11.30    8.97    6.40    4.66
  0     4     0     768     8    12.33   11.28    9.38    7.13    4.52
  0     4     0    1024     1    12.07   10.96    9.37    7.05    4.62
  0     4     0    1024     2    12.82   10.94    9.34    6.81    4.68
  0     4     0    1024     4    12.11   10.72    9.61    6.28    4.65
  0     4     0    1024     8    11.58   10.82    9.17    6.73    4.46
=======================================================================================
Highest bandwidth found:   12.82 GB/s (CPU-timed)
          WaveOrder    :       0
          WordSize     :       4
          Temporal Mode:       0
          BlockSize    :    1024
          Unroll       :       2
          NumSubExec   :       4


@gilbertlee-amd gilbertlee-amd requested a review from a team as a code owner April 15, 2026 22:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the gfxsweep preset to produce more spreadsheet-friendly tabular output and to support sweeping across a configurable set of GFX kernel parameters while tracking the best-performing combination.

Changes:

  • Reworked gfxsweep output into a single table with per-configuration rows and per-NUM_SUB_EXECS columns, plus a “best bandwidth” summary.
  • Switched transfer selection to an env-var-driven string parsed via ParseTransfers, and iterates over multiple transfers (if produced by wildcard expansion).
  • Simplified/updated the set of printed “related” environment variables for this preset.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client/Presets/GfxSweep.hpp Outdated
Comment thread src/client/Presets/GfxSweep.hpp Outdated
Comment thread src/client/Presets/GfxSweep.hpp Outdated
Comment thread src/client/Presets/GfxSweep.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@gilbertlee-amd gilbertlee-amd merged commit 2aa036c into ROCm:candidate Apr 17, 2026
4 checks passed
@gilbertlee-amd gilbertlee-amd deleted the gfxSweepUpdate branch April 17, 2026 16:06
@nileshnegi nileshnegi mentioned this pull request Apr 27, 2026
1 task
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: AtlantaPepsi <timhu102@gmail.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
nileshnegi added a commit that referenced this pull request May 2, 2026
- Initial pod communication support (#235)
- cuda + MNNVL update & pod presets (#241)
- Increase CQ size for high qps (#244)
- fix hang when NVML is present but fabricmanager isnt (#246)
- Adding nica2a preset  (#248)
- Adding HBM read bandwidth preset (#250)
- Pod Ring preset (#251)
- gfxsweep preset (#254) (#256)
- Adding Batched DMA support (hipMemcpyBatchAsync), and bmasweep preset (#255)
- Adding a wallclock consistency detection preset (#258)
- Adding smoketest preset for simple correctness tests (#266)
- Help / envvars / presets presets (#267)
- Modernize CMake build (#268)
- Replace version-based pod/amd-smi detection with compile-time API probes (#269)
- Fix collective mismatch hangs in multi-rank error paths (#270)
- Fix SHOW_ITERATIONS table truncation with multiple transfers per executor (#271)
- Reformat a2asweep output to match gfxsweep style (#272)
- Gfx sweep update (#274)
- Increasing flush frequency in smoketest (#275)
- Adding new experimental copy-only GFX kernel, gfxsweep update (#277)
- Fixes for cuMem compilation and invalid device ordinal (#278)
- Simplifying socket connect, allow for using host address (#279)
- Updating podring to run on single node without need to force single pod (#280)
- Adding SHOW_PERCENTILES to show extra per-iteration statistics (#281)

---------

Co-authored-by: Tim <43156029+AtlantaPepsi@users.noreply.github.com>
Co-authored-by: Pak Nin Lui <pak.lui@amd.com>
Co-authored-by: pierreantoineH <PierreAntoine.Harraud@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
Co-authored-by: Claude <claude@anthropic.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants