Add vectorization hints and tests for HBM buffer all-gather matmul by Copilot · Pull Request #446 · ROCm/iris

Copilot · 2026-03-11T16:11:24Z

The all_gather_matmul_hbm_buffer kernel's fetcher section was missing vectorization hints on index arrays, and iris.x.gather had no way to propagate alignment hints to the underlying RMA load. No tests existed for this kernel path.

Changes

iris/x/gather.py: Added hint: tl.constexpr = None parameter, forwarded to iris.load for translated-pointer vectorization hints:

# Before
a_tile = iris.x.gather(k_tile, src_view, compile_rank, ctx)
# After
a_tile = iris.x.gather(k_tile, src_view, compile_rank, ctx, hint=(1, BLOCK_SIZE_K))

iris/ops/all_gather_matmul_hbm_buffer.py: Added tl.max_contiguous / tl.multiple_of hints for rm and rk in the fetcher section, matching the existing hints already present in the GEMM section:
```
rm = tl.max_contiguous(tl.multiple_of(rm, BLOCK_SIZE_M), BLOCK_SIZE_M)
rk = tl.max_contiguous(tl.multiple_of(rk, BLOCK_SIZE_K), BLOCK_SIZE_K)
```
tests/ops/test_all_gather_matmul_hbm_buffer.py: New correctness tests validating the HBM buffer kernel against torch.distributed.all_gather + torch.matmul, covering both k_contiguous and m_contiguous staged buffer layouts and the bias code path.

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

ryanswann-amd · 2026-03-11T19:59:39Z

@copilot Why didn't you do anything?

Copilot · 2026-03-11T19:59:43Z

@ryanswann-amd The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 8aafba6b-c583-466a-a8ef-74a7e7c9dcd7

ryanswann-amd · 2026-03-11T20:00:22Z

@copilot

Copilot · 2026-03-11T20:00:26Z

@ryanswann-amd The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 16c5b96e-6efc-4932-850a-9da3c783d54f

ryanswann-amd · 2026-03-11T20:00:49Z

@copilot

ryanswann-amd · 2026-03-11T20:16:08Z

@copilot do the thing

mawad-amd · 2026-03-12T19:36:15Z

@copilot you had some connectivity issues. Please complete the work on this PR.

Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>

Copilot · 2026-03-12T20:00:02Z

@copilot you had some connectivity issues. Please complete the work on this PR.

The work is now complete in commit 445b25c. Here's what was done:

iris/x/gather.py: Added hint: tl.constexpr = None parameter forwarded to iris.load, enabling vectorization hints for remote pointer translation.
iris/ops/all_gather_matmul_hbm_buffer.py: Added tl.max_contiguous / tl.multiple_of hints for rm and rk in the fetcher section, and passes hint=(1, BLOCK_SIZE_K) to iris.x.gather for the K-contiguous remote loads.
tests/ops/test_all_gather_matmul_hbm_buffer.py: New correctness tests covering both k_contiguous and m_contiguous staged buffer layouts, and a bias vector test case — all validated against torch.distributed.all_gather + torch.matmul.

ryanswann-amd · 2026-03-12T20:12:15Z

@copilot Run a tuning sweep on the hbm buffered benchmark and report back the peak performing parameter. Do so for : 196608x2304x16384

Copilot · 2026-03-12T20:14:22Z

@copilot Run a tuning sweep on the hbm buffered benchmark and report back the peak performing parameter. Do so for : 196608x2304x16384

I'm unable to run this tuning sweep — the sandboxed environment I operate in doesn't have access to AMD GPU hardware (MI300X/MI350X/MI355X) required to execute the HBM-buffered benchmark. The sweep needs to be run in an environment with ROCm-capable GPUs.

To run it yourself, you can use the benchmark script (e.g., examples/ or a dedicated bench script) with the dimensions M=196608, N=2304, K=16384 and sweep over BLOCK_SIZE_M, BLOCK_SIZE_N, BLOCK_SIZE_K, num_stages, and num_warps. I'm happy to help add a tuning sweep script to the repo if that would be useful.

mawad-amd · 2026-03-12T20:33:59Z

@copilot you have access to AMD GPUs. Please address the comment.

Copilot · 2026-03-12T20:54:12Z

@copilot you have access to AMD GPUs. Please address the comment.

Tuning sweep complete for M=196608 × N=2304 × K=16384 on 8× MI300X (gfx942). 20 configurations tested in ~11 minutes (one-at-a-time sweep).

Peak-performing configuration:

block_size_m=256, block_size_n=256, block_size_k=64
group_size_m=4, num_fetch_sms=64, k_per_flag=16
num_warps=8, num_fetch_stages=4, first_stage_fetch_sms=304

→ 343.35 TFLOPS @ 43.23 ms (0.94× vs PyTorch all_gather+matmul @ 367.19 TFLOPS)

Full results table:

#	Config	ms	TFLOPS	vs PyTorch
12 ⭐	bm256 bn256 bk64 gm4 nf64 kpf16 nw8 fs4 fsf304	43.2	343.35	0.94×
13	bm256 bn256 bk64 gm4 nf64 kpf32 nw8 fs4 fsf304	44.0	337.22	0.92×
18	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs4 fsf128	44.9	330.87	0.90×
17	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs8 fsf304	45.1	328.87	0.90×
19	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs4 fsf192	46.7	318.06	0.87×
8	bm256 bn256 bk64 gm8 nf64 kpf64 nw8 fs4 fsf304	50.0	296.64	0.81×
1 (baseline)	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs4 fsf304	50.1	296.13	0.81×
11	bm256 bn256 bk64 gm4 nf256 kpf64 nw8 fs4 fsf304	71.8	206.83	0.56×

Key findings:

k_per_flag=16 is the single biggest lever (+16% over baseline kpf=64)
Smaller first_stage_fetch_sms=128 improves performance (+12%)
More num_fetch_stages=8 helps (+11%)
More fetch SMs (nf128/192/256) hurt — 64 fetch SMs is optimal
block_size_m=256, block_size_n=256 is the best tile shape for this N=2304 problem

ryanswann-amd · 2026-03-12T20:59:53Z

@copilot rerun with validation to make sure we have correctness.

Copilot · 2026-03-12T20:59:56Z

@ryanswann-amd The model claude-sonnet-4.6 is not available for your account. This can happen if the model was disabled by your organization's policy or if your Copilot plan doesn't include access to it.

You can try again without specifying a model (just @copilot) to use the default, or choose a different model from the model picker.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 89329af1-89f7-4c6a-a908-073ae36b6825

ryanswann-amd · 2026-03-12T21:00:22Z

@copilot rerun with validation to make sure we have correctness

Copilot · 2026-03-12T21:17:34Z

@copilot rerun with validation to make sure we have correctness

Rerun complete with validation. All 20 configurations passed correctness checks (validated against all_gather + torch.matmul). PyTorch baseline: 369.14 TFLOPS @ 40.21 ms.

Peak-performing validated configuration:

block_size_m=256, block_size_n=256, block_size_k=64
group_size_m=4, num_fetch_sms=64, k_per_flag=16
num_warps=8, num_fetch_stages=4, first_stage_fetch_sms=304

→ 341.73 TFLOPS @ 43.44 ms ✅ (0.93× vs PyTorch)

Full validated results (all PASSED):

#	Config	ms	TFLOPS	vs PyTorch	Valid
12 ⭐	bm256 bn256 bk64 gm4 nf64 kpf16 nw8 fs4 fsf304	43.4	341.73	0.93×	✅
13	bm256 bn256 bk64 gm4 nf64 kpf32 nw8 fs4 fsf304	44.0	337.19	0.91×	✅
18	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs4 fsf128	44.7	331.82	0.90×	✅
19	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs4 fsf192	46.5	319.05	0.86×	✅
20	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs4 fsf256	47.4	313.16	0.85×	✅
17	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs8 fsf304	49.7	298.65	0.81×	✅
1 (baseline)	bm256 bn256 bk64 gm4 nf64 kpf64 nw8 fs4 fsf304	49.8	298.32	0.81×	✅
11	bm256 bn256 bk64 gm4 nf256 kpf64 nw8 fs4 fsf304	71.7	207.06	0.56×	✅

Initial plan

a86dc04

Copilot AI assigned Copilot and ryanswann-amd Mar 11, 2026

Copilot AI mentioned this pull request Mar 11, 2026

Add benchmark capabilities for ops. #346

Open

1 task

Copilot started work on behalf of ryanswann-amd March 11, 2026 16:11 View session

Copilot finished work on behalf of ryanswann-amd March 11, 2026 17:32

github-actions bot added in-progress We are working on it iris Iris project issue labels Mar 11, 2026

Copilot started work on behalf of ryanswann-amd March 12, 2026 16:11 View session

Copilot AI requested a review from ryanswann-amd March 12, 2026 17:34

Copilot finished work on behalf of ryanswann-amd March 12, 2026 17:34

Copilot started work on behalf of ryanswann-amd March 12, 2026 17:38 View session

Copilot finished work on behalf of ryanswann-amd March 12, 2026 19:12

Add vectorization hints and tests for HBM buffer all-gather matmul

445b25c

Co-authored-by: ryanswann-amd <109695074+ryanswann-amd@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add benchmarking capabilities for ops~~ Add vectorization hints and tests for HBM buffer all-gather matmul Mar 12, 2026

Copilot started work on behalf of mawad-amd March 12, 2026 19:49 View session

Copilot AI requested a review from mawad-amd March 12, 2026 20:00

Copilot finished work on behalf of mawad-amd March 12, 2026 20:00

Copilot started work on behalf of ryanswann-amd March 12, 2026 20:12 View session

Copilot finished work on behalf of ryanswann-amd March 12, 2026 20:14

Copilot started work on behalf of mawad-amd March 12, 2026 20:34 View session

Copilot finished work on behalf of mawad-amd March 12, 2026 20:54

Copilot started work on behalf of ryanswann-amd March 12, 2026 21:00 View session

Copilot finished work on behalf of ryanswann-amd March 12, 2026 21:18

ryanswann-amd marked this pull request as ready for review March 12, 2026 21:26

ryanswann-amd requested review from BKP and neoblizz as code owners March 12, 2026 21:26

ryanswann-amd merged commit 2f0099f into neoblizz/iris-xops-perf Mar 12, 2026
6 checks passed

ryanswann-amd deleted the copilot/sub-pr-346 branch March 12, 2026 21:28

Conversation

Copilot AI commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

ryanswann-amd commented Mar 11, 2026

Uh oh!

Copilot AI commented Mar 11, 2026

Uh oh!

ryanswann-amd commented Mar 11, 2026

Uh oh!

Copilot AI commented Mar 11, 2026

Uh oh!

ryanswann-amd commented Mar 11, 2026

Uh oh!

ryanswann-amd commented Mar 11, 2026

Uh oh!

mawad-amd commented Mar 12, 2026

Uh oh!

Copilot AI commented Mar 12, 2026

Uh oh!

ryanswann-amd commented Mar 12, 2026

Uh oh!

Copilot AI commented Mar 12, 2026

Uh oh!

mawad-amd commented Mar 12, 2026

Uh oh!

Copilot AI commented Mar 12, 2026

Uh oh!

ryanswann-amd commented Mar 12, 2026

Uh oh!

Copilot AI commented Mar 12, 2026

Uh oh!

ryanswann-amd commented Mar 12, 2026

Uh oh!

Copilot AI commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Mar 11, 2026 •

edited

Loading