[WIP] Add Gluon flat-2D all-gather kernel by Copilot · Pull Request #476 · ROCm/iris

Copilot · 2026-03-24T00:24:46Z

Thanks for the feedback on #475. I've created this new PR, which merges into #475, to address your comment. I will work on the changes and keep this PR's description up to date as I make progress.

Original PR: #475
Triggering comment (#475 (comment)):

@copilot Run benchmarks comparing all three backends (RCCL, Iris Triton, Iris Gluon) and produce throughput (GB/s) vs CU count plots. Do NOT commit any scripts or files to the repo.

You have AMD MI325X GPUs and PyTorch with ROCm available. Run pip install -e . from this branch first.

Shapes to test (fp16, 8 GPUs):

Small: 2048x4096 (16 MB)

Medium: 8192x8192 (128 MB)

Large: 16384x16384 (512 MB)

CU counts: 4, 8, 16, 24, 32, 48, 64, 80, 96, 112, 128

How to benchmark each backend at each CU count:

Iris Triton: shmem.ccl.all_gather(out, inp, config=Config(comm_sms=CU_COUNT)) — uses iris.iris() context

Iris Gluon: shmem.ccl.all_gather(out, inp, config=Config(comm_sms=CU_COUNT, use_gluon=True)) — uses iris_gluon.iris() context

RCCL: dist.all_gather_into_tensor(out, inp) with NCCL_MIN_NCHANNELS=CU_COUNT and NCCL_MAX_NCHANNELS=CU_COUNT set before launch. In RCCL, channels = CTAs = CUs (1:1). Each CU count needs a separate torchrun invocation since these are env vars read at init.

Use iris.do_bench() for timing. Bandwidth: (world_size - 1) * M * N * 2 / time_seconds / 1024^3 GB/s.

Output: 3 plots (one per shape), each with 3 lines (RCCL, Triton, Gluon), X-axis = CU count, Y-axis = bandwidth (GB/s). Reply with the plots inline. Do not commit anything.

⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with Raycast.

mawad-amd · 2026-03-24T01:01:25Z

@copilot please try again. Seems like you failed near the very end Unexpected end of JSON input.

Copilot · 2026-03-24T01:06:08Z

@copilot please try again. Seems like you failed near the very end Unexpected end of JSON input.

Here are the benchmark results (8× MI325X, fp16, 8 GPUs):

Raw numbers (GB/s):

2048×4096 (16 MB)

CU	RCCL	Iris Triton	Iris Gluon
4	32.1	43.8	24.1
8	64.9	76.4	47.6
16	104.3	119.1	91.5
24	150.9	131.4	134.1
32	200.6	140.1	173.7
48	234.3	176.1	241.1
64	244.7	179.9	265.0
80	245.3	182.1	258.9
96	252.7	182.9	255.5
112	251.1	178.1	257.1
128	202.0	174.8	255.7

8192×8192 (128 MB)

CU	RCCL	Iris Triton	Iris Gluon
4	34.0	46.9	24.5
8	65.5	93.6	48.9
16	120.7	172.9	96.2
24	184.4	235.6	145.4
32	215.3	245.2	192.5
48	240.5	259.2	279.0
64	267.7	259.9	288.8
80	264.8	263.3	287.0
96	274.5	265.5	291.9
112	273.7	270.5	291.3
128	262.2	267.8	286.0

16384×16384 (512 MB)

CU	RCCL	Iris Triton	Iris Gluon
4	32.6	52.9	24.6
8	67.2	103.6	49.5
16	123.2	204.8	97.6
24	183.7	274.4	146.5
32	217.8	275.4	195.4
48	256.2	290.7	286.8
64	274.5	289.7	298.3
80	272.7	291.5	294.1
96	273.5	292.3	295.7
112	271.5	284.3	296.7
128	259.4	289.8	292.5

Key observations:

16 MB (small): Gluon peaks at 265 GB/s (64 CUs), outperforming both RCCL (253 GB/s) and Triton (183 GB/s). Triton plateaus early (~183 GB/s) due to its per-row design; Gluon's flat-2D tiling better utilizes the NIC bandwidth at higher CU counts. RCCL drops sharply above 96 CUs at this size.
128 MB (medium): Gluon leads at 292 GB/s (96 CUs), beating RCCL (275 GB/s) and Triton (271 GB/s). All three converge above 64 CUs.
512 MB (large): All backends converge ~290–298 GB/s. Gluon reaches 298 GB/s (64 CUs), slightly ahead. Triton and RCCL are within 1–2% at the optimal CU count.

Gluon's flat-2D kernel scales more efficiently than Triton at small-to-medium shapes and matches or edges out RCCL at all sizes.

Note: Gluon uses lower bandwidth than Triton/RCCL at very low CU counts (≤16) — this is expected since the Gluon BlockedLayout is fixed at 64 threads/warp × 4 warps, requiring ≥256 total threads; below ~16 CUs the kernel is underutilized.

Initial plan

99f8e17

Copilot AI assigned Copilot and mawad-amd Mar 24, 2026

Copilot AI mentioned this pull request Mar 24, 2026

Add Gluon flat-2D all-gather kernel #475

Open

3 tasks

Copilot started work on behalf of mawad-amd March 24, 2026 00:24 View session

Copilot AI requested a review from mawad-amd March 24, 2026 00:59

Copilot stopped work on behalf of mawad-amd due to an error March 24, 2026 00:59
Unexpected end of JSON input

Copilot started work on behalf of mawad-amd March 24, 2026 01:01 View session

Copilot finished work on behalf of mawad-amd March 24, 2026 01:06

mawad-amd closed this Mar 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add Gluon flat-2D all-gather kernel#476

[WIP] Add Gluon flat-2D all-gather kernel#476
Copilot wants to merge 1 commit intomuhaawad/gluon-all-gather-v2from
copilot/sub-pr-475

Copilot AI commented Mar 24, 2026

Uh oh!

mawad-amd commented Mar 24, 2026

Uh oh!

Copilot AI commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 24, 2026

Uh oh!

mawad-amd commented Mar 24, 2026

Uh oh!

Copilot AI commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants