Skip to content

[Codegen][GPU] Lower gpu.subgroup_reduce to DPP intrinsics on AMD GPUs #20468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Muzammiluddin-Syed-ECE
Copy link
Contributor

@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE commented Apr 3, 2025

When performing cross-lane reductions using subgroup_reduce ops across contiguous lanes on AMD GPUs, lower to Data Parallel Primitives (DPP) ops when possible. This reduces latency on applicable devices.
See related #20007

@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE force-pushed the muzasyed/sub branch 3 times, most recently from e69e013 to 1969b6b Compare May 1, 2025 19:30
@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE force-pushed the muzasyed/sub branch 2 times, most recently from 1f63b8e to e553482 Compare May 5, 2025 22:30
Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise, this looks fine to me

However

  1. Can we get a test that just checks that DPP shows in a case when we expect it to?
  2. Can we get perf numbers? I figure sdxl unet with/without this patch might be enlightening

@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE marked this pull request as ready for review May 6, 2025 20:17
@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE force-pushed the muzasyed/sub branch 3 times, most recently from 8acf341 to 60dc379 Compare May 15, 2025 01:12
@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE force-pushed the muzasyed/sub branch 2 times, most recently from 6234f1b to 6184e2f Compare May 21, 2025 03:42
@Muzammiluddin-Syed-ECE
Copy link
Contributor Author

Deactivated this change on the SPIRV pipeline because of this issue: #20872

@Muzammiluddin-Syed-ECE
Copy link
Contributor Author

Muzammiluddin-Syed-ECE commented May 29, 2025

A commit has been merged upstream to fix CI failures from this PR: llvm/llvm-project@893ef7f
Upon cherry-picking the commit, failures were successfully resolved locally: See Gist

After next integrate this will be mergeable.

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
…blems

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
…onToGPU to allow pass to make decisions based on backend target

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits

Copy link
Contributor

@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Comment on lines 449 to 450
// SPIRV doesn't support clustered reduction, so if possible, avoid adding
// problematic attribute until it is supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see cluster sizes in the spec, e.g.: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformIAdd

What is missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lowering missing in GPUToSPIRV for subgroup reduce ops when a cluster size attribute is specified.

https://github.com/llvm/llvm-project/blob/0a25b5022831c7465790cf99655afdcd0f91e34d/mlir/lib/Conversion/GPUToSPIRV/GPUToSPIRV.cpp#L592-L594

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd update the comment to say that the lowering is missing, not that SPIR-V doesn't support it.

Copy link
Contributor Author

@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide more context, there is varying support for subgroup reduce lowering in the three gpu related paths we support:

- Path A) ROCDL
- Path B) NVVM
- Path C) SPIRV

A) subgroup_reduce is fully supported.

B) & C) NVVM & SPIRV can't lower clustered subgroup_reduce ops but can support full-warp reductions

So, the issue is that the other backends have various levels of support for subgroup_reduce but VectorReductionToGPUPass is a pass that touches all three. So, some murky decisions were made to account for this:

  1. We do not preserve subgroup reductions that would produce clustering because of SPIRV's lack of support. An example in this pass is in the warp reduction fn which first reduces within warps then across warps. We choose to lower the reduction across warps to gpu.shuffles because reductions across warps require clustered reductions at the moment.

The fix would be to add a proper lowering for subgroup reduce in the clustered case.

  1. We introduced a flag forROCDL to gpu passes that ideally should not need to treat NVVM and ROCDL differrently. But because of the lack of clustered subgroup reduce lowering support in NVVM it was necessary.

This lack of support is easy to fix, we just need to create a non gpu.shuffle lowering in ExpandGPUOps for NVVM like we did for AMDGPU and then add support for lowering subgroup reduce in the clustered case.

Ideally at some point we could do this clean up in a follow up PR and undo these two decisions.

edit: I guess i should create an issue for this: #21006

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd update the comment to say that the lowering is missing, not that SPIR-V doesn't support it.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok turns out I did create an issue for SPIRV, #20872, and it already has a PR open llvm/llvm-project#141402

Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE changed the title [AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs [Codegen][GPU] Lower gpu.subgroup_reduce to DPP intrinsics on AMD GPUs Jun 4, 2025
@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE enabled auto-merge (squash) June 4, 2025 16:19
@Muzammiluddin-Syed-ECE Muzammiluddin-Syed-ECE merged commit 0c342e0 into iree-org:main Jun 4, 2025
43 checks passed
Muzammiluddin-Syed-ECE added a commit to Muzammiluddin-Syed-ECE/iree that referenced this pull request Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants