-
Notifications
You must be signed in to change notification settings - Fork 713
[Codegen][GPU] Lower gpu.subgroup_reduce to DPP intrinsics on AMD GPUs #20468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Codegen][GPU] Lower gpu.subgroup_reduce to DPP intrinsics on AMD GPUs #20468
Conversation
0154cf1
to
6d4c462
Compare
3be73fd
to
0588b97
Compare
d7689df
to
6e33136
Compare
e69e013
to
1969b6b
Compare
1f63b8e
to
e553482
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-wise, this looks fine to me
However
- Can we get a test that just checks that DPP shows in a case when we expect it to?
- Can we get perf numbers? I figure sdxl unet with/without this patch might be enlightening
8acf341
to
60dc379
Compare
6234f1b
to
6184e2f
Compare
Deactivated this change on the SPIRV pipeline because of this issue: #20872 |
a463098
to
c820e8f
Compare
c820e8f
to
e765dca
Compare
A commit has been merged upstream to fix CI failures from this PR: llvm/llvm-project@893ef7f After next integrate this will be mergeable. |
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
…blems Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
…onToGPU to allow pass to make decisions based on backend target Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
e765dca
to
907b55f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits
compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx950.mlir
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/LLVMGPU/ROCDLLowerExecutableTarget.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_vector_distribute_gfx1100.mlir
Outdated
Show resolved
Hide resolved
48faabb
to
f611789
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
// SPIRV doesn't support clustered reduction, so if possible, avoid adding | ||
// problematic attribute until it is supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see cluster sizes in the spec, e.g.: https://registry.khronos.org/SPIR-V/specs/unified1/SPIRV.html#OpGroupNonUniformIAdd
What is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lowering missing in GPUToSPIRV
for subgroup reduce ops when a cluster size attribute is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd update the comment to say that the lowering is missing, not that SPIR-V doesn't support it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To provide more context, there is varying support for subgroup reduce lowering in the three gpu related paths we support:
- Path A) ROCDL
- Path B) NVVM
- Path C) SPIRV
A) subgroup_reduce
is fully supported.
B) & C) NVVM & SPIRV can't lower clustered subgroup_reduce
ops but can support full-warp reductions
So, the issue is that the other backends have various levels of support for subgroup_reduce
but VectorReductionToGPUPass
is a pass that touches all three. So, some murky decisions were made to account for this:
- We do not preserve subgroup reductions that would produce clustering because of SPIRV's lack of support. An example in this pass is in the warp reduction fn which first reduces within warps then across warps. We choose to lower the reduction across warps to gpu.shuffles because reductions across warps require clustered reductions at the moment.
The fix would be to add a proper lowering for subgroup reduce in the clustered case.
- We introduced a flag
forROCDL
to gpu passes that ideally should not need to treat NVVM and ROCDL differrently. But because of the lack of clustered subgroup reduce lowering support in NVVM it was necessary.
This lack of support is easy to fix, we just need to create a non gpu.shuffle lowering in ExpandGPUOps
for NVVM like we did for AMDGPU and then add support for lowering subgroup reduce in the clustered case.
Ideally at some point we could do this clean up in a follow up PR and undo these two decisions.
edit: I guess i should create an issue for this: #21006
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd update the comment to say that the lowering is missing, not that SPIR-V doesn't support it.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok turns out I did create an issue for SPIRV, #20872, and it already has a PR open llvm/llvm-project#141402
Signed-off-by: Muzammiluddin Syed <muzasyed@amd.com>
f611789
to
b7a554a
Compare
… AMD GPUs (iree-org#20468)" This reverts commit 0c342e0.
When performing cross-lane reductions using subgroup_reduce ops across contiguous lanes on AMD GPUs, lower to Data Parallel Primitives (DPP) ops when possible. This reduces latency on applicable devices.
See related #20007