Wmma support for grouped convolution bwd weight by EnricoDeg · Pull Request #2947 · ROCm/composable_kernel

EnricoDeg · 2025-09-29T13:03:50Z

Proposed changes

Summary:

Modify gridwise implementation to work with convolution (grid descriptors are not created internally but passed from the device level)
Add device level implementation: DeviceGroupedConvBwdWeight_Wmma_CShuffleV3 , DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3 and DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3
Add device implementation of batched gemm multiple Ds (needed for explicit gemm - conv bwd weight)
Adapt existing device implementation of explicit gemm to work for both xdl and wmma implementations of batched gemm multiple Ds
Add support for occupancy-based splitk for one stage and two stage implementations of grouped conv bwd weight
Create instances
Add examples
Remove old instances (they don't support splitk)
Add tests for bwd weight scale

The implementations are based on CShuffleV3 but the functionality is the same as xdl.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

…/conv_bwd_weight_wmma' Convolution bwd weight device implementation See merge request amd/ai/composable_kernel!38

- rdna3 compilation error - gridwise layouts (need to be correct to ensure that CheckValidaity() works correctly)

…ation

…e tests

…re/conv_bwd_weight_wmma' Grouped conv: Instances and example bwd weight See merge request amd/ai/composable_kernel!47

Based on batched gemm multiple D

Device implementation of explicit gemm for grouped conv bwd weight See merge request amd/ai/composable_kernel!52

…le V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists.

…duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases.

… layout / datatype support as before the instance selection process.

…ning. Keep generic instances for support.

…for f16 scale / bilinear.

… NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit.

…fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3.

…ve custom ckProfiler target.

…NHWGCxGKYXC and F16 or BF16 (no mixed in-out types).

…tance_selection WIP: Grouped convolution bwd weight wmma v3 instance selection

Copilot

Pull request overview

Copilot reviewed 82 out of 83 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

bartekxk

lgtm, minor comments

bartekxk · 2025-12-16T08:59:22Z

include/ck/tensor_operation/gpu/grid/gridwise_gemm_wmma_cshuffle_v3_common.hpp


        GridwiseGemm::template Run<HasMainKBlockLoop, EGlobalMemoryDataOperation, TailNum>(
-            p_shared, splitk_batch_offset, karg, epilogue_args, k_id);
+            p_shared, splitk_batch_offset, karg, epilogue_args, 0, k_id);


Please add in /**/ what 0 means

bartekxk · 2025-12-16T09:01:33Z

...tance/gpu/grouped_conv_bwd_weight/device_grouped_conv_bwd_weight_two_stage_wmma_instance.hpp

+    //################################################|     Spatial|         |          |          |       |        |        |        |   Operation|   Operation|   Operation| Specialization|      |      |      |      |     |     |     |        |        | Lengths_AK0_M_AK1|   ArrangeOrder|               |               |      PerVector|  PerVector_AK1|          | Lengths_BK0_N_BK1|   ArrangeOrder|               |              |      PerVector|  PerVector_BK1|          |  PerShuffle|  PerShuffle|      MBlock_MPerBlock|            _NPerBlock|     Sched|             Ver|          |
+    //################################################|            |         |          |          |       |        |        |        |            |            |            |               |      |      |      |      |     |     |     |        |        |                  |               |               |               |               |               |          |                  |               |               |              |               |               |          |            |            |      NBlock_NPerBlock|                      |          |                |          |
+    DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3< NDimSpatial,  ALayout,   BLayout,   ELayout,    F16,     F16,     F16,     F32, PassThrough, PassThrough, PassThrough,       ConvSpec,    32,    16,    16,    32,    8,   16,   16,       1,       1,        S<4, 8, 1>,     S<2, 0, 1>,     S<1, 0, 2>,              1,              1,              4,         0,        S<4, 8, 1>,     S<2, 0, 1>,     S<1, 0, 2>,             1,              1,              4,         0,           1,           1,         S<1, 4, 1, 8>,                     1, Scheduler, PipelineVersion,         1>
+    // DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3< NDimSpatial,  ALayout,   BLayout,   ELayout,    F16,     F16,     F16,     F32, PassThrough, PassThrough, PassThrough,       ConvSpec,   128,   128,   128,    32,    8,   16,   16,       8,       2,       S<4, 32, 1>,     S<2, 0, 1>,     S<1, 0, 2>,              1,              4,              8,         0,       S<4, 32, 1>,     S<2, 0, 1>,     S<1, 0, 2>,             1,              4,              8,         0,           1,           1,        S<1, 16, 1, 8>,                     4, Scheduler, PipelineVersion,         1>,


WHy the most of them are disabled?

This is also described in the instance selection document I sent but basically for NHWGCxGKCYX we also have explicit gemm, and from perf tests we saw that the TwoStage implementation was never faster or necessary for support for this layout. Therefore we only use a single generic instance for now. Two-stage becomes more relevant for other layouts but we are not adding these right now.

Ok, I understand. For the future it is better to add some comment because no one will read some documents from private chat regarding public repo.

This reverts commit 87dd073. Note: Resolved merge conflicts in a best effort.

EnricoDeg added 28 commits September 29, 2025 08:20

Convolution bwd weight device implementation

9d7a01f

Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature…

37b6d28

…/conv_bwd_weight_wmma' Convolution bwd weight device implementation See merge request amd/ai/composable_kernel!38

Fix bug and disable splitK=-1 tests for wmma

9dbbb07

Add generic instances for bf16 f32 bf16

6514b15

check gridwise level validity in device impl for 1 stage D0

2ba2c5d

Fix bugs in device implementation:

8213390

- rdna3 compilation error - gridwise layouts (need to be correct to ensure that CheckValidaity() works correctly)

Add padding in conv to gemm transformers for 1x1Stride1Pad0 specializ…

305cbbc

…ation

Remove workaround for 1x1Stride1Pad0 conv specialization

b1c6973

Add instances for xdl parity (for pipeline v1)

0b7f0cb

Add two stage instances (xdl parity)

e6b7d5e

Add multiple Ds instances

c71f2f2

Add examples

23c9189

Uncomment scale instances

ca078f8

Fix copyright

8ec5908

Fix examples compilation

202cc22

Add atomic add float4

a783028

Fix compilation error

23ccaee

Fix instances

0dc8f8e

Compute tolerances in examples instead of using default ones

671fb7f

Compute tolerances instead of using default ones in bilinear and scal…

7c1c070

…e tests

Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'featu…

207cc39

…re/conv_bwd_weight_wmma' Grouped conv: Instances and example bwd weight See merge request amd/ai/composable_kernel!47

Device implementation of explicit gemm for grouped conv bwd weight

70238ca

Based on batched gemm multiple D

Add instances for pipeline v1 and v3

45b3d26

Add support for occupancy-based splitk

b56e9f6

Fix ckProfiler dependencies

80f7239

Review fixes

85570f9

Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma'

1221921

Device implementation of explicit gemm for grouped conv bwd weight See merge request amd/ai/composable_kernel!52

Fix cmake file for tests

79bee7c

EnricoDeg requested review from carlushuang and illsilin as code owners September 29, 2025 13:03

krithalith and others added 17 commits December 15, 2025 08:56

Add explicit oddMN support with custom tuned instances

8e88660

Add two stage instances based on the parameters from the tuned cshuff…

4fe6c7d

…le V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists.

Replace cshuffle non-v3 lists with v3 lists, making sure to not have …

9d58109

…duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases.

Remove some instances that give incorrect results (f16 NHWGC)

c9c05bf

Add bf16 f32 bf16 instances based on tuned b16 NHWGC GKYXC instances.

93ab2c4

Add back some generic instances to make sure we have the same shape /…

c2fab9d

… layout / datatype support as before the instance selection process.

Add instances for scale and bilinear based on the bf16 NHWGC GKYXC tu…

6380f9a

…ning. Keep generic instances for support.

Disable two stage f16 instances which produce incorrect results.

6a5ea18

Remove more instances which fail verification, for bf16_f32_bf16 and …

5470b80

…for f16 scale / bilinear.

Disable all non-generic two-stage instances in the instance lists for…

31f08a0

… NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit.

Remove unused instance lists and related add_x_instance() functions, …

c2faf36

…fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3.

Re-enable all xdl instances (un-16x16-adapted) and dl instances. Remo…

9021fd6

…ve custom ckProfiler target.

Remove straggler comments

aba5959

Remove [[maybe_unused]]

2154ee8

Fix clang format

5d9c370

Remove unwanted instances. This includes all instances which are not …

94f6c66

…NHWGCxGKYXC and F16 or BF16 (no mixed in-out types).

Merge pull request #3378 from ROCm/streamhpc/conv_bwd_weight_wmma_ins…

2027fca

…tance_selection WIP: Grouped convolution bwd weight wmma v3 instance selection

bartekxk requested review from Copilot and removed request for aska-0096 December 16, 2025 08:50

Copilot started reviewing on behalf of bartekxk December 16, 2025 08:51 View session

Copilot AI reviewed Dec 16, 2025

View reviewed changes

bartekxk reviewed Dec 16, 2025

View reviewed changes

EnricoDeg added 2 commits December 16, 2025 09:07

Merge branch 'develop' into streamhpc/conv_bwd_weight_wmma

daa8fca

Add comment

e629666

bartekxk approved these changes Dec 16, 2025

View reviewed changes

EnricoDeg marked this pull request as ready for review December 17, 2025 09:53

illsilin approved these changes Dec 17, 2025

View reviewed changes

illsilin merged commit 87dd073 into develop Dec 17, 2025
25 of 27 checks passed

illsilin deleted the streamhpc/conv_bwd_weight_wmma branch December 17, 2025 23:59

SreecharanGundaboluAMD added a commit that referenced this pull request Dec 21, 2025

Revert "Wmma support for grouped convolution bwd weight (#2947)"

260f149

This reverts commit 87dd073. Note: Resolved merge conflicts in a best effort.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wmma support for grouped convolution bwd weight#2947

Wmma support for grouped convolution bwd weight#2947
illsilin merged 78 commits intodevelopfrom
streamhpc/conv_bwd_weight_wmma

EnricoDeg commented Sep 29, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

bartekxk left a comment

Uh oh!

bartekxk Dec 16, 2025

Uh oh!

EnricoDeg Dec 16, 2025

Uh oh!

bartekxk Dec 16, 2025

Uh oh!

krithalith Dec 16, 2025

Uh oh!

bartekxk Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

EnricoDeg commented Sep 29, 2025

Proposed changes

Checklist

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

bartekxk left a comment

Choose a reason for hiding this comment

Uh oh!

bartekxk Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

EnricoDeg Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

bartekxk Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

krithalith Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

bartekxk Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants