Wmma support for grouped convolution bwd weight#2947
Conversation
…/conv_bwd_weight_wmma' Convolution bwd weight device implementation See merge request amd/ai/composable_kernel!38
- rdna3 compilation error - gridwise layouts (need to be correct to ensure that CheckValidaity() works correctly)
…re/conv_bwd_weight_wmma' Grouped conv: Instances and example bwd weight See merge request amd/ai/composable_kernel!47
Based on batched gemm multiple D
Device implementation of explicit gemm for grouped conv bwd weight See merge request amd/ai/composable_kernel!52
…le V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists.
…duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases.
… layout / datatype support as before the instance selection process.
…ning. Keep generic instances for support.
…for f16 scale / bilinear.
… NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit.
…fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3.
…ve custom ckProfiler target.
…NHWGCxGKYXC and F16 or BF16 (no mixed in-out types).
…tance_selection WIP: Grouped convolution bwd weight wmma v3 instance selection
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 82 out of 83 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| GridwiseGemm::template Run<HasMainKBlockLoop, EGlobalMemoryDataOperation, TailNum>( | ||
| p_shared, splitk_batch_offset, karg, epilogue_args, k_id); | ||
| p_shared, splitk_batch_offset, karg, epilogue_args, 0, k_id); |
There was a problem hiding this comment.
Please add in /**/ what 0 means
| //################################################| Spatial| | | | | | | | Operation| Operation| Operation| Specialization| | | | | | | | | | Lengths_AK0_M_AK1| ArrangeOrder| | | PerVector| PerVector_AK1| | Lengths_BK0_N_BK1| ArrangeOrder| | | PerVector| PerVector_BK1| | PerShuffle| PerShuffle| MBlock_MPerBlock| _NPerBlock| Sched| Ver| | | ||
| //################################################| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NBlock_NPerBlock| | | | | | ||
| DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3< NDimSpatial, ALayout, BLayout, ELayout, F16, F16, F16, F32, PassThrough, PassThrough, PassThrough, ConvSpec, 32, 16, 16, 32, 8, 16, 16, 1, 1, S<4, 8, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 1, 4, 0, S<4, 8, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 1, 4, 0, 1, 1, S<1, 4, 1, 8>, 1, Scheduler, PipelineVersion, 1> | ||
| // DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3< NDimSpatial, ALayout, BLayout, ELayout, F16, F16, F16, F32, PassThrough, PassThrough, PassThrough, ConvSpec, 128, 128, 128, 32, 8, 16, 16, 8, 2, S<4, 32, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 8, 0, S<4, 32, 1>, S<2, 0, 1>, S<1, 0, 2>, 1, 4, 8, 0, 1, 1, S<1, 16, 1, 8>, 4, Scheduler, PipelineVersion, 1>, |
There was a problem hiding this comment.
WHy the most of them are disabled?
There was a problem hiding this comment.
This is also described in the instance selection document I sent but basically for NHWGCxGKCYX we also have explicit gemm, and from perf tests we saw that the TwoStage implementation was never faster or necessary for support for this layout. Therefore we only use a single generic instance for now. Two-stage becomes more relevant for other layouts but we are not adding these right now.
There was a problem hiding this comment.
Ok, I understand. For the future it is better to add some comment because no one will read some documents from private chat regarding public repo.
This reverts commit 87dd073. Note: Resolved merge conflicts in a best effort.
Proposed changes
Summary:
DeviceGroupedConvBwdWeight_Wmma_CShuffleV3,DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3andDeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3The implementations are based on CShuffleV3 but the functionality is the same as xdl.
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered