[Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and balance ds ops with address calculations by shaojiewang · Pull Request #190 · ROCm/composable_kernel

shaojiewang · 2022-04-17T10:31:32Z

re-layout lds for both output(gradient) and input(activation) tensor.
find a way to balance ds ops with address calculations.

rosenrodt · 2022-04-18T05:07:15Z

Do I understand it correctly? In this PR, backward data adopts K0_MN_4 layout for the underlying FP16 NT gridwise gemm, with extra 4 element LDS padding for every 128 bytes? I am curious about perf difference with similar approach in #98, which uses K0_MN_2 layout and no extra LDS padding

shaojiewang · 2022-04-18T07:00:56Z

Do I understand it correctly? In this PR, backward data adopts K0_MN_4 layout for the underlying FP16 NT gridwise gemm, with extra 4 element LDS padding for every 128 bytes? I am curious about perf difference with similar approach in #98, which uses K0_MN_2 layout and no extra LDS padding

Not totally. This PR is particularly for bwd-weights and adopts k0_mn_8, with extra 8bytes padding per every 128 bytes. It is similar to NT gemm. With shorter K1Value, compiler needs more ds reads because ds_read2_b32 is using. I'm working on reproduce the the approach in #98 into convolution.

shaojiewang · 2022-04-29T05:18:06Z

CI has passed with rocm5.1.

asroy · 2022-05-04T00:48:35Z

PR #210 will use regular gridwise gemm to do batched gemm and split-K gemm.

I think you can use the same idea in this PR. You can refactor gridwise GEMM v2r4r2, so that it looks like a regular gemm without split-K. And also after #210 is merged, only conv-bwd-weight will use gridwise GEMM v2r4r2, so you can refactor it without worrying breaking other code

shaojiewang · 2022-05-04T01:06:16Z

PR #210 will use regular gridwise gemm to do batched gemm and split-K gemm.

I think you can use the same idea in this PR. You can refactor gridwise GEMM v2r4r2, so that it looks like a regular gemm without split-K. And also after #210 is merged, only conv-bwd-weight will use gridwise GEMM v2r4r2, so you can refactor it without worrying breaking other code

Hi @asroy , I do not fully understand this comment. Do you mean that I should use a instance of GridwiseGemmPipeline_v1 instead of implementing Run function to do pipeline inside GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4r2 ?

…s tensor params a struct templete. 3. remove useless code

asroy · 2022-05-05T05:49:35Z

Hi @asroy , I do not fully understand this comment. Do you mean that I should use a instance of GridwiseGemmPipeline_v1 instead of implementing Run function to do pipeline inside GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4r2 ?

I mean currentlyGridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4r2 is a dedicated for batched GEMM. You can refactor it and remove batch dimension so it become a regular GEMM. You can use the same trick as in PR #210.

Doing that allow us to use a single implementation of gridwise gemm for both regular and batched GEMM

asroy · 2022-05-05T19:00:59Z

@shaojiewang Please ignore my previous comment about unifying normal gemm and batched gemm in GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4r2. The unification in PR #210 cannot be applied in convolution. We need to figure out other way to unify them in future

shaojiewang · 2022-05-07T00:55:05Z

@shaojiewang Please ignore my previous comment about unifying normal gemm and batched gemm in GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4r2. The unification in PR #210 cannot be applied in convolution. We need to figure out other way to unify them in future

Yes, thanks for explanations. I agree. I will find a way to unify them.

asroy

[Future] Please put the GEMM pipe in a gridwise pipeline class, you can reuse existing one, or write a new on if needed

asroy · 2022-05-20T05:17:51Z

+        // preload data into LDS
+        {
+            a_blockwise_copy.RunRead(a_b_k0_m_k1_grid_desc, a_grid_buf);
+            b_blockwise_copy.RunRead(b_b_k0_n_k1_grid_desc, b_grid_buf);
+
+            a_blockwise_copy.RunWrite(a_b_k0_m_k1_block_desc, a_block_buf);
+            b_blockwise_copy.RunWrite(b_b_k0_n_k1_block_desc, b_block_buf);
+        }
+
+        // Initialize C
+        c_thread_buf.Clear();
+
+        // main body
+        if constexpr(HasMainKBlockLoop)
+        {
+            index_t k0_block_data_begin = 0;
+
+            do
+            {
+                a_blockwise_copy.MoveSrcSliceWindow(a_b_k0_m_k1_grid_desc, a_block_slice_copy_step);
+                b_blockwise_copy.MoveSrcSliceWindow(b_b_k0_n_k1_grid_desc, b_block_slice_copy_step);
+
+                a_blockwise_copy.RunRead(a_b_k0_m_k1_grid_desc, a_grid_buf);
+
+                block_sync_lds();
+
+                b_blockwise_copy.RunRead(b_b_k0_n_k1_grid_desc, b_grid_buf);
+
+                blockwise_gemm.Run(a_block_buf, b_block_buf, c_thread_buf);
+
+                block_sync_lds();
+
+                a_blockwise_copy.RunWrite(a_b_k0_m_k1_block_desc, a_block_buf);
+                b_blockwise_copy.RunWrite(b_b_k0_n_k1_block_desc, b_block_buf);
+
+                k0_block_data_begin += K0PerBlock;
+            } while(k0_block_data_begin < (K0 - K0PerBlock));
+        }
+
+        // tail
+        {
+            block_sync_lds();
+
+            blockwise_gemm.Run(a_block_buf, b_block_buf, c_thread_buf);
+        }


This could be put into a gridwise pipeline

shaojiewang added 2 commits April 14, 2022 10:16

add some instance to develop

f9fde06

avoid bank conflicts for wrw for all instance

09f365a

shaojiewang added 3 commits April 18, 2022 20:13

add small K1 test

0a1e41a

delete some unused instance

0fd4df3

reset buffer load oob and ds memcpy to default option

230a41c

shaojiewang marked this pull request as ready for review April 26, 2022 08:19

shaojiewang requested a review from asroy April 26, 2022 08:21

shaojiewang and others added 5 commits April 26, 2022 16:25

remove useless instances

81ffce2

remove redandunt space

a6ebdb4

remove printf code

c58fc51

Merge branch 'develop' into wrw_conv_impr

fc17eb4

clang-format-10 change

1654fcc

shaojiewang changed the title ~~[WIP][Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and balance ds ops with address calculations~~ [Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and balance ds ops with address calculations Apr 26, 2022

This comment was marked as outdated.

Sign in to view

asroy force-pushed the develop branch from a94ae7e to 3956085 Compare April 27, 2022 04:50

shaojiewang force-pushed the wrw_conv_impr branch from af8bf5d to 1654fcc Compare April 27, 2022 06:38

shaojiewang added 4 commits April 27, 2022 14:58

fix clang format for the other files

260dcdb

add bank length computation

93871ca

add template to distinguish the instance that need lds padding for wrw

1e5c712

use rocm5.1 as docker

eb09227

shaojiewang and others added 5 commits April 29, 2022 23:41

Merge branch 'develop' into wrw_conv_impr

2e6eaf6

use integer value for GEMM test

579e8e7

Merge remote-tracking branch 'origin/develop' into wrw_conv_impr

507e149

Merge remote-tracking branch 'origin/fix_test' into wrw_conv_impr

60ee26a

Merge branch 'develop' into wrw_conv_impr

5c8ad21

asroy suggested changes May 4, 2022

View reviewed changes

1. move dedicated transform into gridwisegemm's head file. 2. make ld…

28bb628

…s tensor params a struct templete. 3. remove useless code

shaojiewang added 5 commits May 13, 2022 17:31

use a new gridwise gemm header for bwd-weight

a7297c2

revert gridwise gemm v2r4r2

72b5309

Merge branch 'develop' into wrw_conv_impr

3243112

change foramt

10a0802

rename kernel invoker

9459a68

shaojiewang added the CI - Pass label May 14, 2022

zjing14 requested a review from asroy May 17, 2022 18:56

asroy approved these changes May 20, 2022

View reviewed changes

asroy merged commit b9b9c3b into develop May 20, 2022

junliume deleted the wrw_conv_impr branch October 21, 2023 06:10

Conversation

shaojiewang commented Apr 17, 2022

Uh oh!

rosenrodt commented Apr 18, 2022

Uh oh!

shaojiewang commented Apr 18, 2022

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

shaojiewang commented Apr 29, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asroy commented May 4, 2022

Uh oh!

shaojiewang commented May 4, 2022

Uh oh!

asroy commented May 5, 2022

Uh oh!

asroy commented May 5, 2022

Uh oh!

shaojiewang commented May 7, 2022

Uh oh!

asroy left a comment

Choose a reason for hiding this comment

Uh oh!

asroy May 20, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants