conv3d w/out 4GB limitation by j4yan · Pull Request #60 · ROCm/composable_kernel

j4yan · 2021-11-28T20:10:37Z

3D convolution:

a batched version by splitting batches into sub-batches, and each sub-batch is within int32 range. This implementation comes with Number64. The performance of conv3D is about 99% of conv2D if the same problem is solved.
added c_thread_buf initialization to GridwiseGemm_konk1_mn_xdlops_v2r3.
added host version of DoMagicDivision which was missing.
disable buffer_load in DynamicBuffer::Run when CK_USE_AMD_BUFFER_ADDRESSING=1. buffer_load, amd_buffer_load_invalid_element_return_zero to be precise, is problematic in 3D convolution.
fixed a bug in BlockwiseGemmXdlops_k0mk1_k0nk1_m0n0m1n1m2m3m4n2_v1::MakeCGridDescriptor_M0_N0_M1_N1_M2_M3_M4_N2
added example conv3d_fwd_xdl.

asroy · 2021-11-30T15:09:54Z

Need a ctest like this #58

j4yan · 2021-12-14T16:37:43Z

@asroy @zjing14 rebased example conv3d_fwd_xdl against develop. Ready for review.

j4yan · 2021-12-14T16:42:36Z

@asroy Should we keep the conv3d separate from conv2d?

…ct to 2G limitation

…fer::Get

…conv3d; formatting and cleanup

j4yan · 2022-01-07T02:00:00Z

@asroy I've resolved the conflicts. The performance dropped a lot though.

zjing14

LGTM

aserio · 2022-01-14T02:53:55Z

Plan to review code with @asroy tomorrow

j4yan · 2022-01-17T03:43:23Z

@asroy Performance degradation was caused by using if-statement in dynamic buffer to avoid invalid reading, so I recovered to current state, and now the performance is about 99% of conv2d if the same problem is solved. The branches in ISA are caused by

the outer loop over batches in GridwiseGemm
the ternary operator in dynamic buffer if CK_EXPERIMENTAL_USE_MEMCPY_FOR_VECTOR_ACCESS=0

asroy

Please see inline comments.

Please open compiler JIRA for buffer_load issue

For future PR, please:

use a branch from inside CK repo, which is easier for reviewer to switch to and do test
use clang-format-10 before creating PR

…ernel into conv3d_splitN_rebased

j4yan · 2022-01-28T23:06:43Z

Please see inline comments.

Please open compiler JIRA for buffer_load issue

For future PR, please:

use a branch from inside CK repo, which is easier for reviewer to switch to and do test

use clang-format-10 before creating PR

j4yan · 2022-01-28T23:08:02Z

Please see inline comments.

Please open compiler JIRA for buffer_load issue

For future PR, please:

use a branch from inside CK repo, which is easier for reviewer to switch to and do test

use clang-format-10 before creating PR

Here's the buffer_load issue https://ontrack-internal.amd.com/browse/SWDEV-319513

asroy · 2022-02-12T02:12:30Z

+    DoMagicDivision(int32_t dividend_i32, uint32_t multiplier, uint32_t shift)
+    {
+        uint32_t dividend_u32 = bit_cast<uint32_t>(dividend_i32);
+        uint32_t tmp          = static_cast<unsigned long long>(dividend_u32) * multiplier >> 32;


asroy · 2022-02-12T02:34:15Z

+          typename ConvDilations,
+          typename InLeftPads,
+          typename InRightPads>
+void host_conv3d_ndhwc_kzyxc_ndhwk(const Tensor<TIn>& in,


This need to be replace by this ReferenceConvFwd
https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/3dbb8b4078e5f3bdf5e49aa7549a9c069cd14502/reference_operation/include/reference_conv_fwd.hpp#L38

once branch aosewski/conv_nd is merged

It looks weird to me that host::ReferenceConvFwd inherit from device::BaseOperator.

Yes, host::ReferenceConvFwd is an after-thought. We need to re-org their dependency. You can create an issue and assign to me

asroy · 2022-02-12T04:03:04Z

@@ -0,0 +1,106 @@
+#ifndef NAIVE_CONV_FWD_HPP


This need to be wrapped inside a Device operator class.

Could you elaborate on this?

you can write a class DeviceConvolutionNaive, which will call this kernel underneath

asroy · 2022-02-12T04:46:56Z

+          index_t GemmK1Value>
+__host__ __device__ constexpr auto
+transform_forward_convolution3d_into_gemm_v4r4r4_nhwc_kyxc_nhwk_pad_split_batch(
+    // const TensorDescriptor<In...>& in_grid_desc_n_di_hi_wi_c,


asroy · 2022-02-12T04:47:54Z

+                       make_pass_through_transform(N),
+                       make_pass_through_transform(K1)),
+            make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{}, Sequence<3>{}),
+            make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{}));


This bug is still not fixed

asroy · 2022-02-12T04:48:04Z

+                                                   make_pass_through_transform(M),
+                                                   make_pass_through_transform(N)),
+                                        make_tuple(Sequence<0>{}, Sequence<1>{}, Sequence<2>{}),
+                                        make_tuple(Sequence<0>{}, Sequence<1>{}));


asroy · 2022-02-12T05:09:40Z

+          typename CThreadTransferSrcDstAccessOrder,
+          index_t CThreadTransferSrcDstVectorDim,
+          index_t CThreadTransferDstScalarPerVector>
+struct GridwiseBatchedGemm_bk0mk1_k0nk1_bmn_xdlops_v2r3


inconsistent file name and struct name v2r3 v2r3r3

asroy · 2022-02-12T05:14:42Z

+    }
+
+    __host__ __device__ static constexpr auto
+    MakeAGridDescriptor_K0_M_K1(const AGridDesc_B_K0_M_K1& a_grid_desc_b_k0_m_k1, const int bb)


int and long long is not allowed in index calculation.

please use index_t and long_index_t only

asroy · 2022-02-12T05:21:10Z

+template <long_index_t N>
+using LongNumber = integral_constant<long_index_t, N>;
+
+template <typename Index0,


template <typename Index0, Index0 X, typename Index1, Index1 Y> __host__ __device__ constexpr auto operator+(integral_constant<Index0, X>, integral_constant<Index1, Y>) { constexpr auto Z = X + Y; return integral_constant<decltype(Z), Z>{}; }

That's indeed better. I moved the operators into integral_constant.hpp, leaving number.hpp almost empty. Should we also move using Number together?

qianfengz · 2022-02-17T15:39:41Z

+
+            float ave_time = launch_and_time_kernel(naive_conv3d_fwd,
+                                                    nrepeat,
+                                                    dim3(256),


Any reason to use constant grid size of 256 ?

No specific reason. The implementation doesn't reply on blocksize and grid size, so I hard-coded them.

qianfengz · 2022-02-17T15:49:04Z

+        in_left_pads[2]           = std::stoi(argv[21]);
+        in_right_pads[0]          = std::stoi(argv[22]);
+        in_right_pads[1]          = std::stoi(argv[23]);
+        in_right_pads[2]          = std::stoi(argv[24]);


You can use getopt_long() to help classify and analyze the 25 arguments

Good to know. But since other examples don't use it, I'll leave it as is. We can change them all together in future if needed.

j4yan · 2022-02-21T23:21:18Z

See #94 instead.

* Re-organize example directories * Move reference operation into sub-folder * Move mask types into dedicated files * Separate utility interface & implementation * Resume pipeline changes in fmha_fwd.cpp * Rename folder 'fmha_fwd' to 'fmha' * Move more function to utils.* * Remove 'fmha_fwd_kernel_invoker.hpp' * Re-format files * Move Kargs types into dedicated file * Fix formating * Fix compilation errors * Avoid instantiating unused types * Extract configurable codes * Add missing include directive * Instantiate template functions outside fmha_fwd.cpp * Separate implementation files * Merge config files * Merge duplicated code * Remove no-longer used file * Unify enum name * Extract no_mask kernel * Further separate template specializations * Use file(GLOB) to get file list * Include needed config file only once * Remove debug message * Add comment to explain template specializations * Move impl files under 'kernels' sub-folder * Only include *.inc in *.inc files * Add extra type arg to control selected kernel * Add kernel specializations for bf16 * Switch kernel according to cmdline options * Re-order type parameters * Reduce loop indent level * Instantiate launch_kernel() * Rename source files * Remove duplicated codes * Remove more duplicated codes * Clean up codes * Rename 'FmhaMaskType' to 'FmhaMasks' * Remove no-longer used include directive * Move template declarations into dedicated header * use python codegen * modify validation logic * format print and add smoke_test script * modify bf16 elimit, add benchmark script --------- Co-authored-by: carlushuang <carlus.huang@amd.com>

Synchronize the kernel changes used by xformers to ck_tile example/91_tile_program/fmha

fix typo in function mha_fwd

j4yan requested review from asroy and zjing14 November 28, 2021 20:11

asroy reviewed Nov 30, 2021

View reviewed changes

Comment thread .gitignore Outdated

j4yan mentioned this pull request Dec 1, 2021

Add Test for magic number division #58

Merged

j4yan mentioned this pull request Dec 14, 2021

Bug when reading data that is invalid from dynamic buffer, #57

Closed

j4yan changed the title ~~Conv3d split n rebased~~ conv3d w/out 4GB limitation Dec 14, 2021

j4yan requested a review from asroy December 18, 2021 05:37

j4yan and others added 13 commits January 5, 2022 17:17

WIP: conv3d_fwd_v4r4r4_xdlops using coord transforms works, but subje…

3c326d3

…ct to 2G limitation

conv3d works with small-tiling tuning parameters, but not large tiling

bece51f

fixed

8021168

clean

cc3bdc1

conv3d works for larger-than-2Gb cases, but performance is poor

e0422f0

replaced Slice with Freeze; fixed __host__ CalculateLowerIndex of Merge

6f6254e

rebased against latest develop; deactivated buffer_load in DynamicBuf…

8c82912

…fer::Get

add naive_conv3d for fast verification; improved performance of host …

580b3fc

…conv3d; formatting and cleanup

added test ofr host magic division

dab1378

separate macro buffer_addressing with buffer_load, store and atomic_add

ae70a5b

added branch for global load of invalid element to avoid segm fault

51096b4

add example conv3d_fwd_xdl

149ac4c

rebased again

78b2ea9

minor changes

057bc8b

zjing14 previously approved these changes Jan 11, 2022

View reviewed changes

j4yan dismissed zjing14’s stale review via f482313 January 17, 2022 03:33

remove added branches for invalid reading in dynamic buffer

9137bb9

asroy suggested changes Jan 27, 2022

View reviewed changes

j4yan added 4 commits January 28, 2022 17:05

changes according to Chao's review

c34a027

resolved all issues brought up by Chao; next step merge develop branch

6dd316a

Merge branch 'conv3d_splitN_rebased' of github.com:j4yan/composable_k…

945d420

…ernel into conv3d_splitN_rebased

Merge branch 'develop' into conv3d_splitN_rebased

e5a15b1

j4yan closed this Jan 28, 2022

j4yan reopened this Jan 28, 2022

j4yan added 3 commits January 28, 2022 23:08

merge develop

2a2216b

format

30832dd

turn on buffer_load by default

43ae8e4

asroy reviewed Feb 4, 2022

View reviewed changes

Comment thread host/driver_offline/include/driver_batched_gemm_xdlops_v2r3.hpp Outdated

asroy reviewed Feb 4, 2022

View reviewed changes

Comment thread host/driver_offline/src/conv3d_fwd_driver_offline.cpp Outdated

j4yan and others added 5 commits February 8, 2022 17:03

separate 64bit make_tensor_desc functions from 32bit

b282f78

add is_64bit<T>; add template parameter Is64bitElementSpaceSize

e1fbd15

rename file name of gridwise batched gemm

4acf230

Merge remote-tracking branch 'rocm/develop' into conv3d_splitN_rebased

2f915af

clean up

3de8466

asroy suggested changes Feb 12, 2022

View reviewed changes

Chao Liu and others added 3 commits February 12, 2022 05:23

cleaning

eaca392

clearing

be4bc14

add DeviceConv3dFwdNaive to wrap naive_conv_ndhwc_kzyxc_ndhwk

fa636bd

qianfengz reviewed Feb 17, 2022

View reviewed changes

j4yan closed this Feb 21, 2022

carlushuang pushed a commit that referenced this pull request Apr 26, 2024

Merge pull request #60 from ROCm/ck_tile/opt_padding_fa_train_pr

ff1f12d

Synchronize the kernel changes used by xformers to ck_tile example/91_tile_program/fmha

hyoon1 pushed a commit to hyoon1/composable_kernel that referenced this pull request Mar 19, 2026

Merge pull request ROCm#60 from 201419/patch-1

1d0b41b

fix typo in function mha_fwd

Conversation

j4yan commented Nov 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asroy commented Nov 30, 2021

Uh oh!

Uh oh!

j4yan commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j4yan commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

j4yan commented Jan 7, 2022

Uh oh!

zjing14 left a comment

Choose a reason for hiding this comment

Uh oh!

aserio commented Jan 14, 2022

Uh oh!

j4yan commented Jan 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asroy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

j4yan commented Jan 28, 2022

Uh oh!

j4yan commented Jan 28, 2022

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

j4yan Feb 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asroy Feb 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

j4yan Feb 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

j4yan commented Feb 21, 2022

Uh oh!

Reviewers

Assignees

j4yan commented Nov 28, 2021 •

edited

Loading

j4yan commented Dec 14, 2021 •

edited

Loading

j4yan commented Dec 14, 2021 •

edited

Loading

j4yan commented Jan 17, 2022 •

edited

Loading

asroy left a comment •

edited

Loading

j4yan Feb 12, 2022 •

edited

Loading

asroy Feb 12, 2022 •

edited

Loading

j4yan Feb 21, 2022 •

edited

Loading