[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction #1108

qianfengz · 2021-08-22T12:22:12Z

This is P.R is for satisfying the request SWDEV-281541.

For generic reduction (miopenReduceTensor), dynamic means the input tensor specifics (lengths and strides of all dimensions) are passed to the kernels as runtime parameters. In comparison, the generic reduction implementation before this P.R is called static, by which the input tensor specifics are passed to the kernels as compiler constants, which will lead to different kernel binaries and the needing of the compiling process to generate them when the tensor specifics change. So dynamic generic reduction is supposed to improve the performance of the MIOpen applications when the input tensor specifics varies frequently.

To test
#>bin/test_reduce_test --all
#>bin/test_reduce_test --all --half
#>bin/test_reduce_test --all --doulbe

To test and use static dynamic generic reduction, use the environment variable MIOPEN_DISABLE_DYNAMIC_REDUCTION to disable dynamic generic reduction
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all --half
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all --double

Performance Data (comparing kernel execution times between dynamic and static reduction)
Reduction Perf

…sable_kernel'

git-subtree-dir: src/composable_kernel git-subtree-split: f6edda6119ebbb237dfa6270797b34f960d7b190

… files

…le_kernel_init_integration_v3

5781adf5c Update develop (#5) (#6) 97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile 7b1ec41e5 refactor 49c33aaea refactor 54b3e73d1 rename git-subtree-dir: src/composable_kernel git-subtree-split: 5781adf5cf4ac753e2e36da7385791775b744bf7

…rpret_cast

…init_integration_v3

asroy · 2021-08-26T05:41:39Z

...l/composable_kernel/src/kernel_wrapper/gridwise_generic_reduction_second_call_threadwise.cpp

+                                                   make_pad_transform(toReduceLen, 0, srcPad2)),
+                                        make_tuple(Sequence<0>{}, Sequence<1>{}),
+                                        make_tuple(Sequence<0>{}, Sequence<1>{}));
+        if(hipThreadIdx_x == 0)


Please do a string replacement of all hipThreadIdx_x in this PR to get_thread_local_1d_id()

asroy · 2021-08-26T23:06:08Z

src/reducetensor.cpp

+
+        std::string algo_name = "dynamic_generic_reduction";
+
+        std::string param = " -std=c++17 ";


not needed get_ck_common_compiler_flag already contain " -std=c++17 "

Yes, just found that

asroy · 2021-08-27T01:23:34Z

src/reducetensor.cpp

+    return (outs.str());
+};
+
+static std::string get_definition_string_from_type_enums(miopenDataType_t TSrc,


Is MIOpen's miopenDataType_t consistent with DataTypeEnum_t? If not, we need a converter here

With static reduction, I assume they could be in-consistent. But for dynamic reduction, I assume DataTypeEnum_t is just a kernel layer duplication of miopenDataType_t.

Also one point, if we don't assume miopenDataType_t be same as DataTypeEnum_t, DataTypeEnum_t will be useless in the kernel layer. Because if we convert the miopenDataType_t to some invariant form (like a characters 'D' for double) and pass them to kernel, we don't need DataTypeEnum_t since we can get the types directly from the characters

But for dynamic reduction, I assume DataTypeEnum_t is just a kernel layer duplication of miopenDataType_t.

🔴 So for dymatic reduction we must programmatically guarantee that both enums are consistent. Example:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/adc2035614310fb55ec9400c2a16f94b33a1d896/src/find_controls.cpp#L207-L224

It's dangerous to assume the value of two enum types are consistent, we need to assume they will be different.

We need a converter that convert miopen::miopenDataType_t into ck::DataTypeEnum_t, without any assumption of their value

@qianfengz

I will use the same converter used by static reduction to convert from miopen::miopenDataType_t to the data types used by the dynamic reduction kernel. If doing so, Dynamic reduction kernel does not need ck::DataTypeEnum_t.

Please write a converter between miopen::miopenDataType_t and ck::DataTypeEnum_t for dynamic kernel.

Please keep these in mind:

Design all the host logic with dynamic kernel in mind, NOT static kernel

If the logic of static kernel and dynamic kernel are different, DO NOT mix the logic in the same function, put them in different functions.

If static kernel can reuse the host logic designed for dynamic kernel, it's OK to reuse it. If not, write a separate logic in separate function

The reason is that we want to keep iterating on the implementation of dynamic kernels (both kernels and solvers), and then fully retire static kernel. So we don't want to mix their the logic of them together. Everything about the refactor should be making their logic more separate from each other instead of more integrated

Should we use ck::DataTypeEnum_t ? I found no reason to use this, since we can pass the types using a consistent representation, like D for double, F for float, and convert these consistent representation to types directly. We don't work on Type Enum in the kernel, so no need.

src/reducetensor.cpp

atamazov · 2021-08-27T12:15:24Z

src/reducetensor.cpp

+    ReductionMethod_t GetReductionMethod_2(std::size_t invariantLength,
+                                           std::size_t toReduceLength) const
+    {
+        (void)invariantLength;


Why this is necessary?

Yes, this argument can be removed

Should be removed, only last argument is needed

src/reducetensor.cpp

atamazov · 2021-08-27T12:42:25Z

...omposable_kernel/composable_kernel/include/tensor_operation/reduction_functions_warpwise.hpp

+        // synchronize among all threads in this warp
+        __all(1);
+
+        for(index_t stride = warpSize / 2; stride > 0; stride /= 2)


[Performance] Does it worth to use >> 1 instead of dividing by 2?

Should be, good!

atamazov · 2021-08-27T12:42:54Z

...omposable_kernel/composable_kernel/include/tensor_operation/reduction_functions_warpwise.hpp

+
+        __syncthreads();
+
+        for(index_t stride = warpSize / 2; stride > 0; stride /= 2)


src/reducetensor.cpp

atamazov · 2021-08-27T12:56:06Z

src/reducetensor.cpp

+
+        param += " -DCK_PARAM_REDUCE_OP=" + std::to_string(detail::GetReduceTensorOpId(reduceOp));
+        param += " -DCK_PARAM_NAN_PROPAGATE=" +
+                 std::to_string(nanPropaOpt == MIOPEN_PROPAGATE_NAN ? 1 : 0);


[Recommendation] You can avoid conversion of integers to strings:

param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");

[Notice] I am wondering how long we'll use string manipulations instead of KernelBuildParameters class...

@atamazov Could you point to example of using KernelBuildParameters class?

@asroy Thanks for asking ;) https://github.com/ROCmSoftwarePlatform/MIOpen/blob/c701af3ba83f8fb99d7efab30ee0b22de57f2f42/src/solver/activ/fwd_1.cpp#L170-L240
Visit https://github.com/ROCmSoftwarePlatform/MIOpen/search?q=KernelBuildParameters&type=code for more examples.

@DrizztDoUrden can provide additional advice if necessary.

[Recommendation] You can avoid conversion of integers to strings:

param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");

Yes, good

[Recommendation] You can avoid conversion of integers to strings:

param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");

Yes, good

I found this will cause compiler issue, and I can only do so by splitting to two lines as
param += " -DCK_PARAM_NAN_PROPAGATE=";
param += (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");

Resolved.

[Performance][Quality] @DrizztDoUrden @junliume @qianfengz String manipulations increase technical debt and affect host-side performance a bit. From now on we must block new Solvers that use string manipulations instead of KernelBuildParameters.

asroy · 2021-08-27T17:37:43Z

...ernel/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_all_dims.cpp

+                                                   make_pad_transform(toReduceLen, 0, srcPad)),
+                                        make_tuple(Sequence<0>{}, Sequence<1>{}),
+                                        make_tuple(Sequence<0>{}, Sequence<1>{}));
+        if(hipThreadIdx_x == 0)


if (get_block_1d_id() == 0 && get_thread_local_1d_id() == 0)

atamazov · 2021-08-27T17:43:16Z

@asroy Umbrella ticket #1120 updated with your review.

...ernel/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_all_dims.cpp

asroy · 2021-08-27T19:00:34Z

...l/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_partial_dims.cpp

+                                                             void* __restrict__ ws_global)
+{
+    (void)GridSize;
+    (void)BlkGroupSize;


remove unused argument, please

This will require lots of change in the host sides, since so far the host side can use the same kernel launching for different kernels. I will only consider to do so after considering to split the host codes into separate files for static and dynamic reduction.

asroy · 2021-08-27T19:16:11Z

...l/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_partial_dims.cpp

+                                                   make_pad_transform(toReduceLen, 0, srcPad)),
+                                        make_tuple(Sequence<0>{}, Sequence<1>{}),
+                                        make_tuple(Sequence<0>{}, Sequence<1>{}));
+        if(hipThreadIdx_x == 0)


if(get_block_1d_id() == 0 && get_thread_local_1d_id() == 0)

asroy · 2021-08-27T19:18:40Z

...l/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_partial_dims.cpp

+                                                     void* __restrict__ indices_global)
+{
+    (void)BlkGroupSize;
+    (void)ws_buf2_bytes_offset;


remove unused arguements

asroy · 2021-08-27T19:36:28Z

...l/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_partial_dims.cpp

+
+    if(hipThreadIdx_x == 0)
+        *static_cast<decltype(dst1dDesc)*>(p_dst1dDesc) = dst1dDesc;
+};


put src2dDesc and dst1dDesc in a tuple, so they are packed inside memory, and need only a single pointer

const auto desc_tuple = make_tuple(src2dDesc, dst1Desc); *static_cast<decltype(desc_tuple)*>(p_desc_tuple) = desc_tuple;

This brings little benefit, cause we don't have to use single pointer for the two descriptors. Using in this way could make the tuple for the reference descriptors complicated due to the various combinations of src2d and dst1d descriptor padding.

The benefit is src2dDesc and dst1dDesc will be likely packed in same cacheline. Please change that.

I will measure the performance by combining the r/w of the two descriptors, even though it makes codes looks not good

atamazov · 2021-08-27T21:45:09Z

@asroy It is worth to maintain #1120, what do you think?

asroy · 2021-08-27T22:50:28Z

@asroy It is worth to maintain #1120, what do you think?

Yes, we can add comment here and reference it in #1120

atamazov · 2021-08-28T13:06:44Z

@asroy

Yes, we can add comment here and reference it in #1120

Can you do that please? Or I can do that for you again, no problem.

asroy · 2021-08-30T17:13:49Z

...l/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_partial_dims.cpp

+                                                     const void* __restrict__ p_src_global,
+                                                     float beta,
+                                                     void* __restrict__ p_dst_global,
+                                                     void* __restrict__ ws_global,


Where is CONSTANT keyword for tensor descriptor?
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/1df9a07991727f1ac76d9507e963f4fe047eb4b8/src/composable_kernel/composable_kernel/src/kernel_wrapper/convolution_forward_implicit_gemm_v6r1_dlops_nchw_kcyx_nkhw.cpp#L240

I recall we have talked several times about the need to add CONSTANT keyword for pointer to tensor descriptor. If you have encountered issues when using the keyword, we should talk until the issue is resolved instead of silently dropping it.

Not issue found, but also found no benefit to use this. So here, do you think should I use CONSTANT with ws_global, the local p_src2d_descriptor and p_dst1d_descriptor are pointing to some offset from ws_global

asroy · 2021-08-30T17:54:27Z

src/composable_kernel/composable_kernel/include/utility/reduction_operator.hpp

+{
+    using dataType = T;
+
+    __device__ static T GetZeroVal() { return std::numeric_limits<T>::max(); };


Please replace all std::numeric_limits in the code with ck:: NumericLimits

https://github.com/ROCmSoftwarePlatform/MIOpen/blob/1df9a07991727f1ac76d9507e963f4fe047eb4b8/src/composable_kernel/composable_kernel/include/utility/data_type.hpp#L1011

Ok, just notice NmericLimits, thanks

asroy · 2021-08-30T21:35:47Z

src/composable_kernel/composable_kernel/include/utility/reduction_operator.hpp

+};
+
+template <>
+__device__ half_t Max<half_t>::GetZeroVal()


I think other developers would confuse "zero" here with "numerical zero".

Please change the function names toGetReductionZeroValue()

asroy · 2021-08-30T22:23:43Z

src/composable_kernel/composable_kernel/include/utility/reduction_operator.hpp

+
+    __device__ inline constexpr void operator()(T& a, T b) const { a = a + b; }
+
+    static constexpr bool indexable = false;


indexable does not sound like a valid property of reduction-op type.

Please remove indexable reductions-op classes.

Host needs to decide if index is needed as output, and pass the info to kernel as compile-time parameter

indexable is a property of the reduction operator, like ADD is not indexable, MIN is indexable, but it is the host to determine whether to output indices for the reduction result

646fcc268 Merge pull request #47 from ROCmSoftwarePlatform/develop 6014185ac [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44) 3e9113707 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all 211dae822 Merge branch 'develop' into miopen_downstream_all 5890e3007 [Composable Kernel] update develop branch code to ck_upstream d5297abae fix bug in gridwise gemm xdlops v2r3 (#45) 38a90b6ed Merge pull request #43 from ROCmSoftwarePlatform/develop c3018794b bug fix (#39) fd49ff808 add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) b2dc55f82 [MIOpen Downstream] Fix Reduction Kernel (#34) b3e8d57d5 Tweak GEMM kernel (#38) 846f462bd Add VectorType support into StaticBuffer (#27) dfb80c4e3 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156) 8557901d0 Merge pull request #1165 from ROCmSoftwarePlatform/develop f305bebdc Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr b725e3fc8 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr 88833bd9a Merge pull request #32 from ROCmSoftwarePlatform/develop df0d68106 :Merge remote-tracking branch 'origin/develop' into CK_upstream f3acd2510 Add a version of Merge transform that use integerdivision and mod (#25) 19613902b GEMM driver and kernel (#29) 627d8ef35 Backward weight v4r4r2 with xdlops (#18) 10bb81106 Misc fixes (#24) 9e80cdceb [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) a7a758d8c GlobalAtomicAdd for fp32/int32 (#23) 9d3f634a3 Xdlops refactor fix (#22) c6f26bb48 magic division use __umulhi() (#19) 6fe3627a9 Composable kernel init integration v3 (#1097) a2ad6d353 refactor dynamic xdlops iGemm (#13) ba6f79a75 Added host_conv_wrw for verification (#15) git-subtree-dir: src/composable_kernel git-subtree-split: 646fcc268ede841a16cdaafb68aa64803d8390e1

atamazov · 2022-04-07T14:51:59Z

src/composable_kernel/composable_kernel/include/utility/type.hpp

@@ -22,6 +22,9 @@ using remove_reference_t = typename std::remove_reference<T>::type;
 template <typename T>
 using remove_cv_t = typename std::remove_cv<T>::type;

+template <typename T>
+using remove_cvref_t = remove_cv_t<std::remove_reference_t<T>>;


Chao Liu added 30 commits July 30, 2021 17:31

Merge commit 'c840438b62e3071b8e658de7343c8e461387de97' as 'src/compo…

78c293c

…sable_kernel'

Squashed 'src/composable_kernel/' content from commit f6edda611

c840438

git-subtree-dir: src/composable_kernel git-subtree-split: f6edda6119ebbb237dfa6270797b34f960d7b190

add solver ConvIgemmFwdV6r1DlopsNchwKcyxNkhw; rename static ck source…

6204be8

… files

Merge commit '437cc595c6e206dfebb118985b5171bbc1e29eab' into composab…

3c0614b

…le_kernel_init_integration_v3

fix

d822708

refactor

0bf90d4

remove online compilation from CK

c67b040

refactor

fff94ec

fix

44da477

add ctest

1208667

tidy

9f89938

add tidy

4f825a5

tidy

2ad51a5

tidy

add55bb

tidy

5cae6d0

tidy

63204c0

tidy

64b8ab8

tidy

685ff52

tidy

aeddc20

tidy

7b9a9ea

tidy

b5d1fa3

add c-style pointer cast

bf9c7a7

vector/scalar pointer cast use c-style pointer cast instead of reinte…

f258bf4

…rpret_cast

fix clang warning suppression

15467d5

tidy

485800f

suppress cppcheck

6f1ea68

fix enum issue

9547d24

revert chagnes to hip build

d921965

Merge remote-tracking branch 'origin/develop' into composable_kernel_…

f5680a9

…init_integration_v3

junliume added the value_high label Aug 26, 2021

junliume assigned qianfengz Aug 26, 2021

This comment has been minimized.

Sign in to view

junliume approved these changes Aug 27, 2021

View reviewed changes

junliume changed the title ~~[SWDEV-281541] Implementation of Dynamic Generic Reduction~~ [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction Aug 27, 2021

junliume merged commit c07307c into develop Aug 27, 2021

junliume deleted the dynamic_reduction_pr branch August 27, 2021 01:05

asroy reviewed Aug 27, 2021

View reviewed changes

junliume mentioned this pull request Aug 27, 2021

[CI][Tidy] Hotfix for modernize-return-braced-init-list #1118

Merged

atamazov reviewed Aug 27, 2021

View reviewed changes

atamazov mentioned this pull request Aug 27, 2021

[post-merge review] PR 1108, [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction #1120

Closed

3 tasks

asroy reviewed Aug 27, 2021

View reviewed changes

...ernel/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_all_dims.cpp Show resolved Hide resolved

asroy reviewed Aug 27, 2021

View reviewed changes

atamazov mentioned this pull request Aug 27, 2021

[CI] Upgrade to ROCm 4.3.1 [WORKAROUND] Disable dynamic reduction by default for HIP backend && ROCm 4.3 #1125

Merged

asroy mentioned this pull request Aug 30, 2021

[Composable kernel] Improve PR1097 v2 #1106

Closed

asroy reviewed Aug 31, 2021

View reviewed changes

qianfengz mentioned this pull request Sep 17, 2021

[Enhancements] Several bugfixes and refactoring of dynamic generic reduction #1156

Merged

atamazov mentioned this pull request Nov 9, 2021

[Composable Kernel] Integrate xdlops fwd kernel #1260

Closed

atamazov reviewed Apr 7, 2022

View reviewed changes


		std::string algo_name = "dynamic_generic_reduction";

		std::string param = " -std=c++17 ";


		__syncthreads();

		for(index_t stride = warpSize / 2; stride > 0; stride /= 2)


		__device__ inline constexpr void operator()(T& a, T b) const { a = a + b; }

		static constexpr bool indexable = false;

[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction #1108

[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction #1108

Conversation

qianfengz commented Aug 22, 2021 • edited Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atamazov Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

asroy Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asroy Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

asroy Aug 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asroy Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

atamazov Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atamazov commented Aug 27, 2021

asroy Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asroy Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atamazov commented Aug 27, 2021

asroy commented Aug 27, 2021

atamazov commented Aug 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qianfengz commented Aug 22, 2021 •

edited

Loading

atamazov Aug 27, 2021 •

edited

Loading

asroy Aug 27, 2021 •

edited

Loading

asroy Aug 31, 2021 •

edited

Loading

asroy Aug 31, 2021 •

edited

Loading

asroy Aug 27, 2021 •

edited

Loading

atamazov Aug 27, 2021 •

edited

Loading

asroy Aug 27, 2021 •

edited

Loading

asroy Aug 27, 2021 •

edited

Loading