Gemm + softmax (gemm + reduce_max + broadcast sub + exp + reduce_sum + broadcast div) by rocking5566 · Pull Request #178 · ROCm/composable_kernel

rocking5566 · 2022-04-07T01:36:35Z

No description provided.

[Why] Prepare to add reduceSum

2d as 1d version

asroy · 2022-04-14T20:13:18Z

This PR is doing GEMM and reduction separately.

[Why] std::numeric_limits<_Float16>::lowest() will return zero

[Why] Prevent error propogation

qianfengz · 2022-04-19T09:45:43Z

Also, the file name gridwise_elementwise_1d.hpp should be changed to indicate the elementwise operator used here is binary operator, cause we might also use unary or ternary operate on the 1d tensor. Names could be like

gridwise_1d_binary_operate.hpp
gridwise_1d_unary_operate.hpp
gridwise_1d_ternary_operate.hpp

My PR.182 and PR.192 also have similar elementwise binary/unary kernels defined. By I think your implementation is more generic since your implementation support ThreadSliceSize/ThreadTileSize, but my implementation assumes each thread process only one element, for int8 and fp16, use ThreadSliceSize=4/2 could be beneficial for data access performance.

After merging your P.R, I will change my Batch-Norm forward codes to use your kernel

rocking5566 · 2022-04-19T18:22:18Z

Also, the file name gridwise_elementwise_1d.hpp should be changed to indicate the elementwise operator used here is binary operator, cause we might also use unary or ternary operate on the 1d tensor. Names could be like

gridwise_1d_binary_operate.hpp

gridwise_1d_unary_operate.hpp

gridwise_1d_ternary_operate.hpp

My PR.182 and PR.192 also have similar elementwise binary/unary kernels defined. By I think your implementation is more generic since your implementation support ThreadSliceSize/ThreadTileSize, but my implementation assumes each thread process only one element, for int8 and fp16, use ThreadSliceSize=4/2 could be beneficial for data access performance.

After merging your P.R, I will change my Batch-Norm forward codes to use your kernel

Your suggestion is great!
I will rename to gridwise_binary_elementwise_1d.hpp
Because elementwise operation is a term in deep learning area.
This means there must be some sort of operation on two tensor.
https://caffe.berkeleyvision.org/tutorial/layers/eltwise.html

There are also another binary, ternary operation in deep learning, ex: concatenation.

In addition, as discussed with @asroy before.
We will rename original elementwise in each kernel parameter to the functor to stand for the fusion operation.
I open the ticket here
#179

[Why] Prevent loss of precision

[Why] Similar to acc datatype, it increase precision

Let memory coalesce between block

asroy · 2022-04-22T05:10:55Z

+          typename ElementwiseFunctor,
+          index_t ThreadPerBlock,
+          index_t ScalarPerVector>
+struct DeviceBinaryElementwise_2D : public DeviceBinaryElementwise<ElementwiseFunctor>


DeviceBinaryElementwise_ND

You could make this Device Operation supporting N-D tensor (N=1~5).

I will add this task to the backlog

Please open a JIRA task ticket and a github issue, and refer to this comment in both tickets.

2. Use DeviceGemm_Xdl_CShuffle instead of deprecated DeviceGemmXdl_C_Shuffle

[Why] F16 issue for host reduction has been fix in c1ef731

asroy · 2022-04-30T04:23:41Z

After PR #209 get merged, please fix issues in this PR before we merge it

refactor this PR to use fused GEMM+reduction done by Gemm reduce max #209
refactor device element-wise operation to support 1D~5D (Gemm + softmax (gemm + reduce_max + broadcast sub + exp + reduce_sum + broadcast div) #178 (comment))

aosewski · 2022-05-06T15:23:40Z

+
+    // do reduce max
+    auto reduce_max                 = DeviceReduceMaxInstance{};
+    auto reduce_max_workaspace_size = reduce_max.GetWorkspaceSizeInBytes(c_m_n_shape, reduceDims);


Spelling

Suggested change

auto reduce_max_workaspace_size = reduce_max.GetWorkspaceSizeInBytes(c_m_n_shape, reduceDims);

auto reduce_max_workspace_size = reduce_max.GetWorkspaceSizeInBytes(c_m_n_shape, reduceDims);

Good catch!
Fix in b6fe118

…tmax

qianfengz · 2022-04-19T10:03:46Z

+        // m * n
+        const auto m0 = pArg->c_grid_desc_m0_.GetLength(I0);
+
+        if(m0 % BlockTileSize != 0)


I think let the merged length be completely dividable by BlockTileSize is too strong restriction. You should pad the tensor and relax the restriction

qianfengz · 2022-04-19T10:09:42Z

+    std::unique_ptr<BaseArgument> MakeArgumentPointer(const void* p_a,
+                                                      const void* p_b,
+                                                      void* p_c,
+                                                      const std::vector<int>& shape,


Don't use type reference for arguments here, since MakeArgumentPointer() is an API, we could not assume the user always pass left values

qianfengz · 2022-04-19T10:12:05Z

+    MakeArgumentPointer(const void* p_a,
+                        const void* p_b,
+                        void* p_c,
+                        const std::vector<int>& shape_a,


Also, don't use type reference for declaring the arguments as this is an API. We could not always assume the user will pass addressable values

qianfengz · 2022-04-19T10:20:46Z

+template <typename ADataType,
+          typename BDataType,
+          typename CDataType,
+          typename ElementwiseFunctor,


Explicitly rename the ElementwiseFunctor type to be Binary Operator type, since here the kernel called will use Binary Operator. Also the base class DeviceElementwise should be re-named to indicate its usage since using Unary Operator will lead to different API (eg. p_a, p_b as in/out data) than using Binary Operator (eg. p_a, p_b, p_c as in/out data)

qianfengz · 2022-04-19T11:15:01Z

+    {
+        dst = src1 - src2;
+        // FIXME - use float16 exponential
+        float dst_f32 = static_cast<float>(dst);


To simplify, I suggest, here define dst, src1, src2 as AccDataType, assuming the operator() works on the VGPRs storing the converted values. ThreadwiseTransfer() can do the conversion automatically when the data is loaded from device memory to static buffer.

Expression like dst = src1 - src2 will lead to implicit loss of precision. Remember, always do + - * / in AccDataType.

Also should use ck::type_convert for type conversion, since static_cast<>() does not work at least when ck::bhalf_t is involved

qianfengz · 2022-04-19T11:17:02Z

+    __host__ __device__ constexpr void
+    operator()(CDataType& dst, const CDataType& src1, const CDataType& src2) const
+    {
+        dst = src1 / src2;


The same as above. It is horrible if dividing is done using half_t

qianfengz · 2022-04-20T07:40:24Z

+
+using DeviceReduceSumInstance =
+    ck::tensor_operation::device::DeviceReduceBlockWise<CDataType,
+                                                        CDataType,


Use AccDataType !

qianfengz · 2022-04-28T07:43:54Z

@@ -0,0 +1,150 @@
+#pragma once
+
+#include "cluster_descriptor.hpp"


#include "cluster_descriptor.hpp" is not needed since you don't use make_cluster_descriptor(). Also, data_type.hpp is not needed. Actually several other headers are needed even though they are included in-directly, eg. tensor_descriptor_helper.hpp and get_id.hpp.

You don't have to change, there are lots of similar issues in other C.K codes.

qianfengz · 2022-04-29T04:20:05Z

+
+// CAUTION - host reduce_max will call numeric_limits<ck::half_t>::lowest()
+// However, numeric_limits<ck::half_t>::lowest() will return zero. So, used half_float::half instead
+using HostReduceDataType = half_float::half;


Remove using half_float::half, since the Host_Reduction can now support using ck::half_t. Check
PR.195

qianfengz · 2022-04-29T04:30:22Z

+                ComputeDataType Bm = static_cast<ComputeDataType>(B(m));
+                functor(Cmn, Amn, Bm);
+            }
+            C(m, n) = static_cast<ComputeDataType>(Cmn);


Use ck::type_convert<ComputeDataType>(), or else conversion from bhalf_t will not work

asroy · 2022-05-20T03:04:00Z

closing this PR. new PR will be creatted

rocking5566 added 2 commits April 4, 2022 04:16

Part of gemm + softmax, Add gemm + reduceMax

d6e053a

Refine the comment

cbbc7e5

rocking5566 assigned asroy Apr 7, 2022

This comment was marked as resolved.

Sign in to view

asroy changed the title ~~Gemm + reduce_max (part of gemm + softmax)~~ [WIP] Gemm + reduce_max (part of gemm + softmax) Apr 7, 2022

rocking5566 added 5 commits April 10, 2022 18:57

Add device op for elementwise 2d

3e811cc

Merge branch 'develop' into gemm_softmax

0277c89

Fix compile error

6818b58

Add gridwise_elementwise_2d api

e3a09b5

Merge remote-tracking branch 'origin/develop' into gemm_softmax

cb1c473

asroy changed the title ~~[WIP] Gemm + reduce_max (part of gemm + softmax)~~ Gemm + reduce_max (part of gemm + softmax) Apr 12, 2022

rocking5566 added 7 commits April 12, 2022 19:55

A kernel of elementwise_2d (except global store)

a760a73

Add global write

c8b4ac2

Add exponential

f2540aa

[What] Refine naming

30348da

[Why] Prepare to add reduceSum

Add reduce sum for denominator of softmax

b05a594

Add broadcast div, the final step of softmax

6a781e5

Rewrite the gridwise_elementwise_

dba65b1

2d as 1d version

Add verication of softmax

fe65950

rocking5566 changed the title ~~Gemm + reduce_max (part of gemm + softmax)~~ Gemm + softmax (gemm + reduce_max + broadcast sub + exp + reduce_sum + broadcast div) Apr 15, 2022

rocking5566 added 3 commits April 18, 2022 10:44

[What] Use half_float::half instead of ck::half_t for host reduction

e83b22e

[Why] std::numeric_limits<_Float16>::lowest() will return zero

[What] Sync input of each host kernel and device kernel

21802fd

[Why] Prevent error propogation

Merge remote-tracking branch 'origin/develop' into gemm_softmax

c16f789

rocking5566 assigned qianfengz Apr 19, 2022

qianfengz reviewed Apr 19, 2022

View reviewed changes

Comment thread include/ck/tensor_operation/gpu/grid/gridwise_elementwise_1d.hpp Outdated

rocking5566 added 2 commits April 20, 2022 10:32

[What] Use F32 as the acc of reduce sum

cf32669

[Why] Prevent loss of precision

[What] Add ComputeDataType to the eltwise kernel

0f421d6

[Why] Similar to acc datatype, it increase precision

Rewrite the elementwise operation.

5d36f7a

Let memory coalesce between block