Gemm + softmax (gemm + reduce_max + broadcast sub + exp + reduce_sum + broadcast div)#178
Gemm + softmax (gemm + reduce_max + broadcast sub + exp + reduce_sum + broadcast div)#178rocking5566 wants to merge 33 commits into
Conversation
This comment was marked as resolved.
This comment was marked as resolved.
[Why] Prepare to add reduceSum
2d as 1d version
|
This PR is doing GEMM and reduction separately. |
[Why] std::numeric_limits<_Float16>::lowest() will return zero
[Why] Prevent error propogation
|
Also, the file name
My PR.182 and PR.192 also have similar elementwise binary/unary kernels defined. By I think your implementation is more generic since your implementation support After merging your P.R, I will change my Batch-Norm forward codes to use your kernel |
Your suggestion is great! There are also another binary, ternary operation in deep learning, ex: concatenation. In addition, as discussed with @asroy before. |
[Why] Prevent loss of precision
[Why] Similar to acc datatype, it increase precision
Let memory coalesce between block
| typename ElementwiseFunctor, | ||
| index_t ThreadPerBlock, | ||
| index_t ScalarPerVector> | ||
| struct DeviceBinaryElementwise_2D : public DeviceBinaryElementwise<ElementwiseFunctor> |
There was a problem hiding this comment.
DeviceBinaryElementwise_ND
You could make this Device Operation supporting N-D tensor (N=1~5).
There was a problem hiding this comment.
I will add this task to the backlog
There was a problem hiding this comment.
Please open a JIRA task ticket and a github issue, and refer to this comment in both tickets.
2. Use DeviceGemm_Xdl_CShuffle instead of deprecated DeviceGemmXdl_C_Shuffle
7281715 to
a41f548
Compare
[Why] F16 issue for host reduction has been fix in c1ef731
|
After PR #209 get merged, please fix issues in this PR before we merge it
|
|
|
||
| // do reduce max | ||
| auto reduce_max = DeviceReduceMaxInstance{}; | ||
| auto reduce_max_workaspace_size = reduce_max.GetWorkspaceSizeInBytes(c_m_n_shape, reduceDims); |
There was a problem hiding this comment.
Spelling
| auto reduce_max_workaspace_size = reduce_max.GetWorkspaceSizeInBytes(c_m_n_shape, reduceDims); | |
| auto reduce_max_workspace_size = reduce_max.GetWorkspaceSizeInBytes(c_m_n_shape, reduceDims); |
| // m * n | ||
| const auto m0 = pArg->c_grid_desc_m0_.GetLength(I0); | ||
|
|
||
| if(m0 % BlockTileSize != 0) |
There was a problem hiding this comment.
I think let the merged length be completely dividable by BlockTileSize is too strong restriction. You should pad the tensor and relax the restriction
| std::unique_ptr<BaseArgument> MakeArgumentPointer(const void* p_a, | ||
| const void* p_b, | ||
| void* p_c, | ||
| const std::vector<int>& shape, |
There was a problem hiding this comment.
Don't use type reference for arguments here, since MakeArgumentPointer() is an API, we could not assume the user always pass left values
| MakeArgumentPointer(const void* p_a, | ||
| const void* p_b, | ||
| void* p_c, | ||
| const std::vector<int>& shape_a, |
There was a problem hiding this comment.
Also, don't use type reference for declaring the arguments as this is an API. We could not always assume the user will pass addressable values
| template <typename ADataType, | ||
| typename BDataType, | ||
| typename CDataType, | ||
| typename ElementwiseFunctor, |
There was a problem hiding this comment.
Explicitly rename the ElementwiseFunctor type to be Binary Operator type, since here the kernel called will use Binary Operator. Also the base class DeviceElementwise should be re-named to indicate its usage since using Unary Operator will lead to different API (eg. p_a, p_b as in/out data) than using Binary Operator (eg. p_a, p_b, p_c as in/out data)
| { | ||
| dst = src1 - src2; | ||
| // FIXME - use float16 exponential | ||
| float dst_f32 = static_cast<float>(dst); |
There was a problem hiding this comment.
To simplify, I suggest, here define dst, src1, src2 as AccDataType, assuming the operator() works on the VGPRs storing the converted values. ThreadwiseTransfer() can do the conversion automatically when the data is loaded from device memory to static buffer.
Expression like dst = src1 - src2 will lead to implicit loss of precision. Remember, always do + - * / in AccDataType.
Also should use ck::type_convert for type conversion, since static_cast<>() does not work at least when ck::bhalf_t is involved
| __host__ __device__ constexpr void | ||
| operator()(CDataType& dst, const CDataType& src1, const CDataType& src2) const | ||
| { | ||
| dst = src1 / src2; |
There was a problem hiding this comment.
The same as above. It is horrible if dividing is done using half_t
|
|
||
| using DeviceReduceSumInstance = | ||
| ck::tensor_operation::device::DeviceReduceBlockWise<CDataType, | ||
| CDataType, |
| @@ -0,0 +1,150 @@ | |||
| #pragma once | |||
|
|
|||
| #include "cluster_descriptor.hpp" | |||
There was a problem hiding this comment.
#include "cluster_descriptor.hpp" is not needed since you don't use make_cluster_descriptor(). Also, data_type.hpp is not needed. Actually several other headers are needed even though they are included in-directly, eg. tensor_descriptor_helper.hpp and get_id.hpp.
You don't have to change, there are lots of similar issues in other C.K codes.
|
|
||
| // CAUTION - host reduce_max will call numeric_limits<ck::half_t>::lowest() | ||
| // However, numeric_limits<ck::half_t>::lowest() will return zero. So, used half_float::half instead | ||
| using HostReduceDataType = half_float::half; |
There was a problem hiding this comment.
Remove using half_float::half, since the Host_Reduction can now support using ck::half_t. Check
PR.195
| ComputeDataType Bm = static_cast<ComputeDataType>(B(m)); | ||
| functor(Cmn, Amn, Bm); | ||
| } | ||
| C(m, n) = static_cast<ComputeDataType>(Cmn); |
There was a problem hiding this comment.
Use ck::type_convert<ComputeDataType>(), or else conversion from bhalf_t will not work
|
closing this PR. new PR will be creatted |
No description provided.