-
Notifications
You must be signed in to change notification settings - Fork 223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction #1108
Conversation
git-subtree-dir: src/composable_kernel git-subtree-split: f6edda6119ebbb237dfa6270797b34f960d7b190
…le_kernel_init_integration_v3
…init_integration_v3
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
make_pad_transform(toReduceLen, 0, srcPad2)), | ||
make_tuple(Sequence<0>{}, Sequence<1>{}), | ||
make_tuple(Sequence<0>{}, Sequence<1>{})); | ||
if(hipThreadIdx_x == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do a string replacement of all hipThreadIdx_x
in this PR to get_thread_local_1d_id()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
||
std::string algo_name = "dynamic_generic_reduction"; | ||
|
||
std::string param = " -std=c++17 "; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed get_ck_common_compiler_flag already contain " -std=c++17 "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, just found that
return (outs.str()); | ||
}; | ||
|
||
static std::string get_definition_string_from_type_enums(miopenDataType_t TSrc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is MIOpen's miopenDataType_t
consistent with DataTypeEnum_t
? If not, we need a converter here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With static reduction, I assume they could be in-consistent. But for dynamic reduction, I assume DataTypeEnum_t is just a kernel layer duplication of miopenDataType_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also one point, if we don't assume miopenDataType_t be same as DataTypeEnum_t, DataTypeEnum_t will be useless in the kernel layer. Because if we convert the miopenDataType_t to some invariant form (like a characters 'D' for double) and pass them to kernel, we don't need DataTypeEnum_t since we can get the types directly from the characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for dynamic reduction, I assume DataTypeEnum_t is just a kernel layer duplication of miopenDataType_t.
🔴 So for dymatic reduction we must programmatically guarantee that both enums are consistent. Example:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/adc2035614310fb55ec9400c2a16f94b33a1d896/src/find_controls.cpp#L207-L224
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's dangerous to assume the value of two enum types are consistent, we need to assume they will be different.
We need a converter that convert miopen::miopenDataType_t
into ck::DataTypeEnum_t
, without any assumption of their value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use the same converter used by static reduction to convert from miopen::miopenDataType_t to the data types used by the dynamic reduction kernel. If doing so, Dynamic reduction kernel does not need ck::DataTypeEnum_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write a converter between miopen::miopenDataType_t
and ck::DataTypeEnum_t
for dynamic kernel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep these in mind:
- Design all the host logic with dynamic kernel in mind, NOT static kernel
- If the logic of static kernel and dynamic kernel are different, DO NOT mix the logic in the same function, put them in different functions.
- If static kernel can reuse the host logic designed for dynamic kernel, it's OK to reuse it. If not, write a separate logic in separate function
The reason is that we want to keep iterating on the implementation of dynamic kernels (both kernels and solvers), and then fully retire static kernel. So we don't want to mix their the logic of them together. Everything about the refactor should be making their logic more separate from each other instead of more integrated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use ck::DataTypeEnum_t ? I found no reason to use this, since we can pass the types using a consistent representation, like D
for double, F
for float, and convert these consistent representation to types directly. We don't work on Type Enum in the kernel, so no need.
ReductionMethod_t GetReductionMethod_2(std::size_t invariantLength, | ||
std::size_t toReduceLength) const | ||
{ | ||
(void)invariantLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this argument can be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be removed, only last argument is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
// synchronize among all threads in this warp | ||
__all(1); | ||
|
||
for(index_t stride = warpSize / 2; stride > 0; stride /= 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Performance] Does it worth to use >> 1
instead of dividing by 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be, good!
|
||
__syncthreads(); | ||
|
||
for(index_t stride = warpSize / 2; stride > 0; stride /= 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
|
||
param += " -DCK_PARAM_REDUCE_OP=" + std::to_string(detail::GetReduceTensorOpId(reduceOp)); | ||
param += " -DCK_PARAM_NAN_PROPAGATE=" + | ||
std::to_string(nanPropaOpt == MIOPEN_PROPAGATE_NAN ? 1 : 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Recommendation] You can avoid conversion of integers to strings:
param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Notice] I am wondering how long we'll use string manipulations instead of KernelBuildParameters
class...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atamazov Could you point to example of using KernelBuildParameters
class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asroy Thanks for asking ;) https://github.com/ROCmSoftwarePlatform/MIOpen/blob/c701af3ba83f8fb99d7efab30ee0b22de57f2f42/src/solver/activ/fwd_1.cpp#L170-L240
Visit https://github.com/ROCmSoftwarePlatform/MIOpen/search?q=KernelBuildParameters&type=code for more examples.
@DrizztDoUrden can provide additional advice if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Recommendation] You can avoid conversion of integers to strings:
param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");
Yes, good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Recommendation] You can avoid conversion of integers to strings:
param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");
Yes, good
I found this will cause compiler issue, and I can only do so by splitting to two lines as
param += " -DCK_PARAM_NAN_PROPAGATE=";
param += (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved.
[Performance][Quality] @DrizztDoUrden @junliume @qianfengz String manipulations increase technical debt and affect host-side performance a bit. From now on we must block new Solvers that use string manipulations instead of KernelBuildParameters
.
make_pad_transform(toReduceLen, 0, srcPad)), | ||
make_tuple(Sequence<0>{}, Sequence<1>{}), | ||
make_tuple(Sequence<0>{}, Sequence<1>{})); | ||
if(hipThreadIdx_x == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (get_block_1d_id() == 0 && get_thread_local_1d_id() == 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
...ernel/src/kernel_wrapper/gridwise_generic_reduction_first_call_blockwise_reduce_all_dims.cpp
Show resolved
Hide resolved
void* __restrict__ ws_global) | ||
{ | ||
(void)GridSize; | ||
(void)BlkGroupSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unused argument, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will require lots of change in the host sides, since so far the host side can use the same kernel launching for different kernels. I will only consider to do so after considering to split the host codes into separate files for static and dynamic reduction.
make_pad_transform(toReduceLen, 0, srcPad)), | ||
make_tuple(Sequence<0>{}, Sequence<1>{}), | ||
make_tuple(Sequence<0>{}, Sequence<1>{})); | ||
if(hipThreadIdx_x == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(get_block_1d_id() == 0 && get_thread_local_1d_id() == 0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
void* __restrict__ indices_global) | ||
{ | ||
(void)BlkGroupSize; | ||
(void)ws_buf2_bytes_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove unused arguements
|
||
if(hipThreadIdx_x == 0) | ||
*static_cast<decltype(dst1dDesc)*>(p_dst1dDesc) = dst1dDesc; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put src2dDesc and dst1dDesc in a tuple, so they are packed inside memory, and need only a single pointer
const auto desc_tuple = make_tuple(src2dDesc, dst1Desc);
*static_cast<decltype(desc_tuple)*>(p_desc_tuple) = desc_tuple;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This brings little benefit, cause we don't have to use single pointer for the two descriptors. Using in this way could make the tuple for the reference descriptors complicated due to the various combinations of src2d and dst1d descriptor padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benefit is src2dDesc and dst1dDesc will be likely packed in same cacheline. Please change that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will measure the performance by combining the r/w of the two descriptors, even though it makes codes looks not good
const void* __restrict__ p_src_global, | ||
float beta, | ||
void* __restrict__ p_dst_global, | ||
void* __restrict__ ws_global, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is CONSTANT
keyword for tensor descriptor?
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/1df9a07991727f1ac76d9507e963f4fe047eb4b8/src/composable_kernel/composable_kernel/src/kernel_wrapper/convolution_forward_implicit_gemm_v6r1_dlops_nchw_kcyx_nkhw.cpp#L240
I recall we have talked several times about the need to add CONSTANT
keyword for pointer to tensor descriptor. If you have encountered issues when using the keyword, we should talk until the issue is resolved instead of silently dropping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not issue found, but also found no benefit to use this. So here, do you think should I use CONSTANT with ws_global, the local p_src2d_descriptor and p_dst1d_descriptor are pointing to some offset from ws_global
{ | ||
using dataType = T; | ||
|
||
__device__ static T GetZeroVal() { return std::numeric_limits<T>::max(); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please replace all std::numeric_limits
in the code with ck:: NumericLimits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, just notice NmericLimits, thanks
}; | ||
|
||
template <> | ||
__device__ half_t Max<half_t>::GetZeroVal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think other developers would confuse "zero" here with "numerical zero".
Please change the function names toGetReductionZeroValue()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
|
||
__device__ inline constexpr void operator()(T& a, T b) const { a = a + b; } | ||
|
||
static constexpr bool indexable = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexable
does not sound like a valid property of reduction-op type.
Please remove indexable
reductions-op classes.
Host needs to decide if index is needed as output, and pass the info to kernel as compile-time parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indexable
is a property of the reduction operator, like ADD is not indexable, MIN is indexable, but it is the host to determine whether to output indices for the reduction result
646fcc268 Merge pull request #47 from ROCmSoftwarePlatform/develop 6014185ac [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44) 3e9113707 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all 211dae822 Merge branch 'develop' into miopen_downstream_all 5890e3007 [Composable Kernel] update develop branch code to ck_upstream d5297abae fix bug in gridwise gemm xdlops v2r3 (#45) 38a90b6ed Merge pull request #43 from ROCmSoftwarePlatform/develop c3018794b bug fix (#39) fd49ff808 add nchw atomic , nhwc and nhwc atomic method for backward weight (#30) b2dc55f82 [MIOpen Downstream] Fix Reduction Kernel (#34) b3e8d57d5 Tweak GEMM kernel (#38) 846f462bd Add VectorType support into StaticBuffer (#27) dfb80c4e3 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction (#1156) 8557901d0 Merge pull request #1165 from ROCmSoftwarePlatform/develop f305bebdc Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr b725e3fc8 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr 88833bd9a Merge pull request #32 from ROCmSoftwarePlatform/develop df0d68106 :Merge remote-tracking branch 'origin/develop' into CK_upstream f3acd2510 Add a version of Merge transform that use integerdivision and mod (#25) 19613902b GEMM driver and kernel (#29) 627d8ef35 Backward weight v4r4r2 with xdlops (#18) 10bb81106 Misc fixes (#24) 9e80cdceb [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction (#1108) a7a758d8c GlobalAtomicAdd for fp32/int32 (#23) 9d3f634a3 Xdlops refactor fix (#22) c6f26bb48 magic division use __umulhi() (#19) 6fe3627a9 Composable kernel init integration v3 (#1097) a2ad6d353 refactor dynamic xdlops iGemm (#13) ba6f79a75 Added host_conv_wrw for verification (#15) git-subtree-dir: src/composable_kernel git-subtree-split: 646fcc268ede841a16cdaafb68aa64803d8390e1
@@ -22,6 +22,9 @@ using remove_reference_t = typename std::remove_reference<T>::type; | |||
template <typename T> | |||
using remove_cv_t = typename std::remove_cv<T>::type; | |||
|
|||
template <typename T> | |||
using remove_cvref_t = remove_cv_t<std::remove_reference_t<T>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚓
This is P.R is for satisfying the request SWDEV-281541.
For generic reduction (
miopenReduceTensor
), dynamic means the input tensor specifics (lengths and strides of all dimensions) are passed to the kernels as runtime parameters. In comparison, the generic reduction implementation before this P.R is called static, by which the input tensor specifics are passed to the kernels as compiler constants, which will lead to different kernel binaries and the needing of the compiling process to generate them when the tensor specifics change. So dynamic generic reduction is supposed to improve the performance of the MIOpen applications when the input tensor specifics varies frequently.To test
#>bin/test_reduce_test --all
#>bin/test_reduce_test --all --half
#>bin/test_reduce_test --all --doulbe
To test and use static dynamic generic reduction, use the environment variable MIOPEN_DISABLE_DYNAMIC_REDUCTION to disable dynamic generic reduction
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all --half
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all --double
Performance Data (comparing kernel execution times between dynamic and static reduction)
Reduction Perf