Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction #1108

Merged
merged 61 commits into from
Aug 27, 2021

Conversation

qianfengz
Copy link
Contributor

@qianfengz qianfengz commented Aug 22, 2021

This is P.R is for satisfying the request SWDEV-281541.

For generic reduction (miopenReduceTensor), dynamic means the input tensor specifics (lengths and strides of all dimensions) are passed to the kernels as runtime parameters. In comparison, the generic reduction implementation before this P.R is called static, by which the input tensor specifics are passed to the kernels as compiler constants, which will lead to different kernel binaries and the needing of the compiling process to generate them when the tensor specifics change. So dynamic generic reduction is supposed to improve the performance of the MIOpen applications when the input tensor specifics varies frequently.

To test
#>bin/test_reduce_test --all
#>bin/test_reduce_test --all --half
#>bin/test_reduce_test --all --doulbe

To test and use static dynamic generic reduction, use the environment variable MIOPEN_DISABLE_DYNAMIC_REDUCTION to disable dynamic generic reduction
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all --half
#>MIOPEN_DISABLE_DYNAMIC_REDUCTION=1 bin/test_reduce_test --all --double

Performance Data (comparing kernel execution times between dynamic and static reduction)
Reduction Perf

Chao Liu added 30 commits July 30, 2021 17:31
git-subtree-dir: src/composable_kernel
git-subtree-split: f6edda6119ebbb237dfa6270797b34f960d7b190
5781adf5c Update develop (#5) (#6)
97e6d514f Merge pull request #4 from ROCmSoftwarePlatform/separate_online_compile
7b1ec41e5 refactor
49c33aaea refactor
54b3e73d1 rename

git-subtree-dir: src/composable_kernel
git-subtree-split: 5781adf5cf4ac753e2e36da7385791775b744bf7
@asroy

This comment has been minimized.

@junliume

This comment has been minimized.

@asroy

This comment has been minimized.

@junliume junliume changed the title [SWDEV-281541] Implementation of Dynamic Generic Reduction [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction Aug 27, 2021
@junliume junliume merged commit c07307c into develop Aug 27, 2021
@junliume junliume deleted the dynamic_reduction_pr branch August 27, 2021 01:05
make_pad_transform(toReduceLen, 0, srcPad2)),
make_tuple(Sequence<0>{}, Sequence<1>{}),
make_tuple(Sequence<0>{}, Sequence<1>{}));
if(hipThreadIdx_x == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do a string replacement of all hipThreadIdx_x in this PR to get_thread_local_1d_id()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


std::string algo_name = "dynamic_generic_reduction";

std::string param = " -std=c++17 ";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed get_ck_common_compiler_flag already contain " -std=c++17 "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just found that

return (outs.str());
};

static std::string get_definition_string_from_type_enums(miopenDataType_t TSrc,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is MIOpen's miopenDataType_t consistent with DataTypeEnum_t? If not, we need a converter here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With static reduction, I assume they could be in-consistent. But for dynamic reduction, I assume DataTypeEnum_t is just a kernel layer duplication of miopenDataType_t.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one point, if we don't assume miopenDataType_t be same as DataTypeEnum_t, DataTypeEnum_t will be useless in the kernel layer. Because if we convert the miopenDataType_t to some invariant form (like a characters 'D' for double) and pass them to kernel, we don't need DataTypeEnum_t since we can get the types directly from the characters

Copy link
Contributor

@atamazov atamazov Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for dynamic reduction, I assume DataTypeEnum_t is just a kernel layer duplication of miopenDataType_t.

🔴 So for dymatic reduction we must programmatically guarantee that both enums are consistent. Example:
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/adc2035614310fb55ec9400c2a16f94b33a1d896/src/find_controls.cpp#L207-L224

Copy link
Contributor

@asroy asroy Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dangerous to assume the value of two enum types are consistent, we need to assume they will be different.

We need a converter that convert miopen::miopenDataType_t into ck::DataTypeEnum_t, without any assumption of their value

@qianfengz

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use the same converter used by static reduction to convert from miopen::miopenDataType_t to the data types used by the dynamic reduction kernel. If doing so, Dynamic reduction kernel does not need ck::DataTypeEnum_t.

Copy link
Contributor

@asroy asroy Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a converter between miopen::miopenDataType_t and ck::DataTypeEnum_t for dynamic kernel.

Copy link
Contributor

@asroy asroy Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep these in mind:

  1. Design all the host logic with dynamic kernel in mind, NOT static kernel
  2. If the logic of static kernel and dynamic kernel are different, DO NOT mix the logic in the same function, put them in different functions.
  3. If static kernel can reuse the host logic designed for dynamic kernel, it's OK to reuse it. If not, write a separate logic in separate function

The reason is that we want to keep iterating on the implementation of dynamic kernels (both kernels and solvers), and then fully retire static kernel. So we don't want to mix their the logic of them together. Everything about the refactor should be making their logic more separate from each other instead of more integrated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use ck::DataTypeEnum_t ? I found no reason to use this, since we can pass the types using a consistent representation, like D for double, F for float, and convert these consistent representation to types directly. We don't work on Type Enum in the kernel, so no need.

src/reducetensor.cpp Show resolved Hide resolved
ReductionMethod_t GetReductionMethod_2(std::size_t invariantLength,
std::size_t toReduceLength) const
{
(void)invariantLength;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this argument can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be removed, only last argument is needed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

src/reducetensor.cpp Show resolved Hide resolved
src/reducetensor.cpp Show resolved Hide resolved
src/reducetensor.cpp Show resolved Hide resolved
// synchronize among all threads in this warp
__all(1);

for(index_t stride = warpSize / 2; stride > 0; stride /= 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Performance] Does it worth to use >> 1 instead of dividing by 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be, good!


__syncthreads();

for(index_t stride = warpSize / 2; stride > 0; stride /= 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

src/reducetensor.cpp Show resolved Hide resolved

param += " -DCK_PARAM_REDUCE_OP=" + std::to_string(detail::GetReduceTensorOpId(reduceOp));
param += " -DCK_PARAM_NAN_PROPAGATE=" +
std::to_string(nanPropaOpt == MIOPEN_PROPAGATE_NAN ? 1 : 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Recommendation] You can avoid conversion of integers to strings:

param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Notice] I am wondering how long we'll use string manipulations instead of KernelBuildParameters class...

Copy link
Contributor

@asroy asroy Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atamazov Could you point to example of using KernelBuildParameters class?

Copy link
Contributor

@atamazov atamazov Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Recommendation] You can avoid conversion of integers to strings:

param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");

Yes, good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Recommendation] You can avoid conversion of integers to strings:

param += " -DCK_PARAM_NAN_PROPAGATE=" + (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");

Yes, good

I found this will cause compiler issue, and I can only do so by splitting to two lines as
param += " -DCK_PARAM_NAN_PROPAGATE=";
param += (nanPropaOpt == MIOPEN_PROPAGATE_NAN ? "1" : "0");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved.

[Performance][Quality] @DrizztDoUrden @junliume @qianfengz String manipulations increase technical debt and affect host-side performance a bit. From now on we must block new Solvers that use string manipulations instead of KernelBuildParameters.

make_pad_transform(toReduceLen, 0, srcPad)),
make_tuple(Sequence<0>{}, Sequence<1>{}),
make_tuple(Sequence<0>{}, Sequence<1>{}));
if(hipThreadIdx_x == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (get_block_1d_id() == 0 && get_thread_local_1d_id() == 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@atamazov
Copy link
Contributor

@asroy Umbrella ticket #1120 updated with your review.

void* __restrict__ ws_global)
{
(void)GridSize;
(void)BlkGroupSize;
Copy link
Contributor

@asroy asroy Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unused argument, please

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will require lots of change in the host sides, since so far the host side can use the same kernel launching for different kernels. I will only consider to do so after considering to split the host codes into separate files for static and dynamic reduction.

make_pad_transform(toReduceLen, 0, srcPad)),
make_tuple(Sequence<0>{}, Sequence<1>{}),
make_tuple(Sequence<0>{}, Sequence<1>{}));
if(hipThreadIdx_x == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if(get_block_1d_id() == 0 && get_thread_local_1d_id() == 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

void* __restrict__ indices_global)
{
(void)BlkGroupSize;
(void)ws_buf2_bytes_offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unused arguements


if(hipThreadIdx_x == 0)
*static_cast<decltype(dst1dDesc)*>(p_dst1dDesc) = dst1dDesc;
};
Copy link
Contributor

@asroy asroy Aug 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put src2dDesc and dst1dDesc in a tuple, so they are packed inside memory, and need only a single pointer

const auto desc_tuple = make_tuple(src2dDesc, dst1Desc);

*static_cast<decltype(desc_tuple)*>(p_desc_tuple) = desc_tuple;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This brings little benefit, cause we don't have to use single pointer for the two descriptors. Using in this way could make the tuple for the reference descriptors complicated due to the various combinations of src2d and dst1d descriptor padding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benefit is src2dDesc and dst1dDesc will be likely packed in same cacheline. Please change that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will measure the performance by combining the r/w of the two descriptors, even though it makes codes looks not good

@atamazov
Copy link
Contributor

@asroy It is worth to maintain #1120, what do you think?

@asroy
Copy link
Contributor

asroy commented Aug 27, 2021

@asroy It is worth to maintain #1120, what do you think?

Yes, we can add comment here and reference it in #1120

@atamazov
Copy link
Contributor

@asroy

Yes, we can add comment here and reference it in #1120

Can you do that please? Or I can do that for you again, no problem.

const void* __restrict__ p_src_global,
float beta,
void* __restrict__ p_dst_global,
void* __restrict__ ws_global,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is CONSTANT keyword for tensor descriptor?
https://github.com/ROCmSoftwarePlatform/MIOpen/blob/1df9a07991727f1ac76d9507e963f4fe047eb4b8/src/composable_kernel/composable_kernel/src/kernel_wrapper/convolution_forward_implicit_gemm_v6r1_dlops_nchw_kcyx_nkhw.cpp#L240

I recall we have talked several times about the need to add CONSTANT keyword for pointer to tensor descriptor. If you have encountered issues when using the keyword, we should talk until the issue is resolved instead of silently dropping it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not issue found, but also found no benefit to use this. So here, do you think should I use CONSTANT with ws_global, the local p_src2d_descriptor and p_dst1d_descriptor are pointing to some offset from ws_global

{
using dataType = T;

__device__ static T GetZeroVal() { return std::numeric_limits<T>::max(); };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, just notice NmericLimits, thanks

};

template <>
__device__ half_t Max<half_t>::GetZeroVal()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think other developers would confuse "zero" here with "numerical zero".

Please change the function names toGetReductionZeroValue()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree


__device__ inline constexpr void operator()(T& a, T b) const { a = a + b; }

static constexpr bool indexable = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexable does not sound like a valid property of reduction-op type.

Please remove indexable reductions-op classes.

Host needs to decide if index is needed as output, and pass the info to kernel as compile-time parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indexable is a property of the reduction operator, like ADD is not indexable, MIN is indexable, but it is the host to determine whether to output indices for the reduction result

ltqin pushed a commit that referenced this pull request Oct 28, 2021
646fcc268 Merge pull request #47 from ROCmSoftwarePlatform/develop
6014185ac [Bug Fix] GridwiseGemm_bk0mk1_bk0nk1_mn_xdlops_v2r4 loop issue (#44)
3e9113707 Merge pull request #46 from ROCmSoftwarePlatform/miopen_downstream_all
211dae822 Merge branch 'develop' into miopen_downstream_all
5890e3007 [Composable Kernel] update develop branch code to ck_upstream
d5297abae fix bug in gridwise gemm xdlops v2r3 (#45)
38a90b6ed Merge pull request #43 from ROCmSoftwarePlatform/develop
c3018794b bug fix (#39)
fd49ff808 add nchw atomic , nhwc and nhwc atomic method   for backward weight (#30)
b2dc55f82 [MIOpen Downstream] Fix Reduction Kernel (#34)
b3e8d57d5 Tweak GEMM kernel (#38)
846f462bd Add VectorType support into StaticBuffer (#27)
dfb80c4e3 [Enhancements] Several bugfixes and refactoring of dynamic generic reduction  (#1156)
8557901d0 Merge pull request #1165 from ROCmSoftwarePlatform/develop
f305bebdc Merge pull request #31 from ROCmSoftwarePlatform/miopen_downstream-dynamic_reduction_pr
b725e3fc8 Merge remote-tracking branch 'origin/develop' into miopen_downstream-dynamic_reduction_pr
88833bd9a Merge pull request #32 from ROCmSoftwarePlatform/develop
df0d68106 :Merge remote-tracking branch 'origin/develop' into CK_upstream
f3acd2510 Add  a version of Merge transform that use integerdivision and mod (#25)
19613902b GEMM driver and kernel (#29)
627d8ef35 Backward weight v4r4r2 with xdlops (#18)
10bb81106 Misc fixes (#24)
9e80cdceb [SWDEV-281541][MSRCHA-100] Implementation of Dynamic Generic Reduction  (#1108)
a7a758d8c GlobalAtomicAdd for fp32/int32 (#23)
9d3f634a3 Xdlops refactor fix (#22)
c6f26bb48 magic division use __umulhi() (#19)
6fe3627a9 Composable kernel init integration v3 (#1097)
a2ad6d353 refactor dynamic xdlops iGemm (#13)
ba6f79a75 Added host_conv_wrw for verification (#15)

git-subtree-dir: src/composable_kernel
git-subtree-split: 646fcc268ede841a16cdaafb68aa64803d8390e1
@@ -22,6 +22,9 @@ using remove_reference_t = typename std::remove_reference<T>::type;
template <typename T>
using remove_cv_t = typename std::remove_cv<T>::type;

template <typename T>
using remove_cvref_t = remove_cv_t<std::remove_reference_t<T>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants