H2D data transfer optimization for concat kernel #49040

JamesLim-sy · 2022-12-13T04:10:37Z

PR types

Function optimization

PR changes

OPs

Describe

Currently, the input_tensors data address transferred from host to device memory through cudaMemcpyAsync runtime api, which would cause huge H2D transfer data cost, and in most cases, this transfer could be replaced by just pass a structure type as one param to the global function.There are bounch of concat op cases in alpha_fold model while trainning, and cases can be divided into below groups:
1. Concat for 2 input : 59.26%
2. Concat for 3 input : 7.09%
3. Concat for 4 input : 20.62%
4. Concat for 6 input : 8.68%
5. Concat for 8 input : 4.35%
6. Concat for has_same_shape : 89.89%
7. Concat for has_diff_shape : 10.11% (only in 2 input condition, 3 input condition)
Almost 23.14% cases need for cuMemcpyAsync in alpha_fold Concat ops, there is good chance to use structure type to decrease the data transfer cpu cost in concat op.
There are lots of concat cases work well in int32_t index type, make all cases work with int64_t index type would do damage to cuda kernel performance while calculating thread index with int64_t . So int32_t cases is divided from int64_t cases.

paddle-bot · 2022-12-13T04:10:40Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…tion

shaojiewang · 2022-12-15T03:07:25Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

@@ -19,111 +19,155 @@ limitations under the License. */
 namespace phi {
 namespace funcs {

+template <typename T, int Size>
+struct PointerWarpper {


Is it a typo? Is it a PointerWrapper?

oh....it is a typo, will be fixed in next commit.

shaojiewang · 2022-12-15T03:41:59Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+                             const IndexT out_cols,
+                             T* output_data) {
+  CUDA_KERNEL_LOOP_TYPE(tid_x, out_cols, IndexT) {
+    IndexT split = tid_x * 1.0 / fixed_in_col;


If possible, could 1.0/fixed_in_col also be passed by kernel args?

Good one, will be fixed in next commit.

Xreki

PointerArray的逻辑，建议还是可以单独提取出来，concat、stack可以共用。stack中PointerArray没有必要和DivMod强耦合，可以拆开来传参数。

Xreki · 2022-12-15T06:45:36Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+  PointerWarpper(const phi::GPUContext& ctx,
+                 const std::vector<phi::DenseTensor>& ins,
+                 const int64_t& in_num,
+                 const T** inputs_data) {


这里ctx和inputs_data没用？保留这个参数是为了支持另一种情况吧，是否再定义个参数列表不同的构造函数比较好？

这里是为了后面的宏统一使用做了妥协.

Xreki · 2022-12-15T06:46:20Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+  PointerWarpper() {}
+  PointerWarpper(const phi::GPUContext& ctx,
+                 const std::vector<phi::DenseTensor>& ins,
+                 const int64_t& in_num,


in_num也没有必要作为参数传进来，可以直接通过ins.size()获取。

根据建议修改

Xreki · 2022-12-15T06:49:49Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+                 const std::vector<phi::DenseTensor>& ins,
+                 const int64_t& in_num,
+                 const T** inputs_data) {
+    for (auto i = 0; i < in_num; ++i) {


加一下Size >= ins.size()的检查。

后文中在这里用已经做了预设的判断，再加检查感觉没有必要了.

if (limit_num < 8) { IMPL_CONCAT_WITH_WARPPER(8); } else if (limit_num < 16) { IMPL_CONCAT_WITH_WARPPER(16); } else if (limit_num < 32) { IMPL_CONCAT_WITH_WARPPER(32); } else if (limit_num < 64) { IMPL_CONCAT_WITH_WARPPER(64); } else if (limit_num <= 128) { IMPL_CONCAT_WITH_WARPPER(128); } else { IMPL_CONCAT_WITH_WARPPER(0); }

Xreki · 2022-12-15T08:52:55Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+  PointerWarpper(const phi::GPUContext& ctx,
+                 const std::vector<phi::DenseTensor>& ins,
+                 const int64_t& in_num,
+                 const T** inputs_data) {


inputs_data -> pre_alloced_host_ptr

根据建议修改，这个建议的命名无比专业，佩服

Xreki · 2022-12-15T08:54:49Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+    for (auto i = 0; i < in_num; ++i) {
+      inputs_data[i] = ins[i].data<T>();
+    }
+    paddle::memory::allocation::AllocationPtr tmp_dev_ins_data;


如讨论，这个变量改成类的成员，stack中也是。

根据建议修改, stack计算也会在下一个commit 中体现.

Xreki · 2022-12-15T09:11:21Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+template <typename T, typename IndexT, int Size>
+struct PointerAndColWarpper {
+ public:
+  IndexT col_data[Size];


col_data -> col_length，不要啥变量都用data

说得对，根据建议修改，以后代码习惯也会修改掉.

Xreki · 2022-12-15T09:12:36Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

-    const T* input_ptr = inputs_data[split];
-    int64_t tid_y = blockIdx.y * blockDim.y + threadIdx.y;
+template <typename T, typename IndexT, typename WarpperT>
+__global__ void ConcatKernel(WarpperT ins_data,


同上，这个函数应该是ConcatTensorWithSameShape？WarpperT应该是PointerWarpper？

Xreki · 2022-12-15T09:16:06Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

-    const T** inputs_data = inputs_data_vec.data();
-    int64_t* inputs_col = inputs_col_vec.data();
+  IndexT inputs_col_num = in_num + 1;
+  std::vector<const T*> inputs_data_vec(in_num, 0);


初始值用nullptr

一开始是写的std::vector<const T*> inputs_data_vec(in_num, nullptr); 后来反复修改中改错了.
已修改.

Xreki · 2022-12-15T09:17:45Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+  IndexT inputs_col_num = in_num + 1;
+  std::vector<const T*> inputs_data_vec(in_num, 0);
+  std::vector<IndexT> inputs_col_vec(inputs_col_num, 0);
+  const T** inputs_data = inputs_data_vec.data();


inputs_data_vec在很多情况下已经用不到了吧，这部分CPU开销可以优化下。

Xreki · 2022-12-15T09:26:49Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+      <<<grid_dims, block_dims, 0, context.stream()>>>(                   \
+          ptr_array, in_col, out_row, out_col, output->data<T>());
+
+    if (limit_num < 32) {


有一种写法，你可以参考下，paddle/phi/backends/gpu/cuda/cuda_device_function.h中：

#define CUDA_LAUNCH_KERNEL_BASE(dim, ...) \ case (dim): { \ constexpr auto kPowerOfTwoDim = (dim); \ __VA_ARGS__; \ } break #define CUDA_LAUNCH_KERNEL_HELPER(...) \ CUDA_LAUNCH_KERNEL_BASE(1024, ##__VA_ARGS__); \ CUDA_LAUNCH_KERNEL_BASE(512, ##__VA_ARGS__); \ CUDA_LAUNCH_KERNEL_BASE(256, ##__VA_ARGS__); \ CUDA_LAUNCH_KERNEL_BASE(128, ##__VA_ARGS__); \ CUDA_LAUNCH_KERNEL_BASE(64, ##__VA_ARGS__); \ CUDA_LAUNCH_KERNEL_BASE(32, ##__VA_ARGS__);

调用处可写成：

switch (platform::RoundToPowerOfTwo(beam_size * num_seqs * 32)) { CUDA_LAUNCH_KERNEL_HELPER( BeamSearchKernel<kPowerOfTwoDim, kMaxThreadsPerSeq, kMaxSeqs> <<<1, num_seqs * kMaxThreadsPerSeq, 0, context.stream()>>>( selected_ids_data, selected_scores_data, parent_idx_data, selected_offsets, pre_ids_data, pre_scores_data, ids_data, scores_data, seq_offsets, static_cast<int>(num_seqs), static_cast<int>(seq_width), static_cast<int>(beam_size), end_id, is_accumulated, num_used_threads)); }

Good one ! 根据建议修改.

Xreki · 2022-12-19T07:42:56Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+  PointerWrapper(const phi::GPUContext& ctx,
+                 const std::vector<phi::DenseTensor>& ins,
+                 const T** pre_alloced_host_ptr,
+                 const AllocatT& tmp_dev_ins_ptr = nullptr) {


我还是觉得tmp_dev_ins_ptr作为类成员比较好，这个Allocation和这个类里保存的data是一体的。

paddle::memory::AllocationPtr tmp_dev_ins_ptr本身是一个std::unique_ptr，从外部传入的话比较贴合这个指针本身的特征.

paddle::memory::AllocationPtr作为成员变量的一个问题是，struct 整体从CPU拷贝到GPU端时，std::unique_ptr是不可拷贝的，因为这个问题会导致编译出错。所以paddle::memory::AllocationPtr变量应该是外部传入合适些，下面是代码比较紧凑情况下的改法，传入非const引用，但是Paddle中禁止了非const引用作为形参，如果能豁免可在这块的修改中传入引用最好，不能的话还是得切换成外部传入std::unique_ptr的模式：

template <typename T> struct PointerToPointer { public: T** ins_addr{nullptr}; __device__ inline const T* operator[](int i) const { return ins_addr[i]; } PointerToPointer() {} PointerToPointer(const phi::GPUContext& ctx, const std::vector<phi::DenseTensor>& ins, const T** pre_alloced_host_ptr, paddle::memory::AllocationPtr& dev_ins_ptr) { auto in_num = ins.size(); for (auto i = 0; i < in_num; ++i) { pre_alloced_host_ptr[i] = ins[i].data<T>(); } dev_ins_ptr = paddle::memory::Alloc( ctx.GetPlace(), in_num * sizeof(T*), phi::Stream(reinterpret_cast<phi::StreamId>(ctx.stream()))); auto* restored = phi::backends::gpu::RestoreHostMemIfCapturingCUDAGraph( pre_alloced_host_ptr, in_num); paddle::memory::Copy(ctx.GetPlace(), dev_ins_ptr->ptr(), phi::CPUPlace(), restored, in_num * sizeof(T*), ctx.stream()); ins_addr = reinterpret_cast<T**>(dev_ins_ptr->ptr()); } };

Xreki · 2022-12-19T07:51:44Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu


 namespace phi {
 namespace funcs {

+using AllocatT = paddle::memory::AllocationPtr;


AllocatT -> AllocationPtr，using也不用做没有必要的简化，不要丢了该类型的特征

根据建议修改

Xreki · 2022-12-19T08:00:16Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+};
+
+template <typename T, typename IndexT, typename WrapperT>
+__global__ void ConcatKernel_(WrapperT ins_datas,


函数名和WrapperT的comment没有看到？

Xreki · 2022-12-19T08:08:38Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+    // Imple int32_t type concat kernel.
+    switch (phi::backends::gpu::RoundToNextHighPowOfTwo(limit_num, 4)) {
+      IMPL_CONCATE_CUDA_KERNEL_HELPER(
+          IMPL_CONCAT_CUDA_KERNEL_CASE,


在你的这一层封装下，代码确实越来越不好理解了。。。

Xreki · 2022-12-19T08:12:12Z

paddle/phi/kernels/funcs/concat_and_split_functor.cu

 #endif
+}


不要让一个函数超过100行。

根据建议修改

Xreki · 2022-12-19T08:14:21Z

paddle/phi/kernels/gpu/stack_kernel.cu

-    this->SetDivden(dividen);
+                   IndexT num,
+                   IndexT divisor,
+                   const paddle::memory::AllocationPtr& ins_gpu_ptr) {


allocation在类里面定义吧。

paddle::memory::AllocationPtr作为成员变量的一个问题是，struct 整体从CPU拷贝到GPU端时，std::unique_ptr是不可拷贝的，因为这个问题会导致编译出错。所以paddle::memory::AllocationPtr变量应该是外部传入合适，下面是代码比较紧凑情况下的改法，传入指针：

template <typename T> struct PointerToPointer { public: T** ins_addr{nullptr}; __device__ inline const T* operator[](int i) const { return ins_addr[i]; } PointerToPointer() {} PointerToPointer(const phi::GPUContext& ctx, const std::vector<phi::DenseTensor>& ins, const T** pre_alloced_host_ptr, paddle::memory::AllocationPtr* dev_ins_ptr) { auto in_num = ins.size(); for (auto i = 0; i < in_num; ++i) { pre_alloced_host_ptr[i] = ins[i].data<T>(); } *dev_ins_ptr = paddle::memory::Alloc( ctx.GetPlace(), in_num * sizeof(T*), phi::Stream(reinterpret_cast<phi::StreamId>(ctx.stream()))); auto* restored = phi::backends::gpu::RestoreHostMemIfCapturingCUDAGraph( pre_alloced_host_ptr, in_num); paddle::memory::Copy(ctx.GetPlace(), *dev_ins_ptr->ptr(), phi::CPUPlace(), restored, in_num * sizeof(T*), ctx.stream()); ins_addr = reinterpret_cast<T**>(*dev_ins_ptr->ptr()); } };

Xreki · 2022-12-19T08:20:50Z

paddle/phi/kernels/gpu/stack_kernel.cu

    }
  }
+
+#undef IMPL_STACK_CUDA_KERNEL_HELPER


我本身是希望减少一些宏，多一些可通用的组件。你当前这种写法确实不太好。

Xreki

LGTM

first commit

e05afb2

JamesLim-sy added 4 commits December 13, 2022 13:33

first commit

957fe7a

fix code for rocm amd

c529779

fix unique_ptr transfer bugs while pass it as one parameter to a func…

6c3aaa6

…tion

fix wrong changes

6c0d27c

JamesLim-sy force-pushed the memcpy_optimization_for_concat_op branch from c5af993 to 10cdde2 Compare December 14, 2022 12:25

fix rocm bugs

10cdde2

shaojiewang reviewed Dec 15, 2022

View reviewed changes

fix according to the pr suggestions.

089ee1b

Xreki reviewed Dec 15, 2022

View reviewed changes

JamesLim-sy added 3 commits December 15, 2022 18:08

fix according to pr suggestions twice

06b18f2

fix stack kernel.

17e63ca

fix wrong marco spell

4e35570

JamesLim-sy requested a review from Xreki December 16, 2022 09:58

Xreki reviewed Dec 19, 2022

View reviewed changes

JamesLim-sy added 5 commits December 26, 2022 09:08

change codes according to pr suggestions

528d938

change codes according to pr suggestions

e7b1879

change codes according to pr suggestions

41f1c6c

fix git push conflicts

c0a1239

fix according to the pr suggestions.

d3a0a0e

JamesLim-sy changed the title ~~Optimization of H2D data transfer operation in concat ops~~ H2D data transfer optimization for concat kernel Dec 29, 2022

JamesLim-sy added 3 commits December 30, 2022 07:09

fix struct parameter from reference to pointer.

820eabe

fix struct parameter from reference to pointer.

30f2fc3

fix wrong code written in marco

caa0345

JamesLim-sy requested review from Xreki and shaojiewang January 3, 2023 09:05

Xreki approved these changes Jan 3, 2023

View reviewed changes

Xreki merged commit 0de94cd into PaddlePaddle:develop Jan 3, 2023

MARD1NO mentioned this pull request Jan 5, 2023

Add concat optimization #49540

Merged

H2D data transfer optimization for concat kernel #49040

H2D data transfer optimization for concat kernel #49040

Conversation

JamesLim-sy commented Dec 13, 2022 • edited

PR types

PR changes

Describe

paddle-bot bot commented Dec 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy Dec 16, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesLim-sy Dec 27, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xreki left a comment

Choose a reason for hiding this comment

JamesLim-sy commented Dec 13, 2022 •

edited

JamesLim-sy Dec 16, 2022 •

edited

JamesLim-sy Dec 27, 2022 •

edited