Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H2D data transfer optimization for concat kernel #49040

Merged

Conversation

JamesLim-sy
Copy link
Contributor

@JamesLim-sy JamesLim-sy commented Dec 13, 2022

PR types

Function optimization

PR changes

OPs

Describe

  1. Currently, the input_tensors data address transferred from host to device memory through cudaMemcpyAsync runtime api, which would cause huge H2D transfer data cost, and in most cases, this transfer could be replaced by just pass a structure type as one param to the global function.There are bounch of concat op cases in alpha_fold model while trainning, and cases can be divided into below groups:

    1. Concat for 2 input : 59.26%
    2. Concat for 3 input : 7.09%
    3. Concat for 4 input : 20.62%
    4. Concat for 6 input : 8.68%
    5. Concat for 8 input : 4.35%
    6. Concat for has_same_shape : 89.89%
    7. Concat for has_diff_shape : 10.11% (only in 2 input condition, 3 input condition)

    Almost 23.14% cases need for cuMemcpyAsync in alpha_fold Concat ops, there is good chance to use structure type to decrease the data transfer cpu cost in concat op.

  2. There are lots of concat cases work well in int32_t index type, make all cases work with int64_t index type would do damage to cuda kernel performance while calculating thread index with int64_t . So int32_t cases is divided from int64_t cases.

@paddle-bot
Copy link

paddle-bot bot commented Dec 13, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@JamesLim-sy JamesLim-sy force-pushed the memcpy_optimization_for_concat_op branch from c5af993 to 10cdde2 Compare December 14, 2022 12:25
@@ -19,111 +19,155 @@ limitations under the License. */
namespace phi {
namespace funcs {

template <typename T, int Size>
struct PointerWarpper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a typo? Is it a PointerWrapper?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh....it is a typo, will be fixed in next commit.

const IndexT out_cols,
T* output_data) {
CUDA_KERNEL_LOOP_TYPE(tid_x, out_cols, IndexT) {
IndexT split = tid_x * 1.0 / fixed_in_col;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, could 1.0/fixed_in_col also be passed by kernel args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, will be fixed in next commit.

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PointerArray的逻辑,建议还是可以单独提取出来,concatstack可以共用。stack中PointerArray没有必要和DivMod强耦合,可以拆开来传参数。

PointerWarpper(const phi::GPUContext& ctx,
const std::vector<phi::DenseTensor>& ins,
const int64_t& in_num,
const T** inputs_data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里ctxinputs_data没用?保留这个参数是为了支持另一种情况吧,是否再定义个参数列表不同的构造函数比较好?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是为了后面的宏统一使用做了妥协.

PointerWarpper() {}
PointerWarpper(const phi::GPUContext& ctx,
const std::vector<phi::DenseTensor>& ins,
const int64_t& in_num,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in_num也没有必要作为参数传进来,可以直接通过ins.size()获取。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改

const std::vector<phi::DenseTensor>& ins,
const int64_t& in_num,
const T** inputs_data) {
for (auto i = 0; i < in_num; ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

加一下Size >= ins.size()的检查。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后文中在这里用已经做了预设的判断,再加检查感觉没有必要了.

    if (limit_num < 8) {
      IMPL_CONCAT_WITH_WARPPER(8);
    } else if (limit_num < 16) {
      IMPL_CONCAT_WITH_WARPPER(16);
    } else if (limit_num < 32) {
      IMPL_CONCAT_WITH_WARPPER(32);
    } else if (limit_num < 64) {
      IMPL_CONCAT_WITH_WARPPER(64);
    } else if (limit_num <= 128) {
      IMPL_CONCAT_WITH_WARPPER(128);
    } else {
      IMPL_CONCAT_WITH_WARPPER(0);
    }

PointerWarpper(const phi::GPUContext& ctx,
const std::vector<phi::DenseTensor>& ins,
const int64_t& in_num,
const T** inputs_data) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputs_data -> pre_alloced_host_ptr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改,这个建议的命名无比专业,佩服

for (auto i = 0; i < in_num; ++i) {
inputs_data[i] = ins[i].data<T>();
}
paddle::memory::allocation::AllocationPtr tmp_dev_ins_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如讨论,这个变量改成类的成员,stack中也是。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改, stack计算也会在下一个commit 中体现.

template <typename T, typename IndexT, int Size>
struct PointerAndColWarpper {
public:
IndexT col_data[Size];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col_data -> col_length,不要啥变量都用data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说得对,根据建议修改,以后代码习惯也会修改掉.

const T* input_ptr = inputs_data[split];
int64_t tid_y = blockIdx.y * blockDim.y + threadIdx.y;
template <typename T, typename IndexT, typename WarpperT>
__global__ void ConcatKernel(WarpperT ins_data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,这个函数应该是ConcatTensorWithSameShapeWarpperT应该是PointerWarpper

const T** inputs_data = inputs_data_vec.data();
int64_t* inputs_col = inputs_col_vec.data();
IndexT inputs_col_num = in_num + 1;
std::vector<const T*> inputs_data_vec(in_num, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

初始值用nullptr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一开始是写的std::vector<const T*> inputs_data_vec(in_num, nullptr); 后来反复修改中改错了.
已修改.

IndexT inputs_col_num = in_num + 1;
std::vector<const T*> inputs_data_vec(in_num, 0);
std::vector<IndexT> inputs_col_vec(inputs_col_num, 0);
const T** inputs_data = inputs_data_vec.data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputs_data_vec在很多情况下已经用不到了吧,这部分CPU开销可以优化下。

<<<grid_dims, block_dims, 0, context.stream()>>>( \
ptr_array, in_col, out_row, out_col, output->data<T>());

if (limit_num < 32) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有一种写法,你可以参考下,paddle/phi/backends/gpu/cuda/cuda_device_function.h中:

#define CUDA_LAUNCH_KERNEL_BASE(dim, ...)  \
  case (dim): {                            \
    constexpr auto kPowerOfTwoDim = (dim); \
    __VA_ARGS__;                           \
  } break

#define CUDA_LAUNCH_KERNEL_HELPER(...)          \
  CUDA_LAUNCH_KERNEL_BASE(1024, ##__VA_ARGS__); \
  CUDA_LAUNCH_KERNEL_BASE(512, ##__VA_ARGS__);  \
  CUDA_LAUNCH_KERNEL_BASE(256, ##__VA_ARGS__);  \                                                                         
  CUDA_LAUNCH_KERNEL_BASE(128, ##__VA_ARGS__);  \
  CUDA_LAUNCH_KERNEL_BASE(64, ##__VA_ARGS__);   \
  CUDA_LAUNCH_KERNEL_BASE(32, ##__VA_ARGS__);

调用处可写成:

switch (platform::RoundToPowerOfTwo(beam_size * num_seqs * 32)) {
        CUDA_LAUNCH_KERNEL_HELPER(
            BeamSearchKernel<kPowerOfTwoDim, kMaxThreadsPerSeq, kMaxSeqs>
            <<<1, num_seqs * kMaxThreadsPerSeq, 0, context.stream()>>>(
                selected_ids_data,
                selected_scores_data,
                parent_idx_data,
                selected_offsets,
                pre_ids_data,
                pre_scores_data,
                ids_data,
                scores_data,
                seq_offsets,
                static_cast<int>(num_seqs),
                static_cast<int>(seq_width),
                static_cast<int>(beam_size),
                end_id,
                is_accumulated,
                num_used_threads));
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one ! 根据建议修改.

PointerWrapper(const phi::GPUContext& ctx,
const std::vector<phi::DenseTensor>& ins,
const T** pre_alloced_host_ptr,
const AllocatT& tmp_dev_ins_ptr = nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我还是觉得tmp_dev_ins_ptr作为类成员比较好,这个Allocation和这个类里保存的data是一体的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle::memory::AllocationPtr tmp_dev_ins_ptr本身是一个std::unique_ptr,从外部传入的话比较贴合这个指针本身的特征.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle::memory::AllocationPtr作为成员变量的一个问题是,struct 整体从CPU拷贝到GPU端时,std::unique_ptr是不可拷贝的,因为这个问题会导致编译出错。所以paddle::memory::AllocationPtr变量应该是外部传入合适些,下面是代码比较紧凑情况下的改法,传入非const引用,但是Paddle中禁止了非const引用作为形参,如果能豁免可在这块的修改中传入引用最好,不能的话还是得切换成外部传入std::unique_ptr的模式:

template <typename T>
struct PointerToPointer {
 public:
  T** ins_addr{nullptr};
  __device__ inline const T* operator[](int i) const { return ins_addr[i]; }

  PointerToPointer() {}
  PointerToPointer(const phi::GPUContext& ctx,
                   const std::vector<phi::DenseTensor>& ins,
                   const T** pre_alloced_host_ptr,
                   paddle::memory::AllocationPtr& dev_ins_ptr) {
    auto in_num = ins.size();
    for (auto i = 0; i < in_num; ++i) {
      pre_alloced_host_ptr[i] = ins[i].data<T>();
    }
    dev_ins_ptr = paddle::memory::Alloc(
        ctx.GetPlace(),
        in_num * sizeof(T*),
        phi::Stream(reinterpret_cast<phi::StreamId>(ctx.stream())));
    auto* restored = phi::backends::gpu::RestoreHostMemIfCapturingCUDAGraph(
        pre_alloced_host_ptr, in_num);
    paddle::memory::Copy(ctx.GetPlace(),
                         dev_ins_ptr->ptr(),
                         phi::CPUPlace(),
                         restored,
                         in_num * sizeof(T*),
                         ctx.stream());
    ins_addr = reinterpret_cast<T**>(dev_ins_ptr->ptr());
  }
};


namespace phi {
namespace funcs {

using AllocatT = paddle::memory::AllocationPtr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllocatT -> AllocationPtr,using也不用做没有必要的简化,不要丢了该类型的特征

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改

};

template <typename T, typename IndexT, typename WrapperT>
__global__ void ConcatKernel_(WrapperT ins_datas,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名和WrapperT的comment没有看到?

// Imple int32_t type concat kernel.
switch (phi::backends::gpu::RoundToNextHighPowOfTwo(limit_num, 4)) {
IMPL_CONCATE_CUDA_KERNEL_HELPER(
IMPL_CONCAT_CUDA_KERNEL_CASE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在你的这一层封装下,代码确实越来越不好理解了。。。

#endif
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不要让一个函数超过100行。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据建议修改

this->SetDivden(dividen);
IndexT num,
IndexT divisor,
const paddle::memory::AllocationPtr& ins_gpu_ptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocation在类里面定义吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddle::memory::AllocationPtr作为成员变量的一个问题是,struct 整体从CPU拷贝到GPU端时,std::unique_ptr是不可拷贝的,因为这个问题会导致编译出错。所以paddle::memory::AllocationPtr变量应该是外部传入合适,下面是代码比较紧凑情况下的改法,传入指针:

template <typename T>
struct PointerToPointer {
 public:
  T** ins_addr{nullptr};
  __device__ inline const T* operator[](int i) const { return ins_addr[i]; }

  PointerToPointer() {}
  PointerToPointer(const phi::GPUContext& ctx,
                   const std::vector<phi::DenseTensor>& ins,
                   const T** pre_alloced_host_ptr,
                   paddle::memory::AllocationPtr* dev_ins_ptr) {
    auto in_num = ins.size();
    for (auto i = 0; i < in_num; ++i) {
      pre_alloced_host_ptr[i] = ins[i].data<T>();
    }
    *dev_ins_ptr = paddle::memory::Alloc(
        ctx.GetPlace(),
        in_num * sizeof(T*),
        phi::Stream(reinterpret_cast<phi::StreamId>(ctx.stream())));
    auto* restored = phi::backends::gpu::RestoreHostMemIfCapturingCUDAGraph(
        pre_alloced_host_ptr, in_num);
    paddle::memory::Copy(ctx.GetPlace(),
                         *dev_ins_ptr->ptr(),
                         phi::CPUPlace(),
                         restored,
                         in_num * sizeof(T*),
                         ctx.stream());
    ins_addr = reinterpret_cast<T**>(*dev_ins_ptr->ptr());
  }
};

}
}

#undef IMPL_STACK_CUDA_KERNEL_HELPER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我本身是希望减少一些宏,多一些可通用的组件。你当前这种写法确实不太好。

@JamesLim-sy JamesLim-sy changed the title Optimization of H2D data transfer operation in concat ops H2D data transfer optimization for concat kernel Dec 29, 2022
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Xreki Xreki merged commit 0de94cd into PaddlePaddle:develop Jan 3, 2023
@MARD1NO MARD1NO mentioned this pull request Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants