New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H2D data transfer optimization for concat kernel #49040
H2D data transfer optimization for concat kernel #49040
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
c5af993
to
10cdde2
Compare
@@ -19,111 +19,155 @@ limitations under the License. */ | |||
namespace phi { | |||
namespace funcs { | |||
|
|||
template <typename T, int Size> | |||
struct PointerWarpper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a typo? Is it a PointerWrapper
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh....it is a typo, will be fixed in next commit.
const IndexT out_cols, | ||
T* output_data) { | ||
CUDA_KERNEL_LOOP_TYPE(tid_x, out_cols, IndexT) { | ||
IndexT split = tid_x * 1.0 / fixed_in_col; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, could 1.0/fixed_in_col
also be passed by kernel args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, will be fixed in next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PointerArray
的逻辑,建议还是可以单独提取出来,concat
、stack
可以共用。stack中PointerArray
没有必要和DivMod
强耦合,可以拆开来传参数。
PointerWarpper(const phi::GPUContext& ctx, | ||
const std::vector<phi::DenseTensor>& ins, | ||
const int64_t& in_num, | ||
const T** inputs_data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里ctx
和inputs_data
没用?保留这个参数是为了支持另一种情况吧,是否再定义个参数列表不同的构造函数比较好?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是为了后面的宏统一使用做了妥协.
PointerWarpper() {} | ||
PointerWarpper(const phi::GPUContext& ctx, | ||
const std::vector<phi::DenseTensor>& ins, | ||
const int64_t& in_num, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in_num
也没有必要作为参数传进来,可以直接通过ins.size()
获取。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
const std::vector<phi::DenseTensor>& ins, | ||
const int64_t& in_num, | ||
const T** inputs_data) { | ||
for (auto i = 0; i < in_num; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加一下Size >= ins.size()
的检查。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后文中在这里用已经做了预设的判断,再加检查感觉没有必要了.
if (limit_num < 8) {
IMPL_CONCAT_WITH_WARPPER(8);
} else if (limit_num < 16) {
IMPL_CONCAT_WITH_WARPPER(16);
} else if (limit_num < 32) {
IMPL_CONCAT_WITH_WARPPER(32);
} else if (limit_num < 64) {
IMPL_CONCAT_WITH_WARPPER(64);
} else if (limit_num <= 128) {
IMPL_CONCAT_WITH_WARPPER(128);
} else {
IMPL_CONCAT_WITH_WARPPER(0);
}
PointerWarpper(const phi::GPUContext& ctx, | ||
const std::vector<phi::DenseTensor>& ins, | ||
const int64_t& in_num, | ||
const T** inputs_data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputs_data
-> pre_alloced_host_ptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改,这个建议的命名无比专业,佩服
for (auto i = 0; i < in_num; ++i) { | ||
inputs_data[i] = ins[i].data<T>(); | ||
} | ||
paddle::memory::allocation::AllocationPtr tmp_dev_ins_data; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如讨论,这个变量改成类的成员,stack中也是。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改, stack
计算也会在下一个commit 中体现.
template <typename T, typename IndexT, int Size> | ||
struct PointerAndColWarpper { | ||
public: | ||
IndexT col_data[Size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
col_data
-> col_length
,不要啥变量都用data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
说得对,根据建议修改,以后代码习惯也会修改掉.
const T* input_ptr = inputs_data[split]; | ||
int64_t tid_y = blockIdx.y * blockDim.y + threadIdx.y; | ||
template <typename T, typename IndexT, typename WarpperT> | ||
__global__ void ConcatKernel(WarpperT ins_data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,这个函数应该是ConcatTensorWithSameShape
?WarpperT
应该是PointerWarpper
?
const T** inputs_data = inputs_data_vec.data(); | ||
int64_t* inputs_col = inputs_col_vec.data(); | ||
IndexT inputs_col_num = in_num + 1; | ||
std::vector<const T*> inputs_data_vec(in_num, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
初始值用nullptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一开始是写的std::vector<const T*> inputs_data_vec(in_num, nullptr);
后来反复修改中改错了.
已修改.
IndexT inputs_col_num = in_num + 1; | ||
std::vector<const T*> inputs_data_vec(in_num, 0); | ||
std::vector<IndexT> inputs_col_vec(inputs_col_num, 0); | ||
const T** inputs_data = inputs_data_vec.data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputs_data_vec
在很多情况下已经用不到了吧,这部分CPU开销可以优化下。
<<<grid_dims, block_dims, 0, context.stream()>>>( \ | ||
ptr_array, in_col, out_row, out_col, output->data<T>()); | ||
|
||
if (limit_num < 32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
有一种写法,你可以参考下,paddle/phi/backends/gpu/cuda/cuda_device_function.h
中:
#define CUDA_LAUNCH_KERNEL_BASE(dim, ...) \
case (dim): { \
constexpr auto kPowerOfTwoDim = (dim); \
__VA_ARGS__; \
} break
#define CUDA_LAUNCH_KERNEL_HELPER(...) \
CUDA_LAUNCH_KERNEL_BASE(1024, ##__VA_ARGS__); \
CUDA_LAUNCH_KERNEL_BASE(512, ##__VA_ARGS__); \
CUDA_LAUNCH_KERNEL_BASE(256, ##__VA_ARGS__); \
CUDA_LAUNCH_KERNEL_BASE(128, ##__VA_ARGS__); \
CUDA_LAUNCH_KERNEL_BASE(64, ##__VA_ARGS__); \
CUDA_LAUNCH_KERNEL_BASE(32, ##__VA_ARGS__);
调用处可写成:
switch (platform::RoundToPowerOfTwo(beam_size * num_seqs * 32)) {
CUDA_LAUNCH_KERNEL_HELPER(
BeamSearchKernel<kPowerOfTwoDim, kMaxThreadsPerSeq, kMaxSeqs>
<<<1, num_seqs * kMaxThreadsPerSeq, 0, context.stream()>>>(
selected_ids_data,
selected_scores_data,
parent_idx_data,
selected_offsets,
pre_ids_data,
pre_scores_data,
ids_data,
scores_data,
seq_offsets,
static_cast<int>(num_seqs),
static_cast<int>(seq_width),
static_cast<int>(beam_size),
end_id,
is_accumulated,
num_used_threads));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one ! 根据建议修改.
PointerWrapper(const phi::GPUContext& ctx, | ||
const std::vector<phi::DenseTensor>& ins, | ||
const T** pre_alloced_host_ptr, | ||
const AllocatT& tmp_dev_ins_ptr = nullptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我还是觉得tmp_dev_ins_ptr
作为类成员比较好,这个Allocation
和这个类里保存的data
是一体的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle::memory::AllocationPtr tmp_dev_ins_ptr
本身是一个std::unique_ptr,从外部传入的话比较贴合这个指针本身的特征.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle::memory::AllocationPtr
作为成员变量的一个问题是,struct 整体从CPU拷贝到GPU端时,std::unique_ptr
是不可拷贝的,因为这个问题会导致编译出错。所以paddle::memory::AllocationPtr
变量应该是外部传入合适些,下面是代码比较紧凑情况下的改法,传入非const引用,但是Paddle中禁止了非const引用作为形参,如果能豁免可在这块的修改中传入引用最好,不能的话还是得切换成外部传入std::unique_ptr
的模式:
template <typename T>
struct PointerToPointer {
public:
T** ins_addr{nullptr};
__device__ inline const T* operator[](int i) const { return ins_addr[i]; }
PointerToPointer() {}
PointerToPointer(const phi::GPUContext& ctx,
const std::vector<phi::DenseTensor>& ins,
const T** pre_alloced_host_ptr,
paddle::memory::AllocationPtr& dev_ins_ptr) {
auto in_num = ins.size();
for (auto i = 0; i < in_num; ++i) {
pre_alloced_host_ptr[i] = ins[i].data<T>();
}
dev_ins_ptr = paddle::memory::Alloc(
ctx.GetPlace(),
in_num * sizeof(T*),
phi::Stream(reinterpret_cast<phi::StreamId>(ctx.stream())));
auto* restored = phi::backends::gpu::RestoreHostMemIfCapturingCUDAGraph(
pre_alloced_host_ptr, in_num);
paddle::memory::Copy(ctx.GetPlace(),
dev_ins_ptr->ptr(),
phi::CPUPlace(),
restored,
in_num * sizeof(T*),
ctx.stream());
ins_addr = reinterpret_cast<T**>(dev_ins_ptr->ptr());
}
};
|
||
namespace phi { | ||
namespace funcs { | ||
|
||
using AllocatT = paddle::memory::AllocationPtr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllocatT
-> AllocationPtr
,using也不用做没有必要的简化,不要丢了该类型的特征
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
}; | ||
|
||
template <typename T, typename IndexT, typename WrapperT> | ||
__global__ void ConcatKernel_(WrapperT ins_datas, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
函数名和WrapperT
的comment没有看到?
// Imple int32_t type concat kernel. | ||
switch (phi::backends::gpu::RoundToNextHighPowOfTwo(limit_num, 4)) { | ||
IMPL_CONCATE_CUDA_KERNEL_HELPER( | ||
IMPL_CONCAT_CUDA_KERNEL_CASE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在你的这一层封装下,代码确实越来越不好理解了。。。
#endif | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要让一个函数超过100行。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据建议修改
this->SetDivden(dividen); | ||
IndexT num, | ||
IndexT divisor, | ||
const paddle::memory::AllocationPtr& ins_gpu_ptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocation在类里面定义吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddle::memory::AllocationPtr
作为成员变量的一个问题是,struct 整体从CPU拷贝到GPU端时,std::unique_ptr
是不可拷贝的,因为这个问题会导致编译出错。所以paddle::memory::AllocationPtr
变量应该是外部传入合适,下面是代码比较紧凑情况下的改法,传入指针:
template <typename T>
struct PointerToPointer {
public:
T** ins_addr{nullptr};
__device__ inline const T* operator[](int i) const { return ins_addr[i]; }
PointerToPointer() {}
PointerToPointer(const phi::GPUContext& ctx,
const std::vector<phi::DenseTensor>& ins,
const T** pre_alloced_host_ptr,
paddle::memory::AllocationPtr* dev_ins_ptr) {
auto in_num = ins.size();
for (auto i = 0; i < in_num; ++i) {
pre_alloced_host_ptr[i] = ins[i].data<T>();
}
*dev_ins_ptr = paddle::memory::Alloc(
ctx.GetPlace(),
in_num * sizeof(T*),
phi::Stream(reinterpret_cast<phi::StreamId>(ctx.stream())));
auto* restored = phi::backends::gpu::RestoreHostMemIfCapturingCUDAGraph(
pre_alloced_host_ptr, in_num);
paddle::memory::Copy(ctx.GetPlace(),
*dev_ins_ptr->ptr(),
phi::CPUPlace(),
restored,
in_num * sizeof(T*),
ctx.stream());
ins_addr = reinterpret_cast<T**>(*dev_ins_ptr->ptr());
}
};
} | ||
} | ||
|
||
#undef IMPL_STACK_CUDA_KERNEL_HELPER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我本身是希望减少一些宏,多一些可通用的组件。你当前这种写法确实不太好。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Function optimization
PR changes
OPs
Describe
Currently, the input_tensors data address transferred from host to device memory through cudaMemcpyAsync runtime api, which would cause huge H2D transfer data cost, and in most cases, this transfer could be replaced by just pass a structure type as one param to the global function.There are bounch of concat op cases in alpha_fold model while trainning, and cases can be divided into below groups:
Almost 23.14% cases need for cuMemcpyAsync in alpha_fold Concat ops, there is good chance to use structure type to decrease the data transfer cpu cost in concat op.
There are lots of concat cases work well in int32_t index type, make all cases work with int64_t index type would do damage to cuda kernel performance while calculating thread index with int64_t . So int32_t cases is divided from int64_t cases.