Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concat optimization #49540

Merged
merged 4 commits into from Jan 9, 2023
Merged

Add concat optimization #49540

merged 4 commits into from Jan 9, 2023

Conversation

MARD1NO
Copy link
Contributor

@MARD1NO MARD1NO commented Jan 4, 2023

PR types

Performance optimization

PR changes

OPs

Describe

使用向量化操作优化多个输入的Concat Kernel

使用IndexType Dispatch不同类型的索引,以节省寄存器

int64下用的寄存器40,occupancy被限制到了50%
int32下用的寄存器19,occupancy不受限制

TestCase来自TCIR的一段,等价代码为:

x1 = paddle.ones((1, 90, 518400), dtype=paddle.float16)
x2 = paddle.ones((1, 32, 518400), dtype=paddle.float16)
x3 = paddle.ones((1, 32, 518400), dtype=paddle.float16)
x4 = paddle.ones((1, 32, 518400), dtype=paddle.float16)
x5 = paddle.ones((1, 32, 518400), dtype=paddle.float16)
out = paddle.concat([x1, x2, x3, x4, x5], axis=1)
Paddle Before(us) Paddle After(us) Paddle PR49310 After(us) OneFlow(us) PyTorch(us)
645.28us 360us 320us 124+41+41+41+41=288us 519us

速度下降的原因是 #49040 该PR通过特化一部分结构体来节省了 H2D 拷贝耗时。但是在Kernel里访问数组用运行时才能确定的下标,会导致Kernel先将该结构体存到local memory(其实算Global Memory)中,进而产生了更多的内存访问。

而如果是按照之前的以 const T** 访问,则不会发生local memory的读写。节省H2D拷贝时间和避免local memory读写,二者只能选其一。

@paddle-bot
Copy link

paddle-bot bot commented Jan 4, 2023

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -5,6 +5,8 @@ add_subdirectory(detail)

math_library(deformable_conv_functor DEPS dense_tensor)
math_library(concat_and_split_functor DEPS dense_tensor)
target_compile_definitions(concat_and_split_functor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这部分是不同版本的MSVC编译器对aligned storage的区别,相关Issue可以看 ceres-solver/ceres-solver#481

Use #define _ENABLE_EXTENDED_ALIGNED_STORAGE
This would make the alignment work correctly but has the drawback, that the compiled library is no longer ABI compatible to code compiled with an older version of MSVC.
Use #define _DISABLE_EXTENDED_ALIGNED_STORAGE
This would use the wrong alignment but mixing the code with code compiled from a different version would work.
  1. 开启 _ENABLE_EXTENDED_ALIGNED_STORAGE 就会使用正确的align type,但是不兼容旧MSVC
  2. 开启 _DISABLE_EXTENDED_ALIGNED_STORAGE,则会使用错误的aligntype,但是能保证兼容

综上还是保证正确的align type,开启 _ENABLE_EXTENDED_ALIGNED_STORAGE

int block_rows = kThreadsPerBlock / block_cols;
*block_dims = dim3(block_cols, block_rows, 1);

constexpr int waves = 1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

保持使用1个wave,因为local memory属于线程私有的,使用过多线程反而会导致local memory的读写量增加

template <typename T, int Size>
struct PointerWrapper {
public:
const T* ins_addr[Size];
__device__ inline const T* operator[](int i) const { return ins_addr[i]; }
const void* ins_addr[Size];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

改成void*擦除类型T,后续用aligned_storage访问,使用一个模板参数MovSize代替了T+VecSize,减少模板特化个数

#undef IMPL_COMPLEX_CONCAT_CUDA_KERNEL_CASE
}

template <typename T, typename IndexT>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据计算的movsize进行派发特化的模板

inputs_col[idx] /= dispatch_vec_size;
}
const IndexT mov_size = sizeof(T) * dispatch_vec_size;
if (has_same_shape) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果是same_shape,那么所有输入的in_cols都是一样的,这里直接传inputs_col[1],(inputs_col[0]是0)

@MARD1NO MARD1NO marked this pull request as ready for review January 5, 2023 03:54
Comment on lines 161 to 162
using VecT = typename std::aligned_storage<MovSize, MovSize>::type;
VecT* dst = reinterpret_cast<VecT*>(output);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@MARD1NO MARD1NO Jan 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

里面提到的std::byte只有c++17才引入,现在换成了另外一种写法,用alignas结构体代替,用char代替

template <int MovSize>
struct alignas(MovSize) Packed {
  __device__ Packed() {
    // do nothing
  }
  union {
    char buf[MovSize];
  };
};

也去除了对应的CMake Flag设置

@qingqing01 qingqing01 requested a review from Xreki January 9, 2023 03:15
@@ -20,27 +20,52 @@ limitations under the License. */
namespace phi {
namespace funcs {

static inline void GetBlockDims(const phi::GPUContext& context,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数是不是和改动前没区别?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没区别,只是放到了前面

@qingqing01 qingqing01 merged commit 1a0b366 into PaddlePaddle:develop Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants