New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add concat optimization #49540

Merged

qingqing01 merged 4 commits into PaddlePaddle:develop from MARD1NO:optimize_concat2

Jan 9, 2023

Contributor

MARD1NO commented Jan 4, 2023 •

edited

PR types

Performance optimization

PR changes

OPs

Describe

使用向量化操作优化多个输入的Concat Kernel

使用IndexType Dispatch不同类型的索引，以节省寄存器

int64下用的寄存器40，occupancy被限制到了50%
int32下用的寄存器19，occupancy不受限制

TestCase来自TCIR的一段，等价代码为：

x1 = paddle.ones((1, 90, 518400), dtype=paddle.float16)
x2 = paddle.ones((1, 32, 518400), dtype=paddle.float16)
x3 = paddle.ones((1, 32, 518400), dtype=paddle.float16)
x4 = paddle.ones((1, 32, 518400), dtype=paddle.float16)
x5 = paddle.ones((1, 32, 518400), dtype=paddle.float16)
out = paddle.concat([x1, x2, x3, x4, x5], axis=1)

Paddle Before(us)	Paddle After(us)	Paddle PR49310 After(us)	OneFlow(us)	PyTorch(us)
645.28us	360us	320us	124+41+41+41+41=288us	519us

速度下降的原因是 #49040 该PR通过特化一部分结构体来节省了 H2D 拷贝耗时。但是在Kernel里访问数组用运行时才能确定的下标，会导致Kernel先将该结构体存到local memory（其实算Global Memory)中，进而产生了更多的内存访问。

而如果是按照之前的以 const T** 访问，则不会发生local memory的读写。节省H2D拷贝时间和避免local memory读写，二者只能选其一。


          add concat optimization

982198c

paddle-bot bot commented Jan 4, 2023

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.


          refine

e8e5e22

MARD1NO commented

View reviewed changes

paddle/phi/kernels/funcs/CMakeLists.txt Outdated

@@ @@ -5,6 +5,8 @@ add_subdirectory(detail) @@
               math_library(deformable_conv_functor DEPS dense_tensor)
               math_library(concat_and_split_functor DEPS dense_tensor)
+              target_compile_definitions(concat_and_split_functor

Contributor Author

MARD1NO Jan 5, 2023

这部分是不同版本的MSVC编译器对aligned storage的区别，相关Issue可以看 ceres-solver/ceres-solver#481

Use #define _ENABLE_EXTENDED_ALIGNED_STORAGE
This would make the alignment work correctly but has the drawback, that the compiled library is no longer ABI compatible to code compiled with an older version of MSVC.
Use #define _DISABLE_EXTENDED_ALIGNED_STORAGE
This would use the wrong alignment but mixing the code with code compiled from a different version would work.

开启 _ENABLE_EXTENDED_ALIGNED_STORAGE 就会使用正确的align type，但是不兼容旧MSVC
开启 _DISABLE_EXTENDED_ALIGNED_STORAGE，则会使用错误的aligntype，但是能保证兼容

综上还是保证正确的align type，开启 _ENABLE_EXTENDED_ALIGNED_STORAGE

MARD1NO commented

View reviewed changes

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+                int block_rows = kThreadsPerBlock / block_cols;
+                *block_dims = dim3(block_cols, block_rows, 1);
+                constexpr int waves = 1;

Contributor Author

MARD1NO Jan 5, 2023

保持使用1个wave，因为local memory属于线程私有的，使用过多线程反而会导致local memory的读写量增加

MARD1NO commented

View reviewed changes

paddle/phi/kernels/funcs/concat_and_split_functor.cu

               template <typename T, int Size>
               struct PointerWrapper {
                public:
-                const T* ins_addr[Size];
-                __device__ inline const T* operator[](int i) const { return ins_addr[i]; }
+                const void* ins_addr[Size];

Contributor Author

MARD1NO Jan 5, 2023

改成void*擦除类型T，后续用aligned_storage访问，使用一个模板参数MovSize代替了T+VecSize，减少模板特化个数

MARD1NO commented

View reviewed changes

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+              #undef IMPL_COMPLEX_CONCAT_CUDA_KERNEL_CASE
+              }
+              template <typename T, typename IndexT>

Contributor Author

MARD1NO Jan 5, 2023

根据计算的movsize进行派发特化的模板

MARD1NO commented

View reviewed changes

paddle/phi/kernels/funcs/concat_and_split_functor.cu

+                  inputs_col[idx] /= dispatch_vec_size;
+                }
+                const IndexT mov_size = sizeof(T) * dispatch_vec_size;
+                if (has_same_shape) {

Contributor Author

MARD1NO Jan 5, 2023

如果是same_shape，那么所有输入的in_cols都是一样的，这里直接传inputs_col[1]，（inputs_col[0]是0）


          remove annotation

58218ef

MARD1NO marked this pull request as ready for review

January 5, 2023 03:54

zhangjun reviewed

View reviewed changes

paddle/phi/kernels/funcs/concat_and_split_functor.cu Outdated

Comment on lines 161 to 162

		using VecT = typename std::aligned_storage<MovSize, MovSize>::type;
		VecT* dst = reinterpret_cast<VecT*>(output);

Contributor

zhangjun Jan 5, 2023

最新c++标准看上去准备废弃aligned_storage，https://stackoverflow.com/questions/71828288/why-is-stdaligned-storage-to-be-deprecated-in-c23-and-what-to-use-instead

Contributor Author

MARD1NO Jan 6, 2023 •

edited

里面提到的std::byte只有c++17才引入，现在换成了另外一种写法，用alignas结构体代替，用char代替

template <int MovSize>
struct alignas(MovSize) Packed {
  __device__ Packed() {
    // do nothing
  }
  union {
    char buf[MovSize];
  };
};

也去除了对应的CMake Flag设置


          use alignas instead of aligned_storage

5b066b2

qingqing01 requested a review from Xreki

January 9, 2023 03:15

qingqing01 approved these changes

View reviewed changes

zhoutianzi666 reviewed

View reviewed changes

paddle/phi/kernels/funcs/concat_and_split_functor.cu

@@ @@ -20,27 +20,52 @@ limitations under the License. */ @@
               namespace phi {
               namespace funcs {
+              static inline void GetBlockDims(const phi::GPUContext& context,

Contributor

zhoutianzi666 Jan 9, 2023

这个函数是不是和改动前没区别？

Contributor Author

MARD1NO Jan 9, 2023

没区别，只是放到了前面

qingqing01 merged commit 1a0b366 into PaddlePaddle:develop

JamesLim-sy mentioned this pull request

H2D data transfer optimization for split kernel #49086

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment