New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add concat optimization #49540
Add concat optimization #49540
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
@@ -5,6 +5,8 @@ add_subdirectory(detail) | |||
|
|||
math_library(deformable_conv_functor DEPS dense_tensor) | |||
math_library(concat_and_split_functor DEPS dense_tensor) | |||
target_compile_definitions(concat_and_split_functor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分是不同版本的MSVC编译器对aligned storage的区别,相关Issue可以看 ceres-solver/ceres-solver#481
Use #define _ENABLE_EXTENDED_ALIGNED_STORAGE
This would make the alignment work correctly but has the drawback, that the compiled library is no longer ABI compatible to code compiled with an older version of MSVC.
Use #define _DISABLE_EXTENDED_ALIGNED_STORAGE
This would use the wrong alignment but mixing the code with code compiled from a different version would work.
- 开启 _ENABLE_EXTENDED_ALIGNED_STORAGE 就会使用正确的align type,但是不兼容旧MSVC
- 开启 _DISABLE_EXTENDED_ALIGNED_STORAGE,则会使用错误的aligntype,但是能保证兼容
综上还是保证正确的align type,开启 _ENABLE_EXTENDED_ALIGNED_STORAGE
int block_rows = kThreadsPerBlock / block_cols; | ||
*block_dims = dim3(block_cols, block_rows, 1); | ||
|
||
constexpr int waves = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
保持使用1个wave,因为local memory属于线程私有的,使用过多线程反而会导致local memory的读写量增加
template <typename T, int Size> | ||
struct PointerWrapper { | ||
public: | ||
const T* ins_addr[Size]; | ||
__device__ inline const T* operator[](int i) const { return ins_addr[i]; } | ||
const void* ins_addr[Size]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
改成void*擦除类型T,后续用aligned_storage访问,使用一个模板参数MovSize代替了T+VecSize,减少模板特化个数
#undef IMPL_COMPLEX_CONCAT_CUDA_KERNEL_CASE | ||
} | ||
|
||
template <typename T, typename IndexT> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据计算的movsize进行派发特化的模板
inputs_col[idx] /= dispatch_vec_size; | ||
} | ||
const IndexT mov_size = sizeof(T) * dispatch_vec_size; | ||
if (has_same_shape) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果是same_shape,那么所有输入的in_cols都是一样的,这里直接传inputs_col[1],(inputs_col[0]是0)
using VecT = typename std::aligned_storage<MovSize, MovSize>::type; | ||
VecT* dst = reinterpret_cast<VecT*>(output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最新c++标准看上去准备废弃aligned_storage,https://stackoverflow.com/questions/71828288/why-is-stdaligned-storage-to-be-deprecated-in-c23-and-what-to-use-instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
里面提到的std::byte只有c++17才引入,现在换成了另外一种写法,用alignas结构体代替,用char代替
template <int MovSize>
struct alignas(MovSize) Packed {
__device__ Packed() {
// do nothing
}
union {
char buf[MovSize];
};
};
也去除了对应的CMake Flag设置
@@ -20,27 +20,52 @@ limitations under the License. */ | |||
namespace phi { | |||
namespace funcs { | |||
|
|||
static inline void GetBlockDims(const phi::GPUContext& context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数是不是和改动前没区别?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没区别,只是放到了前面
PR types
Performance optimization
PR changes
OPs
Describe
使用向量化操作优化多个输入的Concat Kernel
使用IndexType Dispatch不同类型的索引,以节省寄存器
int64下用的寄存器40,occupancy被限制到了50%
int32下用的寄存器19,occupancy不受限制
TestCase来自TCIR的一段,等价代码为:
速度下降的原因是 #49040 该PR通过特化一部分结构体来节省了 H2D 拷贝耗时。但是在Kernel里访问数组用运行时才能确定的下标,会导致Kernel先将该结构体存到local memory(其实算Global Memory)中,进而产生了更多的内存访问。
而如果是按照之前的以 const T** 访问,则不会发生local memory的读写。节省H2D拷贝时间和避免local memory读写,二者只能选其一。