Use integer promotion for warp_reduce#6819
Conversation
We can leverage integer promotion to use the `__reduce_meow_sync` instructions
| inline constexpr bool | ||
| can_use_reduce_add_sync<T, ::cuda::std::plus<>, ::cuda::std::void_t<decltype(__reduce_add_sync(0xFFFFFFFF, T{}))>> = | ||
| ::cuda::std::is_integral_v<T> && sizeof(T) <= sizeof(unsigned); |
There was a problem hiding this comment.
Q: what is the decltype(__reduce_add_sync(0xFFFFFFFF, T{})) actually good for? We know that it can only be a max 32-bit integral, we needn't to test the invocability
There was a problem hiding this comment.
I believe that is meant for compiler / toolkit combinations where we cannot rely solely on SM_PROVIDES_SM_80
There was a problem hiding this comment.
Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support
There was a problem hiding this comment.
decltype(__reduce_add_sync) is a historical way to handle this function. The common NV_IF_TARGET works perfectly fine with all compilers
There was a problem hiding this comment.
SFINAE here is very verbose and adds compilation complexity
There was a problem hiding this comment.
Or better said, there are compiler where
__reduce_min_syncand friends might not be implemented but that have partial SM80 support
But I don't think SFINAE would help with this, the function forward declared in ${CTK_INCLUDE}/crt/sm_80_rt.h, which is always included when compiling for arch 80+
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input))); | ||
| } | ||
| else if constexpr (detail::can_use_reduce_or_sync<T, ReductionOp>) | ||
| { | ||
| return __reduce_or_sync(member_mask, input); | ||
| return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input))); | ||
| } | ||
| else if constexpr (detail::can_use_reduce_xor_sync<T, ReductionOp>) | ||
| { | ||
| return __reduce_xor_sync(member_mask, input); | ||
| return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input))); |
There was a problem hiding this comment.
We could use the builtins for bitwise operations even for 64 and 128 bit types, maybe it could work also for min/max
There was a problem hiding this comment.
the optimizations that you are proposing + many others are part of #4312
|
updated the description with the performance results. TLDR: looks good on SM86, SM90, SM120 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
miscco
left a comment
There was a problem hiding this comment.
I do not like the change for the return value.
We are demoting compile time information into run-time information, which might be detrimental
| else if constexpr (is_cuda_std_bit_and_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>) | ||
| { | ||
| return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input))); | ||
| } | ||
| else if constexpr (is_cuda_std_bit_or_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>) | ||
| { | ||
| return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input))); | ||
| } | ||
| else if constexpr (is_cuda_std_bit_xor_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>) | ||
| { | ||
| return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input))); | ||
| } | ||
| else |
There was a problem hiding this comment.
This is incorrect, now we are doing nothing in the bitwise cases if the type is signed.
Please revert to the previous formulation
There was a problem hiding this comment.
do you expect users to perform bitwise operations on signed integer? 🤨
There was a problem hiding this comment.
anyway, removed the constraints
This comment has been minimized.
This comment has been minimized.
Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
This comment has been minimized.
This comment has been minimized.
🥳 CI Workflow Results🟩 Finished in 6h 51m: Pass: 100%/95 | Total: 5d 07h | Max: 5h 53m | Hits: 62%/91430See results here. |
We can leverage integer promotion to use the
__reduce_meow_syncinstructionsWith that we get a lot of shuffle instructions turned into reduce instructions

WarpReduceimplementation #6814(from: @fbusato)
Updating the performance results
[0] NVIDIA RTX A6000 (SM 8.6)
SUM
MIN
[0] NVIDIA H100 80GB HBM3 (SM 9.0)
MIN
SUM
[0] NVIDIA GeForce RTX 5080 (SM120)
SUM
MIN