Skip to content

Use integer promotion for warp_reduce#6819

Merged
fbusato merged 22 commits intoNVIDIA:mainfrom
miscco:warp_reduce_promotion
Dec 11, 2025
Merged

Use integer promotion for warp_reduce#6819
fbusato merged 22 commits intoNVIDIA:mainfrom
miscco:warp_reduce_promotion

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Dec 1, 2025

We can leverage integer promotion to use the __reduce_meow_sync instructions

With that we get a lot of shuffle instructions turned into reduce instructions
Reduce MAX

(from: @fbusato)

Updating the performance results

[0] NVIDIA RTX A6000 (SM 8.6)

SUM

T{ct} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 43.135 us 0.81% 12.141 us 3.01% -30.994 us -71.85% FAST
I16 42.388 us 1.09% 12.969 us 3.53% -29.419 us -69.40% FAST
I32 11.899 us 4.19% 11.894 us 4.23% -0.005 us -0.04% SAME
I128 814.372 us 0.18% 814.035 us 0.23% -0.337 us -0.04% SAME
F16 43.142 us 0.82% 43.136 us 0.80% -0.006 us -0.01% SAME
BF16 44.129 us 0.76% 44.112 us 0.78% -0.017 us -0.04% SAME
F32 41.760 us 1.04% 41.840 us 0.91% 0.080 us 0.19% SAME
F64 328.289 us 0.14% 328.334 us 0.16% 0.044 us 0.01% SAME
C16 37.590 us 1.22% 37.564 us 1.25% -0.026 us -0.07% SAME
CB16 38.623 us 1.08% 38.508 us 1.24% -0.115 us -0.30% SAME
C32 72.180 us 0.70% 72.317 us 0.68% 0.137 us 0.19% SAME
C64 652.939 us 0.12% 652.566 us 0.07% -0.372 us -0.06% SAME

MIN

T{ct} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 59.180 us 0.73% 12.145 us 2.98% -47.035 us -79.48% FAST
I16 60.003 us 0.79% 12.980 us 3.51% -47.022 us -78.37% FAST
I32 11.883 us 4.26% 11.872 us 4.24% -0.010 us -0.09% SAME
I128 814.882 us 0.23% 815.139 us 0.17% 0.257 us 0.03% SAME
F16 43.408 us 1.14% 43.522 us 1.19% 0.114 us 0.26% SAME
BF16 44.275 us 0.93% 44.398 us 1.04% 0.122 us 0.28% SAME
F32 37.678 us 1.10% 37.685 us 1.08% 0.007 us 0.02% SAME
F64 331.726 us 0.14% 331.767 us 0.13% 0.041 us 0.01% SAME

[0] NVIDIA H100 80GB HBM3 (SM 9.0)

MIN

T{ct} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 75.502 us 0.21% 13.311 us 0.89% -62.191 us -82.37% FAST
I16 77.241 us 1.11% 14.461 us 1.31% -62.779 us -81.28% FAST
I32 12.817 us 1.12% 12.603 us 0.80% -0.214 us -1.67% FAST
I64 5.325 us 3.61% 5.119 us 3.48% -0.206 us -3.87% FAST
I128 1.007 ms 0.22% 1.006 ms 0.24% -0.985 us -0.10% SAME
F16 47.180 us 6.53% 44.684 us 3.13% -2.496 us -5.29% FAST
BF16 48.785 us 5.31% 49.097 us 1.59% 0.312 us 0.64% SAME
F32 38.557 us 1.33% 37.889 us 1.50% -0.668 us -1.73% FAST
F64 116.027 us 5.85% 114.194 us 7.15% -1.832 us -1.58% SAME

SUM

T{ct} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 45.257 us 0.36% 13.411 us 0.99% -31.846 us -70.37% FAST
I16 35.208 us 3.05% 14.623 us 1.32% -20.585 us -58.47% FAST
I32 12.898 us 1.27% 13.078 us 1.96% 0.180 us 1.39% SLOW
I64 57.591 us 5.01% 57.843 us 5.95% 0.252 us 0.44% SAME
I128 1.017 ms 1.06% 1.014 ms 0.96% -2.766 us -0.27% SAME
F16 42.920 us 5.00% 42.266 us 4.85% -0.654 us -1.52% SAME
BF16 39.856 us 5.99% 39.882 us 4.47% 0.025 us 0.06% SAME
F32 31.366 us 2.13% 31.290 us 3.24% -0.076 us -0.24% SAME
F64 69.920 us 5.91% 68.952 us 3.96% -0.968 us -1.38% SAME
C16 31.042 us 4.22% 30.883 us 2.85% -0.159 us -0.51% SAME
CB16 30.705 us 4.74% 29.805 us 4.42% -0.899 us -2.93% SAME
C32 58.815 us 4.85% 59.348 us 4.62% 0.533 us 0.91% SAME
C64 109.400 us 4.38% 104.894 us 7.11% -4.507 us -4.12% SAME

[0] NVIDIA GeForce RTX 5080 (SM120)

SUM

T{ct} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 32.385 us 0.50% 9.860 us 1.67% -22.526 us -69.56% FAST
I16 33.266 us 0.29% 10.738 us 0.95% -22.528 us -67.72% FAST
I32 9.860 us 1.64% 9.863 us 1.68% 0.003 us 0.04% SAME
I64 51.701 us 0.19% 51.699 us 0.19% -0.002 us -0.00% SAME
I128 149.695 us 0.64% 91.797 us 0.23% -57.898 us -38.68% FAST
F16 33.259 us 0.32% 33.254 us 0.16% -0.004 us -0.01% SAME
BF16 32.769 us 0.12% 32.768 us 0.10% -0.001 us -0.00% SAME
F32 31.228 us 0.37% 31.226 us 0.30% -0.002 us -0.01% SAME
F64 210.893 us 0.07% 210.891 us 0.07% -0.002 us -0.00% SAME
C16 27.126 us 0.40% 27.125 us 0.37% -0.002 us -0.01% SAME
CB16 26.579 us 0.48% 26.584 us 0.46% 0.005 us 0.02% SAME
C32 49.673 us 0.50% 49.666 us 0.45% -0.007 us -0.01% SAME
C64 426.750 us 0.24% 426.737 us 0.24% -0.013 us -0.00% SAME

MIN

T{ct} Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 42.638 us 0.40% 9.860 us 1.64% -32.778 us -76.87% FAST
I16 33.265 us 0.36% 10.739 us 0.90% -22.526 us -67.72% FAST
I32 9.861 us 1.63% 9.860 us 1.66% -0.000 us -0.00% SAME
I64 49.806 us 1.08% 49.818 us 1.11% 0.012 us 0.02% SAME
I128 148.673 us 0.78% 93.831 us 0.19% -54.843 us -36.89% FAST
F16 33.260 us 0.37% 33.258 us 0.35% -0.002 us -0.01% SAME
BF16 32.768 us 0.08% 32.768 us 0.10% 0.000 us 0.00% SAME
F32 29.170 us 0.37% 29.171 us 0.36% 0.001 us 0.00% SAME
F64 211.330 us 0.39% 211.098 us 0.29% -0.232 us -0.11% SAME

@miscco miscco requested a review from a team as a code owner December 1, 2025 10:12
@miscco miscco requested a review from gevtushenko December 1, 2025 10:12
@github-project-automation github-project-automation bot moved this to Todo in CCCL Dec 1, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 1, 2025
Comment on lines 52 to 54
inline constexpr bool
can_use_reduce_add_sync<T, ::cuda::std::plus<>, ::cuda::std::void_t<decltype(__reduce_add_sync(0xFFFFFFFF, T{}))>> =
::cuda::std::is_integral_v<T> && sizeof(T) <= sizeof(unsigned);
Copy link
Contributor

@davebayer davebayer Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: what is the decltype(__reduce_add_sync(0xFFFFFFFF, T{})) actually good for? We know that it can only be a max 32-bit integral, we needn't to test the invocability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is meant for compiler / toolkit combinations where we cannot rely solely on SM_PROVIDES_SM_80

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decltype(__reduce_add_sync) is a historical way to handle this function. The common NV_IF_TARGET works perfectly fine with all compilers

Copy link
Contributor

@fbusato fbusato Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SFINAE here is very verbose and adds compilation complexity

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support

But I don't think SFINAE would help with this, the function forward declared in ${CTK_INCLUDE}/crt/sm_80_rt.h, which is always included when compiling for arch 80+

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@fbusato fbusato requested a review from a team as a code owner December 2, 2025 00:35
@github-actions

This comment has been minimized.

Comment on lines 563 to 571
return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input)));
}
else if constexpr (detail::can_use_reduce_or_sync<T, ReductionOp>)
{
return __reduce_or_sync(member_mask, input);
return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input)));
}
else if constexpr (detail::can_use_reduce_xor_sync<T, ReductionOp>)
{
return __reduce_xor_sync(member_mask, input);
return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the builtins for bitwise operations even for 64 and 128 bit types, maybe it could work also for min/max

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the optimizations that you are proposing + many others are part of #4312

@fbusato
Copy link
Contributor

fbusato commented Dec 6, 2025

updated the description with the performance results. TLDR: looks good on SM86, SM90, SM120

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Copy link
Contributor Author

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like the change for the return value.

We are demoting compile time information into run-time information, which might be detrimental

Comment on lines 465 to 477
else if constexpr (is_cuda_std_bit_and_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>)
{
return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input)));
}
else if constexpr (is_cuda_std_bit_or_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>)
{
return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input)));
}
else if constexpr (is_cuda_std_bit_xor_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>)
{
return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input)));
}
else
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, now we are doing nothing in the bitwise cases if the type is signed.

Please revert to the previous formulation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you expect users to perform bitwise operations on signed integer? 🤨

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyway, removed the constraints

@github-actions

This comment has been minimized.

fbusato and others added 3 commits December 10, 2025 09:07
@github-actions

This comment has been minimized.

@fbusato fbusato enabled auto-merge (squash) December 11, 2025 00:32
@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 6h 51m: Pass: 100%/95 | Total: 5d 07h | Max: 5h 53m | Hits: 62%/91430

See results here.

@fbusato fbusato merged commit a04ffc4 into NVIDIA:main Dec 11, 2025
210 of 213 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Dec 11, 2025
@miscco miscco deleted the warp_reduce_promotion branch December 11, 2025 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants