Use integer promotion for `warp_reduce` by miscco · Pull Request #6819 · NVIDIA/cccl

miscco · 2025-12-01T10:12:34Z

We can leverage integer promotion to use the __reduce_meow_sync instructions

With that we get a lot of shuffle instructions turned into reduce instructions

Merge after Improve our WarpReduce implementation #6814

Updating the performance results

[0] NVIDIA RTX A6000 (SM 8.6)

SUM

T{ct}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	43.135 us	0.81%	12.141 us	3.01%	-30.994 us	-71.85%	FAST
I16	42.388 us	1.09%	12.969 us	3.53%	-29.419 us	-69.40%	FAST
I32	11.899 us	4.19%	11.894 us	4.23%	-0.005 us	-0.04%	SAME
I128	814.372 us	0.18%	814.035 us	0.23%	-0.337 us	-0.04%	SAME
F16	43.142 us	0.82%	43.136 us	0.80%	-0.006 us	-0.01%	SAME
BF16	44.129 us	0.76%	44.112 us	0.78%	-0.017 us	-0.04%	SAME
F32	41.760 us	1.04%	41.840 us	0.91%	0.080 us	0.19%	SAME
F64	328.289 us	0.14%	328.334 us	0.16%	0.044 us	0.01%	SAME
C16	37.590 us	1.22%	37.564 us	1.25%	-0.026 us	-0.07%	SAME
CB16	38.623 us	1.08%	38.508 us	1.24%	-0.115 us	-0.30%	SAME
C32	72.180 us	0.70%	72.317 us	0.68%	0.137 us	0.19%	SAME
C64	652.939 us	0.12%	652.566 us	0.07%	-0.372 us	-0.06%	SAME

MIN

T{ct}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	59.180 us	0.73%	12.145 us	2.98%	-47.035 us	-79.48%	FAST
I16	60.003 us	0.79%	12.980 us	3.51%	-47.022 us	-78.37%	FAST
I32	11.883 us	4.26%	11.872 us	4.24%	-0.010 us	-0.09%	SAME
I128	814.882 us	0.23%	815.139 us	0.17%	0.257 us	0.03%	SAME
F16	43.408 us	1.14%	43.522 us	1.19%	0.114 us	0.26%	SAME
BF16	44.275 us	0.93%	44.398 us	1.04%	0.122 us	0.28%	SAME
F32	37.678 us	1.10%	37.685 us	1.08%	0.007 us	0.02%	SAME
F64	331.726 us	0.14%	331.767 us	0.13%	0.041 us	0.01%	SAME

[0] NVIDIA H100 80GB HBM3 (SM 9.0)

MIN

T{ct}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	75.502 us	0.21%	13.311 us	0.89%	-62.191 us	-82.37%	FAST
I16	77.241 us	1.11%	14.461 us	1.31%	-62.779 us	-81.28%	FAST
I32	12.817 us	1.12%	12.603 us	0.80%	-0.214 us	-1.67%	FAST
I64	5.325 us	3.61%	5.119 us	3.48%	-0.206 us	-3.87%	FAST
I128	1.007 ms	0.22%	1.006 ms	0.24%	-0.985 us	-0.10%	SAME
F16	47.180 us	6.53%	44.684 us	3.13%	-2.496 us	-5.29%	FAST
BF16	48.785 us	5.31%	49.097 us	1.59%	0.312 us	0.64%	SAME
F32	38.557 us	1.33%	37.889 us	1.50%	-0.668 us	-1.73%	FAST
F64	116.027 us	5.85%	114.194 us	7.15%	-1.832 us	-1.58%	SAME

SUM

T{ct}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	45.257 us	0.36%	13.411 us	0.99%	-31.846 us	-70.37%	FAST
I16	35.208 us	3.05%	14.623 us	1.32%	-20.585 us	-58.47%	FAST
I32	12.898 us	1.27%	13.078 us	1.96%	0.180 us	1.39%	SLOW
I64	57.591 us	5.01%	57.843 us	5.95%	0.252 us	0.44%	SAME
I128	1.017 ms	1.06%	1.014 ms	0.96%	-2.766 us	-0.27%	SAME
F16	42.920 us	5.00%	42.266 us	4.85%	-0.654 us	-1.52%	SAME
BF16	39.856 us	5.99%	39.882 us	4.47%	0.025 us	0.06%	SAME
F32	31.366 us	2.13%	31.290 us	3.24%	-0.076 us	-0.24%	SAME
F64	69.920 us	5.91%	68.952 us	3.96%	-0.968 us	-1.38%	SAME
C16	31.042 us	4.22%	30.883 us	2.85%	-0.159 us	-0.51%	SAME
CB16	30.705 us	4.74%	29.805 us	4.42%	-0.899 us	-2.93%	SAME
C32	58.815 us	4.85%	59.348 us	4.62%	0.533 us	0.91%	SAME
C64	109.400 us	4.38%	104.894 us	7.11%	-4.507 us	-4.12%	SAME

[0] NVIDIA GeForce RTX 5080 (SM120)

SUM

T{ct}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	32.385 us	0.50%	9.860 us	1.67%	-22.526 us	-69.56%	FAST
I16	33.266 us	0.29%	10.738 us	0.95%	-22.528 us	-67.72%	FAST
I32	9.860 us	1.64%	9.863 us	1.68%	0.003 us	0.04%	SAME
I64	51.701 us	0.19%	51.699 us	0.19%	-0.002 us	-0.00%	SAME
I128	149.695 us	0.64%	91.797 us	0.23%	-57.898 us	-38.68%	FAST
F16	33.259 us	0.32%	33.254 us	0.16%	-0.004 us	-0.01%	SAME
BF16	32.769 us	0.12%	32.768 us	0.10%	-0.001 us	-0.00%	SAME
F32	31.228 us	0.37%	31.226 us	0.30%	-0.002 us	-0.01%	SAME
F64	210.893 us	0.07%	210.891 us	0.07%	-0.002 us	-0.00%	SAME
C16	27.126 us	0.40%	27.125 us	0.37%	-0.002 us	-0.01%	SAME
CB16	26.579 us	0.48%	26.584 us	0.46%	0.005 us	0.02%	SAME
C32	49.673 us	0.50%	49.666 us	0.45%	-0.007 us	-0.01%	SAME
C64	426.750 us	0.24%	426.737 us	0.24%	-0.013 us	-0.00%	SAME

MIN

T{ct}	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	42.638 us	0.40%	9.860 us	1.64%	-32.778 us	-76.87%	FAST
I16	33.265 us	0.36%	10.739 us	0.90%	-22.526 us	-67.72%	FAST
I32	9.861 us	1.63%	9.860 us	1.66%	-0.000 us	-0.00%	SAME
I64	49.806 us	1.08%	49.818 us	1.11%	0.012 us	0.02%	SAME
I128	148.673 us	0.78%	93.831 us	0.19%	-54.843 us	-36.89%	FAST
F16	33.260 us	0.37%	33.258 us	0.35%	-0.002 us	-0.01%	SAME
BF16	32.768 us	0.08%	32.768 us	0.10%	0.000 us	0.00%	SAME
F32	29.170 us	0.37%	29.171 us	0.36%	0.001 us	0.00%	SAME
F64	211.330 us	0.39%	211.098 us	0.29%	-0.232 us	-0.11%	SAME

We can leverage integer promotion to use the `__reduce_meow_sync` instructions

davebayer · 2025-12-01T12:40:45Z

cub/cub/warp/specializations/warp_reduce_shfl.cuh

+inline constexpr bool
+  can_use_reduce_add_sync<T, ::cuda::std::plus<>, ::cuda::std::void_t<decltype(__reduce_add_sync(0xFFFFFFFF, T{}))>> =
+    ::cuda::std::is_integral_v<T> && sizeof(T) <= sizeof(unsigned);


Q: what is the decltype(__reduce_add_sync(0xFFFFFFFF, T{})) actually good for? We know that it can only be a max 32-bit integral, we needn't to test the invocability

I believe that is meant for compiler / toolkit combinations where we cannot rely solely on SM_PROVIDES_SM_80

Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support

decltype(__reduce_add_sync) is a historical way to handle this function. The common NV_IF_TARGET works perfectly fine with all compilers

SFINAE here is very verbose and adds compilation complexity

Or better said, there are compiler where __reduce_min_sync and friends might not be implemented but that have partial SM80 support

But I don't think SFINAE would help with this, the function forward declared in ${CTK_INCLUDE}/crt/sm_80_rt.h, which is always included when compiling for arch 80+

cub/benchmarks/bench/reduce/warp_reduce_base.cuh

cub/benchmarks/bench/reduce/warp_reduce_min.cu

davebayer · 2025-12-02T09:34:53Z

cub/cub/warp/specializations/warp_reduce_shfl.cuh

+                       return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input)));
                     }
                     else if constexpr (detail::can_use_reduce_or_sync<T, ReductionOp>)
                     {
-                       return __reduce_or_sync(member_mask, input);
+                       return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input)));
                     }
                     else if constexpr (detail::can_use_reduce_xor_sync<T, ReductionOp>)
                     {
-                       return __reduce_xor_sync(member_mask, input);
+                       return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input)));


We could use the builtins for bitwise operations even for 64 and 128 bit types, maybe it could work also for min/max

the optimizations that you are proposing + many others are part of #4312

fbusato · 2025-12-06T00:57:02Z

updated the description with the performance results. TLDR: looks good on SM86, SM90, SM120

miscco

I do not like the change for the return value.

We are demoting compile time information into run-time information, which might be detrimental

cub/cub/warp/specializations/warp_reduce_shfl.cuh

miscco · 2025-12-10T09:06:44Z

cub/cub/warp/specializations/warp_reduce_shfl.cuh

+    else if constexpr (is_cuda_std_bit_and_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>)
+    {
+      return static_cast<T>(__reduce_and_sync(member_mask, static_cast<PromotedT>(input)));
+    }
+    else if constexpr (is_cuda_std_bit_or_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>)
+    {
+      return static_cast<T>(__reduce_or_sync(member_mask, static_cast<PromotedT>(input)));
+    }
+    else if constexpr (is_cuda_std_bit_xor_v<ReductionOp, T> && ::cuda::std::is_unsigned_v<T>)
+    {
+      return static_cast<T>(__reduce_xor_sync(member_mask, static_cast<PromotedT>(input)));
+    }
+    else


This is incorrect, now we are doing nothing in the bitwise cases if the type is signed.

Please revert to the previous formulation

do you expect users to perform bitwise operations on signed integer? 🤨

anyway, removed the constraints

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

github-actions · 2025-12-11T01:03:31Z

🥳 CI Workflow Results

🟩 Finished in 6h 51m: Pass: 100%/95 | Total: 5d 07h | Max: 5h 53m | Hits: 62%/91430

See results here.

miscco added 7 commits December 1, 2025 10:38

Use inline variables to detect builtins

24d2dc6

Drop unused dispatch

77f5739

Rework the dispatch to more efficient intrinsics

4412106

Add support for __reduce_and_sync and __reduce_or_sync

9075131

Use better name

e40d923

Use bitwise operations and also support __reduce_xor_sync

9479fc0

Use integer promotion for warp_reduce

e7b01a6

We can leverage integer promotion to use the `__reduce_meow_sync` instructions

miscco requested a review from a team as a code owner December 1, 2025 10:12

miscco requested a review from gevtushenko December 1, 2025 10:12

github-project-automation bot added this to CCCL Dec 1, 2025

github-project-automation bot moved this to Todo in CCCL Dec 1, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Dec 1, 2025

davebayer reviewed Dec 1, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

miscco mentioned this pull request Dec 1, 2025

Improve our WarpReduce implementation #6814

Merged

miscco assigned fbusato Dec 1, 2025

This comment has been minimized.

Sign in to view

fbusato and others added 4 commits December 1, 2025 15:47

Merge branch 'main' into warp_reduce_promotion

1f8b297

remove unused headers

ef02576

add more bench types

2cb09d0

add bench op name

0b24534

fbusato requested a review from a team as a code owner December 2, 2025 00:35

This comment has been minimized.

Sign in to view

bernhardmgruber reviewed Dec 2, 2025

View reviewed changes

cub/benchmarks/bench/reduce/warp_reduce_base.cuh Outdated Show resolved Hide resolved

cub/benchmarks/bench/reduce/warp_reduce_min.cu Outdated Show resolved Hide resolved

davebayer reviewed Dec 2, 2025

View reviewed changes

use "base"

aafb837

bernhardmgruber approved these changes Dec 6, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

Merge branch 'main' into warp_reduce_promotion

941fcba

This comment has been minimized.

Sign in to view

fbusato added 4 commits December 8, 2025 11:24

dispatch simplification

8b2e7f9

add _CCCL_UNREACHABLE

5d01f8d

don't use __reduce_op_sync global namespace

39f21e0

add another _CCCL_UNREACHABLE

082fde8

This comment has been minimized.

Sign in to view

miscco commented Dec 10, 2025

View reviewed changes

cub/cub/warp/specializations/warp_reduce_shfl.cuh Show resolved Hide resolved

cub/cub/warp/specializations/warp_reduce_shfl.cuh Outdated Show resolved Hide resolved

miscco added 2 commits December 10, 2025 10:04

Fix incorrect transformation

58f50c0

Merge branch 'main' into warp_reduce_promotion

7037be2

miscco commented Dec 10, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

fbusato and others added 3 commits December 10, 2025 09:07

Update cub/cub/warp/specializations/warp_reduce_shfl.cuh

9a027a1

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>

address comments

c987a59

restrict to reduce_op

0d234b5

This comment has been minimized.

Sign in to view

fbusato enabled auto-merge (squash) December 11, 2025 00:32

fbusato merged commit a04ffc4 into NVIDIA:main Dec 11, 2025
210 of 213 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Dec 11, 2025

miscco deleted the warp_reduce_promotion branch December 11, 2025 08:53

Conversation

miscco commented Dec 1, 2025 • edited by bernhardmgruber Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[0] NVIDIA RTX A6000 (SM 8.6)

[0] NVIDIA H100 80GB HBM3 (SM 9.0)

[0] NVIDIA GeForce RTX 5080 (SM120)

Uh oh!

davebayer Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fbusato Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fbusato commented Dec 6, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

miscco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Dec 11, 2025

🥳 CI Workflow Results

🟩 Finished in 6h 51m: Pass: 100%/95 | Total: 5d 07h | Max: 5h 53m | Hits: 62%/91430

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

miscco commented Dec 1, 2025 •

edited by bernhardmgruber

Loading

davebayer Dec 1, 2025 •

edited

Loading

fbusato Dec 1, 2025 •

edited

Loading