Implement the new tuning API for `DeviceReduce` by bernhardmgruber · Pull Request #6544 · NVIDIA/cccl

bernhardmgruber · 2025-11-07T02:04:48Z

Part of #6368, which was design approved yesterday. The goal is to merge refacttorings like the one here continuously, but avoid any public exposure of the tuning APIs for now. We can turn them live once we completed the rewrite.

No SASS difference for cub.bench.reduce.sum.base on sm120
No SASS difference for cub.bench.transform_reduce.sum.base on sm120
Compile-time comparison between before and after this PR

Running 3 times (excluding 1 warmup)

touch ../cub/test/catch2_test_device_reduce.cu
time make cub.test.device.reduce.lid_0.types_0

before:

34.716s
34.758s
34.842s

after:

34.697s
34.477s
34.500s

Quick benchmark of cub.bench.reduce.sum.base on my RTX 5090, since the SASS diff would not cover regressions in host code. LGTM:

## [0] NVIDIA GeForce RTX 5090

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |      I32      |      2^16      |   6.127 us |       1.74% |   6.183 us |       3.24% |  0.056 us |   0.91% |   SAME   |
|   I8    |      I32      |      2^20      |   6.325 us |       9.10% |   6.529 us |      12.28% |  0.204 us |   3.23% |   SAME   |
|   I8    |      I32      |      2^24      |  18.513 us |       2.25% |  18.575 us |       2.91% |  0.062 us |   0.34% |   SAME   |
|   I8    |      I32      |      2^28      | 186.257 us |       0.67% | 185.953 us |       0.70% | -0.304 us |  -0.16% |   SAME   |
|   I8    |      I64      |      2^16      |   6.138 us |       2.74% |   6.144 us |       0.38% |  0.006 us |   0.09% |   SAME   |
|   I8    |      I64      |      2^20      |   6.219 us |       6.64% |   6.288 us |       8.30% |  0.069 us |   1.11% |   SAME   |
|   I8    |      I64      |      2^24      |  18.408 us |       0.64% |  18.437 us |       1.12% |  0.029 us |   0.16% |   SAME   |
|   I8    |      I64      |      2^28      | 185.568 us |       0.73% | 186.777 us |       0.68% |  1.209 us |   0.65% |   SAME   |
|   I16   |      I32      |      2^16      |   6.214 us |       6.61% |   6.236 us |       6.93% |  0.022 us |   0.35% |   SAME   |
|   I16   |      I32      |      2^20      |   8.185 us |       0.99% |   8.188 us |       0.58% |  0.002 us |   0.03% |   SAME   |
|   I16   |      I32      |      2^24      |  28.724 us |       1.37% |  28.770 us |       1.52% |  0.047 us |   0.16% |   SAME   |
|   I16   |      I32      |      2^28      | 347.629 us |       0.55% | 347.639 us |       0.55% |  0.009 us |   0.00% |   SAME   |
|   I16   |      I64      |      2^16      |   6.115 us |       2.25% |   6.112 us |       2.62% | -0.003 us |  -0.05% |   SAME   |
|   I16   |      I64      |      2^20      |   8.173 us |       1.21% |   8.188 us |       0.59% |  0.015 us |   0.18% |   SAME   |
|   I16   |      I64      |      2^24      |  29.304 us |       3.49% |  29.576 us |       3.91% |  0.271 us |   0.93% |   SAME   |
|   I16   |      I64      |      2^28      | 347.087 us |       0.56% | 347.213 us |       0.56% |  0.127 us |   0.04% |   SAME   |
|   I32   |      I32      |      2^16      |   6.126 us |       2.07% |   6.132 us |       1.67% |  0.005 us |   0.08% |   SAME   |
|   I32   |      I32      |      2^20      |   8.183 us |       0.83% |   8.186 us |       0.70% |  0.003 us |   0.04% |   SAME   |
|   I32   |      I32      |      2^24      |  51.217 us |       0.53% |  51.215 us |       0.38% | -0.002 us |  -0.00% |   SAME   |
|   I32   |      I32      |      2^28      | 672.272 us |       0.46% | 672.407 us |       0.50% |  0.135 us |   0.02% |   SAME   |
|   I32   |      I64      |      2^16      |   6.121 us |       1.83% |   6.139 us |       1.16% |  0.018 us |   0.29% |   SAME   |
|   I32   |      I64      |      2^20      |   8.190 us |       0.40% |   8.172 us |       1.28% | -0.018 us |  -0.22% |   SAME   |
|   I32   |      I64      |      2^24      |  51.259 us |       0.74% |  51.325 us |       1.02% |  0.066 us |   0.13% |   SAME   |
|   I32   |      I64      |      2^28      | 671.143 us |       0.61% | 670.702 us |       0.46% | -0.441 us |  -0.07% |   SAME   |
|   I64   |      I32      |      2^16      |   6.131 us |       1.63% |   6.135 us |       1.25% |  0.005 us |   0.08% |   SAME   |
|   I64   |      I32      |      2^20      |  12.444 us |       4.49% |  12.376 us |       3.37% | -0.068 us |  -0.55% |   SAME   |
|   I64   |      I32      |      2^24      |  95.237 us |       1.13% |  95.110 us |       1.12% | -0.127 us |  -0.13% |   SAME   |
|   I64   |      I32      |      2^28      |   1.312 ms |       0.14% |   1.312 ms |       0.15% |  0.045 us |   0.00% |   SAME   |
|   I64   |      I64      |      2^16      |   6.135 us |       1.99% |   6.136 us |       2.53% |  0.001 us |   0.02% |   SAME   |
|   I64   |      I64      |      2^20      |  14.277 us |       2.22% |  14.311 us |       1.59% |  0.035 us |   0.24% |   SAME   |
|   I64   |      I64      |      2^24      |  95.506 us |       1.02% |  95.617 us |       1.03% |  0.112 us |   0.12% |   SAME   |
|   I64   |      I64      |      2^28      |   1.313 ms |       0.15% |   1.312 ms |       0.11% | -0.291 us |  -0.02% |   SAME   |
|  I128   |      I32      |      2^16      |   6.565 us |      12.21% |   6.615 us |      12.69% |  0.050 us |   0.76% |   SAME   |
|  I128   |      I32      |      2^20      |  18.460 us |       1.82% |  18.444 us |       1.16% | -0.016 us |  -0.09% |   SAME   |
|  I128   |      I32      |      2^24      | 188.893 us |       1.50% | 189.003 us |       1.52% |  0.110 us |   0.06% |   SAME   |
|  I128   |      I32      |      2^28      |   2.581 ms |       0.14% |   2.581 ms |       0.13% |  0.101 us |   0.00% |   SAME   |
|  I128   |      I64      |      2^16      |   6.165 us |       2.94% |   6.211 us |       5.45% |  0.046 us |   0.75% |   SAME   |
|  I128   |      I64      |      2^20      |  18.637 us |       3.38% |  18.854 us |       4.38% |  0.216 us |   1.16% |   SAME   |
|  I128   |      I64      |      2^24      | 183.319 us |       0.97% | 183.897 us |       1.05% |  0.578 us |   0.32% |   SAME   |
|  I128   |      I64      |      2^28      |   2.573 ms |       0.11% |   2.573 ms |       0.10% |  0.118 us |   0.00% |   SAME   |
|   F32   |      I32      |      2^16      |   6.216 us |       6.39% |   6.144 us |       0.00% | -0.072 us |  -1.17% |   ????   |
|   F32   |      I32      |      2^20      |   8.189 us |       0.40% |   8.188 us |       0.58% | -0.001 us |  -0.01% |   SAME   |
|   F32   |      I32      |      2^24      |  51.843 us |       2.05% |  51.733 us |       1.98% | -0.110 us |  -0.21% |   SAME   |
|   F32   |      I32      |      2^28      | 666.607 us |       0.27% | 666.752 us |       0.25% |  0.145 us |   0.02% |   SAME   |
|   F32   |      I64      |      2^16      |   6.126 us |       1.63% |   6.136 us |       1.10% |  0.010 us |   0.17% |   SAME   |
|   F32   |      I64      |      2^20      |   9.863 us |       7.95% |   9.680 us |       9.39% | -0.183 us |  -1.86% |   SAME   |
|   F32   |      I64      |      2^24      |  51.605 us |       1.66% |  51.636 us |       1.68% |  0.031 us |   0.06% |   SAME   |
|   F32   |      I64      |      2^28      | 668.767 us |       0.43% | 668.438 us |       0.39% | -0.329 us |  -0.05% |   SAME   |
|   F64   |      I32      |      2^16      |  10.204 us |       1.37% |  10.222 us |       1.13% |  0.018 us |   0.18% |   SAME   |
|   F64   |      I32      |      2^20      |  14.522 us |       4.02% |  14.466 us |       3.40% | -0.057 us |  -0.39% |   SAME   |
|   F64   |      I32      |      2^24      |  96.754 us |       1.11% |  96.618 us |       1.11% | -0.137 us |  -0.14% |   SAME   |
|   F64   |      I32      |      2^28      |   1.317 ms |       0.24% |   1.317 ms |       0.24% | -0.008 us |  -0.00% |   SAME   |
|   F64   |      I64      |      2^16      |  10.189 us |       1.81% |  10.233 us |       1.45% |  0.044 us |   0.43% |   SAME   |
|   F64   |      I64      |      2^20      |  14.373 us |       2.37% |  14.337 us |       0.20% | -0.036 us |  -0.25% |   FAST   |
|   F64   |      I64      |      2^24      |  96.030 us |       0.86% |  96.017 us |       0.86% | -0.013 us |  -0.01% |   SAME   |
|   F64   |      I64      |      2^28      |   1.313 ms |       0.18% |   1.313 ms |       0.17% | -0.124 us |  -0.01% |   SAME   |
|   C32   |      I32      |      2^16      |   6.142 us |       1.25% |   6.133 us |       1.57% | -0.009 us |  -0.15% |   SAME   |
|   C32   |      I32      |      2^20      |  12.470 us |       4.63% |  12.505 us |       5.04% |  0.035 us |   0.28% |   SAME   |
|   C32   |      I32      |      2^24      |  95.110 us |       1.09% |  94.948 us |       1.09% | -0.161 us |  -0.17% |   SAME   |
|   C32   |      I32      |      2^28      |   1.311 ms |       0.13% |   1.311 ms |       0.13% | -0.088 us |  -0.01% |   SAME   |
|   C32   |      I64      |      2^16      |   6.125 us |       1.67% |   6.144 us |       0.00% |  0.019 us |   0.31% |   ????   |
|   C32   |      I64      |      2^20      |  14.251 us |       2.79% |  14.206 us |       3.52% | -0.045 us |  -0.31% |   SAME   |
|   C32   |      I64      |      2^24      |  95.332 us |       1.13% |  95.240 us |       1.12% | -0.091 us |  -0.10% |   SAME   |
|   C32   |      I64      |      2^28      |   1.312 ms |       0.13% |   1.312 ms |       0.11% | -0.170 us |  -0.01% |   SAME   |

Fixes: #6565

copy-pr-bot · 2025-11-07T02:04:51Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bernhardmgruber · 2025-11-07T02:36:40Z

/ok to test ad38ff2

cub/cub/detail/launcher/cuda_driver.cuh

bernhardmgruber · 2025-11-11T09:26:02Z

Found an issue with the way the accumulator type is specified in the benchmarks, which explains the regressions I currently observe when using the public tuning API: #6576

miscco

I like the direction this is going

cub/benchmarks/bench/reduce/base.cuh

cub/cub/device/dispatch/tuning/tuning_reduce.cuh

cub/cub/util_arch.cuh

griwes

As said in one of the comments below - I do like the overall structure of this. That said, lack of pattern matching = pain.

c/parallel/src/reduce.cu

griwes · 2025-11-18T22:23:43Z

c/parallel/src/reduce.cu

+    // convert type information to CUB arch_policies
+    using namespace cub::detail::reduce;
+
+    auto at = accum_type::other;
+    if (accum_t.type == CCCL_FLOAT32)
+    {
+      at = accum_type::float32;
+    }
+    if (accum_t.type == CCCL_FLOAT64)
+    {
+      at = accum_type::double32;
+    }
+
+    auto ot = op_type::unknown;
+    switch (op.type)
+    {
+      case CCCL_PLUS:
+        ot = op_type::plus;
+        break;
+      case CCCL_MINIMUM:
+      case CCCL_MAXIMUM:
+        ot = op_type::min_or_max;
+        break;
+      default:
+        break;
+    }

-    using cub::detail::RuntimeReduceAgentPolicy;
-    auto reduce_policy = RuntimeReduceAgentPolicy::from_json(runtime_policy, "ReducePolicy");
-    auto st_policy     = RuntimeReduceAgentPolicy::from_json(runtime_policy, "SingleTilePolicy");
+    auto os = offset_size::_8; // sizeof(uint64_t)


This should be centralized. Not just for c.parallel (so that we can avoid re-stating this over and over again in mimicry of the CUB classify calls), but also for CUB itself so that c.parallel can just do this per category (op_type, accum_type) instead of doing it per algorithm.

c/parallel/src/reduce.cu

cub/cub/device/dispatch/kernels/kernel_reduce.cuh

cub/cub/device/dispatch/tuning/tuning_reduce.cuh

griwes · 2025-11-18T23:07:42Z

cub/cub/device/dispatch/tuning/tuning_reduce.cuh

  using MaxPolicy = Policy1000;
 };
+
+struct arch_policies // equivalent to the policy_hub, holds policies for a bunch of CUDA architectures


This is an internal type, but one that still materializes when users invoke the algorithms, right? I wonder if this should turn into a template and its data members should be turned into an environment returning those values by queries, because as is, any change to the layout would be an ABI break...

A very appealing aspect of the current design is that tuning information is expressed very simply as structs with data members, so I would love if we could keep that.

Regarding API breaks, we do allow those at every release. This is pointed out in our README:

Symbols in the thrust:: and cub:: namespaces may break ABI at any time without warning.

I agree; however, it'd be nice to have the ABI break manifest as a linker error instead of being entirely silent.

Doesn't this already happen automatically, since each CCCL release will have all CUB and Thrust entities in a different inline namespace? Like, now the type is called cub::_V_300300_SM120::detail::reduce. With the next release, the 300300 changes to 300400. What more is needed?

jrhemstad · 2025-11-18T23:18:28Z

@bernhardmgruber can we see a comparison in compile time between this approach and the new one for the DeviceReduce tests? I want to see if there is any impact (for better or worse) on compile time with the new tuning machinery.

cub/benchmarks/bench/reduce/base.cuh

This reverts commit 881b89a.

Co-authored-by: Nader Al Awar <naderalawar@gmail.com>

github-actions · 2025-12-02T08:51:07Z

🥳 CI Workflow Results

🟩 Finished in 12h 31m: Pass: 100%/98 | Total: 5d 07h | Max: 5h 13m | Hits: 33%/97705

See results here.

bernhardmgruber · 2026-02-26T09:12:37Z

cub/cub/device/dispatch/dispatch_streaming_reduce.cuh

-    using dispatch_reduce_t =
-      DispatchReduce<arg_index_input_iterator_t,
-                     accumulating_transform_out_it_t,
-                     PerPartitionOffsetT,
-                     ReductionOpT,
-                     empty_problem_init_t,
-                     per_partition_accum_t,
-                     ::cuda::std::identity,
-                     PolicyChainT>;


Critical: PolicyChainT was passed here to dispatch_reduce_t and is later picked up by dispatch_reduce_t::Dispatch. The replacement call reduce::dispatch<per_partition_accum_t> no longer carries forward this information.

Issue: #7804

github-project-automation bot added this to CCCL Nov 7, 2025

github-project-automation bot moved this to Todo in CCCL Nov 7, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 7, 2025

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the tuning branch from ad38ff2 to 1500255 Compare November 10, 2025 12:19

bernhardmgruber commented Nov 10, 2025

View reviewed changes

cub/cub/detail/launcher/cuda_driver.cuh Show resolved Hide resolved

bernhardmgruber force-pushed the tuning branch 2 times, most recently from b45686c to 3e605f8 Compare November 10, 2025 21:49

This was referenced Nov 10, 2025

[FEA]: Explore alternative tuning policy design #6492

Closed

Test and refactor [Mem|Reg]BoundScaling #6575

Merged

bernhardmgruber force-pushed the tuning branch from c9d7fd7 to d2f0578 Compare November 11, 2025 08:33

bernhardmgruber force-pushed the tuning branch from 7a5a806 to 61bf19d Compare November 11, 2025 10:04

miscco reviewed Nov 11, 2025

View reviewed changes

bernhardmgruber force-pushed the tuning branch from 410d404 to 04ee487 Compare November 13, 2025 19:53

This was referenced Nov 17, 2025

Replace term "PTX version" by "CUDA architecture" or something else #4081

Open

Refactor out variant from transform tunings #6669

Merged

griwes reviewed Nov 18, 2025

View reviewed changes

bernhardmgruber changed the title ~~Design a new tuning API~~ Implement the new tuning API for DeviceReduce Nov 19, 2025

bernhardmgruber commented Nov 19, 2025

View reviewed changes

cub/benchmarks/bench/reduce/base.cuh Outdated Show resolved Hide resolved

bernhardmgruber force-pushed the tuning branch 2 times, most recently from cada761 to ae3a5aa Compare November 19, 2025 12:00

bernhardmgruber marked this pull request as ready for review November 19, 2025 12:07

bernhardmgruber requested review from a team as code owners November 19, 2025 12:07

bernhardmgruber requested a review from shwina November 19, 2025 12:07

bernhardmgruber and others added 18 commits December 1, 2025 21:16

cudaError_t

4436f8b

needless cast

d64fea2

Revert "launch bounds templ param"

daf6cc4

This reverts commit 881b89a.

CI fixes

aa12b02

please GCC

d86a236

Fix segmented_reduce

7dc85d1

Add cub::detail::ptx_arch_id

f44d065

Fix transform_reduce

b0bda7d

drop comment

86d9d78

port transform_reduce benchmark as well

907e5f9

fix

9e7bec1

fix CI

2a7dd31

Fix concept checks

5c87169

nvcc < 12.2 workaround

f76bb0d

Scale in benchmarks

33f6948

Fix duplicated function after rebase

a43f747

Update cub/cub/device/dispatch/tuning/tuning_reduce.cuh

7ff561b

Co-authored-by: Nader Al Awar <naderalawar@gmail.com>

fix double32

91f0c69

bernhardmgruber force-pushed the tuning branch from 3422d89 to 91f0c69 Compare December 1, 2025 20:16

bernhardmgruber enabled auto-merge (squash) December 2, 2025 00:06

This comment has been minimized.

Sign in to view

bernhardmgruber merged commit f5ddc3c into NVIDIA:main Dec 2, 2025
213 of 218 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Dec 2, 2025

bernhardmgruber deleted the tuning branch December 2, 2025 08:53

bernhardmgruber mentioned this pull request Jan 12, 2026

[EPIC] Internal Tuning API for CUB algorithms #7165

Open

30 tasks

jrhemstad mentioned this pull request Feb 3, 2026

Update cccl.c to use new CUB tuning API #7453

Closed

4 tasks

bernhardmgruber commented Feb 26, 2026

View reviewed changes

This was referenced Feb 26, 2026

dispatch_streaming_arg_reduce_t fails to forward PolicyChainT to reduce::dispatch #7804

Closed

Forward policy hub from dispatch_streaming_arg_reduce_t to reduce::dispatch #7805

Merged

Conversation

bernhardmgruber commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 7, 2025

Uh oh!

bernhardmgruber commented Nov 7, 2025

Uh oh!

This comment has been minimized.

Uh oh!

bernhardmgruber commented Nov 11, 2025

Uh oh!

miscco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

griwes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

griwes Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

griwes Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

griwes Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

jrhemstad commented Nov 18, 2025

Uh oh!

Uh oh!

This comment has been minimized.

github-actions bot commented Dec 2, 2025

🥳 CI Workflow Results

🟩 Finished in 12h 31m: Pass: 100%/98 | Total: 5d 07h | Max: 5h 13m | Hits: 33%/97705

Uh oh!

Uh oh!

bernhardmgruber Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bernhardmgruber commented Nov 7, 2025 •

edited

Loading

bernhardmgruber Nov 19, 2025 •

edited

Loading