Refactor warpspeed scan tuning by bernhardmgruber · Pull Request #8145 · NVIDIA/cccl

bernhardmgruber · 2026-03-24T10:14:47Z

These are some refactorings following the restructuring to support CCCL.C and the new tuning API: #7565

No SASS changes for cub.bench.scan.exclusive.sum.base on SM75;80;86;90;100;120

bernhardmgruber · 2026-03-24T10:21:14Z

cub/cub/device/dispatch/tuning/tuning_scan.cuh

-#if __cccl_ptx_isa >= 860
-    struct WarpspeedPolicy
-    {


Note: I am entirely removing warpspeed scan from the old policy hub, since we have the new tuning API now and we did not ship warpspeed scan to any release yet. So this is not a breaking change.

bernhardmgruber · 2026-03-24T10:22:17Z

cub/cub/device/dispatch/tuning/tuning_scan.cuh

 struct policy_selector_from_types
 {
-  static constexpr int input_value_size       = int{sizeof(InputValueT)};
-  static constexpr int input_value_alignment  = int{alignof(InputValueT)};
-  static constexpr int output_value_size      = int{sizeof(OutputValueT)};
-  static constexpr int output_value_alignment = int{alignof(OutputValueT)};
-  static constexpr int accum_size             = int{sizeof(AccumT)};
-  static constexpr int accum_alignment        = int{alignof(AccumT)};
-  static constexpr type_t input_type          = classify_type<InputValueT>;


Note: It's not the policy selector's job to handle the type erasure required for CCCL.C, that's what we have the kernel source for.

Doh! This was the missing piece to being able to straightforwardly make things easily constexpr, since values coming in as arguments can't do that. Wish I noticed it earlier 😅

bernhardmgruber · 2026-03-24T10:23:24Z

cub/cub/device/dispatch/dispatch_scan.cuh

-    using policy_selector_t = detail::scan::policy_selector_from_types<
-      detail::it_value_t<InputIteratorT>,
-      detail::it_value_t<OutputIteratorT>,
-      AccumT,
-      OffsetT,
-      ScanOpT>;


Note: This was incorrect, since it ignored the user provided policy hub.

bernhardmgruber · 2026-03-24T10:24:38Z

cub/test/catch2_test_device_scan_env.cu

-  REQUIRE(cudaSuccess == cudaGetDeviceProperties(&device_props, current_device));
-
-  const auto target_block_size =
-    selector_t{}(cuda::to_arch_id(cuda::compute_capability{device_props.major, device_props.minor})).block_threads;


Note: It could be argued that we should not use a detail function in the unit tests, but we will probably expose ptx_arch_id, or the compute capability version, in the public API when we go public with the tuning API. So this objection would be temporary.

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

cub/cub/device/dispatch/kernels/scan_warpspeed_policy.cuh

fbusato · 2026-03-24T17:21:56Z

cub/cub/device/dispatch/tuning/tuning_scan.cuh

  };

-  using MaxPolicy = Policy1200;
+  using MaxPolicy = Policy1000;


is this change expected?

cub/cub/device/dispatch/tuning/tuning_scan.cuh

fbusato · 2026-03-24T17:25:55Z

cub/cub/device/dispatch/tuning/tuning_scan.cuh

    // 1);

-    warpspeed_policy.tile_size = warpspeed_policy.items_per_thread * squad_reduce_thread_count;
+    if (arch >= ::cuda::arch_id::sm_120 && operation_t == op_kind_t::other && is_arithmetic_type(input_type))


should not is_arithmetic_type be fully qualified?

we don't fully qualify in CUB yet.

fbusato · 2026-03-24T17:29:05Z

cub/cub/device/dispatch/dispatch_scan.cuh

+      static_cast<int>(kernel_src.InputSize()),
+      static_cast<int>(kernel_src.InputAlign()),
+      static_cast<int>(kernel_src.OutputSize()),
+      static_cast<int>(kernel_src.OutputAlign()),
+      static_cast<int>(kernel_src.AccumSize()),
+      static_cast<int>(kernel_src.AccumAlign()));


my understanding is that everything here is at compile-time

Suggested change

static_cast<int>(kernel_src.InputSize()),

static_cast<int>(kernel_src.InputAlign()),

static_cast<int>(kernel_src.OutputSize()),

static_cast<int>(kernel_src.OutputAlign()),

static_cast<int>(kernel_src.AccumSize()),

static_cast<int>(kernel_src.AccumAlign()));

int{kernel_src.InputSize()},

int{kernel_src.InputAlign()},

int{kernel_src.OutputSize()},

int{kernel_src.OutputAlign()},

int{kernel_src.AccumSize()},

int{kernel_src.AccumAlign())};

It's only constexpr when called through the CUB API. It's just const when called through CCCL.C.

fbusato · 2026-03-24T17:29:52Z

cub/cub/device/dispatch/tuning/tuning_scan.cuh

+// TODO(bgruber): put this somewhere else
 constexpr _CCCL_HOST_DEVICE bool is_arithmetic_type(type_t type)
 {
  switch (type)


question. Do we really need this kind of dispatch instead of using a template type + cuda::std utilities?

Unfortunately, yes. We need to be able to compile the entire dispatch and tuning without any types when coming from Python via CCCL.C.

github-actions · 2026-03-25T19:42:35Z

🥳 CI Workflow Results

🟩 Finished in 1h 32m: Pass: 100%/255 | Total: 8d 11h | Max: 1h 25m | Hits: 71%/161009

See results here.

griwes

I love the unification of the divergent constexpr/nonconstexpr paths.

griwes · 2026-03-26T03:47:14Z

cub/cub/device/dispatch/tuning/tuning_scan.cuh

 struct policy_selector_from_types
 {
-  static constexpr int input_value_size       = int{sizeof(InputValueT)};
-  static constexpr int input_value_alignment  = int{alignof(InputValueT)};
-  static constexpr int output_value_size      = int{sizeof(OutputValueT)};
-  static constexpr int output_value_alignment = int{alignof(OutputValueT)};
-  static constexpr int accum_size             = int{sizeof(AccumT)};
-  static constexpr int accum_alignment        = int{alignof(AccumT)};
-  static constexpr type_t input_type          = classify_type<InputValueT>;


Doh! This was the missing piece to being able to straightforwardly make things easily constexpr, since values coming in as arguments can't do that. Wish I noticed it earlier 😅

bernhardmgruber requested review from a team as code owners March 24, 2026 10:14

bernhardmgruber requested a review from oleksandr-pavlyk March 24, 2026 10:14

github-project-automation bot added this to CCCL Mar 24, 2026

bernhardmgruber requested a review from fbusato March 24, 2026 10:14

github-project-automation bot moved this to Todo in CCCL Mar 24, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 24, 2026

bernhardmgruber commented Mar 24, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

bernhardmgruber mentioned this pull request Mar 24, 2026

Let scan tuning policy choose warpspeed or not #8158

Draft

2 tasks

This comment has been minimized.

Sign in to view

fbusato requested changes Mar 24, 2026

View reviewed changes

github-project-automation bot moved this from In Review to In Progress in CCCL Mar 24, 2026

bernhardmgruber force-pushed the scan_refactor branch 2 times, most recently from 029feca to b46a654 Compare March 25, 2026 07:48

miscco approved these changes Mar 25, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

bernhardmgruber added 8 commits March 25, 2026 19:07

Use ptx_arch_id

f2a950c

Refactor warpspeed scan tuning

df072c5

Move some stuff from policy_selector to kernel source

cc526b4

Update benchmark polict selector

af04c06

Work around clang

9fc4c3e

Fix warning

4b23889

Fix warning

db1c4d7

Review

db404fb

bernhardmgruber force-pushed the scan_refactor branch from b46a654 to db404fb Compare March 25, 2026 18:07

bernhardmgruber requested a review from fbusato March 25, 2026 21:38

griwes approved these changes Mar 26, 2026

View reviewed changes

Conversation

bernhardmgruber commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

github-actions bot commented Mar 25, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 32m: Pass: 100%/255 | Total: 8d 11h | Max: 1h 25m | Hits: 71%/161009

Uh oh!

griwes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bernhardmgruber commented Mar 24, 2026 •

edited

Loading