Skip to content

Refactor warpspeed scan tuning#8145

Open
bernhardmgruber wants to merge 8 commits intoNVIDIA:mainfrom
bernhardmgruber:scan_refactor
Open

Refactor warpspeed scan tuning#8145
bernhardmgruber wants to merge 8 commits intoNVIDIA:mainfrom
bernhardmgruber:scan_refactor

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented Mar 24, 2026

These are some refactorings following the restructuring to support CCCL.C and the new tuning API: #7565

  • No SASS changes for cub.bench.scan.exclusive.sum.base on SM75;80;86;90;100;120

@bernhardmgruber bernhardmgruber requested review from a team as code owners March 24, 2026 10:14
@bernhardmgruber bernhardmgruber requested a review from fbusato March 24, 2026 10:14
@github-project-automation github-project-automation bot moved this to Todo in CCCL Mar 24, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Mar 24, 2026
Comment on lines -642 to -644
#if __cccl_ptx_isa >= 860
struct WarpspeedPolicy
{
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I am entirely removing warpspeed scan from the old policy hub, since we have the new tuning API now and we did not ship warpspeed scan to any release yet. So this is not a breaking change.

Comment on lines 1309 to -1317
struct policy_selector_from_types
{
static constexpr int input_value_size = int{sizeof(InputValueT)};
static constexpr int input_value_alignment = int{alignof(InputValueT)};
static constexpr int output_value_size = int{sizeof(OutputValueT)};
static constexpr int output_value_alignment = int{alignof(OutputValueT)};
static constexpr int accum_size = int{sizeof(AccumT)};
static constexpr int accum_alignment = int{alignof(AccumT)};
static constexpr type_t input_type = classify_type<InputValueT>;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: It's not the policy selector's job to handle the type erasure required for CCCL.C, that's what we have the kernel source for.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! This was the missing piece to being able to straightforwardly make things easily constexpr, since values coming in as arguments can't do that. Wish I noticed it earlier 😅

Comment on lines -854 to -859
using policy_selector_t = detail::scan::policy_selector_from_types<
detail::it_value_t<InputIteratorT>,
detail::it_value_t<OutputIteratorT>,
AccumT,
OffsetT,
ScanOpT>;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This was incorrect, since it ignored the user provided policy hub.

REQUIRE(cudaSuccess == cudaGetDeviceProperties(&device_props, current_device));

const auto target_block_size =
selector_t{}(cuda::to_arch_id(cuda::compute_capability{device_props.major, device_props.minor})).block_threads;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: It could be argued that we should not use a detail function in the unit tests, but we will probably expose ptx_arch_id, or the compute capability version, in the public API when we go public with the tuning API. So this objection would be temporary.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

};

using MaxPolicy = Policy1200;
using MaxPolicy = Policy1000;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change expected?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

// 1);

warpspeed_policy.tile_size = warpspeed_policy.items_per_thread * squad_reduce_thread_count;
if (arch >= ::cuda::arch_id::sm_120 && operation_t == op_kind_t::other && is_arithmetic_type(input_type))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not is_arithmetic_type be fully qualified?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't fully qualify in CUB yet.

Comment on lines +555 to +560
static_cast<int>(kernel_src.InputSize()),
static_cast<int>(kernel_src.InputAlign()),
static_cast<int>(kernel_src.OutputSize()),
static_cast<int>(kernel_src.OutputAlign()),
static_cast<int>(kernel_src.AccumSize()),
static_cast<int>(kernel_src.AccumAlign()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that everything here is at compile-time

Suggested change
static_cast<int>(kernel_src.InputSize()),
static_cast<int>(kernel_src.InputAlign()),
static_cast<int>(kernel_src.OutputSize()),
static_cast<int>(kernel_src.OutputAlign()),
static_cast<int>(kernel_src.AccumSize()),
static_cast<int>(kernel_src.AccumAlign()));
int{kernel_src.InputSize()},
int{kernel_src.InputAlign()},
int{kernel_src.OutputSize()},
int{kernel_src.OutputAlign()},
int{kernel_src.AccumSize()},
int{kernel_src.AccumAlign())};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only constexpr when called through the CUB API. It's just const when called through CCCL.C.

// TODO(bgruber): put this somewhere else
constexpr _CCCL_HOST_DEVICE bool is_arithmetic_type(type_t type)
{
switch (type)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question. Do we really need this kind of dispatch instead of using a template type + cuda::std utilities?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, yes. We need to be able to compile the entire dispatch and tuning without any types when coming from Python via CCCL.C.

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Mar 24, 2026
@bernhardmgruber bernhardmgruber force-pushed the scan_refactor branch 2 times, most recently from 029feca to b46a654 Compare March 25, 2026 07:48
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 32m: Pass: 100%/255 | Total: 8d 11h | Max: 1h 25m | Hits: 71%/161009

See results here.

@bernhardmgruber bernhardmgruber requested a review from fbusato March 25, 2026 21:38
Copy link
Copy Markdown
Contributor

@griwes griwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the unification of the divergent constexpr/nonconstexpr paths.

Comment on lines 1309 to -1317
struct policy_selector_from_types
{
static constexpr int input_value_size = int{sizeof(InputValueT)};
static constexpr int input_value_alignment = int{alignof(InputValueT)};
static constexpr int output_value_size = int{sizeof(OutputValueT)};
static constexpr int output_value_alignment = int{alignof(OutputValueT)};
static constexpr int accum_size = int{sizeof(AccumT)};
static constexpr int accum_alignment = int{alignof(AccumT)};
static constexpr type_t input_type = classify_type<InputValueT>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! This was the missing piece to being able to straightforwardly make things easily constexpr, since values coming in as arguments can't do that. Wish I noticed it earlier 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

4 participants