Refactor warpspeed scan 1/2 by bernhardmgruber · Pull Request #9168 · NVIDIA/cccl

bernhardmgruber · 2026-05-28T21:36:48Z

This is the first part of a few cleanup commits that should improve the readability of the warpspeed scan implementation.

AI generated summary:

Replaced the large kernelBody free function with an agent_warpspeed_scan struct that holds the kernel parameters as member data
Broke the monolithic function body into focused member functions:
- load_next_tile_index — loads the next tile index into shared memory (scheduler squad)
- load_current_tile — bulk-loads the current tile from global to shared memory (load squad)
- lookback — performs the decoupled lookback for prefix propagation (lookback squad)
- reduce_tile — loads tile from smem, reduces across threads/warps/squad, stores aggregates (reduce squad)
- scan_and_store_tile — performs the inclusive/exclusive scan and stores results back to global memory (scan-store squad)
- run — main loop orchestrating the pipeline across squads
Eliminated the scan_and_store lambda (which couldn't capture structured bindings in C++17) in favor of a proper template member function with IsLastTile as a template parameter
smem_resource.cuh: Renamed popStage() to nextStage() for clarity

I tried pedantically to not cause any SASS changes, but extracting the reduce_tile function somehow let the compiler elide an ISETP.GE.U64.AND P3, PT, R14, 0xf80, PT ; instruction, so the kernel is now exactly one instruction shorter. A few register names changed as well, but the instructions are otherwise identical.

no SASS changes

kernel gets one instruction shorter: ISETP.GE.U64.AND P3, PT, R14, 0xf80, PT ; plus a few registers have different names now

coderabbitai · 2026-05-28T21:40:35Z

Actionable comments posted: 0

coderabbitai · 2026-05-28T21:40:38Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77750c59-508e-4aca-923b-0cb0ff10efc9

📥 Commits

Reviewing files that changed from the base of the PR and between 874d40c and 700f422.

📒 Files selected for processing (1)

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

📝 Walkthrough

Summary by CodeRabbit

Refactor
- Enhanced const-correctness across resource management and synchronization method signatures
- Restructured scan kernel implementation to improve code organization through decomposition of complex logic into specialized helper methods
- Updated comparison operators to use value-based parameters instead of references, enabling more flexible semantics

Walkthrough

Refactors the warpspeed scan kernel by extracting inline kernelBody into an agent_warpspeed_scan struct, updating related abstractions (SmemResource, Squad, SquadDesc), and rewriting the tiling loop to use nextStage() for stage advancement while preserving barrier synchronization and scan semantics.

Changes

Warpspeed scan kernel refactoring

Layer / File(s)	Summary
Warpspeed abstraction layer API updates `cub/cub/detail/warpspeed/resource/smem_resource.cuh`, `cub/cub/detail/warpspeed/squad/squad.cuh`, `cub/cub/detail/warpspeed/squad/squad_desc.cuh`	SmemResource stage accessor renamed from `popStage()` to `nextStage()`. Squad::syncThreads() now `const`-qualified with explicit barrier-index computation. SquadDesc comparison operators changed from const-reference parameters to by-value parameters.
Agent struct definition and dispatch orchestration `cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh`	Introduces `agent_warpspeed_scan` struct encapsulating cached registers, scan parameters, and shared-memory resources. Former inline `kernelBody` logic factored into helper methods (`load_next_tile_index`, `load_current_tile`, `lookback`, `reduce_tile`, `scan_and_store_tile`). `dispatch_squad()` rewrites tiling loop to advance stages via `nextStage()` while preserving phase synchronization, lookback behavior, partial-tile handling, and TMA-backed bulk store logic. Dispatch wiring updated to instantiate and call `agent_warpspeed_scan::dispatch_squad()` instead of `kernelBody()`.

Possibly related PRs

NVIDIA/cccl#9128: Modifies warpspeed scan resource allocation; directly connected via shared-memory stage handling changes in this PR.

Suggested reviewers

NaderAlAwar
shwina

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-05-28T22:07:25Z

Actionable comments posted: 0

This reverts commit befb22d.

bernhardmgruber · 2026-05-28T22:32:46Z

-  [[nodiscard]] _CCCL_HOST_DEVICE_API friend constexpr bool
-  operator==(const SquadDesc& lhs, const SquadDesc& rhs) noexcept


This is needed to avoid ODR use of squad descriptions which are static constexpr members of a class, since otherwise we get the error that those are not accessible in __device__ code.

davebayer · 2026-05-29T05:49:17Z

-    int barrierIdx = (int) this->mSquadIdx + 1;
+    const int barrierIdx = this->mSquadIdx + 1;

    __barrier_sync_count(barrierIdx, this->threadCount());


Unrelated to the PR: Why do we use unaligned version of the __barrier_sync? From what I see in the main loop, we always synchronize the warps on the same line in the code, so we should be able to use the aligned version, which produces a bit less code.

See the comparison: https://godbolt.org/z/vPGbT9W7c

Ah, I'm probably wrong, it's not that all threads that are part of the barrier must call the instruction uniformly, but rather the whole CTA :(

Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>

bernhardmgruber · 2026-05-29T08:29:47Z

/ok to test 700f422

miscco · 2026-05-29T08:50:56Z

+    bool is_first_tile,
+    bool is_last_tile, // TODO(bgruber): should we dispatch on is_last_tile outside this function and compile it twice?


I believe in the long run we should have an enumeration like

enum class good_name{ __full_tile, __first_tile, __last_tile, __only_tile, };

miscco · 2026-05-29T08:51:50Z

+    PhaseInOutT& phaseInOutRW,
+    PhaseSumT& phaseSumThreadAndWarpW,
+    int valid_items,
+    bool is_first_tile,


Should those rather be template arguments?

This is exactly what the next line implies:

// TODO(bgruber): should we dispatch on is_last_tile outside this function and compile it twice?

I believe we may want to test this, but not in this PR.

github-actions · 2026-05-29T10:23:11Z

🥳 CI Workflow Results

🟩 Finished in 1h 51m: Pass: 100%/285 | Total: 3d 07h | Max: 41m 02s | Hits: 100%/196423

See results here.

bernhardmgruber added 13 commits May 28, 2026 23:23

Re-enable warpspeed on SM120

befb22d

Refactor warpspeed scan

987be20

no SASS changes

No SASS changes

d603feb

No SASS changes

5d962d1

No SASS changes

fc1958d

Just register name changes, no new instructions

e2f3760

No SASS changes

baaae20

SASS changes:

ad8fb39

kernel gets one instruction shorter: ISETP.GE.U64.AND P3, PT, R14, 0xf80, PT ; plus a few registers have different names now

No SASS changes

efbf825

Comment

e732c5c

Name

617f8ed

Rename

6b63747

const

60553ef

bernhardmgruber requested a review from a team as a code owner May 28, 2026 21:36

github-project-automation Bot added this to CCCL May 28, 2026

bernhardmgruber requested a review from NaderAlAwar May 28, 2026 21:36

github-project-automation Bot moved this to Todo in CCCL May 28, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 28, 2026

Fix

874d40c

bernhardmgruber changed the title ~~Refactor warpspeed scan 1~~ Refactor warpspeed scan 1/2 May 28, 2026

bernhardmgruber mentioned this pull request May 28, 2026

Refactor warpspeed scan 2/2 #9169

Open

1 task

Revert "Re-enable warpspeed on SM120"

6d392f9

This reverts commit befb22d.

bernhardmgruber commented May 28, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

davebayer reviewed May 29, 2026

View reviewed changes

Update cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

700f422

Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>

miscco reviewed May 29, 2026

View reviewed changes

miscco approved these changes May 29, 2026

View reviewed changes

bernhardmgruber merged commit 64a42f1 into NVIDIA:main May 29, 2026
308 of 309 checks passed

bernhardmgruber deleted the ref_scan_part1 branch May 29, 2026 10:36

		[[nodiscard]] _CCCL_HOST_DEVICE_API friend constexpr bool
		operator==(const SquadDesc& lhs, const SquadDesc& rhs) noexcept

		bool is_first_tile,
		bool is_last_tile, // TODO(bgruber): should we dispatch on is_last_tile outside this function and compile it twice?

Conversation

bernhardmgruber commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

bernhardmgruber May 28, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

davebayer May 29, 2026

Choose a reason for hiding this comment

Uh oh!

davebayer May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bernhardmgruber commented May 29, 2026

Uh oh!

miscco May 29, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber May 29, 2026

Choose a reason for hiding this comment

Uh oh!

miscco May 29, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 29, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 51m: Pass: 100%/285 | Total: 3d 07h | Max: 41m 02s | Hits: 100%/196423

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bernhardmgruber commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

bernhardmgruber May 29, 2026 •

edited

Loading