Skip to content

Refactor warpspeed scan 1/2#9168

Merged
bernhardmgruber merged 16 commits into
NVIDIA:mainfrom
bernhardmgruber:ref_scan_part1
May 29, 2026
Merged

Refactor warpspeed scan 1/2#9168
bernhardmgruber merged 16 commits into
NVIDIA:mainfrom
bernhardmgruber:ref_scan_part1

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented May 28, 2026

This is the first part of a few cleanup commits that should improve the readability of the warpspeed scan implementation.

AI generated summary:

  • Replaced the large kernelBody free function with an agent_warpspeed_scan struct that holds the kernel parameters as member data
  • Broke the monolithic function body into focused member functions:
    • load_next_tile_index — loads the next tile index into shared memory (scheduler squad)
    • load_current_tile — bulk-loads the current tile from global to shared memory (load squad)
    • lookback — performs the decoupled lookback for prefix propagation (lookback squad)
    • reduce_tile — loads tile from smem, reduces across threads/warps/squad, stores aggregates (reduce squad)
    • scan_and_store_tile — performs the inclusive/exclusive scan and stores results back to global memory (scan-store squad)
    • run — main loop orchestrating the pipeline across squads
  • Eliminated the scan_and_store lambda (which couldn't capture structured bindings in C++17) in favor of a proper template member function with IsLastTile as a template parameter
  • smem_resource.cuh: Renamed popStage() to nextStage() for clarity

I tried pedantically to not cause any SASS changes, but extracting the reduce_tile function somehow let the compiler elide an ISETP.GE.U64.AND P3, PT, R14, 0xf80, PT ; instruction, so the kernel is now exactly one instruction shorter. A few register names changed as well, but the instructions are otherwise identical.

@bernhardmgruber bernhardmgruber requested a review from a team as a code owner May 28, 2026 21:36
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 28, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 28, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 77750c59-508e-4aca-923b-0cb0ff10efc9

📥 Commits

Reviewing files that changed from the base of the PR and between 874d40c and 700f422.

📒 Files selected for processing (1)
  • cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Enhanced const-correctness across resource management and synchronization method signatures
    • Restructured scan kernel implementation to improve code organization through decomposition of complex logic into specialized helper methods
    • Updated comparison operators to use value-based parameters instead of references, enabling more flexible semantics

Walkthrough

Refactors the warpspeed scan kernel by extracting inline kernelBody into an agent_warpspeed_scan struct, updating related abstractions (SmemResource, Squad, SquadDesc), and rewriting the tiling loop to use nextStage() for stage advancement while preserving barrier synchronization and scan semantics.

Changes

Warpspeed scan kernel refactoring

Layer / File(s) Summary
Warpspeed abstraction layer API updates
cub/cub/detail/warpspeed/resource/smem_resource.cuh, cub/cub/detail/warpspeed/squad/squad.cuh, cub/cub/detail/warpspeed/squad/squad_desc.cuh
SmemResource stage accessor renamed from popStage() to nextStage(). Squad::syncThreads() now const-qualified with explicit barrier-index computation. SquadDesc comparison operators changed from const-reference parameters to by-value parameters.
Agent struct definition and dispatch orchestration
cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh
Introduces agent_warpspeed_scan struct encapsulating cached registers, scan parameters, and shared-memory resources. Former inline kernelBody logic factored into helper methods (load_next_tile_index, load_current_tile, lookback, reduce_tile, scan_and_store_tile). dispatch_squad() rewrites tiling loop to advance stages via nextStage() while preserving phase synchronization, lookback behavior, partial-tile handling, and TMA-backed bulk store logic. Dispatch wiring updated to instantiate and call agent_warpspeed_scan::dispatch_squad() instead of kernelBody().

Possibly related PRs

  • NVIDIA/cccl#9128: Modifies warpspeed scan resource allocation; directly connected via shared-memory stage handling changes in this PR.

Suggested reviewers

  • NaderAlAwar
  • shwina

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

@bernhardmgruber bernhardmgruber changed the title Refactor warpspeed scan 1 Refactor warpspeed scan 1/2 May 28, 2026
Comment on lines -50 to -51
[[nodiscard]] _CCCL_HOST_DEVICE_API friend constexpr bool
operator==(const SquadDesc& lhs, const SquadDesc& rhs) noexcept
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to avoid ODR use of squad descriptions which are static constexpr members of a class, since otherwise we get the error that those are not accessible in __device__ code.

@github-actions

This comment has been minimized.

int barrierIdx = (int) this->mSquadIdx + 1;
const int barrierIdx = this->mSquadIdx + 1;

__barrier_sync_count(barrierIdx, this->threadCount());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the PR: Why do we use unaligned version of the __barrier_sync? From what I see in the main loop, we always synchronize the warps on the same line in the code, so we should be able to use the aligned version, which produces a bit less code.

See the comparison: https://godbolt.org/z/vPGbT9W7c

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I'm probably wrong, it's not that all threads that are part of the barrier must call the instruction uniformly, but rather the whole CTA :(

Comment thread cub/cub/device/dispatch/kernels/kernel_scan_warpspeed.cuh Outdated
Co-authored-by: David Bayer <48736217+davebayer@users.noreply.github.com>
@bernhardmgruber
Copy link
Copy Markdown
Contributor Author

/ok to test 700f422

Comment on lines +342 to +343
bool is_first_tile,
bool is_last_tile, // TODO(bgruber): should we dispatch on is_last_tile outside this function and compile it twice?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe in the long run we should have an enumeration like

enum class good_name{
  __full_tile,
  __first_tile,
  __last_tile,
  __only_tile,
};

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly!

PhaseInOutT& phaseInOutRW,
PhaseSumT& phaseSumThreadAndWarpW,
int valid_items,
bool is_first_tile,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should those rather be template arguments?

Copy link
Copy Markdown
Contributor Author

@bernhardmgruber bernhardmgruber May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what the next line implies:

// TODO(bgruber): should we dispatch on is_last_tile outside this function and compile it twice?

I believe we may want to test this, but not in this PR.

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 51m: Pass: 100%/285 | Total: 3d 07h | Max: 41m 02s | Hits: 100%/196423

See results here.

@bernhardmgruber bernhardmgruber merged commit 64a42f1 into NVIDIA:main May 29, 2026
308 of 309 checks passed
@bernhardmgruber bernhardmgruber deleted the ref_scan_part1 branch May 29, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants