[cudax] Initial `cudax::coop::reduce` prototype by davebayer · Pull Request #9154 · NVIDIA/cccl

davebayer · 2026-05-28T09:18:16Z

This PR implements cudax::coop::reduce function prototype for cudax::this_(thread|warp|block). The function dispatches to cub:: equivalents for now and allocates all of the shared memory itself.

coderabbitai · 2026-05-28T09:29:25Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

important:

Walkthrough

Adds cuda::experimental::coop::reduce for cooperative reductions across thread, warp, and block scopes using CUB primitives. Each scope returns cuda::std::optional<T>: thread always returns the reduced value; warp and block return the reduced value only for the root rank. Public API enforces static group sizes and dispatches to corresponding internal implementations. Tests cover integral and floating-point types with multiple reduction operators.

Changes

Cooperative Group Reduction

Layer / File(s)	Summary
Cooperative reduction core implementation `cudax/include/cuda/experimental/__coop/reduce.cuh`	Internal `__reduce_impl` overloads dispatch to CUB reduction primitives (`WarpReduce`, `BlockReduce`) for `this_thread`, `this_warp`, and `this_block`. Public `reduce` wrapper validates static group size and group membership, returning `optional<T>` with root-only engagement for warp and block scopes.
Public header and test registration `cudax/include/cuda/experimental/coop.cuh`, `cudax/test/CMakeLists.txt`	New public `coop.cuh` wrapper re-exports `reduce` functionality. CMakeLists registers three Catch2 test targets for `this_thread`, `this_warp`, and `this_block` scopes.
Thread-scoped reduction tests `cudax/test/coop/reduce/this_thread.cu`	Device kernel stages 1–16 runtime-selected items into per-thread local array, invokes `this_thread` reduction, and writes result to output. Host tests generate seeded inputs, compute reference via `cuda::std::accumulate` with identity elements, and verify against kernel output using exact equality (integral) or relative tolerance (floating-point).
Warp-scoped reduction tests `cudax/test/coop/reduce/this_warp.cu`	Device kernel loads 1–4 items per 32-lane warp, invokes `this_warp` reduction, asserts only root has a value, and writes root result to output. Host tests compute reference over `num_items * 32` elements, run kernel, and verify using exact (integral) or relative-tolerance (floating-point) comparison.
Block-scoped reduction tests `cudax/test/coop/reduce/this_block.cu`	Device kernel loads 1–4 items per thread using block-relative rank/count, invokes `this_block` reduction, asserts root-rank presence, and conditionally writes result. Host tests generate seeded inputs, compute reference via `cuda::std::accumulate` over `num_items * block_size`, run kernel, and verify using exact (integral) or relative-tolerance (floating-point) matching.
CUB launcher checks `cub/cub/detail/launcher/*.cuh`	Adds runtime assertion helpers to launcher factories that gate enabling dependent launches (PDL) on device/PTX compute capability >= 9.0.
Dispatch sites: dependent-launch wiring `cub/cub/device/dispatch/*.cuh`	Replaces unconditional dependent_launch=true usage at multiple dispatch sites with capability predicates (`ptx_version >= 900` or `cc >= 9.0`) for init and main kernel launches across find, histogram, merge_sort, scan, and transform paths.

Possibly related PRs

NVIDIA/cccl#9180: Changes the compile-time macro enabling group features (like is_root_rank availability) that this PR's cooperative reduce implementation depends on.

Suggested reviewers

griwes
miscco

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (8)

cudax/include/cuda/experimental/__coop/reduce.cuh (1)

1-99: ⚡ Quick win

suggestion: Add a brief “unstable API / subject to change without notice” note in this header to satisfy CUDA Experimental API documentation requirements.

As per coding guidelines, "cudax/include/**/*.{h,hpp,cuh}: All CUDA Experimental APIs must be documented as unstable and subject to change without notice."

cudax/include/cuda/experimental/coop.cuh (1)

1-26: ⚡ Quick win

suggestion: Add an unstable-API note in this public CUDA Experimental header so consumers are explicitly warned that the surface may change without notice.

As per coding guidelines, "cudax/include/**/*.{h,hpp,cuh}: All CUDA Experimental APIs must be documented as unstable and subject to change without notice."

cudax/test/coop/reduce/this_thread.cu (1)

138-138: ⚡ Quick win

suggestion: swap the REQUIRE_THAT actual/expected sides for relative-match correctness and clearer failures.

Use REQUIRE_THAT(test_results, Catch::Matchers::WithinRel(expected_data, T{0.05})) so tolerance is anchored to the reference value and diagnostics read as actual vs expected.

cudax/test/coop/reduce/this_warp.cu (2)

31-60: ⚡ Quick win

important: remove the duplicated license and include block.

The file repeats the full prologue and headers verbatim, which adds maintenance noise and can hide future include drift.

As per coding guidelines, remove unused/duplicate code artifacts.

174-174: ⚡ Quick win

suggestion: flip REQUIRE_THAT argument order in the floating-point check.

Use actual result as the checked value and reference as matcher target to preserve expected relative-tolerance semantics and error output.

cudax/test/coop/reduce/this_block.cu (3)

31-60: ⚡ Quick win

important: remove the duplicated file prologue/includes.

The second copy is redundant and should be dropped to keep this test source maintainable.

As per coding guidelines, remove unused/duplicate code artifacts.

197-207: ⚡ Quick win

important: extend block-scope test coverage to include num_items 2 and 3.

Current block tests only validate item counts 1 and 4, leaving intermediate per-thread item paths untested.

Also applies to: 236-237, 263-264

175-175: ⚡ Quick win

suggestion: use actual-result-first ordering in REQUIRE_THAT for floating checks.

This keeps relative matching anchored to the reference and improves failure readability.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1921fa4b-a5d1-45b0-b57f-eaaa367e3367

📥 Commits

Reviewing files that changed from the base of the PR and between 5e3f881 and b2ac78b.

📒 Files selected for processing (7)

cub/cub/warp/warp_reduce.cuh
cudax/include/cuda/experimental/__coop/reduce.cuh
cudax/include/cuda/experimental/coop.cuh
cudax/test/CMakeLists.txt
cudax/test/coop/reduce/this_block.cu
cudax/test/coop/reduce/this_thread.cu
cudax/test/coop/reduce/this_warp.cu

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

cudax/test/coop/reduce/this_thread.cu (2)

11-17: ⚡ Quick win

suggestion: add a direct include for cuda::std::numeric_limits instead of relying on transitive includes.

As per coding guidelines, "Files must include all headers related to symbols they use" and "avoid transitive includes and only include the most precise headers needed."

Also applies to: 225-225, 245-245

59-105: ⚡ Quick win

suggestion: operator_to_std/operator_to_std_t are unused in this test and should be removed to keep the test surface minimal.

As per coding guidelines, "Remove unused code, variables, functions, types, template parameters, and headers".

cudax/test/coop/reduce/this_warp.cu (1)

11-17: ⚡ Quick win

suggestion: add a direct include for cuda::std::numeric_limits to avoid transitive include dependency.

As per coding guidelines, "Files must include all headers related to symbols they use" and "avoid transitive includes and only include the most precise headers needed."

Also applies to: 144-144, 165-165

cudax/test/coop/reduce/this_block.cu (1)

11-17: ⚡ Quick win

suggestion: add a direct include for cuda::std::numeric_limits instead of relying on transitive includes.

As per coding guidelines, "Files must include all headers related to symbols they use" and "avoid transitive includes and only include the most precise headers needed."

Also applies to: 148-148, 174-174

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f8e9175d-5759-4110-af71-613709b62b09

📥 Commits

Reviewing files that changed from the base of the PR and between 0efc515 and 4e8ecd1.

📒 Files selected for processing (8)

cub/cub/warp/warp_reduce.cuh
cudax/cmake/cudaxBuildCompilerTargets.cmake
cudax/include/cuda/experimental/__coop/reduce.cuh
cudax/include/cuda/experimental/coop.cuh
cudax/test/CMakeLists.txt
cudax/test/coop/reduce/this_block.cu
cudax/test/coop/reduce/this_thread.cu
cudax/test/coop/reduce/this_warp.cu

🚧 Files skipped from review as they are similar to previous changes (5)

cudax/include/cuda/experimental/coop.cuh
cub/cub/warp/warp_reduce.cuh
cudax/test/CMakeLists.txt
cudax/include/cuda/experimental/__coop/reduce.cuh
cudax/cmake/cudaxBuildCompilerTargets.cmake

miscco · 2026-05-29T08:24:32Z

+  const auto __result = _WarpReduce{__scratch}.Reduce(__thread_data, __red_fn);
+  return (gpu_thread.is_root_rank(__group)) ? ::cuda::std::optional{__result} : ::cuda::std::nullopt;


I am confused by this line. Is this because the value is only valid in the leader thread? Should we broadcast it rather than diverging further?

Otherwise, why do we even compute it if its not desired

This is because only the root rank has the correct value. CUB does the same thing, but returns garbage for non-root ranks. I think returning optional is the better way.

We've already discussed this before and we agreed that we would start like this and add an API that would also broadcast the result in the future.

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7934d393-277b-45fe-93b6-0348f06eef46

📥 Commits

Reviewing files that changed from the base of the PR and between 4e8ecd1 and 885df76.

📒 Files selected for processing (6)

cudax/include/cuda/experimental/__coop/reduce.cuh
cudax/include/cuda/experimental/coop.cuh
cudax/test/CMakeLists.txt
cudax/test/coop/reduce/this_block.cu
cudax/test/coop/reduce/this_thread.cu
cudax/test/coop/reduce/this_warp.cu

✅ Files skipped from review due to trivial changes (1)

cudax/include/cuda/experimental/coop.cuh

🚧 Files skipped from review as they are similar to previous changes (3)

cudax/include/cuda/experimental/__coop/reduce.cuh
cudax/test/coop/reduce/this_block.cu
cudax/test/coop/reduce/this_warp.cu

github-actions · 2026-05-29T15:19:36Z

🥳 CI Workflow Results

🟩 Finished in 33m 24s: Pass: 100%/55 | Total: 4h 20m | Max: 33m 17s | Hits: 97%/34333

See results here.

davebayer requested review from a team as code owners May 28, 2026 09:18

github-project-automation Bot added this to CCCL May 28, 2026

davebayer requested review from andralex and gevtushenko May 28, 2026 09:18

github-project-automation Bot moved this to Todo in CCCL May 28, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 28, 2026

davebayer linked an issue May 28, 2026 that may be closed by this pull request

[FEA]: Initial cudax::coop::reduce prototype #9156

Open

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__coop/reduce.cuh

Comment thread cudax/include/cuda/experimental/coop.cuh

Comment thread cudax/test/coop/reduce/this_block.cu

Comment thread cudax/test/coop/reduce/this_thread.cu

Comment thread cudax/test/coop/reduce/this_warp.cu

davebayer requested a review from a team as a code owner May 28, 2026 09:38

davebayer requested a review from griwes May 28, 2026 09:38

This comment has been minimized.

Sign in to view

davebayer force-pushed the cudax_coop_reduce branch from 0efc515 to 4e8ecd1 Compare May 28, 2026 13:28

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

miscco reviewed May 29, 2026

View reviewed changes

davebayer force-pushed the cudax_coop_reduce branch from 4e8ecd1 to 885df76 Compare May 29, 2026 10:52

davebayer requested a review from miscco May 29, 2026 10:59

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread cudax/test/coop/reduce/this_thread.cu

This comment has been minimized.

Sign in to view

[cudax] Implement cudax::coop::reduce prototype

37dac27

davebayer force-pushed the cudax_coop_reduce branch from d774f1f to 37dac27 Compare May 29, 2026 14:39

		const auto __result = _WarpReduce{__scratch}.Reduce(__thread_data, __red_fn);
		return (gpu_thread.is_root_rank(__group)) ? ::cuda::std::optional{__result} : ::cuda::std::nullopt;

Conversation

davebayer commented May 28, 2026

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

miscco May 29, 2026

Choose a reason for hiding this comment

Uh oh!

davebayer May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

github-actions Bot commented May 29, 2026

🥳 CI Workflow Results

🟩 Finished in 33m 24s: Pass: 100%/55 | Total: 4h 20m | Max: 33m 17s | Hits: 97%/34333

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 28, 2026 •

edited

Loading