Skip to content

[cudax] Initial cudax::coop::reduce prototype#9154

Open
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:cudax_coop_reduce
Open

[cudax] Initial cudax::coop::reduce prototype#9154
davebayer wants to merge 1 commit into
NVIDIA:mainfrom
davebayer:cudax_coop_reduce

Conversation

@davebayer
Copy link
Copy Markdown
Contributor

This PR implements cudax::coop::reduce function prototype for cudax::this_(thread|warp|block). The function dispatches to cub:: equivalents for now and allocates all of the shared memory itself.

@davebayer davebayer requested review from a team as code owners May 28, 2026 09:18
@davebayer davebayer requested review from andralex and gevtushenko May 28, 2026 09:18
@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 28, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 28, 2026
@davebayer davebayer linked an issue May 28, 2026 that may be closed by this pull request
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

important:

Walkthrough

Adds cuda::experimental::coop::reduce for cooperative reductions across thread, warp, and block scopes using CUB primitives. Each scope returns cuda::std::optional<T>: thread always returns the reduced value; warp and block return the reduced value only for the root rank. Public API enforces static group sizes and dispatches to corresponding internal implementations. Tests cover integral and floating-point types with multiple reduction operators.

Changes

Cooperative Group Reduction

Layer / File(s) Summary
Cooperative reduction core implementation
cudax/include/cuda/experimental/__coop/reduce.cuh
Internal __reduce_impl overloads dispatch to CUB reduction primitives (WarpReduce, BlockReduce) for this_thread, this_warp, and this_block. Public reduce wrapper validates static group size and group membership, returning optional<T> with root-only engagement for warp and block scopes.
Public header and test registration
cudax/include/cuda/experimental/coop.cuh, cudax/test/CMakeLists.txt
New public coop.cuh wrapper re-exports reduce functionality. CMakeLists registers three Catch2 test targets for this_thread, this_warp, and this_block scopes.
Thread-scoped reduction tests
cudax/test/coop/reduce/this_thread.cu
Device kernel stages 1–16 runtime-selected items into per-thread local array, invokes this_thread reduction, and writes result to output. Host tests generate seeded inputs, compute reference via cuda::std::accumulate with identity elements, and verify against kernel output using exact equality (integral) or relative tolerance (floating-point).
Warp-scoped reduction tests
cudax/test/coop/reduce/this_warp.cu
Device kernel loads 1–4 items per 32-lane warp, invokes this_warp reduction, asserts only root has a value, and writes root result to output. Host tests compute reference over num_items * 32 elements, run kernel, and verify using exact (integral) or relative-tolerance (floating-point) comparison.
Block-scoped reduction tests
cudax/test/coop/reduce/this_block.cu
Device kernel loads 1–4 items per thread using block-relative rank/count, invokes this_block reduction, asserts root-rank presence, and conditionally writes result. Host tests generate seeded inputs, compute reference via cuda::std::accumulate over num_items * block_size, run kernel, and verify using exact (integral) or relative-tolerance (floating-point) matching.
CUB launcher checks
cub/cub/detail/launcher/*.cuh
Adds runtime assertion helpers to launcher factories that gate enabling dependent launches (PDL) on device/PTX compute capability >= 9.0.
Dispatch sites: dependent-launch wiring
cub/cub/device/dispatch/*.cuh
Replaces unconditional dependent_launch=true usage at multiple dispatch sites with capability predicates (ptx_version >= 900 or cc >= 9.0) for init and main kernel launches across find, histogram, merge_sort, scan, and transform paths.

Possibly related PRs

  • NVIDIA/cccl#9180: Changes the compile-time macro enabling group features (like is_root_rank availability) that this PR's cooperative reduce implementation depends on.

Suggested reviewers

  • griwes
  • miscco

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (8)
cudax/include/cuda/experimental/__coop/reduce.cuh (1)

1-99: ⚡ Quick win

suggestion: Add a brief “unstable API / subject to change without notice” note in this header to satisfy CUDA Experimental API documentation requirements.

As per coding guidelines, "cudax/include/**/*.{h,hpp,cuh}: All CUDA Experimental APIs must be documented as unstable and subject to change without notice."

cudax/include/cuda/experimental/coop.cuh (1)

1-26: ⚡ Quick win

suggestion: Add an unstable-API note in this public CUDA Experimental header so consumers are explicitly warned that the surface may change without notice.

As per coding guidelines, "cudax/include/**/*.{h,hpp,cuh}: All CUDA Experimental APIs must be documented as unstable and subject to change without notice."

cudax/test/coop/reduce/this_thread.cu (1)

138-138: ⚡ Quick win

suggestion: swap the REQUIRE_THAT actual/expected sides for relative-match correctness and clearer failures.

Use REQUIRE_THAT(test_results, Catch::Matchers::WithinRel(expected_data, T{0.05})) so tolerance is anchored to the reference value and diagnostics read as actual vs expected.

cudax/test/coop/reduce/this_warp.cu (2)

31-60: ⚡ Quick win

important: remove the duplicated license and include block.

The file repeats the full prologue and headers verbatim, which adds maintenance noise and can hide future include drift.

As per coding guidelines, remove unused/duplicate code artifacts.


174-174: ⚡ Quick win

suggestion: flip REQUIRE_THAT argument order in the floating-point check.

Use actual result as the checked value and reference as matcher target to preserve expected relative-tolerance semantics and error output.

cudax/test/coop/reduce/this_block.cu (3)

31-60: ⚡ Quick win

important: remove the duplicated file prologue/includes.

The second copy is redundant and should be dropped to keep this test source maintainable.

As per coding guidelines, remove unused/duplicate code artifacts.


197-207: ⚡ Quick win

important: extend block-scope test coverage to include num_items 2 and 3.

Current block tests only validate item counts 1 and 4, leaving intermediate per-thread item paths untested.

Also applies to: 236-237, 263-264


175-175: ⚡ Quick win

suggestion: use actual-result-first ordering in REQUIRE_THAT for floating checks.

This keeps relative matching anchored to the reference and improves failure readability.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1921fa4b-a5d1-45b0-b57f-eaaa367e3367

📥 Commits

Reviewing files that changed from the base of the PR and between 5e3f881 and b2ac78b.

📒 Files selected for processing (7)
  • cub/cub/warp/warp_reduce.cuh
  • cudax/include/cuda/experimental/__coop/reduce.cuh
  • cudax/include/cuda/experimental/coop.cuh
  • cudax/test/CMakeLists.txt
  • cudax/test/coop/reduce/this_block.cu
  • cudax/test/coop/reduce/this_thread.cu
  • cudax/test/coop/reduce/this_warp.cu

Comment thread cudax/include/cuda/experimental/__coop/reduce.cuh
Comment thread cudax/include/cuda/experimental/coop.cuh
Comment thread cudax/test/coop/reduce/this_block.cu
Comment thread cudax/test/coop/reduce/this_thread.cu
Comment thread cudax/test/coop/reduce/this_warp.cu
@davebayer davebayer requested a review from a team as a code owner May 28, 2026 09:38
@davebayer davebayer requested a review from griwes May 28, 2026 09:38
@github-actions

This comment has been minimized.

@davebayer davebayer force-pushed the cudax_coop_reduce branch from 0efc515 to 4e8ecd1 Compare May 28, 2026 13:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
cudax/test/coop/reduce/this_thread.cu (2)

11-17: ⚡ Quick win

suggestion: add a direct include for cuda::std::numeric_limits instead of relying on transitive includes.

As per coding guidelines, "Files must include all headers related to symbols they use" and "avoid transitive includes and only include the most precise headers needed."

Also applies to: 225-225, 245-245


59-105: ⚡ Quick win

suggestion: operator_to_std/operator_to_std_t are unused in this test and should be removed to keep the test surface minimal.

As per coding guidelines, "Remove unused code, variables, functions, types, template parameters, and headers".

cudax/test/coop/reduce/this_warp.cu (1)

11-17: ⚡ Quick win

suggestion: add a direct include for cuda::std::numeric_limits to avoid transitive include dependency.

As per coding guidelines, "Files must include all headers related to symbols they use" and "avoid transitive includes and only include the most precise headers needed."

Also applies to: 144-144, 165-165

cudax/test/coop/reduce/this_block.cu (1)

11-17: ⚡ Quick win

suggestion: add a direct include for cuda::std::numeric_limits instead of relying on transitive includes.

As per coding guidelines, "Files must include all headers related to symbols they use" and "avoid transitive includes and only include the most precise headers needed."

Also applies to: 148-148, 174-174


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f8e9175d-5759-4110-af71-613709b62b09

📥 Commits

Reviewing files that changed from the base of the PR and between 0efc515 and 4e8ecd1.

📒 Files selected for processing (8)
  • cub/cub/warp/warp_reduce.cuh
  • cudax/cmake/cudaxBuildCompilerTargets.cmake
  • cudax/include/cuda/experimental/__coop/reduce.cuh
  • cudax/include/cuda/experimental/coop.cuh
  • cudax/test/CMakeLists.txt
  • cudax/test/coop/reduce/this_block.cu
  • cudax/test/coop/reduce/this_thread.cu
  • cudax/test/coop/reduce/this_warp.cu
🚧 Files skipped from review as they are similar to previous changes (5)
  • cudax/include/cuda/experimental/coop.cuh
  • cub/cub/warp/warp_reduce.cuh
  • cudax/test/CMakeLists.txt
  • cudax/include/cuda/experimental/__coop/reduce.cuh
  • cudax/cmake/cudaxBuildCompilerTargets.cmake

@github-actions

This comment has been minimized.

Comment thread cub/cub/warp/warp_reduce.cuh
Comment thread cudax/cmake/cudaxBuildCompilerTargets.cmake Outdated
Comment on lines +54 to +55
const auto __result = _WarpReduce{__scratch}.Reduce(__thread_data, __red_fn);
return (gpu_thread.is_root_rank(__group)) ? ::cuda::std::optional{__result} : ::cuda::std::nullopt;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this line. Is this because the value is only valid in the leader thread? Should we broadcast it rather than diverging further?

Otherwise, why do we even compute it if its not desired

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because only the root rank has the correct value. CUB does the same thing, but returns garbage for non-root ranks. I think returning optional is the better way.

We've already discussed this before and we agreed that we would start like this and add an API that would also broadcast the result in the future.

Comment thread cudax/test/coop/reduce/this_thread.cu Outdated
Comment thread cudax/test/coop/reduce/this_thread.cu Outdated
@davebayer davebayer force-pushed the cudax_coop_reduce branch from 4e8ecd1 to 885df76 Compare May 29, 2026 10:52
@davebayer davebayer requested a review from miscco May 29, 2026 10:59
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7934d393-277b-45fe-93b6-0348f06eef46

📥 Commits

Reviewing files that changed from the base of the PR and between 4e8ecd1 and 885df76.

📒 Files selected for processing (6)
  • cudax/include/cuda/experimental/__coop/reduce.cuh
  • cudax/include/cuda/experimental/coop.cuh
  • cudax/test/CMakeLists.txt
  • cudax/test/coop/reduce/this_block.cu
  • cudax/test/coop/reduce/this_thread.cu
  • cudax/test/coop/reduce/this_warp.cu
✅ Files skipped from review due to trivial changes (1)
  • cudax/include/cuda/experimental/coop.cuh
🚧 Files skipped from review as they are similar to previous changes (3)
  • cudax/include/cuda/experimental/__coop/reduce.cuh
  • cudax/test/coop/reduce/this_block.cu
  • cudax/test/coop/reduce/this_warp.cu

Comment thread cudax/test/coop/reduce/this_thread.cu
@github-actions

This comment has been minimized.

@davebayer davebayer force-pushed the cudax_coop_reduce branch from d774f1f to 37dac27 Compare May 29, 2026 14:39
@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 33m 24s: Pass: 100%/55 | Total: 4h 20m | Max: 33m 17s | Hits: 97%/34333

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Initial cudax::coop::reduce prototype

2 participants