Implement the new tuning API for DeviceSegmentedReduce by bernhardmgruber · Pull Request #7334 · NVIDIA/cccl

bernhardmgruber · 2026-01-23T11:14:50Z

I started with this prompt to Cursor+codex:

Let's look at commit f5ddc3c, it introduces a refactoring to a new tuning API design for DispatchReduce.
Please apply this rewrite to DispatchSegmentedReduce as well.
Do not introduce any API breaking changes into the DispatchSegmentedReduce class.
Run the tests and fix any errors.

Except CCCL.C, all worked, but it was not particularly clean. I would say it got 85% there. I am still positively surprised.

I added a lot of refactoring and cleanup on top. And I debugged some segfaults in CCCL.C myself.

CUB tests pass
CCCL.C tests pass
No SASS diff for cub.bench.segmented_reduce.sum.base on SM75/SM100

copy-pr-bot · 2026-01-23T11:14:54Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bernhardmgruber · 2026-01-23T11:32:14Z

/ok to test 074fd13

bernhardmgruber · 2026-01-23T13:07:41Z

/ok to test 5863ec6

miscco · 2026-01-26T07:02:00Z

cub/cub/device/dispatch/kernels/kernel_segmented_reduce.cuh

+CUB_DETAIL_KERNEL_ATTRIBUTES __launch_bounds__(int(
+  PolicySelector{}(::cuda::arch_id{CUB_PTX_ARCH / 10})
+    .segmented_reduce.block_threads)) void DeviceSegmentedReduceKernel(InputIteratorT d_in,


That is some cursed formatting

cub/cub/device/dispatch/dispatch_segmented_reduce.cuh

github-actions · 2026-01-28T13:50:31Z

🥳 CI Workflow Results

🟩 Finished in 5h 18m: Pass: 100%/93 | Total: 4d 07h | Max: 5h 15m | Hits: 72%/92292

See results here.

github-project-automation bot added this to CCCL Jan 23, 2026

github-project-automation bot moved this to Todo in CCCL Jan 23, 2026

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Jan 23, 2026

bernhardmgruber mentioned this pull request Jan 23, 2026

[EPIC] Finalize public API design for tuning CUB algorithms #7165

Open

30 tasks

bernhardmgruber force-pushed the tuning_segmented_reduce branch from 4c33932 to 074fd13 Compare January 23, 2026 11:31

bernhardmgruber marked this pull request as ready for review January 23, 2026 14:05

bernhardmgruber requested review from a team as code owners January 23, 2026 14:05

bernhardmgruber requested review from gevtushenko and wmaxey January 23, 2026 14:05

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Jan 23, 2026

This comment has been minimized.

Sign in to view

miscco reviewed Jan 26, 2026

View reviewed changes

bernhardmgruber mentioned this pull request Jan 26, 2026

Implement the new tuning API for deterministic (rfa) reduce dispatch #7346

Open

4 tasks

bernhardmgruber force-pushed the tuning_segmented_reduce branch from 93a5e32 to df71698 Compare January 27, 2026 22:26

This comment has been minimized.

Sign in to view

bernhardmgruber added 9 commits January 28, 2026 08:47

Drop AccumSize()

d086b70

First draft

dd9fca9

Fixes

1cde284

Fix policy hub test

f1c1494

Drop headers again

f13cac8

CCCL.C fixes

55438d9

fix test

f0eea31

type traits header

0553f47

qualify std::stringstream

7d2dcdc

bernhardmgruber force-pushed the tuning_segmented_reduce branch from df71698 to 7d2dcdc Compare January 28, 2026 07:47

bernhardmgruber added 2 commits January 28, 2026 09:28

Fix formatting

967f17c

Move policy_selector_from_hub

386df0e

NaderAlAwar approved these changes Jan 29, 2026

View reviewed changes

bernhardmgruber merged commit ce3629e into NVIDIA:main Jan 29, 2026
110 of 111 checks passed

bernhardmgruber deleted the tuning_segmented_reduce branch January 29, 2026 13:46

github-project-automation bot moved this from In Review to Done in CCCL Jan 29, 2026

jrhemstad mentioned this pull request Feb 3, 2026

Update cccl.c to use new CUB tuning API #7453

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the new tuning API for DeviceSegmentedReduce#7334

Implement the new tuning API for DeviceSegmentedReduce#7334
bernhardmgruber merged 11 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_segmented_reduce

bernhardmgruber commented Jan 23, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

bernhardmgruber commented Jan 23, 2026

Uh oh!

bernhardmgruber commented Jan 23, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

miscco Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions bot commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bernhardmgruber commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Jan 23, 2026

Uh oh!

bernhardmgruber commented Jan 23, 2026

Uh oh!

bernhardmgruber commented Jan 23, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

miscco Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

github-actions bot commented Jan 28, 2026

🥳 CI Workflow Results

🟩 Finished in 5h 18m: Pass: 100%/93 | Total: 4d 07h | Max: 5h 15m | Hits: 72%/92292

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bernhardmgruber commented Jan 23, 2026 •

edited

Loading