Skip to content

Implement the new tuning API for DeviceSegmentedReduce#7334

Merged
bernhardmgruber merged 11 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_segmented_reduce
Jan 29, 2026
Merged

Implement the new tuning API for DeviceSegmentedReduce#7334
bernhardmgruber merged 11 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_segmented_reduce

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Jan 23, 2026

I started with this prompt to Cursor+codex:

Let's look at commit f5ddc3c, it introduces a refactoring to a new tuning API design for DispatchReduce.
Please apply this rewrite to DispatchSegmentedReduce as well.
Do not introduce any API breaking changes into the DispatchSegmentedReduce class.
Run the tests and fix any errors.

Except CCCL.C, all worked, but it was not particularly clean. I would say it got 85% there. I am still positively surprised.

I added a lot of refactoring and cleanup on top. And I debugged some segfaults in CCCL.C myself.

  • CUB tests pass
  • CCCL.C tests pass
  • No SASS diff for cub.bench.segmented_reduce.sum.base on SM75/SM100

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 23, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber
Copy link
Contributor Author

/ok to test 074fd13

@bernhardmgruber
Copy link
Contributor Author

/ok to test 5863ec6

@bernhardmgruber bernhardmgruber marked this pull request as ready for review January 23, 2026 14:05
@bernhardmgruber bernhardmgruber requested review from a team as code owners January 23, 2026 14:05
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Jan 23, 2026
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment on lines 108 to 110
CUB_DETAIL_KERNEL_ATTRIBUTES __launch_bounds__(int(
PolicySelector{}(::cuda::arch_id{CUB_PTX_ARCH / 10})
.segmented_reduce.block_threads)) void DeviceSegmentedReduceKernel(InputIteratorT d_in,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is some cursed formatting

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the tuning_segmented_reduce branch from df71698 to 7d2dcdc Compare January 28, 2026 07:47
@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 5h 18m: Pass: 100%/93 | Total: 4d 07h | Max: 5h 15m | Hits: 72%/92292

See results here.

@bernhardmgruber bernhardmgruber merged commit ce3629e into NVIDIA:main Jan 29, 2026
110 of 111 checks passed
@bernhardmgruber bernhardmgruber deleted the tuning_segmented_reduce branch January 29, 2026 13:46
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Jan 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants