Skip to content

Allow public tuning of cub::DeviceMergeSort #8600

Merged
bernhardmgruber merged 21 commits into
NVIDIA:mainfrom
bernhardmgruber:merge_sort_tune_prod
Jun 3, 2026
Merged

Allow public tuning of cub::DeviceMergeSort #8600
bernhardmgruber merged 21 commits into
NVIDIA:mainfrom
bernhardmgruber:merge_sort_tune_prod

Conversation

@bernhardmgruber
Copy link
Copy Markdown
Contributor

@bernhardmgruber bernhardmgruber commented Apr 21, 2026

Preceding work:

Fixes: #8574

New public entities for cub::DeviceMergeSort

Entity Data Members
cub::MergeSortPolicy threads_per_block, items_per_thread, load_algorithm, load_modifier, store_algorithm

@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented Apr 21, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL Apr 21, 2026
@bernhardmgruber bernhardmgruber force-pushed the merge_sort_tune_prod branch 5 times, most recently from c268a82 to 18fb743 Compare April 27, 2026 14:18
@bernhardmgruber bernhardmgruber marked this pull request as ready for review April 27, 2026 14:18
@bernhardmgruber bernhardmgruber requested review from a team as code owners April 27, 2026 14:18
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL Apr 27, 2026
Comment thread cub/cub/device/device_merge_sort.cuh
@github-actions

This comment has been minimized.

Comment thread cub/cub/device/dispatch/tuning/tuning_merge_sort.cuh Outdated
@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the merge_sort_tune_prod branch from c16b518 to 9f648f3 Compare May 5, 2026 16:04
@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the merge_sort_tune_prod branch from 9f648f3 to dd2410e Compare May 5, 2026 22:48
@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the merge_sort_tune_prod branch from dd2410e to 9bacddc Compare May 6, 2026 11:23
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber marked this pull request as draft May 6, 2026 19:22
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d2aaf747-4776-4001-ad6b-fa3dcdb23f60

📥 Commits

Reviewing files that changed from the base of the PR and between 576a30c and 9811c8f.

📒 Files selected for processing (10)
  • cub/benchmarks/bench/merge_sort/keys.cu
  • cub/benchmarks/bench/merge_sort/pairs.cu
  • cub/cub/agent/agent_merge_sort.cuh
  • cub/cub/device/device_merge_sort.cuh
  • cub/cub/device/dispatch/dispatch_merge_sort.cuh
  • cub/cub/device/dispatch/kernels/kernel_merge_sort.cuh
  • cub/cub/device/dispatch/tuning/tuning_merge_sort.cuh
  • cub/test/catch2_test_device_merge_sort_custom_policy_hub.cu
  • cub/test/catch2_test_device_merge_sort_env.cu
  • cub/test/catch2_test_device_merge_sort_env_api.cu
✅ Files skipped from review due to trivial changes (2)
  • cub/cub/device/device_merge_sort.cuh
  • cub/test/catch2_test_device_merge_sort_custom_policy_hub.cu
🚧 Files skipped from review as they are similar to previous changes (6)
  • cub/benchmarks/bench/merge_sort/pairs.cu
  • cub/benchmarks/bench/merge_sort/keys.cu
  • cub/cub/agent/agent_merge_sort.cuh
  • cub/test/catch2_test_device_merge_sort_env_api.cu
  • cub/cub/device/dispatch/kernels/kernel_merge_sort.cuh
  • cub/cub/device/dispatch/tuning/tuning_merge_sort.cuh

Note: CodeRabbit is enabled on this repository as a convenience for maintainers
and contributors. Use your best judgment when considering its review comments and
suggestions — a suggested change may be inadequate, unnecessary, or safe to ignore.
Contributors are not expected to address every comment. Human reviews are what
ultimately matter for merging.

  • New public tuning API

    • Introduces a public MergeSortPolicy type (cub::MergeSortPolicy) usable by environment-based DeviceMergeSort tuning. MergeSortPolicy is an aggregate with members:
      • threads_per_block
      • items_per_thread
      • load_algorithm
      • load_modifier
      • store_algorithm
    • Replaces internal/detail policy types in many places so policy selectors and tuning machinery now return/consume cub::MergeSortPolicy.
  • Documentation and examples

    • Adds a Tuning subsection to DeviceMergeSort docs showing how to supply a custom policy selector and examples pulled from tests (cub/test/catch2_test_device_merge_sort_env_api.cu).
  • Tests

    • Adds compile-time tests validating MergeSortPolicy traits (semiregular, aggregate), aggregate/dedicated initialization, and equality.
    • Adds an env-based runtime test demonstrating tuning via cuda::execution::tune with a custom MergeSortPolicySelector and validating a tuned SortPairs call.
    • Minor test fixes/renames included in commits.
  • Implementation changes (notable)

    • Numerous files updated to use MergeSortPolicy instead of the removed/renamed internal merge_sort_policy, including agent, kernel, dispatch, and tuning headers.
    • Device/dispatch kernels and helpers compute ITEMS_PER_TILE as BLOCK_THREADS * ITEMS_PER_THREAD (policy members) rather than via the removed items_per_tile() helper.
    • DispatchMergeSort is annotated deprecated (CCCL_DEPRECATED_BECAUSE("Please use DeviceMergeSort")) and its Invoke signature and internal policy handling updated to use MergeSortPolicy; runtime assertions (_CCCL_ASSERT) added to validate threads_per_block and items_per_thread where applicable.
    • Benchmarks updated to return cub::MergeSortPolicy from policy selectors.
  • Behavioral and API notes

    • Publicly-exposed environment-aware DeviceMergeSort APIs can now be tuned with user-supplied policy selectors returning cub::MergeSortPolicy.
    • The internal items_per_tile() accessor was removed; callers now derive tile sizing from policy members directly.
    • Assertions/validation for policy values were added to guard tuned inputs.
  • Related / workflow

    • Dependent PRs listed to merge before or soon after this change: #8623 (ideally before), #8652, #8835, #8836.
    • PR converted to draft to avoid accidental merging while awaiting approval of #7671.
    • This change claims to fix issue #8574.
  • Impact

    • Public API surface expanded (cub::MergeSortPolicy, policy-selector return types).
    • Tests and docs updated to demonstrate/validate the new public tuning capability.
    • Review effort: medium — changes touch device kernels, dispatch, tuning machinery, tests, and docs; reviewers should confirm policy semantics, compile-time/device compatibility, and that deprecation/compatibility notes for DispatchMergeSort are adequate.

suggestion:

Walkthrough

Exports public cub::MergeSortPolicy and updates selectors, dispatch, kernels, agents, docs, and tests to use it; sizing now uses policy.threads_per_block * policy.items_per_thread and benchmarks/tests demonstrate env-based tuning with MergeSortPolicy.

Changes

MergeSortPolicy Public API and Dispatch Chain

Layer / File(s) Summary
MergeSortPolicy type definition and policy selection
cub/cub/device/dispatch/tuning/tuning_merge_sort.cuh
Defines public cub::MergeSortPolicy and updates policy_selector, policy_selector_from_types, and policy_selector_from_hub to return MergeSortPolicy.
Dispatch code using MergeSortPolicy fields directly
cub/cub/device/dispatch/dispatch_merge_sort.cuh
DispatchMergeSort::Invoke and detail::merge_sort::dispatch compute tile_size as policy.threads_per_block * policy.items_per_thread; policy getter returns MergeSortPolicy; Invoke template parameter is now an unnamed default; DispatchMergeSort marked deprecated.
Kernel and agent implementations using MergeSortPolicy
cub/cub/device/dispatch/kernels/kernel_merge_sort.cuh, cub/cub/agent/agent_merge_sort.cuh
Kernel-local active_policy and helper policy members use MergeSortPolicy. AgentBlockSort and AgentMerge compute BLOCK_THREADS, ITEMS_PER_THREAD, ITEMS_PER_TILE from policy fields and size shared buffers using ITEMS_PER_TILE + 1.
Public API documentation and benchmark examples
cub/cub/device/device_merge_sort.cuh, cub/benchmarks/bench/merge_sort/keys.cu, cub/benchmarks/bench/merge_sort/pairs.cu
Adds a Doxygen Tuning section describing env-based tuning and shows benchmark policy_selector examples returning cub::MergeSortPolicy.
Type trait and API usage tests
cub/test/catch2_test_device_merge_sort_env.cu, cub/test/catch2_test_device_merge_sort_env_api.cu, cub/test/catch2_test_device_merge_sort_custom_policy_hub.cu
Updates tuning selector tests to return cub::MergeSortPolicy, adds compile-time trait checks for MergeSortPolicy, adds _CCCL_STD_VER >= 2020 example using cuda::execution::tune with a MergeSortPolicySelector, and suppresses deprecated-API warnings in one test.

Assessment against linked issues

Objective Addressed Explanation
Productize the cub::DeviceMergeSort tuning API [#8574]

Suggested reviewers

  • griwes

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a3136543-ac94-4d25-82ad-22987888fe5f

📥 Commits

Reviewing files that changed from the base of the PR and between 9a8f7f6 and 9f35c3f.

📒 Files selected for processing (9)
  • cub/benchmarks/bench/merge_sort/keys.cu
  • cub/benchmarks/bench/merge_sort/pairs.cu
  • cub/cub/agent/agent_merge_sort.cuh
  • cub/cub/device/device_merge_sort.cuh
  • cub/cub/device/dispatch/dispatch_merge_sort.cuh
  • cub/cub/device/dispatch/kernels/kernel_merge_sort.cuh
  • cub/cub/device/dispatch/tuning/tuning_merge_sort.cuh
  • cub/test/catch2_test_device_merge_sort_env.cu
  • cub/test/catch2_test_device_merge_sort_env_api.cu

Comment thread cub/cub/device/dispatch/dispatch_merge_sort.cuh
Comment thread cub/cub/device/dispatch/tuning/tuning_merge_sort.cuh
Comment thread cub/test/catch2_test_device_merge_sort_env_api.cu Outdated
@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

🥳 CI Workflow Results

🟩 Finished in 2h 12m: Pass: 100%/284 | Total: 11d 05h | Max: 2h 10m | Hits: 20%/929605

See results here.

@bernhardmgruber bernhardmgruber merged commit 5a9ea63 into NVIDIA:main Jun 3, 2026
304 checks passed
@bernhardmgruber bernhardmgruber deleted the merge_sort_tune_prod branch June 3, 2026 11:02
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Productize the cub::DeviceMergeSort tuning API

2 participants