Skip to content

Implement the new tuning API for DeviceRadixSort#6767

Merged
bernhardmgruber merged 7 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_radix_sort
Jan 22, 2026
Merged

Implement the new tuning API for DeviceRadixSort#6767
bernhardmgruber merged 7 commits intoNVIDIA:mainfrom
bernhardmgruber:tuning_radix_sort

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Nov 24, 2025

PR is not fully ready yet, but ready to take a first round of review.

  • cub.test.device.radix_sort_keys.lid_0.key_bits_16 passes
  • CCCL.C tests pass
  • Retain the if constexpr on the onesweep algorithm in the dispatcher
  • implement policy_selector_from_hub
  • Drop _policy
  • Merge before: Rename new tuning API policies and fix MSVC warning #7103
  • Merge before: Refactor cub::DeviceRadixSort #7282
  • tests for supplying a custom policy hub
  • No SASS changes for cub.bench.radix_sort.keys.base for SMs 70;80;90;100
  • No SASS changes for cub.bench.radix_sort.pairs.base for SMs 70;80;90;100

I cannot SASS check benchmarks for < SM70, because nvbench_helper.cu does not compile there with:

/home/coder/cccl/lib/cmake/cub/../../../cub/cub/agent/agent_find.cuh(177): error #20305-D: atomic operations' argument of memory order is supported on architecture sm_70 or above. Fall back to use membar.
          __nv_atomic_load(found_pos_ptr, &temp_storage.global_result, __NV_ATOMIC_RELAXED, __NV_THREAD_SCOPE_DEVICE);
                           ^

But it's somewhat ok, since we don't officially support <SM75 anyway

Fixes: #6676

@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 24, 2025
@bernhardmgruber bernhardmgruber changed the title Implement the new tuning API for DeviceRadixSort Implement the new tuning API for DeviceRadixSort Nov 24, 2025
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 24, 2025
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how much this cleans everything up

@bernhardmgruber
Copy link
Contributor Author

/ok to test 56c437d

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the tuning_radix_sort branch 3 times, most recently from 8ad87dc to 800b936 Compare January 11, 2026 19:03
@bernhardmgruber
Copy link
Contributor Author

/ok to test 800b936

@github-actions

This comment has been minimized.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Jan 12, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@bernhardmgruber
Copy link
Contributor Author

/ok to test 7e722a2

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor Author

/ok to test 6cefe64

@bernhardmgruber
Copy link
Contributor Author

/ok to test 74089a2

@github-actions

This comment has been minimized.

@bernhardmgruber bernhardmgruber force-pushed the tuning_radix_sort branch 2 times, most recently from c09c7cc to 212d86d Compare January 13, 2026 15:54
@bernhardmgruber bernhardmgruber force-pushed the tuning_radix_sort branch 2 times, most recently from 04994b7 to f358513 Compare January 17, 2026 00:51
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor Author

bernhardmgruber commented Jan 18, 2026

There are SASS differences for cub.bench.radix_sort.pairs.base on SM70 in several instantiations of DeviceRadixSortDownsweepKernel. Other SMs are fine.

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor Author

There are SASS differences for cub.bench.radix_sort.pairs.base on SM70 in several instantiations of DeviceRadixSortDownsweepKernel. Other SMs are fine.

Fixed

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment on lines +1 to +2
// SPDX-FileCopyrightText: Copyright (c) 2011-2023, NVIDIA CORPORATION. All rights reserved.
// SPDX-License-Identifier: BSD-3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is the wrong license?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled most of the code out of the benchmark files keys.cu and pairs.cu so I retained the license.

Comment on lines +81 to +82
__launch_bounds__(int(ALT_DIGIT_BITS ? PolicySelector{}(::cuda::arch_id{CUB_PTX_ARCH / 10}).alt_upsweep.block_threads
: PolicySelector{}(::cuda::arch_id{CUB_PTX_ARCH / 10}).upsweep.block_threads))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly wondering why this is not part of the PolicySelector class? Is there any reason we have to pass this individually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the kernel is instantiated twice, where ALT_DIGIT_BITS is once true and once false. The logic here picks the corresponding tuning based on which kernel instantiation we have.

@github-actions

This comment has been minimized.

@NVIDIA NVIDIA deleted a comment from copy-pr-bot bot Jan 21, 2026
@bernhardmgruber bernhardmgruber enabled auto-merge (squash) January 21, 2026 16:08
@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 1d 04h: Pass: 100%/93 | Total: 5d 04h | Max: 4h 54m | Hits: 46%/91677

See results here.

@bernhardmgruber bernhardmgruber merged commit eebd133 into NVIDIA:main Jan 22, 2026
202 of 211 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Jan 22, 2026
@bernhardmgruber bernhardmgruber deleted the tuning_radix_sort branch January 23, 2026 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Implement the new tuning API for cub::DeviceRadixSort

3 participants