Skip to content

Implement parallel cuda::std::sort#8621

Merged
miscco merged 2 commits into
NVIDIA:mainfrom
miscco:parallel_sort
May 12, 2026
Merged

Implement parallel cuda::std::sort#8621
miscco merged 2 commits into
NVIDIA:mainfrom
miscco:parallel_sort

Conversation

@miscco
Copy link
Copy Markdown
Contributor

@miscco miscco commented Apr 22, 2026

This implements the sort algorithm for the cuda backend.

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes #7376

@miscco miscco requested review from a team as code owners April 22, 2026 13:46
@miscco miscco requested a review from oleksandr-pavlyk April 22, 2026 13:46
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Apr 22, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Apr 22, 2026
@miscco

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

Comment thread libcudacxx/include/cuda/std/__memory/pointer_traits.h
Comment thread libcudacxx/include/cuda/std/__pstl/cuda/sort.h Outdated
@bernhardmgruber
Copy link
Copy Markdown
Contributor

The benchmark looks a bit negative, but the slowdowns are tiny. I am still wondering a bit whether we missed something. Maybe we should repeat the benchmark on a different machine.

@miscco miscco requested a review from a team as a code owner April 24, 2026 13:39
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@miscco miscco force-pushed the parallel_sort branch 2 times, most recently from ebb66b8 to b349875 Compare April 28, 2026 14:00
@github-actions

This comment has been minimized.

Comment thread thrust/thrust/system/cuda/detail/sort.h
Comment thread cub/cub/device/device_radix_sort.cuh Outdated
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

This implements the sort algorithm for the cuda backend.

* std::sort see https://en.cppreference.com/w/cpp/algorithm/sort.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes NVIDIA#7376
Comment thread cub/cub/device/device_radix_sort.cuh
Comment thread thrust/testing/cuda/sort.cu
@miscco
Copy link
Copy Markdown
Contributor Author

miscco commented May 12, 2026

I reevaluated on blackwell:

['thrust_sort.json', 'pstl_sort.json']
# base

## [0] NVIDIA RTX PRO 6000 Blackwell Workstation Edition

|  T   |  Elements  |  Entropy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|------|------------|-----------|------------|-------------|------------|-------------|------------|---------|----------|
|  I8  |    2^16    |     1     |  22.607 us |       4.24% |  22.571 us |       2.78% |  -0.036 us |  -0.16% |   SAME   |
|  I8  |    2^20    |     1     |  25.535 us |       5.35% |  26.550 us |       2.75% |   1.015 us |   3.97% |   SLOW   |
|  I8  |    2^24    |     1     |  86.938 us |       1.39% |  89.644 us |       1.10% |   2.706 us |   3.11% |   SLOW   |
|  I8  |    2^28    |     1     |   1.216 ms |       0.18% |   1.223 ms |       0.16% |   6.960 us |   0.57% |   SLOW   |
|  I8  |    2^16    |   0.201   |  22.452 us |       3.50% |  22.725 us |       4.41% |   0.273 us |   1.22% |   SAME   |
|  I8  |    2^20    |   0.201   |  24.060 us |       4.20% |  25.156 us |       4.92% |   1.096 us |   4.56% |   SLOW   |
|  I8  |    2^24    |   0.201   |  84.586 us |       1.48% |  87.429 us |       1.60% |   2.843 us |   3.36% |   SLOW   |
|  I8  |    2^28    |   0.201   |   1.219 ms |       0.58% |   1.226 ms |       0.16% |   6.706 us |   0.55% |   SLOW   |
| I16  |    2^16    |     1     |  28.044 us |       4.11% |  28.294 us |       2.49% |   0.250 us |   0.89% |   SAME   |
| I16  |    2^20    |     1     |  34.937 us |       2.65% |  35.308 us |       1.88% |   0.371 us |   1.06% |   SAME   |
| I16  |    2^24    |     1     | 150.464 us |       0.86% | 151.752 us |       0.77% |   1.288 us |   0.86% |   SLOW   |
| I16  |    2^28    |     1     |   1.951 ms |       0.35% |   1.975 ms |       0.48% |  24.316 us |   1.25% |   SLOW   |
| I16  |    2^16    |   0.201   |  27.493 us |       4.12% |  27.792 us |       2.80% |   0.298 us |   1.09% |   SAME   |
| I16  |    2^20    |   0.201   |  31.890 us |       3.20% |  32.395 us |       2.71% |   0.505 us |   1.58% |   SAME   |
| I16  |    2^24    |   0.201   | 130.327 us |       0.77% | 131.303 us |       0.80% |   0.976 us |   0.75% |   SAME   |
| I16  |    2^28    |   0.201   |   1.981 ms |       0.13% |   1.989 ms |       0.13% |   8.202 us |   0.41% |   SLOW   |
| I32  |    2^16    |     1     |  42.317 us |       1.79% |  42.660 us |       2.42% |   0.343 us |   0.81% |   SAME   |
| I32  |    2^20    |     1     |  58.572 us |       1.19% |  59.254 us |       1.06% |   0.681 us |   1.16% |   SLOW   |
| I32  |    2^24    |     1     | 309.775 us |       0.38% | 312.215 us |       0.36% |   2.440 us |   0.79% |   SLOW   |
| I32  |    2^28    |     1     |   6.841 ms |       0.07% |   6.847 ms |       0.07% |   5.874 us |   0.09% |   SLOW   |
| I32  |    2^16    |   0.201   |  41.666 us |       1.70% |  42.253 us |       2.73% |   0.587 us |   1.41% |   SAME   |
| I32  |    2^20    |   0.201   |  51.945 us |       2.36% |  52.548 us |       2.00% |   0.603 us |   1.16% |   SAME   |
| I32  |    2^24    |   0.201   | 271.176 us |       0.36% | 273.037 us |       0.38% |   1.860 us |   0.69% |   SLOW   |
| I32  |    2^28    |   0.201   |   6.798 ms |       0.13% |   6.804 ms |       0.11% |   5.957 us |   0.09% |   SAME   |
| I64  |    2^16    |     1     |  69.310 us |       2.03% |  69.898 us |       1.05% |   0.588 us |   0.85% |   SAME   |
| I64  |    2^20    |     1     | 108.659 us |       1.28% | 109.246 us |       0.71% |   0.587 us |   0.54% |   SAME   |
| I64  |    2^24    |     1     |   1.488 ms |       0.19% |   1.484 ms |       0.15% |  -4.460 us |  -0.30% |   FAST   |
| I64  |    2^28    |     1     |  25.449 ms |       0.09% |  25.457 ms |       0.03% |   8.102 us |   0.03% |   SLOW   |
| I64  |    2^16    |   0.201   |  66.083 us |       1.86% |  66.768 us |       1.60% |   0.685 us |   1.04% |   SAME   |
| I64  |    2^20    |   0.201   | 101.069 us |       1.03% | 101.670 us |       0.66% |   0.602 us |   0.60% |   SAME   |
| I64  |    2^24    |   0.201   |   1.574 ms |       0.11% |   1.576 ms |       0.12% |   2.507 us |   0.16% |   SLOW   |
| I64  |    2^28    |   0.201   |  25.412 ms |       0.03% |  25.432 ms |       0.02% |  20.878 us |   0.08% |   SLOW   |
| I128 |    2^16    |     1     | 109.491 us |       1.38% | 110.220 us |       0.91% |   0.729 us |   0.67% |   SAME   |
| I128 |    2^20    |     1     | 247.203 us |       0.43% | 248.617 us |       0.39% |   1.414 us |   0.57% |   SLOW   |
| I128 |    2^24    |     1     |   6.127 ms |       0.07% |   6.124 ms |       0.06% |  -2.286 us |  -0.04% |   SAME   |
| I128 |    2^28    |     1     | 100.288 ms |       0.01% | 100.307 ms |       0.01% |  19.100 us |   0.02% |   SLOW   |
| I128 |    2^16    |   0.201   | 103.644 us |       0.87% | 104.662 us |       0.81% |   1.018 us |   0.98% |   SLOW   |
| I128 |    2^20    |   0.201   | 239.595 us |       0.40% | 241.232 us |       0.37% |   1.637 us |   0.68% |   SLOW   |
| I128 |    2^24    |   0.201   |   6.157 ms |       0.07% |   6.159 ms |       0.06% |   2.396 us |   0.04% |   SAME   |
| I128 |    2^28    |   0.201   |  99.330 ms |       0.01% |  99.272 ms |       0.01% | -57.574 us |  -0.06% |   FAST   |
| F32  |    2^16    |     1     |  42.369 us |       2.05% |  43.028 us |       2.31% |   0.659 us |   1.56% |   SAME   |
| F32  |    2^20    |     1     |  59.386 us |       1.64% |  60.291 us |       1.11% |   0.905 us |   1.52% |   SLOW   |
| F32  |    2^24    |     1     | 339.452 us |       0.40% | 340.858 us |       0.26% |   1.406 us |   0.41% |   SLOW   |
| F32  |    2^28    |     1     |   7.030 ms |       0.16% |   7.026 ms |       0.11% |  -4.677 us |  -0.07% |   SAME   |
| F32  |    2^16    |   0.201   |  40.731 us |       1.90% |  41.680 us |       2.63% |   0.948 us |   2.33% |   SLOW   |
| F32  |    2^20    |   0.201   |  54.109 us |       1.85% |  55.116 us |       2.14% |   1.007 us |   1.86% |   SLOW   |
| F32  |    2^24    |   0.201   | 302.259 us |       0.34% | 303.758 us |       0.40% |   1.499 us |   0.50% |   SLOW   |
| F32  |    2^28    |   0.201   |   6.903 ms |       0.06% |   6.902 ms |       0.10% |  -1.049 us |  -0.02% |   SAME   |
| F64  |    2^16    |     1     |  70.710 us |       1.35% |  71.603 us |       2.83% |   0.894 us |   1.26% |   SAME   |
| F64  |    2^20    |     1     | 109.874 us |       0.70% | 110.832 us |       0.79% |   0.958 us |   0.87% |   SLOW   |
| F64  |    2^24    |     1     |   1.478 ms |       0.14% |   1.472 ms |       0.17% |  -6.638 us |  -0.45% |   FAST   |
| F64  |    2^28    |     1     |  26.599 ms |       0.04% |  26.633 ms |       0.03% |  34.056 us |   0.13% |   SLOW   |
| F64  |    2^16    |   0.201   |  69.317 us |       1.24% |  70.636 us |       1.63% |   1.319 us |   1.90% |   SLOW   |
| F64  |    2^20    |   0.201   | 103.788 us |       0.71% | 104.977 us |       0.80% |   1.189 us |   1.15% |   SLOW   |
| F64  |    2^24    |   0.201   |   1.576 ms |       0.14% |   1.576 ms |       0.15% |  -0.037 us |  -0.00% |   SAME   |
| F64  |    2^28    |   0.201   |  25.739 ms |       0.02% |  25.738 ms |       0.02% |  -1.160 us |  -0.00% |   SAME   |

# with_predicate

## [0] NVIDIA RTX PRO 6000 Blackwell Workstation Edition

|  T   |  Elements  |  Entropy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|------|------------|-----------|------------|-------------|------------|-------------|------------|---------|----------|
|  I8  |    2^16    |     1     |  32.676 us |       2.50% |  33.116 us |       2.80% |   0.441 us |   1.35% |   SAME   |
|  I8  |    2^20    |     1     |  76.569 us |       0.68% |  77.141 us |       0.70% |   0.572 us |   0.75% |   SLOW   |
|  I8  |    2^24    |     1     | 397.458 us |       0.50% | 398.535 us |       0.52% |   1.077 us |   0.27% |   SAME   |
|  I8  |    2^28    |     1     |   7.975 ms |       0.23% |   8.015 ms |       0.24% |  39.424 us |   0.49% |   SLOW   |
|  I8  |    2^16    |   0.201   |  31.805 us |       2.00% |  32.313 us |       3.27% |   0.508 us |   1.60% |   SAME   |
|  I8  |    2^20    |   0.201   |  74.753 us |       0.77% |  75.371 us |       1.21% |   0.618 us |   0.83% |   SLOW   |
|  I8  |    2^24    |   0.201   | 396.979 us |       0.49% | 397.752 us |       0.53% |   0.772 us |   0.19% |   SAME   |
|  I8  |    2^28    |   0.201   |   7.942 ms |       0.20% |   7.981 ms |       0.21% |  38.748 us |   0.49% |   SLOW   |
| I16  |    2^16    |     1     |  34.222 us |       2.58% |  34.825 us |       2.47% |   0.602 us |   1.76% |   SAME   |
| I16  |    2^20    |     1     |  78.926 us |       1.55% |  79.486 us |       1.68% |   0.561 us |   0.71% |   SAME   |
| I16  |    2^24    |     1     | 399.217 us |       0.81% | 406.097 us |       7.81% |   6.880 us |   1.72% |   SLOW   |
| I16  |    2^28    |     1     |  14.122 ms |       0.11% |  14.142 ms |       0.22% |  19.868 us |   0.14% |   SLOW   |
| I16  |    2^16    |   0.201   |  33.821 us |       1.56% |  34.476 us |       1.52% |   0.655 us |   1.94% |   SLOW   |
| I16  |    2^20    |   0.201   |  78.864 us |       1.65% |  79.827 us |       1.68% |   0.963 us |   1.22% |   SAME   |
| I16  |    2^24    |   0.201   | 390.449 us |       0.60% | 395.891 us |       1.02% |   5.442 us |   1.39% |   SLOW   |
| I16  |    2^28    |   0.201   |  14.089 ms |       0.10% |  14.107 ms |       0.18% |  17.938 us |   0.13% |   SLOW   |
| I32  |    2^16    |     1     |  33.602 us |       1.66% |  34.395 us |       1.42% |   0.793 us |   2.36% |   SLOW   |
| I32  |    2^20    |     1     |  83.763 us |       0.84% |  84.078 us |       0.70% |   0.314 us |   0.38% |   SAME   |
| I32  |    2^24    |     1     | 540.429 us |       0.37% | 542.771 us |       1.48% |   2.342 us |   0.43% |   SLOW   |
| I32  |    2^28    |     1     |  26.143 ms |       0.05% |  26.153 ms |       0.06% |   9.072 us |   0.03% |   SAME   |
| I32  |    2^16    |   0.201   |  33.310 us |       1.95% |  33.986 us |       2.41% |   0.676 us |   2.03% |   SLOW   |
| I32  |    2^20    |   0.201   |  81.204 us |       0.64% |  82.193 us |       0.98% |   0.989 us |   1.22% |   SLOW   |
| I32  |    2^24    |   0.201   | 556.023 us |       0.44% | 557.285 us |       0.50% |   1.262 us |   0.23% |   SAME   |
| I32  |    2^28    |   0.201   |  26.144 ms |       0.06% |  26.162 ms |       0.06% |  18.180 us |   0.07% |   SLOW   |
| I64  |    2^16    |     1     |  38.438 us |       0.99% |  39.301 us |       2.15% |   0.862 us |   2.24% |   SLOW   |
| I64  |    2^20    |     1     | 100.603 us |       2.06% | 101.776 us |       2.09% |   1.173 us |   1.17% |   SAME   |
| I64  |    2^24    |     1     |   2.686 ms |       0.09% |   2.689 ms |       0.11% |   2.673 us |   0.10% |   SLOW   |
| I64  |    2^28    |     1     |  57.319 ms |       0.04% |  57.350 ms |       0.04% |  31.231 us |   0.05% |   SLOW   |
| I64  |    2^16    |   0.201   |  38.588 us |       1.22% |  39.566 us |       1.58% |   0.978 us |   2.53% |   SLOW   |
| I64  |    2^20    |   0.201   | 100.140 us |       2.05% | 100.933 us |       2.15% |   0.793 us |   0.79% |   SAME   |
| I64  |    2^24    |   0.201   |   2.685 ms |       0.09% |   2.688 ms |       1.29% |   3.426 us |   0.13% |   SLOW   |
| I64  |    2^28    |   0.201   |  57.308 ms |       0.04% |  57.340 ms |       0.04% |  31.794 us |   0.06% |   SLOW   |
| I128 |    2^16    |     1     |  40.601 us |       1.23% |  41.249 us |       1.06% |   0.647 us |   1.59% |   SLOW   |
| I128 |    2^20    |     1     | 146.193 us |       0.54% | 147.176 us |       0.68% |   0.984 us |   0.67% |   SLOW   |
| I128 |    2^24    |     1     |   5.735 ms |       0.08% |   5.737 ms |       0.07% |   1.688 us |   0.03% |   SAME   |
| I128 |    2^28    |     1     | 119.009 ms |       0.03% | 119.071 ms |       0.03% |  62.212 us |   0.05% |   SLOW   |
| I128 |    2^16    |   0.201   |  41.776 us |       1.12% |  42.762 us |       3.68% |   0.986 us |   2.36% |   SLOW   |
| I128 |    2^20    |   0.201   | 146.139 us |       0.56% | 147.163 us |       0.71% |   1.024 us |   0.70% |   SLOW   |
| I128 |    2^24    |   0.201   |   5.734 ms |       0.08% |   5.734 ms |       0.08% |   0.469 us |   0.01% |   SAME   |
| I128 |    2^28    |   0.201   | 118.975 ms |       0.03% | 119.045 ms |       0.03% |  69.785 us |   0.06% |   SLOW   |
| F32  |    2^16    |     1     |  33.631 us |       1.27% |  34.499 us |       1.71% |   0.868 us |   2.58% |   SLOW   |
| F32  |    2^20    |     1     |  83.276 us |       0.52% |  84.346 us |       0.85% |   1.071 us |   1.29% |   SLOW   |
| F32  |    2^24    |     1     | 541.899 us |       0.34% | 543.406 us |       0.41% |   1.507 us |   0.28% |   SAME   |
| F32  |    2^28    |     1     |  26.130 ms |       0.05% |  26.118 ms |       0.05% | -12.110 us |  -0.05% |   SAME   |
| F32  |    2^16    |   0.201   |  33.128 us |       1.05% |  34.000 us |       2.73% |   0.872 us |   2.63% |   SLOW   |
| F32  |    2^20    |   0.201   |  81.558 us |       0.80% |  82.460 us |       0.81% |   0.902 us |   1.11% |   SLOW   |
| F32  |    2^24    |   0.201   | 558.632 us |       0.40% | 559.986 us |       0.41% |   1.354 us |   0.24% |   SAME   |
| F32  |    2^28    |   0.201   |  26.128 ms |       0.06% |  26.116 ms |       0.06% | -11.864 us |  -0.05% |   SAME   |
| F64  |    2^16    |     1     |  42.750 us |       1.08% |  43.513 us |       0.99% |   0.763 us |   1.79% |   SLOW   |
| F64  |    2^20    |     1     | 113.864 us |       1.53% | 114.964 us |       1.80% |   1.100 us |   0.97% |   SAME   |
| F64  |    2^24    |     1     |   2.738 ms |       0.10% |   2.739 ms |       0.10% |   0.882 us |   0.03% |   SAME   |
| F64  |    2^28    |     1     |  58.106 ms |       0.04% |  58.140 ms |       0.04% |  34.494 us |   0.06% |   SLOW   |
| F64  |    2^16    |   0.201   |  41.966 us |       1.37% |  42.978 us |       2.02% |   1.012 us |   2.41% |   SLOW   |
| F64  |    2^20    |   0.201   | 113.604 us |       1.63% | 114.703 us |       1.59% |   1.099 us |   0.97% |   SAME   |
| F64  |    2^24    |   0.201   |   2.736 ms |       0.09% |   2.738 ms |       0.10% |   2.570 us |   0.09% |   SLOW   |
| F64  |    2^28    |   0.201   |  58.101 ms |       0.04% |  58.112 ms |       0.04% |  11.206 us |   0.02% |   SAME   |
| C32  |    2^16    |     1     |  37.408 us |       1.49% |  38.620 us |       1.85% |   1.212 us |   3.24% |   SLOW   |
| C32  |    2^20    |     1     | 100.506 us |       2.08% | 101.264 us |       1.98% |   0.758 us |   0.75% |   SAME   |
| C32  |    2^24    |     1     |   2.679 ms |       0.10% |   2.680 ms |       0.09% |   0.834 us |   0.03% |   SAME   |
| C32  |    2^28    |     1     |  57.247 ms |       0.04% |  57.263 ms |       0.04% |  15.995 us |   0.03% |   SAME   |
| C32  |    2^16    |   0.201   |  38.359 us |       2.74% |  39.284 us |       2.60% |   0.926 us |   2.41% |   SAME   |
| C32  |    2^20    |   0.201   | 100.125 us |       1.94% | 101.225 us |       2.45% |   1.100 us |   1.10% |   SAME   |
| C32  |    2^24    |   0.201   |   2.678 ms |       0.10% |   2.681 ms |       0.11% |   2.847 us |   0.11% |   SLOW   |
| C32  |    2^28    |   0.201   |  57.241 ms |       0.04% |  57.249 ms |       0.04% |   8.605 us |   0.02% |   SAME   |

# Summary

- Total Matches: 120
  - Pass    (diff <= min_noise): 52
  - Unknown (infinite noise):    0
  - Failure (diff > min_noise):  68

@bernhardmgruber
Copy link
Copy Markdown
Contributor

Please create a tracking issue to investigate the small performance difference to Thrust.

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 27m: Pass: 100%/395 | Total: 4d 13h | Max: 1h 27m | Hits: 92%/667828

See results here.

@miscco miscco merged commit 057cd14 into NVIDIA:main May 12, 2026
417 checks passed
@miscco miscco deleted the parallel_sort branch May 12, 2026 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement CUDA backend for parallel cuda::std::sort

2 participants