Implement parallel `cuda::std::copy` by miscco · Pull Request #7513 · NVIDIA/cccl

miscco · 2026-02-05T15:14:46Z

This implements the copy{_n} algorithms for the cuda backend.

* std::copy see https://en.cppreference.com/w/cpp/algorithm/copy.html
* std::copy_n see https://en.cppreference.com/w/cpp/algorithm/copy_n.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes #7366

This implements the copy{_n} algorithms for the cuda backend. * std::copy see https://en.cppreference.com/w/cpp/algorithm/copy.html * std::copy_n see https://en.cppreference.com/w/cpp/algorithm/copy_n.html It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++ The functionality is publicly available yet and implemented in a private internal header Fixes NVIDIA#7366

libcudacxx/test/libcudacxx/std/algorithms/alg.modifying/alg.copy/pstl_copy.cu

libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h

pciolkosz · 2026-02-05T22:11:12Z

libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h

+      }
+      catch (const ::cuda::cuda_error& __err)
+      {
+        if (__err.status() == cudaErrorMemoryAllocation)


Just from my curiosity, why do we want to translate that exception here?

That is the standard requirement. Usually the algorithm should either throw a bad_alloc or terminate. We do the implementation extension to throw the cuda error too.

jrhemstad · 2026-02-05T22:57:58Z

suggestion: copy is a tricky algorithm in a heterogeneous setting because people often reach for this to do host <-> device copies, but that doesn't really work in the general case. We may want to consider just tabling cuda::std::copy for now.

libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h

bernhardmgruber · 2026-02-06T07:59:07Z

suggestion: copy is a tricky algorithm in a heterogeneous setting because people often reach for this to do host <-> device copies, but that doesn't really work in the general case. We may want to consider just tabling cuda::std::copy for now.

Since we decided to not automatically derive the memory space from iterators, like Thrust does, the PSTL's CUDA backend must assume that all memory is device accessible. Therefore, we cannot implement heterogeneous copies with cuda::std::copy.

miscco · 2026-02-06T10:04:56Z

Benchmarks look fine, there is some minor slowdown for small integers, that could also just be the noise of my machine:

['thrust_copy.json', 'pstl_copy.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |   7.332 us |      10.33% |   7.580 us |       7.29% |  0.248 us |   3.38% |   SAME   |
|   I8    |    2^20    |  10.170 us |       4.86% |  10.682 us |       4.42% |  0.512 us |   5.03% |   SLOW   |
|   I8    |    2^24    |  56.656 us |       1.07% |  57.211 us |       1.22% |  0.555 us |   0.98% |   SAME   |
|   I8    |    2^28    | 798.906 us |       2.20% | 799.576 us |       2.09% |  0.670 us |   0.08% |   SAME   |
|   I16   |    2^16    |   7.403 us |       5.97% |   7.643 us |       6.74% |  0.240 us |   3.24% |   SAME   |
|   I16   |    2^20    |  12.526 us |       8.15% |  12.733 us |       4.24% |  0.208 us |   1.66% |   SAME   |
|   I16   |    2^24    | 105.781 us |       0.53% | 106.030 us |       0.55% |  0.249 us |   0.24% |   SAME   |
|   I16   |    2^28    |   1.583 ms |       1.52% |   1.585 ms |       1.57% |  2.215 us |   0.14% |   SAME   |
|   I32   |    2^16    |   7.837 us |       6.29% |   7.929 us |       9.52% |  0.092 us |   1.17% |   SAME   |
|   I32   |    2^20    |  19.515 us |       4.31% |  19.496 us |       4.03% | -0.019 us |  -0.10% |   SAME   |
|   I32   |    2^24    | 203.917 us |       0.36% | 204.015 us |       0.33% |  0.098 us |   0.05% |   SAME   |
|   I32   |    2^28    |   3.153 ms |       1.14% |   3.156 ms |       1.00% |  3.056 us |   0.10% |   SAME   |
|   I64   |    2^16    |   8.447 us |       8.03% |   8.472 us |       5.80% |  0.025 us |   0.29% |   SAME   |
|   I64   |    2^20    |  32.420 us |       2.24% |  32.112 us |       2.46% | -0.308 us |  -0.95% |   SAME   |
|   I64   |    2^24    | 400.385 us |       0.19% | 400.074 us |       0.21% | -0.311 us |  -0.08% |   SAME   |
|   I64   |    2^28    |   6.295 ms |       0.71% |   6.298 ms |       0.74% |  3.472 us |   0.06% |   SAME   |
|  I128   |    2^16    |   9.972 us |       5.11% |   9.992 us |       7.75% |  0.020 us |   0.21% |   SAME   |
|  I128   |    2^20    |  59.238 us |       1.06% |  58.894 us |       1.11% | -0.344 us |  -0.58% |   SAME   |
|  I128   |    2^24    | 797.891 us |       2.00% | 798.496 us |       2.92% |  0.605 us |   0.08% |   SAME   |
|  I128   |    2^28    |  12.560 ms |       0.38% |  12.571 ms |       0.40% | 10.892 us |   0.09% |   SAME   |
|   F32   |    2^16    |   7.790 us |      14.99% |   7.968 us |       8.01% |  0.178 us |   2.28% |   SAME   |
|   F32   |    2^20    |  19.850 us |       2.65% |  19.919 us |      14.07% |  0.069 us |   0.35% |   SAME   |
|   F32   |    2^24    | 216.945 us |       0.36% | 217.196 us |       0.42% |  0.251 us |   0.12% |   SAME   |
|   F32   |    2^28    |   3.165 ms |       0.99% |   3.172 ms |       1.14% |  7.033 us |   0.22% |   SAME   |
|   F64   |    2^16    |   8.401 us |       4.70% |   8.535 us |       5.75% |  0.133 us |   1.59% |   SAME   |
|   F64   |    2^20    |  32.487 us |       2.47% |  32.655 us |       2.27% |  0.168 us |   0.52% |   SAME   |
|   F64   |    2^24    | 400.459 us |       0.20% | 400.170 us |       0.22% | -0.289 us |  -0.07% |   SAME   |
|   F64   |    2^28    |   6.296 ms |       0.79% |   6.297 ms |       0.65% |  0.761 us |   0.01% |   SAME   |

# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |   6.899 us |       6.12% |   6.885 us |       6.63% | -0.014 us |  -0.20% |   SAME   |
|   I8    |    2^20    |   7.519 us |      10.83% |   7.598 us |       9.31% |  0.079 us |   1.06% |   SAME   |
|   I8    |    2^24    |  31.556 us |       2.73% |  31.539 us |       3.06% | -0.018 us |  -0.06% |   SAME   |
|   I8    |    2^28    | 385.844 us |       2.21% | 385.990 us |       2.26% |  0.146 us |   0.04% |   SAME   |
|   I16   |    2^16    |   6.866 us |       8.20% |   6.835 us |       7.66% | -0.030 us |  -0.44% |   SAME   |
|   I16   |    2^20    |   9.599 us |       5.68% |   9.456 us |       5.64% | -0.143 us |  -1.49% |   SAME   |
|   I16   |    2^24    |  53.897 us |       1.60% |  53.880 us |       1.14% | -0.017 us |  -0.03% |   SAME   |
|   I16   |    2^28    | 758.174 us |       0.06% | 758.069 us |       0.07% | -0.105 us |  -0.01% |   SAME   |
|   I32   |    2^16    |   6.910 us |       9.75% |   6.870 us |       7.49% | -0.040 us |  -0.58% |   SAME   |
|   I32   |    2^20    |  12.884 us |       8.17% |  12.680 us |       3.80% | -0.204 us |  -1.59% |   SAME   |
|   I32   |    2^24    | 100.965 us |       0.60% | 100.917 us |       0.90% | -0.048 us |  -0.05% |   SAME   |
|   I32   |    2^28    |   1.509 ms |       0.06% |   1.509 ms |       0.06% | -0.033 us |  -0.00% |   SAME   |
|   I64   |    2^16    |   7.589 us |       6.56% |   7.590 us |       9.39% |  0.001 us |   0.02% |   SAME   |
|   I64   |    2^20    |  19.832 us |       3.32% |  19.389 us |       5.28% | -0.443 us |  -2.23% |   SAME   |
|   I64   |    2^24    | 197.970 us |       0.38% | 198.219 us |       0.37% |  0.249 us |   0.13% |   SAME   |
|   I64   |    2^28    |   3.055 ms |       0.05% |   3.062 ms |       0.07% |  7.505 us |   0.25% |   SLOW   |

['thrust_copy_n.json', 'pstl_copy_n.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |   7.413 us |       5.76% |   7.674 us |      13.61% |  0.261 us |   3.52% |   SAME   |
|   I8    |    2^20    |  10.289 us |       8.28% |  10.717 us |       4.94% |  0.428 us |   4.16% |   SAME   |
|   I8    |    2^24    |  56.802 us |       0.93% |  57.329 us |       1.02% |  0.527 us |   0.93% |   SLOW   |
|   I8    |    2^28    | 798.862 us |       2.20% | 800.209 us |       2.15% |  1.347 us |   0.17% |   SAME   |
|   I16   |    2^16    |   7.514 us |      13.57% |   7.725 us |      12.69% |  0.211 us |   2.81% |   SAME   |
|   I16   |    2^20    |  12.555 us |       4.30% |  12.900 us |       5.10% |  0.345 us |   2.75% |   SAME   |
|   I16   |    2^24    | 105.865 us |       0.95% | 106.073 us |       0.59% |  0.208 us |   0.20% |   SAME   |
|   I16   |    2^28    |   1.585 ms |       1.79% |   1.586 ms |       1.69% |  0.968 us |   0.06% |   SAME   |
|   I32   |    2^16    |   7.903 us |       5.92% |   8.063 us |       9.65% |  0.159 us |   2.02% |   SAME   |
|   I32   |    2^20    |  19.574 us |       2.72% |  19.632 us |       4.13% |  0.058 us |   0.30% |   SAME   |
|   I32   |    2^24    | 204.055 us |       0.30% | 204.192 us |       0.41% |  0.137 us |   0.07% |   SAME   |
|   I32   |    2^28    |   3.152 ms |       1.05% |   3.157 ms |       1.28% |  4.564 us |   0.14% |   SAME   |
|   I64   |    2^16    |   8.519 us |      11.33% |   8.656 us |       9.07% |  0.137 us |   1.60% |   SAME   |
|   I64   |    2^20    |  32.479 us |       2.26% |  32.648 us |       3.18% |  0.170 us |   0.52% |   SAME   |
|   I64   |    2^24    | 400.366 us |       0.18% | 400.317 us |       0.40% | -0.049 us |  -0.01% |   SAME   |
|   I64   |    2^28    |   6.295 ms |       0.73% |   6.297 ms |       0.71% |  2.193 us |   0.03% |   SAME   |
|  I128   |    2^16    |  10.073 us |       4.98% |  10.012 us |       4.93% | -0.061 us |  -0.60% |   SAME   |
|  I128   |    2^20    |  59.291 us |       1.20% |  58.821 us |       1.17% | -0.470 us |  -0.79% |   SAME   |
|  I128   |    2^24    | 794.612 us |       1.23% | 797.338 us |       1.74% |  2.726 us |   0.34% |   SAME   |
|  I128   |    2^28    |  12.560 ms |       0.37% |  12.572 ms |       0.35% | 11.273 us |   0.09% |   SAME   |
|   F32   |    2^16    |   7.789 us |       7.04% |   7.977 us |       6.43% |  0.188 us |   2.41% |   SAME   |
|   F32   |    2^20    |  20.177 us |       4.91% |  19.721 us |       2.85% | -0.456 us |  -2.26% |   SAME   |
|   F32   |    2^24    | 217.039 us |       0.38% | 217.283 us |       0.43% |  0.244 us |   0.11% |   SAME   |
|   F32   |    2^28    |   3.157 ms |       1.15% |   3.171 ms |       1.11% | 14.575 us |   0.46% |   SAME   |
|   F64   |    2^16    |   8.502 us |       8.58% |   8.605 us |      11.74% |  0.103 us |   1.21% |   SAME   |
|   F64   |    2^20    |  32.442 us |       2.26% |  32.645 us |       1.91% |  0.202 us |   0.62% |   SAME   |
|   F64   |    2^24    | 400.635 us |       0.32% | 400.252 us |       0.23% | -0.383 us |  -0.10% |   SAME   |
|   F64   |    2^28    |   6.294 ms |       0.68% |   6.298 ms |       0.71% |  3.594 us |   0.06% |   SAME   |

# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |   6.875 us |       6.61% |   6.963 us |       9.47% |  0.088 us |   1.29% |   SAME   |
|   I8    |    2^20    |   7.358 us |       5.43% |   7.729 us |      14.10% |  0.371 us |   5.04% |   SAME   |
|   I8    |    2^24    |  31.397 us |       1.74% |  31.524 us |       1.84% |  0.127 us |   0.40% |   SAME   |
|   I8    |    2^28    | 385.523 us |       2.12% | 386.422 us |       2.35% |  0.898 us |   0.23% |   SAME   |
|   I16   |    2^16    |   6.847 us |       7.08% |   6.890 us |       7.97% |  0.043 us |   0.62% |   SAME   |
|   I16   |    2^20    |   9.512 us |       6.18% |   9.599 us |       5.57% |  0.087 us |   0.91% |   SAME   |
|   I16   |    2^24    |  53.861 us |       1.24% |  53.952 us |       1.03% |  0.091 us |   0.17% |   SAME   |
|   I16   |    2^28    | 758.217 us |       0.07% | 758.201 us |       0.08% | -0.015 us |  -0.00% |   SAME   |
|   I32   |    2^16    |   6.918 us |      11.25% |   6.917 us |       6.80% | -0.002 us |  -0.02% |   SAME   |
|   I32   |    2^20    |  12.813 us |       3.99% |  12.869 us |       5.81% |  0.056 us |   0.44% |   SAME   |
|   I32   |    2^24    | 100.863 us |       0.51% | 100.892 us |       0.53% |  0.029 us |   0.03% |   SAME   |
|   I32   |    2^28    |   1.509 ms |       0.04% |   1.509 ms |       0.04% | -0.087 us |  -0.01% |   SAME   |
|   I64   |    2^16    |   7.511 us |       6.84% |   7.537 us |      12.71% |  0.026 us |   0.35% |   SAME   |
|   I64   |    2^20    |  19.724 us |       2.88% |  19.394 us |       3.04% | -0.330 us |  -1.67% |   SAME   |
|   I64   |    2^24    | 197.895 us |       0.34% | 198.372 us |       0.47% |  0.477 us |   0.24% |   SAME   |
|   I64   |    2^28    |   3.055 ms |       0.06% |   3.062 ms |       0.06% |  7.501 us |   0.25% |   SLOW   |

bernhardmgruber · 2026-02-06T10:40:32Z

libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h

+      { // CUB requires a 32 or 64 bit offset type, so cast here
+        using _OffsetType = ::cub::detail::choose_signed_offset_t<iter_difference_t<_InputIterator>>;
+        return __par_impl(
+          __policy,
+          ::cuda::std::move(__first),
+          static_cast<_OffsetType>(__count),


Q: I don't understand why we need to cast the iterator difference type. CUB should be able to handle any offset type at the public API, otherwise we have a bug. Why is this needed?

Cub has a static assert that it only accepts 32 and 64 bit signed integer types

That might not always be the case for example counting_iterator has the payload as difference type if it is a siggned integer

Ah, I get it now. You are calling random internal dispatch function and not the public cub::DeviceTransform API. Why is this necessary? Why can't we just call the public API (which handles offset types accordingly).

oh that was just something thrust did, so I wasnt sure whether its necessary anymore

github-actions · 2026-02-06T11:12:07Z

😬 CI Workflow Results

🟥 Finished in 1h 03m: Pass: 94%/101 | Total: 16h 54m | Max: 38m 44s | Hits: 98%/250260

See results here.

miscco requested review from a team as code owners February 5, 2026 15:14

miscco requested a review from shwina February 5, 2026 15:14

github-project-automation bot added this to CCCL Feb 5, 2026

miscco requested a review from pciolkosz February 5, 2026 15:14

github-project-automation bot moved this to Todo in CCCL Feb 5, 2026

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 5, 2026

miscco force-pushed the parallel_copy branch from 2d665d1 to 11b8522 Compare February 5, 2026 15:33

This comment has been minimized.

Sign in to view

pciolkosz approved these changes Feb 5, 2026

View reviewed changes

pciolkosz mentioned this pull request Feb 5, 2026

Implement parallel cuda::std::copy_if #7518

Open

bernhardmgruber reviewed Feb 6, 2026

View reviewed changes

libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h Outdated Show resolved Hide resolved

libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h Show resolved Hide resolved

miscco requested a review from a team as a code owner February 6, 2026 09:58

miscco requested a review from jrhemstad February 6, 2026 09:58

miscco force-pushed the parallel_copy branch from 13847ba to ce4cad6 Compare February 6, 2026 10:03

Address review comments

10540cf

miscco force-pushed the parallel_copy branch from ce4cad6 to 10540cf Compare February 6, 2026 10:06

bernhardmgruber reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement parallel `cuda::std::copy`#7513

Implement parallel `cuda::std::copy`#7513
miscco wants to merge 2 commits intoNVIDIA:mainfrom
miscco:parallel_copy

miscco commented Feb 5, 2026

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

pciolkosz Feb 5, 2026

Uh oh!

miscco Feb 6, 2026

Uh oh!

jrhemstad commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

bernhardmgruber commented Feb 6, 2026

Uh oh!

miscco commented Feb 6, 2026

Uh oh!

bernhardmgruber Feb 6, 2026

Uh oh!

miscco Feb 6, 2026

Uh oh!

bernhardmgruber Feb 6, 2026

Uh oh!

miscco Feb 13, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

miscco commented Feb 5, 2026

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

pciolkosz Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

miscco Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

jrhemstad commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

bernhardmgruber commented Feb 6, 2026

Uh oh!

miscco commented Feb 6, 2026

Uh oh!

bernhardmgruber Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

miscco Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

bernhardmgruber Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

miscco Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 6, 2026

😬 CI Workflow Results

🟥 Finished in 1h 03m: Pass: 94%/101 | Total: 16h 54m | Max: 38m 44s | Hits: 98%/250260

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants