Skip to content

Implement parallel cuda::std::copy#7513

Open
miscco wants to merge 2 commits intoNVIDIA:mainfrom
miscco:parallel_copy
Open

Implement parallel cuda::std::copy#7513
miscco wants to merge 2 commits intoNVIDIA:mainfrom
miscco:parallel_copy

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Feb 5, 2026

This implements the copy{_n} algorithms for the cuda backend.

* std::copy see https://en.cppreference.com/w/cpp/algorithm/copy.html
* std::copy_n see https://en.cppreference.com/w/cpp/algorithm/copy_n.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes #7366

@miscco miscco requested review from a team as code owners February 5, 2026 15:14
@miscco miscco requested a review from shwina February 5, 2026 15:14
@miscco miscco requested a review from pciolkosz February 5, 2026 15:14
@github-project-automation github-project-automation bot moved this to Todo in CCCL Feb 5, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Feb 5, 2026
    This implements the copy{_n} algorithms for the cuda backend.

    * std::copy see https://en.cppreference.com/w/cpp/algorithm/copy.html
    * std::copy_n see https://en.cppreference.com/w/cpp/algorithm/copy_n.html

    It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

    The functionality is publicly available yet and implemented in a private internal header

    Fixes NVIDIA#7366
@github-actions

This comment has been minimized.

}
catch (const ::cuda::cuda_error& __err)
{
if (__err.status() == cudaErrorMemoryAllocation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just from my curiosity, why do we want to translate that exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the standard requirement. Usually the algorithm should either throw a bad_alloc or terminate. We do the implementation extension to throw the cuda error too.

@jrhemstad
Copy link
Collaborator

suggestion: copy is a tricky algorithm in a heterogeneous setting because people often reach for this to do host <-> device copies, but that doesn't really work in the general case. We may want to consider just tabling cuda::std::copy for now.

@bernhardmgruber
Copy link
Contributor

suggestion: copy is a tricky algorithm in a heterogeneous setting because people often reach for this to do host <-> device copies, but that doesn't really work in the general case. We may want to consider just tabling cuda::std::copy for now.

Since we decided to not automatically derive the memory space from iterators, like Thrust does, the PSTL's CUDA backend must assume that all memory is device accessible. Therefore, we cannot implement heterogeneous copies with cuda::std::copy.

@miscco miscco requested a review from a team as a code owner February 6, 2026 09:58
@miscco miscco requested a review from jrhemstad February 6, 2026 09:58
@miscco
Copy link
Contributor Author

miscco commented Feb 6, 2026

Benchmarks look fine, there is some minor slowdown for small integers, that could also just be the noise of my machine:

['thrust_copy.json', 'pstl_copy.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |   7.332 us |      10.33% |   7.580 us |       7.29% |  0.248 us |   3.38% |   SAME   |
|   I8    |    2^20    |  10.170 us |       4.86% |  10.682 us |       4.42% |  0.512 us |   5.03% |   SLOW   |
|   I8    |    2^24    |  56.656 us |       1.07% |  57.211 us |       1.22% |  0.555 us |   0.98% |   SAME   |
|   I8    |    2^28    | 798.906 us |       2.20% | 799.576 us |       2.09% |  0.670 us |   0.08% |   SAME   |
|   I16   |    2^16    |   7.403 us |       5.97% |   7.643 us |       6.74% |  0.240 us |   3.24% |   SAME   |
|   I16   |    2^20    |  12.526 us |       8.15% |  12.733 us |       4.24% |  0.208 us |   1.66% |   SAME   |
|   I16   |    2^24    | 105.781 us |       0.53% | 106.030 us |       0.55% |  0.249 us |   0.24% |   SAME   |
|   I16   |    2^28    |   1.583 ms |       1.52% |   1.585 ms |       1.57% |  2.215 us |   0.14% |   SAME   |
|   I32   |    2^16    |   7.837 us |       6.29% |   7.929 us |       9.52% |  0.092 us |   1.17% |   SAME   |
|   I32   |    2^20    |  19.515 us |       4.31% |  19.496 us |       4.03% | -0.019 us |  -0.10% |   SAME   |
|   I32   |    2^24    | 203.917 us |       0.36% | 204.015 us |       0.33% |  0.098 us |   0.05% |   SAME   |
|   I32   |    2^28    |   3.153 ms |       1.14% |   3.156 ms |       1.00% |  3.056 us |   0.10% |   SAME   |
|   I64   |    2^16    |   8.447 us |       8.03% |   8.472 us |       5.80% |  0.025 us |   0.29% |   SAME   |
|   I64   |    2^20    |  32.420 us |       2.24% |  32.112 us |       2.46% | -0.308 us |  -0.95% |   SAME   |
|   I64   |    2^24    | 400.385 us |       0.19% | 400.074 us |       0.21% | -0.311 us |  -0.08% |   SAME   |
|   I64   |    2^28    |   6.295 ms |       0.71% |   6.298 ms |       0.74% |  3.472 us |   0.06% |   SAME   |
|  I128   |    2^16    |   9.972 us |       5.11% |   9.992 us |       7.75% |  0.020 us |   0.21% |   SAME   |
|  I128   |    2^20    |  59.238 us |       1.06% |  58.894 us |       1.11% | -0.344 us |  -0.58% |   SAME   |
|  I128   |    2^24    | 797.891 us |       2.00% | 798.496 us |       2.92% |  0.605 us |   0.08% |   SAME   |
|  I128   |    2^28    |  12.560 ms |       0.38% |  12.571 ms |       0.40% | 10.892 us |   0.09% |   SAME   |
|   F32   |    2^16    |   7.790 us |      14.99% |   7.968 us |       8.01% |  0.178 us |   2.28% |   SAME   |
|   F32   |    2^20    |  19.850 us |       2.65% |  19.919 us |      14.07% |  0.069 us |   0.35% |   SAME   |
|   F32   |    2^24    | 216.945 us |       0.36% | 217.196 us |       0.42% |  0.251 us |   0.12% |   SAME   |
|   F32   |    2^28    |   3.165 ms |       0.99% |   3.172 ms |       1.14% |  7.033 us |   0.22% |   SAME   |
|   F64   |    2^16    |   8.401 us |       4.70% |   8.535 us |       5.75% |  0.133 us |   1.59% |   SAME   |
|   F64   |    2^20    |  32.487 us |       2.47% |  32.655 us |       2.27% |  0.168 us |   0.52% |   SAME   |
|   F64   |    2^24    | 400.459 us |       0.20% | 400.170 us |       0.22% | -0.289 us |  -0.07% |   SAME   |
|   F64   |    2^28    |   6.296 ms |       0.79% |   6.297 ms |       0.65% |  0.761 us |   0.01% |   SAME   |

# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |   6.899 us |       6.12% |   6.885 us |       6.63% | -0.014 us |  -0.20% |   SAME   |
|   I8    |    2^20    |   7.519 us |      10.83% |   7.598 us |       9.31% |  0.079 us |   1.06% |   SAME   |
|   I8    |    2^24    |  31.556 us |       2.73% |  31.539 us |       3.06% | -0.018 us |  -0.06% |   SAME   |
|   I8    |    2^28    | 385.844 us |       2.21% | 385.990 us |       2.26% |  0.146 us |   0.04% |   SAME   |
|   I16   |    2^16    |   6.866 us |       8.20% |   6.835 us |       7.66% | -0.030 us |  -0.44% |   SAME   |
|   I16   |    2^20    |   9.599 us |       5.68% |   9.456 us |       5.64% | -0.143 us |  -1.49% |   SAME   |
|   I16   |    2^24    |  53.897 us |       1.60% |  53.880 us |       1.14% | -0.017 us |  -0.03% |   SAME   |
|   I16   |    2^28    | 758.174 us |       0.06% | 758.069 us |       0.07% | -0.105 us |  -0.01% |   SAME   |
|   I32   |    2^16    |   6.910 us |       9.75% |   6.870 us |       7.49% | -0.040 us |  -0.58% |   SAME   |
|   I32   |    2^20    |  12.884 us |       8.17% |  12.680 us |       3.80% | -0.204 us |  -1.59% |   SAME   |
|   I32   |    2^24    | 100.965 us |       0.60% | 100.917 us |       0.90% | -0.048 us |  -0.05% |   SAME   |
|   I32   |    2^28    |   1.509 ms |       0.06% |   1.509 ms |       0.06% | -0.033 us |  -0.00% |   SAME   |
|   I64   |    2^16    |   7.589 us |       6.56% |   7.590 us |       9.39% |  0.001 us |   0.02% |   SAME   |
|   I64   |    2^20    |  19.832 us |       3.32% |  19.389 us |       5.28% | -0.443 us |  -2.23% |   SAME   |
|   I64   |    2^24    | 197.970 us |       0.38% | 198.219 us |       0.37% |  0.249 us |   0.13% |   SAME   |
|   I64   |    2^28    |   3.055 ms |       0.05% |   3.062 ms |       0.07% |  7.505 us |   0.25% |   SLOW   |

['thrust_copy_n.json', 'pstl_copy_n.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |   7.413 us |       5.76% |   7.674 us |      13.61% |  0.261 us |   3.52% |   SAME   |
|   I8    |    2^20    |  10.289 us |       8.28% |  10.717 us |       4.94% |  0.428 us |   4.16% |   SAME   |
|   I8    |    2^24    |  56.802 us |       0.93% |  57.329 us |       1.02% |  0.527 us |   0.93% |   SLOW   |
|   I8    |    2^28    | 798.862 us |       2.20% | 800.209 us |       2.15% |  1.347 us |   0.17% |   SAME   |
|   I16   |    2^16    |   7.514 us |      13.57% |   7.725 us |      12.69% |  0.211 us |   2.81% |   SAME   |
|   I16   |    2^20    |  12.555 us |       4.30% |  12.900 us |       5.10% |  0.345 us |   2.75% |   SAME   |
|   I16   |    2^24    | 105.865 us |       0.95% | 106.073 us |       0.59% |  0.208 us |   0.20% |   SAME   |
|   I16   |    2^28    |   1.585 ms |       1.79% |   1.586 ms |       1.69% |  0.968 us |   0.06% |   SAME   |
|   I32   |    2^16    |   7.903 us |       5.92% |   8.063 us |       9.65% |  0.159 us |   2.02% |   SAME   |
|   I32   |    2^20    |  19.574 us |       2.72% |  19.632 us |       4.13% |  0.058 us |   0.30% |   SAME   |
|   I32   |    2^24    | 204.055 us |       0.30% | 204.192 us |       0.41% |  0.137 us |   0.07% |   SAME   |
|   I32   |    2^28    |   3.152 ms |       1.05% |   3.157 ms |       1.28% |  4.564 us |   0.14% |   SAME   |
|   I64   |    2^16    |   8.519 us |      11.33% |   8.656 us |       9.07% |  0.137 us |   1.60% |   SAME   |
|   I64   |    2^20    |  32.479 us |       2.26% |  32.648 us |       3.18% |  0.170 us |   0.52% |   SAME   |
|   I64   |    2^24    | 400.366 us |       0.18% | 400.317 us |       0.40% | -0.049 us |  -0.01% |   SAME   |
|   I64   |    2^28    |   6.295 ms |       0.73% |   6.297 ms |       0.71% |  2.193 us |   0.03% |   SAME   |
|  I128   |    2^16    |  10.073 us |       4.98% |  10.012 us |       4.93% | -0.061 us |  -0.60% |   SAME   |
|  I128   |    2^20    |  59.291 us |       1.20% |  58.821 us |       1.17% | -0.470 us |  -0.79% |   SAME   |
|  I128   |    2^24    | 794.612 us |       1.23% | 797.338 us |       1.74% |  2.726 us |   0.34% |   SAME   |
|  I128   |    2^28    |  12.560 ms |       0.37% |  12.572 ms |       0.35% | 11.273 us |   0.09% |   SAME   |
|   F32   |    2^16    |   7.789 us |       7.04% |   7.977 us |       6.43% |  0.188 us |   2.41% |   SAME   |
|   F32   |    2^20    |  20.177 us |       4.91% |  19.721 us |       2.85% | -0.456 us |  -2.26% |   SAME   |
|   F32   |    2^24    | 217.039 us |       0.38% | 217.283 us |       0.43% |  0.244 us |   0.11% |   SAME   |
|   F32   |    2^28    |   3.157 ms |       1.15% |   3.171 ms |       1.11% | 14.575 us |   0.46% |   SAME   |
|   F64   |    2^16    |   8.502 us |       8.58% |   8.605 us |      11.74% |  0.103 us |   1.21% |   SAME   |
|   F64   |    2^20    |  32.442 us |       2.26% |  32.645 us |       1.91% |  0.202 us |   0.62% |   SAME   |
|   F64   |    2^24    | 400.635 us |       0.32% | 400.252 us |       0.23% | -0.383 us |  -0.10% |   SAME   |
|   F64   |    2^28    |   6.294 ms |       0.68% |   6.298 ms |       0.71% |  3.594 us |   0.06% |   SAME   |

# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^16    |   6.875 us |       6.61% |   6.963 us |       9.47% |  0.088 us |   1.29% |   SAME   |
|   I8    |    2^20    |   7.358 us |       5.43% |   7.729 us |      14.10% |  0.371 us |   5.04% |   SAME   |
|   I8    |    2^24    |  31.397 us |       1.74% |  31.524 us |       1.84% |  0.127 us |   0.40% |   SAME   |
|   I8    |    2^28    | 385.523 us |       2.12% | 386.422 us |       2.35% |  0.898 us |   0.23% |   SAME   |
|   I16   |    2^16    |   6.847 us |       7.08% |   6.890 us |       7.97% |  0.043 us |   0.62% |   SAME   |
|   I16   |    2^20    |   9.512 us |       6.18% |   9.599 us |       5.57% |  0.087 us |   0.91% |   SAME   |
|   I16   |    2^24    |  53.861 us |       1.24% |  53.952 us |       1.03% |  0.091 us |   0.17% |   SAME   |
|   I16   |    2^28    | 758.217 us |       0.07% | 758.201 us |       0.08% | -0.015 us |  -0.00% |   SAME   |
|   I32   |    2^16    |   6.918 us |      11.25% |   6.917 us |       6.80% | -0.002 us |  -0.02% |   SAME   |
|   I32   |    2^20    |  12.813 us |       3.99% |  12.869 us |       5.81% |  0.056 us |   0.44% |   SAME   |
|   I32   |    2^24    | 100.863 us |       0.51% | 100.892 us |       0.53% |  0.029 us |   0.03% |   SAME   |
|   I32   |    2^28    |   1.509 ms |       0.04% |   1.509 ms |       0.04% | -0.087 us |  -0.01% |   SAME   |
|   I64   |    2^16    |   7.511 us |       6.84% |   7.537 us |      12.71% |  0.026 us |   0.35% |   SAME   |
|   I64   |    2^20    |  19.724 us |       2.88% |  19.394 us |       3.04% | -0.330 us |  -1.67% |   SAME   |
|   I64   |    2^24    | 197.895 us |       0.34% | 198.372 us |       0.47% |  0.477 us |   0.24% |   SAME   |
|   I64   |    2^28    |   3.055 ms |       0.06% |   3.062 ms |       0.06% |  7.501 us |   0.25% |   SLOW   |

Comment on lines +106 to +111
{ // CUB requires a 32 or 64 bit offset type, so cast here
using _OffsetType = ::cub::detail::choose_signed_offset_t<iter_difference_t<_InputIterator>>;
return __par_impl(
__policy,
::cuda::std::move(__first),
static_cast<_OffsetType>(__count),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: I don't understand why we need to cast the iterator difference type. CUB should be able to handle any offset type at the public API, otherwise we have a bug. Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cub has a static assert that it only accepts 32 and 64 bit signed integer types

That might not always be the case for example counting_iterator has the payload as difference type if it is a siggned integer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I get it now. You are calling random internal dispatch function and not the public cub::DeviceTransform API. Why is this necessary? Why can't we just call the public API (which handles offset types accordingly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that was just something thrust did, so I wasnt sure whether its necessary anymore

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

😬 CI Workflow Results

🟥 Finished in 1h 03m: Pass: 94%/101 | Total: 16h 54m | Max: 38m 44s | Hits: 98%/250260

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement CUDA backend for parallel cuda::std::copy

4 participants