Skip to content

Implement parallel cuda::std::replace_copy#7410

Merged
miscco merged 1 commit intoNVIDIA:mainfrom
miscco:parallel_replace_copy
Feb 5, 2026
Merged

Implement parallel cuda::std::replace_copy#7410
miscco merged 1 commit intoNVIDIA:mainfrom
miscco:parallel_replace_copy

Conversation

@miscco
Copy link
Contributor

@miscco miscco commented Jan 29, 2026

This implements the replace_copy{_if} algorithms for the cuda backend.

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes #7409

@miscco miscco requested review from a team as code owners January 29, 2026 09:29
@github-project-automation github-project-automation bot moved this to Todo in CCCL Jan 29, 2026
@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Jan 29, 2026
@miscco miscco force-pushed the parallel_replace_copy branch from 6ba6554 to 7e2c03e Compare January 29, 2026 10:04
@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor

This PR states to add replace_copy but it also includes transform. Can you please scope down the PR to only include the algorithm it prescribes? Thx!

@miscco
Copy link
Contributor Author

miscco commented Feb 2, 2026

This PR states to add replace_copy but it also includes transform. Can you please scope down the PR to only include the algorithm it prescribes? Thx!

replace_copy is implemented in terms of transform so this contains the transform commit.

A kingdom for stacked PRs

@miscco
Copy link
Contributor Author

miscco commented Feb 3, 2026

Benchmarks against Thrust look great:

['thrust_replace_copy.json', 'pstl_replace_copy.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |    2^16    |   7.621 us |       7.41% |   7.549 us |       6.78% |   -0.072 us |  -0.94% |   SAME   |
|   I8    |    2^20    |  12.047 us |       4.55% |  10.737 us |       4.85% |   -1.309 us | -10.87% |   FAST   |
|   I8    |    2^24    |  64.882 us |       1.67% |  57.305 us |       2.00% |   -7.577 us | -11.68% |   FAST   |
|   I8    |    2^28    | 892.615 us |       2.26% | 800.015 us |       2.14% |  -92.599 us | -10.37% |   FAST   |
|   I16   |    2^16    |   7.759 us |       6.95% |   7.626 us |       9.75% |   -0.133 us |  -1.71% |   SAME   |
|   I16   |    2^20    |  14.349 us |       3.27% |  12.751 us |       4.14% |   -1.599 us | -11.14% |   FAST   |
|   I16   |    2^24    | 118.931 us |       1.44% | 106.084 us |       0.66% |  -12.847 us | -10.80% |   FAST   |
|   I16   |    2^28    |   1.758 ms |       1.25% |   1.585 ms |       1.57% | -173.108 us |  -9.85% |   FAST   |
|   I32   |    2^16    |   8.069 us |       8.56% |   7.889 us |       6.94% |   -0.180 us |  -2.23% |   SAME   |
|   I32   |    2^20    |  20.643 us |       4.99% |  19.459 us |       3.21% |   -1.184 us |  -5.73% |   FAST   |
|   I32   |    2^24    | 211.407 us |       0.44% | 204.111 us |       0.38% |   -7.295 us |  -3.45% |   FAST   |
|   I32   |    2^28    |   3.267 ms |       1.12% |   3.156 ms |       1.04% | -111.015 us |  -3.40% |   FAST   |
|   I64   |    2^16    |   8.585 us |       9.08% |   8.457 us |       9.38% |   -0.128 us |  -1.49% |   SAME   |
|   I64   |    2^20    |  33.476 us |       2.56% |  32.146 us |       2.59% |   -1.330 us |  -3.97% |   FAST   |
|   I64   |    2^24    | 420.794 us |       0.44% | 400.107 us |       0.21% |  -20.688 us |  -4.92% |   FAST   |
|   I64   |    2^28    |   6.566 ms |       0.69% |   6.296 ms |       0.61% | -269.285 us |  -4.10% |   FAST   |
|  I128   |    2^16    |   9.955 us |      10.79% |   9.886 us |       5.31% |   -0.069 us |  -0.69% |   SAME   |
|  I128   |    2^20    |  59.414 us |       1.03% |  58.701 us |       0.97% |   -0.713 us |  -1.20% |   FAST   |
|  I128   |    2^24    | 795.327 us |       0.78% | 795.122 us |       1.21% |   -0.205 us |  -0.03% |   SAME   |
|  I128   |    2^28    |  12.582 ms |       0.40% |  12.572 ms |       0.44% |  -10.219 us |  -0.08% |   SAME   |
|   F32   |    2^16    |   7.909 us |       7.24% |   7.869 us |       7.21% |   -0.040 us |  -0.50% |   SAME   |
|   F32   |    2^20    |  20.866 us |       2.68% |  19.716 us |       4.48% |   -1.150 us |  -5.51% |   FAST   |
|   F32   |    2^24    | 225.280 us |       0.56% | 217.118 us |       0.41% |   -8.161 us |  -3.62% |   FAST   |
|   F32   |    2^28    |   3.275 ms |       1.22% |   3.158 ms |       0.93% | -117.862 us |  -3.60% |   FAST   |
|   F64   |    2^16    |   8.581 us |       5.69% |   8.968 us |       5.29% |    0.386 us |   4.50% |   SAME   |
|   F64   |    2^20    |  33.614 us |       2.59% |  32.552 us |       2.31% |   -1.062 us |  -3.16% |   FAST   |
|   F64   |    2^24    | 421.692 us |       0.49% | 400.678 us |       0.23% |  -21.013 us |  -4.98% |   FAST   |
|   F64   |    2^28    |   6.583 ms |       0.74% |   6.298 ms |       0.63% | -284.740 us |  -4.33% |   FAST   |

['thrust_replace_copy_if.json', 'pstl_replace_copy_if.json']
# base

## [0] NVIDIA RTX A6000

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |    2^16    |   7.613 us |       7.38% |   7.817 us |      11.19% |    0.205 us |   2.69% |   SAME   |
|   I8    |    2^20    |  12.072 us |       4.54% |  11.028 us |       8.05% |   -1.044 us |  -8.65% |   FAST   |
|   I8    |    2^24    |  64.870 us |       1.55% |  57.533 us |       1.33% |   -7.337 us | -11.31% |   FAST   |
|   I8    |    2^28    | 892.830 us |       2.06% | 801.210 us |       2.31% |  -91.620 us | -10.26% |   FAST   |
|   I16   |    2^16    |   7.816 us |       8.21% |   7.678 us |      12.78% |   -0.139 us |  -1.78% |   SAME   |
|   I16   |    2^20    |  14.344 us |       4.00% |  12.961 us |       4.39% |   -1.383 us |  -9.64% |   FAST   |
|   I16   |    2^24    | 119.073 us |       1.48% | 106.172 us |       0.93% |  -12.901 us | -10.83% |   FAST   |
|   I16   |    2^28    |   1.759 ms |       1.41% |   1.586 ms |       1.73% | -172.766 us |  -9.82% |   FAST   |
|   I32   |    2^16    |   8.001 us |       5.63% |   7.963 us |      13.36% |   -0.039 us |  -0.48% |   SAME   |
|   I32   |    2^20    |  20.461 us |       2.79% |  19.519 us |       4.98% |   -0.942 us |  -4.60% |   FAST   |
|   I32   |    2^24    | 211.631 us |       0.48% | 204.201 us |       0.37% |   -7.429 us |  -3.51% |   FAST   |
|   I32   |    2^28    |   3.266 ms |       1.03% |   3.153 ms |       0.97% | -112.977 us |  -3.46% |   FAST   |
|   I64   |    2^16    |   8.590 us |       6.38% |   8.569 us |      13.37% |   -0.021 us |  -0.24% |   SAME   |
|   I64   |    2^20    |  33.558 us |       2.54% |  32.236 us |       2.42% |   -1.322 us |  -3.94% |   FAST   |
|   I64   |    2^24    | 420.769 us |       0.41% | 400.197 us |       0.23% |  -20.572 us |  -4.89% |   FAST   |
|   I64   |    2^28    |   6.566 ms |       0.63% |   6.296 ms |       0.60% | -269.934 us |  -4.11% |   FAST   |
|  I128   |    2^16    |  10.017 us |       8.64% |   9.996 us |       9.36% |   -0.021 us |  -0.21% |   SAME   |
|  I128   |    2^20    |  59.435 us |       1.00% |  58.837 us |       1.13% |   -0.598 us |  -1.01% |   FAST   |
|  I128   |    2^24    | 796.817 us |       1.37% | 796.416 us |       1.55% |   -0.401 us |  -0.05% |   SAME   |
|  I128   |    2^28    |  12.582 ms |       0.39% |  12.571 ms |       0.39% |  -10.603 us |  -0.08% |   SAME   |
|   F32   |    2^16    |   7.882 us |       8.69% |   7.900 us |       7.07% |    0.018 us |   0.23% |   SAME   |
|   F32   |    2^20    |  20.810 us |       3.23% |  19.704 us |       2.75% |   -1.106 us |  -5.32% |   FAST   |
|   F32   |    2^24    | 225.382 us |       0.53% | 217.085 us |       0.39% |   -8.297 us |  -3.68% |   FAST   |
|   F32   |    2^28    |   3.274 ms |       0.96% |   3.158 ms |       0.94% | -116.037 us |  -3.54% |   FAST   |
|   F64   |    2^16    |   8.556 us |      11.57% |   8.846 us |       6.26% |    0.291 us |   3.40% |   SAME   |
|   F64   |    2^20    |  33.404 us |       2.42% |  32.427 us |       2.22% |   -0.977 us |  -2.92% |   FAST   |
|   F64   |    2^24    | 420.743 us |       0.48% | 400.602 us |       0.22% |  -20.141 us |  -4.79% |   FAST   |
|   F64   |    2^28    |   6.566 ms |       0.65% |   6.298 ms |       0.68% | -267.288 us |  -4.07% |   FAST   |

@miscco miscco force-pushed the parallel_replace_copy branch from 7e2c03e to 5ee4289 Compare February 3, 2026 12:07
@github-actions

This comment has been minimized.

@miscco miscco force-pushed the parallel_replace_copy branch from 5ee4289 to d832910 Compare February 3, 2026 13:54
@github-actions

This comment has been minimized.

This implements the `replace_copy{_if}` algorithms for the cuda backend.

* std::replace_copy see https://en.cppreference.com/w/cpp/algorithm/replace_copy.html
* std::replace_copy_if see https://en.cppreference.com/w/cpp/algorithm/replace_copy.html

It provides tests and benchmarks similar to Thrust and some boilerplate for libcu++

The functionality is publicly available yet and implemented in a private internal header

Fixes NVIDIA#7409
@miscco miscco force-pushed the parallel_replace_copy branch from d832910 to ff09673 Compare February 4, 2026 16:51
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 29m: Pass: 100%/95 | Total: 17h 19m | Max: 1h 05m | Hits: 98%/248496

See results here.

@miscco miscco enabled auto-merge (squash) February 5, 2026 09:40
Comment on lines +52 to +56
_CCCL_HOST_API constexpr __replace_copy_select(const _Tp& __old_value,
const _Tp& __new_value) noexcept(is_nothrow_copy_constructible_v<_Tp>)
: __old_value_(__old_value)
, __new_value_(__new_value)
{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: just use aggregate init and drop the ctor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not work in all CTK versions supported

Comment on lines +52 to +56
_CCCL_HOST_API constexpr __replace_copy_if_select(_UnaryPred __pred, const _Tp& __new_value) noexcept(
is_nothrow_move_constructible_v<_UnaryPred> && is_nothrow_copy_constructible_v<_Tp>)
: __pred_(__pred)
, __new_value_(__new_value)
{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto does not work everywhere

@miscco miscco merged commit 921d037 into NVIDIA:main Feb 5, 2026
113 checks passed
@miscco miscco deleted the parallel_replace_copy branch February 5, 2026 10:36
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[FEA]: Implement CUDA backend for parallel cuda::std::replace_copy

2 participants