Skip to content

[libcu++] Use stream's context in PSTL#9219

Merged
davebayer merged 6 commits into
NVIDIA:mainfrom
davebayer:ensure_current_context_in_pstl
Jun 4, 2026
Merged

[libcu++] Use stream's context in PSTL#9219
davebayer merged 6 commits into
NVIDIA:mainfrom
davebayer:ensure_current_context_in_pstl

Conversation

@davebayer
Copy link
Copy Markdown
Contributor

Fixes #9212.

@davebayer davebayer requested a review from a team as a code owner June 2, 2026 19:41
@davebayer davebayer requested a review from pciolkosz June 2, 2026 19:41
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 2, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 2, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 2, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ae9932a-1a73-4795-b5c4-49ddc00ad03e

📥 Commits

Reviewing files that changed from the base of the PR and between e1e1182 and 2659805.

📒 Files selected for processing (1)
  • libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Added a CUDA backend helper to ensure the correct execution context is established for parallel algorithms.
  • Bug Fixes

    • Improved CUDA execution-context and stream handling across many parallel algorithms so kernels, copies, and synchronizations consistently use the correct GPU context and stream.
    • Fixed temporary-storage fallback to select the current device at runtime, improving memory allocation correctness when no explicit memory pool is provided.

important:

Walkthrough

Acquire policy stream and call __pstl_ensure_current_ctx_for(__policy) at __par_impl entry across CUDA PSTL dispatchers; remove later stream re-declarations so kernels, temporary storage and async memcpys use the policy's device/stream.

Changes

PSTL stream device context enforcement across CUDA algorithms

Layer / File(s) Summary
Core context utility and temporary-storage fallback
libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h, libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
Introduce __pstl_ensure_current_ctx_for(policy) which builds a context from policy stream when available or queries current device; temporary_storage fallback now queries current device instead of hardcoding device 0.
Scan algorithms
exclusive_scan.h, inclusive_scan.h
Scan dispatchers capture stream and call __pstl_ensure_current_ctx_for at entry; removed later stream re-fetches so CUB scans and sync reuse the early stream.
Reduce / Transform family
reduce.h, transform.h, transform_reduce.h
Reduce and transform dispatchers move stream/context setup to function entry; later in-body stream redeclarations removed; CUB kernels and host-device memcpys use the upfront stream.
Element search
find_if.h, max_element.h, min_element.h
Find/max/min capture const stream and ensure context before sizing and reduction; device-to-host copies and iterator computation use the early stream.
Copy / Generate / Adjacent / ForEach
copy_n.h, copy_if.h, generate_n.h, adjacent_difference.h, for_each_n.h
Copy/generate and utility algorithms acquire stream and ensure context before offset/return declarations; async copy-back and transforms use the early stream.
Partitioning
partition.h, partition_copy.h, stable_partition.h
Partition variants establish stream and context upfront; temporary-storage allocation and kernel launches reuse the initial stream.
Filter / Unique
remove_if.h, unique.h
remove_if and unique initialize stream/context early and remove later stream declarations; offsets/returns computed after context established.
Shift / Rotate
shift_left.h, shift_right.h, rotate.h, rotate_copy.h
Shift and rotate functions compute stream and ensure context before building iterator predicates and computing counts; later __stream redeclarations removed.
Merge / Sort
merge.h, sort.h
Merge and sort acquire const stream and ensure context before count computations and CUB launches; avoid post-launch stream lookups.

Assessment against linked issues

Objective Addressed Explanation
Ensure PSTL algorithms launch kernels on stream's device regardless of current device state [#9212]
Guard device-related operations with context [#9212]

Possibly related PRs

  • NVIDIA/cccl#9214: Overlaps with stream-acquisition/sites and stream-selection fallback changes in related dispatch code.

Suggested reviewers

  • gevtushenko
  • griwes
  • bernhardmgruber

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libcudacxx/include/cuda/std/__pstl/cuda/sort.h (1)

150-151: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: __merge_sort_impl was not given the context guard, so the merge-sort path still launches on the process-current device instead of the stream's device — exactly the bug this PR fixes. __radix_sort_impl got __ctx (lines 90-91) but this path did not. Also __stream should be const here (it isn't mutated; .get()/.sync() are const).

-    const auto __count = ::cuda::std::distance(__first, __last);
-    auto __stream      = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{::cudaStream_t{}}, __policy);
+    const auto __stream = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{::cudaStream_t{}}, __policy);
+    const auto __ctx    = ::cuda::std::execution::__pstl_ensure_current_ctx_for(__policy);
+
+    const auto __count = ::cuda::std::distance(__first, __last);
🧹 Nitpick comments (1)
libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h (1)

78-78: 💤 Low value

suggestion: Error string and static_assert (Line 126) reference merge, but this is adjacent_difference. Misleading during debugging; rename to the correct algorithm.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cea57cf8-1a55-4f7c-92e3-7a3fc64b7b47

📥 Commits

Reviewing files that changed from the base of the PR and between 2a82ae1 and 88396e9.

📒 Files selected for processing (26)
  • libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h
  • libcudacxx/include/cuda/std/__pstl/cuda/common.h
  • libcudacxx/include/cuda/std/__pstl/cuda/copy_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.h
  • libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/for_each_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.h
  • libcudacxx/include/cuda/std/__pstl/cuda/max_element.h
  • libcudacxx/include/cuda/std/__pstl/cuda/merge.h
  • libcudacxx/include/cuda/std/__pstl/cuda/min_element.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/reduce.h
  • libcudacxx/include/cuda/std/__pstl/cuda/remove_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/rotate.h
  • libcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/shift_left.h
  • libcudacxx/include/cuda/std/__pstl/cuda/shift_right.h
  • libcudacxx/include/cuda/std/__pstl/cuda/sort.h
  • libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
  • libcudacxx/include/cuda/std/__pstl/cuda/transform.h
  • libcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.h
  • libcudacxx/include/cuda/std/__pstl/cuda/unique.h

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
Comment thread libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h
_BinaryOp __binary_op)
{
auto __count = ::cuda::std::distance(__first, __last);
const auto __stream = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{cudaStream_t{}}, __policy);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concern: Using the null stream as a default seems like a deviation from our policy of not using the null stream.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a better candidate to be honest, I think its either no default or null as default

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however, I believe we had a discussion in the past about this and decided to go with null stream. I don't remember the exact reason. @miscco do you remember?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had a discussion about this, but the problem is that I need to follow what CUB does, otherwise if no stream is provided CUB will internally use a different stream

else
{
// If no stream was specified, we use the device 0.
return __ensure_current_context{device_ref{0}};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if no stream is specified and we are using the NULL stream, we should use whatever is the current device

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so, but inside the temp memory allocation helper, we always allocate memory on device 0. I don't know what's correct then.

If we always used the current device, it would be easier to implement

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am 100% with @pciolkosz here that we should aim to use the current device.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ::cuda::device_default_memory_pool(0);

This is a bug I think. We should also create the memory pool on the current device if no memory resource or stream was provided by the user. @davebayer please update this in this PR as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is a bug

{ // no stream no memory resource, assume device 0
{
// If no stream was specified, we use the device 0.
return ::cuda::device_default_memory_pool(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see its pre-existing, I think this might also want to use the current device

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure about that. @miscco what do you think?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the current device here. I think using device 0 is a bug here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should use the current device

@github-actions

This comment has been minimized.

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libcudacxx/include/cuda/std/__pstl/cuda/sort.h (1)

147-163: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: __merge_sort_impl still launches DeviceMergeSort::SortKeys without __pstl_ensure_current_ctx_for(__policy), so the merge-sort branch can still run under the wrong device context when policy stream device differs from current device. Add the same context guard at function entry in this branch (next to stream acquisition) before the CUB launch.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 87b5b756-2a28-4090-935a-e599872e30f2

📥 Commits

Reviewing files that changed from the base of the PR and between 88396e9 and eb92bc9.

📒 Files selected for processing (25)
  • libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h
  • libcudacxx/include/cuda/std/__pstl/cuda/copy_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
  • libcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.h
  • libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/for_each_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.h
  • libcudacxx/include/cuda/std/__pstl/cuda/max_element.h
  • libcudacxx/include/cuda/std/__pstl/cuda/merge.h
  • libcudacxx/include/cuda/std/__pstl/cuda/min_element.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/reduce.h
  • libcudacxx/include/cuda/std/__pstl/cuda/remove_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/rotate.h
  • libcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/shift_left.h
  • libcudacxx/include/cuda/std/__pstl/cuda/shift_right.h
  • libcudacxx/include/cuda/std/__pstl/cuda/sort.h
  • libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/transform.h
  • libcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.h
  • libcudacxx/include/cuda/std/__pstl/cuda/unique.h

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h Outdated
Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h Outdated
Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 140074fb-cd22-4205-8c0d-a8c100b88313

📥 Commits

Reviewing files that changed from the base of the PR and between eb92bc9 and e1e1182.

📒 Files selected for processing (2)
  • libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
  • libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
@bernhardmgruber
Copy link
Copy Markdown
Contributor

Can you still run a benchmark for one of the algorithms and measure whether there is a performance impact? Otherwise the PR LGTM.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 17m: Pass: 100%/115 | Total: 1d 17h | Max: 57m 04s | Hits: 75%/446956

See results here.

@davebayer
Copy link
Copy Markdown
Contributor Author

Tested on for_each:

['baseline.json', 'modified.json']

base

[0] NVIDIA RTX PRO 6000 Blackwell Workstation Edition

T{ct} Elements Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 2^16 5.429 us 85.53% 5.555 us 91.68% 0.126 us 2.32% SAME
I8 2^20 7.494 us 43.40% 7.540 us 50.66% 0.046 us 0.61% SAME
I8 2^24 35.958 us 14.01% 35.948 us 17.47% -0.009 us -0.03% SAME
I8 2^28 398.470 us 5.20% 398.127 us 4.73% -0.343 us -0.09% SAME
I16 2^16 5.600 us 63.58% 5.559 us 58.64% -0.041 us -0.73% SAME
I16 2^20 8.219 us 49.70% 8.295 us 47.59% 0.076 us 0.92% SAME
I16 2^24 45.351 us 16.02% 45.392 us 11.92% 0.041 us 0.09% SAME
I16 2^28 726.343 us 2.68% 726.432 us 2.73% 0.089 us 0.01% SAME
I32 2^16 5.680 us 56.38% 5.672 us 68.60% -0.008 us -0.14% SAME
I32 2^20 9.403 us 40.60% 8.919 us 40.96% -0.484 us -5.15% SAME
I32 2^24 89.735 us 8.78% 89.872 us 9.96% 0.137 us 0.15% SAME
I32 2^28 1.458 ms 1.98% 1.458 ms 1.95% 0.219 us 0.02% SAME
I64 2^16 5.918 us 62.06% 5.920 us 54.92% 0.001 us 0.02% SAME
I64 2^20 13.063 us 28.37% 13.195 us 32.64% 0.132 us 1.01% SAME
I64 2^24 186.034 us 7.14% 186.178 us 7.24% 0.144 us 0.08% SAME
I64 2^28 2.939 ms 2.04% 2.938 ms 1.87% -0.917 us -0.03% SAME
I128 2^16 6.898 us 44.92% 6.665 us 53.42% -0.233 us -3.38% SAME
I128 2^20 23.837 us 19.45% 23.817 us 22.02% -0.020 us -0.09% SAME
I128 2^24 368.747 us 5.44% 368.419 us 5.06% -0.327 us -0.09% SAME
I128 2^28 5.883 ms 1.62% 5.882 ms 1.65% -1.306 us -0.02% SAME
F32 2^16 5.675 us 62.22% 5.677 us 58.59% 0.001 us 0.02% SAME
F32 2^20 9.397 us 41.73% 9.369 us 37.31% -0.029 us -0.30% SAME
F32 2^24 89.833 us 9.55% 89.771 us 10.39% -0.062 us -0.07% SAME
F32 2^28 1.457 ms 1.80% 1.457 ms 1.80% 0.307 us 0.02% SAME
F64 2^16 5.896 us 52.99% 5.983 us 56.32% 0.087 us 1.48% SAME
F64 2^20 13.332 us 27.25% 13.602 us 33.44% 0.270 us 2.02% SAME
F64 2^24 186.167 us 6.62% 185.990 us 6.28% -0.177 us -0.10% SAME
F64 2^28 2.938 ms 1.99% 2.938 ms 2.04% -0.106 us -0.00% SAME

Summary

  • Total Matches: 28
    • Pass (diff <= min_noise): 28
    • Unknown (infinite noise): 0
    • Failure (diff > min_noise): 0

@davebayer davebayer merged commit 2f7cb8b into NVIDIA:main Jun 4, 2026
136 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Jun 4, 2026
@davebayer davebayer deleted the ensure_current_context_in_pstl branch June 4, 2026 11:39
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Backport failed for branch/3.4.x, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin branch/3.4.x
git worktree add -d .worktree/backport-9219-to-branch/3.4.x origin/branch/3.4.x
cd .worktree/backport-9219-to-branch/3.4.x
git switch --create backport-9219-to-branch/3.4.x
git cherry-pick -x 2f7cb8b5a363da93ebef37e93ff9c4eab70edce4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

[BUG]: PSTL algorithms don't launch on stream's device

5 participants