[libcu++] Use stream's context in PSTL by davebayer · Pull Request #9219 · NVIDIA/cccl

davebayer · 2026-06-02T19:41:30Z

Fixes #9212.

coderabbitai · 2026-06-02T19:48:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7ae9932a-1a73-4795-b5c4-49ddc00ad03e

📥 Commits

Reviewing files that changed from the base of the PR and between e1e1182 and 2659805.

📒 Files selected for processing (1)

libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h

🚧 Files skipped from review as they are similar to previous changes (1)

libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h

📝 Walkthrough

Summary by CodeRabbit

New Features
- Added a CUDA backend helper to ensure the correct execution context is established for parallel algorithms.
Bug Fixes
- Improved CUDA execution-context and stream handling across many parallel algorithms so kernels, copies, and synchronizations consistently use the correct GPU context and stream.
- Fixed temporary-storage fallback to select the current device at runtime, improving memory allocation correctness when no explicit memory pool is provided.

important:

Walkthrough

Acquire policy stream and call __pstl_ensure_current_ctx_for(__policy) at __par_impl entry across CUDA PSTL dispatchers; remove later stream re-declarations so kernels, temporary storage and async memcpys use the policy's device/stream.

Changes

PSTL stream device context enforcement across CUDA algorithms

Layer / File(s)	Summary
Core context utility and temporary-storage fallback `libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h`, `libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h`	Introduce `__pstl_ensure_current_ctx_for(policy)` which builds a context from policy stream when available or queries current device; temporary_storage fallback now queries current device instead of hardcoding device 0.
Scan algorithms `exclusive_scan.h`, `inclusive_scan.h`	Scan dispatchers capture stream and call `__pstl_ensure_current_ctx_for` at entry; removed later stream re-fetches so CUB scans and sync reuse the early stream.
Reduce / Transform family `reduce.h`, `transform.h`, `transform_reduce.h`	Reduce and transform dispatchers move stream/context setup to function entry; later in-body stream redeclarations removed; CUB kernels and host-device memcpys use the upfront stream.
Element search `find_if.h`, `max_element.h`, `min_element.h`	Find/max/min capture const stream and ensure context before sizing and reduction; device-to-host copies and iterator computation use the early stream.
Copy / Generate / Adjacent / ForEach `copy_n.h`, `copy_if.h`, `generate_n.h`, `adjacent_difference.h`, `for_each_n.h`	Copy/generate and utility algorithms acquire stream and ensure context before offset/return declarations; async copy-back and transforms use the early stream.
Partitioning `partition.h`, `partition_copy.h`, `stable_partition.h`	Partition variants establish stream and context upfront; temporary-storage allocation and kernel launches reuse the initial stream.
Filter / Unique `remove_if.h`, `unique.h`	remove_if and unique initialize stream/context early and remove later stream declarations; offsets/returns computed after context established.
Shift / Rotate `shift_left.h`, `shift_right.h`, `rotate.h`, `rotate_copy.h`	Shift and rotate functions compute stream and ensure context before building iterator predicates and computing counts; later __stream redeclarations removed.
Merge / Sort `merge.h`, `sort.h`	Merge and sort acquire const stream and ensure context before count computations and CUB launches; avoid post-launch stream lookups.

Assessment against linked issues

Objective	Addressed	Explanation
Ensure PSTL algorithms launch kernels on stream's device regardless of current device state [`#9212`]	✅
Guard device-related operations with context [`#9212`]	✅

Possibly related PRs

NVIDIA/cccl#9214: Overlaps with stream-acquisition/sites and stream-selection fallback changes in related dispatch code.

Suggested reviewers

gevtushenko
griwes
bernhardmgruber

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libcudacxx/include/cuda/std/__pstl/cuda/sort.h (1)
150-151: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: __merge_sort_impl was not given the context guard, so the merge-sort path still launches on the process-current device instead of the stream's device — exactly the bug this PR fixes. __radix_sort_impl got __ctx (lines 90-91) but this path did not. Also __stream should be const here (it isn't mutated; .get()/.sync() are const).
-    const auto __count = ::cuda::std::distance(__first, __last);
-    auto __stream      = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{::cudaStream_t{}}, __policy);
+    const auto __stream = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{::cudaStream_t{}}, __policy);
+    const auto __ctx    = ::cuda::std::execution::__pstl_ensure_current_ctx_for(__policy);
+
+    const auto __count = ::cuda::std::distance(__first, __last);

🧹 Nitpick comments (1)

libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h (1)

78-78: 💤 Low value

suggestion: Error string and static_assert (Line 126) reference merge, but this is adjacent_difference. Misleading during debugging; rename to the correct algorithm.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cea57cf8-1a55-4f7c-92e3-7a3fc64b7b47

📥 Commits

Reviewing files that changed from the base of the PR and between 2a82ae1 and 88396e9.

📒 Files selected for processing (26)

libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h
libcudacxx/include/cuda/std/__pstl/cuda/common.h
libcudacxx/include/cuda/std/__pstl/cuda/copy_if.h
libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h
libcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.h
libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
libcudacxx/include/cuda/std/__pstl/cuda/for_each_n.h
libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h
libcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.h
libcudacxx/include/cuda/std/__pstl/cuda/max_element.h
libcudacxx/include/cuda/std/__pstl/cuda/merge.h
libcudacxx/include/cuda/std/__pstl/cuda/min_element.h
libcudacxx/include/cuda/std/__pstl/cuda/partition.h
libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h
libcudacxx/include/cuda/std/__pstl/cuda/reduce.h
libcudacxx/include/cuda/std/__pstl/cuda/remove_if.h
libcudacxx/include/cuda/std/__pstl/cuda/rotate.h
libcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.h
libcudacxx/include/cuda/std/__pstl/cuda/shift_left.h
libcudacxx/include/cuda/std/__pstl/cuda/shift_right.h
libcudacxx/include/cuda/std/__pstl/cuda/sort.h
libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h
libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
libcudacxx/include/cuda/std/__pstl/cuda/transform.h
libcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.h
libcudacxx/include/cuda/std/__pstl/cuda/unique.h

jrhemstad · 2026-06-02T19:49:28Z

    _BinaryOp __binary_op)
  {
-    auto __count = ::cuda::std::distance(__first, __last);
+    const auto __stream = ::cuda::__call_or(::cuda::get_stream, ::cuda::stream_ref{cudaStream_t{}}, __policy);


Concern: Using the null stream as a default seems like a deviation from our policy of not using the null stream.

I don't think there is a better candidate to be honest, I think its either no default or null as default

I agree, however, I believe we had a discussion in the past about this and decided to go with null stream. I don't remember the exact reason. @miscco do you remember?

We had a discussion about this, but the problem is that I need to follow what CUB does, otherwise if no stream is provided CUB will internally use a different stream

pciolkosz · 2026-06-02T19:50:16Z

+  else
+  {
+    // If no stream was specified, we use the device 0.
+    return __ensure_current_context{device_ref{0}};


I think if no stream is specified and we are using the NULL stream, we should use whatever is the current device

I thought so, but inside the temp memory allocation helper, we always allocate memory on device 0. I don't know what's correct then.

If we always used the current device, it would be easier to implement

I am 100% with @pciolkosz here that we should aim to use the current device.

cccl/libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h

Line 144 in cfe7e26

return ::cuda::device_default_memory_pool(0);

This is a bug I think. We should also create the memory pool on the current device if no memory resource or stream was provided by the user. @davebayer please update this in this PR as well.

yeah that is a bug

pciolkosz · 2026-06-02T19:51:45Z

-    { // no stream no memory resource, assume device 0
+    {
+      // If no stream was specified, we use the device 0.
      return ::cuda::device_default_memory_pool(0);


Ahh, I see its pre-existing, I think this might also want to use the current device

Actually, I'm not sure about that. @miscco what do you think?

Use the current device here. I think using device 0 is a bug here.

Yeah we should use the current device

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libcudacxx/include/cuda/std/__pstl/cuda/sort.h (1)

147-163: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: __merge_sort_impl still launches DeviceMergeSort::SortKeys without __pstl_ensure_current_ctx_for(__policy), so the merge-sort branch can still run under the wrong device context when policy stream device differs from current device. Add the same context guard at function entry in this branch (next to stream acquisition) before the CUB launch.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 87b5b756-2a28-4090-935a-e599872e30f2

📥 Commits

Reviewing files that changed from the base of the PR and between 88396e9 and eb92bc9.

📒 Files selected for processing (25)

libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h
libcudacxx/include/cuda/std/__pstl/cuda/copy_if.h
libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h
libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
libcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.h
libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
libcudacxx/include/cuda/std/__pstl/cuda/for_each_n.h
libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h
libcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.h
libcudacxx/include/cuda/std/__pstl/cuda/max_element.h
libcudacxx/include/cuda/std/__pstl/cuda/merge.h
libcudacxx/include/cuda/std/__pstl/cuda/min_element.h
libcudacxx/include/cuda/std/__pstl/cuda/partition.h
libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h
libcudacxx/include/cuda/std/__pstl/cuda/reduce.h
libcudacxx/include/cuda/std/__pstl/cuda/remove_if.h
libcudacxx/include/cuda/std/__pstl/cuda/rotate.h
libcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.h
libcudacxx/include/cuda/std/__pstl/cuda/shift_left.h
libcudacxx/include/cuda/std/__pstl/cuda/shift_right.h
libcudacxx/include/cuda/std/__pstl/cuda/sort.h
libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h
libcudacxx/include/cuda/std/__pstl/cuda/transform.h
libcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.h
libcudacxx/include/cuda/std/__pstl/cuda/unique.h

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 140074fb-cd22-4205-8c0d-a8c100b88313

📥 Commits

Reviewing files that changed from the base of the PR and between eb92bc9 and e1e1182.

📒 Files selected for processing (2)

libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h

bernhardmgruber · 2026-06-03T12:52:55Z

Can you still run a benchmark for one of the algorithms and measure whether there is a performance impact? Otherwise the PR LGTM.

github-actions · 2026-06-03T14:07:19Z

🥳 CI Workflow Results

🟩 Finished in 1h 17m: Pass: 100%/115 | Total: 1d 17h | Max: 57m 04s | Hits: 75%/446956

See results here.

davebayer · 2026-06-04T11:26:00Z

Tested on for_each:

['baseline.json', 'modified.json']

base

[0] NVIDIA RTX PRO 6000 Blackwell Workstation Edition

T{ct}	Elements	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	2^16	5.429 us	85.53%	5.555 us	91.68%	0.126 us	2.32%	SAME
I8	2^20	7.494 us	43.40%	7.540 us	50.66%	0.046 us	0.61%	SAME
I8	2^24	35.958 us	14.01%	35.948 us	17.47%	-0.009 us	-0.03%	SAME
I8	2^28	398.470 us	5.20%	398.127 us	4.73%	-0.343 us	-0.09%	SAME
I16	2^16	5.600 us	63.58%	5.559 us	58.64%	-0.041 us	-0.73%	SAME
I16	2^20	8.219 us	49.70%	8.295 us	47.59%	0.076 us	0.92%	SAME
I16	2^24	45.351 us	16.02%	45.392 us	11.92%	0.041 us	0.09%	SAME
I16	2^28	726.343 us	2.68%	726.432 us	2.73%	0.089 us	0.01%	SAME
I32	2^16	5.680 us	56.38%	5.672 us	68.60%	-0.008 us	-0.14%	SAME
I32	2^20	9.403 us	40.60%	8.919 us	40.96%	-0.484 us	-5.15%	SAME
I32	2^24	89.735 us	8.78%	89.872 us	9.96%	0.137 us	0.15%	SAME
I32	2^28	1.458 ms	1.98%	1.458 ms	1.95%	0.219 us	0.02%	SAME
I64	2^16	5.918 us	62.06%	5.920 us	54.92%	0.001 us	0.02%	SAME
I64	2^20	13.063 us	28.37%	13.195 us	32.64%	0.132 us	1.01%	SAME
I64	2^24	186.034 us	7.14%	186.178 us	7.24%	0.144 us	0.08%	SAME
I64	2^28	2.939 ms	2.04%	2.938 ms	1.87%	-0.917 us	-0.03%	SAME
I128	2^16	6.898 us	44.92%	6.665 us	53.42%	-0.233 us	-3.38%	SAME
I128	2^20	23.837 us	19.45%	23.817 us	22.02%	-0.020 us	-0.09%	SAME
I128	2^24	368.747 us	5.44%	368.419 us	5.06%	-0.327 us	-0.09%	SAME
I128	2^28	5.883 ms	1.62%	5.882 ms	1.65%	-1.306 us	-0.02%	SAME
F32	2^16	5.675 us	62.22%	5.677 us	58.59%	0.001 us	0.02%	SAME
F32	2^20	9.397 us	41.73%	9.369 us	37.31%	-0.029 us	-0.30%	SAME
F32	2^24	89.833 us	9.55%	89.771 us	10.39%	-0.062 us	-0.07%	SAME
F32	2^28	1.457 ms	1.80%	1.457 ms	1.80%	0.307 us	0.02%	SAME
F64	2^16	5.896 us	52.99%	5.983 us	56.32%	0.087 us	1.48%	SAME
F64	2^20	13.332 us	27.25%	13.602 us	33.44%	0.270 us	2.02%	SAME
F64	2^24	186.167 us	6.62%	185.990 us	6.28%	-0.177 us	-0.10%	SAME
F64	2^28	2.938 ms	1.99%	2.938 ms	2.04%	-0.106 us	-0.00%	SAME

Summary

Total Matches: 28
- Pass (diff <= min_noise): 28
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 0

github-actions · 2026-06-04T11:39:27Z

Backport failed for branch/3.4.x, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin branch/3.4.x
git worktree add -d .worktree/backport-9219-to-branch/3.4.x origin/branch/3.4.x
cd .worktree/backport-9219-to-branch/3.4.x
git switch --create backport-9219-to-branch/3.4.x
git cherry-pick -x 2f7cb8b5a363da93ebef37e93ff9c4eab70edce4

[libcu++] Use stream's context in PSTL

88396e9

davebayer requested a review from a team as a code owner June 2, 2026 19:41

davebayer requested a review from pciolkosz June 2, 2026 19:41

github-project-automation Bot added this to CCCL Jun 2, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 2, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 2, 2026

davebayer added the backport branch/3.4.x label Jun 2, 2026

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h

jrhemstad reviewed Jun 2, 2026

View reviewed changes

pciolkosz reviewed Jun 2, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

miscco reviewed Jun 3, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h

miscco added 3 commits June 3, 2026 13:59

Address review comments

4cc5242

Actually use the right name

da926b5

Morning coffee

eb92bc9

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h Outdated

bernhardmgruber reviewed Jun 3, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h Outdated

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h Outdated

fixes

e1e1182

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h

fix

2659805

miscco approved these changes Jun 4, 2026

View reviewed changes

davebayer merged commit 2f7cb8b into NVIDIA:main Jun 4, 2026
136 checks passed

github-project-automation Bot moved this from In Review to Done in CCCL Jun 4, 2026

davebayer deleted the ensure_current_context_in_pstl branch June 4, 2026 11:39

davebayer mentioned this pull request Jun 4, 2026

[Backport 3.4] Backport PSTL fixes #9256

Open

Conversation

davebayer commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Assessment against linked issues

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bernhardmgruber commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 17m: Pass: 100%/115 | Total: 1d 17h | Max: 57m 04s | Hits: 75%/446956

Uh oh!

davebayer commented Jun 4, 2026

base

[0] NVIDIA RTX PRO 6000 Blackwell Workstation Edition

Summary

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading