Skip to content

[Backport 3.4] Backport PSTL fixes#9256

Open
davebayer wants to merge 4 commits into
NVIDIA:branch/3.4.xfrom
davebayer:backport_pstl_fixes
Open

[Backport 3.4] Backport PSTL fixes#9256
davebayer wants to merge 4 commits into
NVIDIA:branch/3.4.xfrom
davebayer:backport_pstl_fixes

Conversation

davebayer and others added 4 commits June 4, 2026 13:45
* [libcu++] Use stream's context in PSTL

* Address review comments

* Actually use the right name

* Morning coffee

* fixes

* fix

---------

Co-authored-by: Michael Schellenberger Costa <miscco@nvidia.com>
(cherry picked from commit 2f7cb8b)
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed CUDA execution context handling in parallel algorithms to ensure correct device context is consistently established during kernel operations.
    • Improved CUDA stream management and selection reliability across multiple PSTL implementations.
  • Tests

    • Updated CUDA execution policy tests to reflect improved stream handling behavior.
  • Chores

    • Added compiler compatibility flag for NVCC 12.0 with GCC host compilers.

Walkthrough

This PR introduces __pstl_ensure_current_ctx_for utility to enforce correct CUDA execution context in PSTL operations. It systematically updates 20+ CUDA algorithm backends to acquire streams and establish context early, eliminates duplicate stream re-fetching, replaces cudaStreamPerThread defaults with cudaStream_t{}, queries the current device instead of hardcoding device 0, and refactors exception handling.

Changes

CUDA Context and Stream Management

Layer / File(s) Summary
New ensure_current_context utility and header
libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
Introduces __pstl_ensure_current_ctx_for template that returns an RAII __ensure_current_context object; selects stream-derived context if policy provides one via get_stream, otherwise queries current device via cudaGetDevice and wraps in device reference.
Temporary storage device and memory pool management
libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
Queries current device dynamically instead of using hardcoded device 0 for default memory pool; removes noexcept from __get_memory_resource_or; changes default stream from cudaStreamPerThread to cudaStream_t{}.
PSTL algorithm implementations: stream and context setup
libcudacxx/include/cuda/std/__pstl/cuda/{adjacent_difference,copy_if,copy_n,exclusive_scan,find_if,for_each_n,generate_n,inclusive_scan,max_element,merge,min_element,partition,partition_copy,reduce,remove_if,rotate,rotate_copy,shift_left,shift_right,transform,transform_reduce,unique,stable_partition}.h
Systematically includes ensure_current_context.h, moves stream acquisition early in __par_impl, calls __pstl_ensure_current_ctx_for(__policy) to establish context, and eliminates redundant later stream re-acquisition; all backends now follow consistent early-initialization pattern.
Exception handling refactoring
libcudacxx/include/cuda/std/__pstl/cuda/{sort,copy_if}.h
sort.h replaces C++ try/catch with _CCCL_TRY/_CCCL_CATCH macros, mapping cudaErrorMemoryAllocation to std::bad_alloc and rethrowing other errors; copy_if.h uses _CCCL_RETHROW instead of raw throw for non-allocation CUDA errors.
Test and build config updates
libcudacxx/test/libcudacxx/cuda/execution/execution_policy/{get_stream,get_memory_resource}.pass.cpp, libcudacxx/test/utils/libcudacxx/test/config.py
Execution policy tests change baseline stream from cudaStreamPerThread to cudaStream_t{}; build config adds nvcc 12.0 + gcc warning suppression via -Xcompiler -Wno-attributes.

Possibly related PRs

  • NVIDIA/cccl#9220: Exception handling macro standardization in PSTL backends.
  • NVIDIA/cccl#9214: Stream selection changes from cudaStreamPerThread to cudaStream_t{} across PSTL CUDA backends.
  • NVIDIA/cccl#9219: Use stream's context in PSTL algorithm implementations with matching ensure_current_context integration.

Suggested labels

backport branch/3.4.x

Suggested reviewers

  • fbusato
  • pciolkosz

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Infer (1.2.0)
libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp:17:10: fatal error: 'cuda/functional' file not found
17 | #include <cuda/functional>
| ^~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-592ecba06aa173d3/tmp/clang_command_.tmp.3b1ee5.txt
++Contents of '/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-592ecba06aa173d3/tmp/clang_command_.tmp.3b1ee5.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"

... [truncated 1214 characters] ...

l/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/592ecba06aa173d3/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp"
"-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp

libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp:17:10: fatal error: 'cuda/functional' file not found
17 | #include <cuda/functional>
| ^~~~~~~~~~~~~~~~~
1 error generated.
Error: the following clang command did not run successfully:
/opt/infer-linux-x86_64-v1.2.0/lib/infer/facebook-clang-plugins/clang/install/bin/clang-18
@/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-dcd2ecffd9fe1481/tmp/clang_command_.tmp.f26b08.txt
++Contents of '/tmp/coderabbit-infer/ab8cc8da130a2faab6ee636c996f62d9621bc16b-dcd2ecffd9fe1481/tmp/clang_command_.tmp.f26b08.txt':
"-cc1" "-load"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../../facebook-clang-plugins/libtooling/build/FacebookClangPlugin.dylib"
"-add-plugin" "BiniouASTExporter" "-plugin-arg-BiniouASTExporter" "-"
"-plugin-arg-BiniouASTExporter" "PREPEND_CURRENT_DIR=1"
"-plugin-arg-BiniouASTExporter" "MAX_STRING_SIZE=65535" "-cc1" "-triple"
"x86_64-

... [truncated 1187 characters] ...

/usr/local/include" "-internal-isystem"
"/usr/lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include"
"-internal-externc-isystem" "/usr/include/x86_64-linux-gnu"
"-internal-externc-isystem" "/include" "-internal-externc-isystem"
"/usr/include" "-Wno-ignored-optimization-argument" "-Wno-everything"
"-fdeprecated-macro" "-ferror-limit" "19" "-fgnuc-version=4.2.1"
"-fskip-odr-check-in-gmf" "-fcxx-exceptions" "-fexceptions"
"-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o"
"/tmp/coderabbit-infer/dcd2ecffd9fe1481/file.o" "-x" "c++"
"libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp"
"-O0" "-fno-builtin" "-include"
"/opt/infer-linux-x86_64-v1.2.0/lib/infer/infer/bin/../lib/clang_wrappers/global_defines.h"
"-Wno-everything"


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h (1)

39-41: ⚡ Quick win

suggestion: Fully qualify get_stream_t and get_stream from the global namespace.

Line 39 and Line 41 rely on unqualified lookup inside cuda::std::execution; switch to ::cuda::get_stream_t and ::cuda::get_stream to match project rules and avoid accidental shadowing.
As per coding guidelines, "All calls to free functions must be fully qualified starting from the global namespace, e.g., ::cuda::ceil_div."


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94b27f12-f257-407e-a313-1fd766494a86

📥 Commits

Reviewing files that changed from the base of the PR and between 576e227 and ab8cc8d.

📒 Files selected for processing (29)
  • libcudacxx/include/cuda/std/__pstl/cuda/adjacent_difference.h
  • libcudacxx/include/cuda/std/__pstl/cuda/copy_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/copy_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/ensure_current_context.h
  • libcudacxx/include/cuda/std/__pstl/cuda/exclusive_scan.h
  • libcudacxx/include/cuda/std/__pstl/cuda/find_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/for_each_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/generate_n.h
  • libcudacxx/include/cuda/std/__pstl/cuda/inclusive_scan.h
  • libcudacxx/include/cuda/std/__pstl/cuda/max_element.h
  • libcudacxx/include/cuda/std/__pstl/cuda/merge.h
  • libcudacxx/include/cuda/std/__pstl/cuda/min_element.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/partition_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/reduce.h
  • libcudacxx/include/cuda/std/__pstl/cuda/remove_if.h
  • libcudacxx/include/cuda/std/__pstl/cuda/rotate.h
  • libcudacxx/include/cuda/std/__pstl/cuda/rotate_copy.h
  • libcudacxx/include/cuda/std/__pstl/cuda/shift_left.h
  • libcudacxx/include/cuda/std/__pstl/cuda/shift_right.h
  • libcudacxx/include/cuda/std/__pstl/cuda/sort.h
  • libcudacxx/include/cuda/std/__pstl/cuda/stable_partition.h
  • libcudacxx/include/cuda/std/__pstl/cuda/temporary_storage.h
  • libcudacxx/include/cuda/std/__pstl/cuda/transform.h
  • libcudacxx/include/cuda/std/__pstl/cuda/transform_reduce.h
  • libcudacxx/include/cuda/std/__pstl/cuda/unique.h
  • libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_memory_resource.pass.cpp
  • libcudacxx/test/libcudacxx/cuda/execution/execution_policy/get_stream.pass.cpp
  • libcudacxx/test/utils/libcudacxx/test/config.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

🥳 CI Workflow Results

🟩 Finished in 1h 28m: Pass: 100%/113 | Total: 2d 02h | Max: 1h 04m | Hits: 75%/439700

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

2 participants