Skip to content

[STF] Properly destroy CUDA streams and do not try to initialize CUDA while capturing#8919

Merged
caugonnet merged 18 commits into
NVIDIA:mainfrom
caugonnet:stream-ctx-capture-safe
May 26, 2026
Merged

[STF] Properly destroy CUDA streams and do not try to initialize CUDA while capturing#8919
caugonnet merged 18 commits into
NVIDIA:mainfrom
caugonnet:stream-ctx-capture-safe

Conversation

@caugonnet
Copy link
Copy Markdown
Contributor

@caugonnet caugonnet commented May 12, 2026

Destroy pool-owned streams with the stream pool and initialize the CUDA runtime only once so consecutive stream_ctx instances on a caller stream serialize without explicit synchronization.

Description

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Destroy pool-owned streams with the stream pool and initialize the CUDA runtime only once so consecutive stream_ctx instances on a caller stream serialize without explicit synchronization.
@caugonnet caugonnet self-assigned this May 12, 2026
@caugonnet caugonnet added the stf Sequential Task Flow programming model label May 12, 2026
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 12, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 12, 2026
caugonnet added 2 commits May 12, 2026 12:15
Describe the runtime initialization invariant without relying on implementation history.
Skip CUDA runtime initialization when constructing a stream_ctx from an already-capturing user stream; the stream itself implies CUDA is initialized, and normal contexts still initialize before issuing work.
@caugonnet caugonnet changed the title [STF] Chain back-to-back stream contexts [STF] Properly destroy CUDA streams and do not try to initialize CUDA while capturing May 12, 2026
@caugonnet caugonnet marked this pull request as ready for review May 12, 2026 11:38
@caugonnet caugonnet requested review from a team as code owners May 12, 2026 11:38
@caugonnet caugonnet requested review from alliepiper and andralex May 12, 2026 11:38
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 12, 2026
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test 7059d3b

@github-actions

This comment has been minimized.

@caugonnet caugonnet enabled auto-merge (squash) May 15, 2026 06:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 15, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

important:

Walkthrough

Runtime initialization is now conditional when a user stream is in capture; stream_ctx detects capture and forwards an initialize flag to backend impl. stream_pool records external ownership and avoids destroying user streams. A new STf test exercises back-to-back stream_ctx ordering on a caller stream.

Changes

Stream Capture and Ownership Management

Layer / File(s) Summary
Backend context initialization control
cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
Backend impl ctor gains bool initialize_cuda_runtime = true and <mutex> include; cudaFree(0) and its CHECK run only when the flag is true.
Stream context capture detection and forwarding
cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
Add private is_capturing(cudaStream_t) helper; constructor derives initialize_cuda_runtime = !is_capturing(user_stream), updates capture validation, and forwards flag to nested impl/base.
Stream pool ownership tracking
cudax/include/cuda/experimental/__places/stream_pool.cuh
stream_pool::impl adds bool externally_owned and a destructor that only destroys lazily-created streams when the pool owns them; externally-owned pools leave user streams untouched.
Stream context lifetime and ordering test
cudax/test/stf/CMakeLists.txt, cudax/test/stf/local_stf/stream_ctx_lifetime_btb.cu
Register new STf CUDA test and implement three scenarios (no shared handle/no sync, no shared handle/explicit sync, shared async_resources_handle) with a repeat harness validating final buffer contents.

important:

Suggested labels

  • places

important:

Suggested reviewers

  • alliepiper
  • bernhardmgruber

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
cudax/test/stf/local_stf/stream_ctx_lifetime_btb.cu (1)

66-66: ⚡ Quick win

suggestion: add a compile-time guard for the partitioning assumption (N % CHAIN_COUNT == 0, and optionally N >= CHAIN_COUNT) so this test cannot silently skip tail elements if constants change later.

cudax/include/cuda/experimental/__places/stream_pool.cuh (1)

171-186: 💤 Low value

suggestion: The destructor silently discards cudaStreamDestroy errors. While destructors must not throw, failures during stream cleanup (e.g., cudaErrorInvalidValue, cudaErrorCudartUnloading) may indicate resource leaks or runtime shutdown races. Consider logging the error or documenting why it's safe to ignore.

Additionally, mark the destructor noexcept to make the no-throw contract explicit.

Proposed addition
-    ~impl()
+    ~impl() noexcept
     {
       if (externally_owned)
       {
         return;
       }
 
       for (auto& ds : payload)
       {
         if (ds.stream != nullptr)
         {
-          [[maybe_unused]] cudaError_t err = cudaStreamDestroy(ds.stream);
+          cudaError_t err = cudaStreamDestroy(ds.stream);
+          // Errors during shutdown (e.g., cudaErrorCudartUnloading) are benign.
+          // Log or assert in debug builds if needed.
+          (void)err;
           ds.stream                        = nullptr;
         }
       }
     }
cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh (1)

180-186: 💤 Low value

suggestion: is_capturing does not explicitly handle user_stream == nullptr. While cudaStreamIsCapturing accepts a NULL stream and returns cudaSuccess with cudaStreamCaptureStatusNone, the intent should be documented. If a null stream is never expected here, consider adding an assertion; otherwise, document that null streams are treated as not-capturing.

Proposed clarification
   [[nodiscard]] static bool is_capturing(cudaStream_t user_stream)
   {
+    // Null stream (legacy default stream) is never capturing.
     cudaStreamCaptureStatus capture_status = cudaStreamCaptureStatusNone;
     cuda_safe_call(cudaStreamIsCapturing(user_stream, &capture_status));
     return capture_status != cudaStreamCaptureStatusNone;
   }

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0e8be004-fe22-4a85-b2d5-61d8fc3b76fb

📥 Commits

Reviewing files that changed from the base of the PR and between db93bd1 and 5f8163d.

📒 Files selected for processing (5)
  • cudax/include/cuda/experimental/__places/stream_pool.cuh
  • cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
  • cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
  • cudax/test/stf/CMakeLists.txt
  • cudax/test/stf/local_stf/stream_ctx_lifetime_btb.cu

Comment thread cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test 543742a

@github-actions

This comment has been minimized.

Comment thread cudax/include/cuda/experimental/__places/stream_pool.cuh Outdated
Comment thread cudax/test/stf/local_stf/stream_ctx_lifetime_btb.cu Outdated
Comment thread cudax/test/stf/local_stf/stream_ctx_lifetime_btb.cu Outdated
Comment thread cudax/test/stf/local_stf/stream_ctx_lifetime_btb.cu Outdated
Copy link
Copy Markdown
Contributor

@andralex andralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm % nits

caugonnet and others added 3 commits May 21, 2026 11:21
Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>
Co-authored-by: Andrei Alexandrescu <andrei@erdani.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 809dabdd-1172-4444-833d-ec721f461569

📥 Commits

Reviewing files that changed from the base of the PR and between 5f8163d and a6b3a10.

📒 Files selected for processing (4)
  • cudax/include/cuda/experimental/__stf/internal/backend_ctx.cuh
  • cudax/include/cuda/experimental/__stf/stream/stream_ctx.cuh
  • cudax/test/stf/CMakeLists.txt
  • cudax/test/stf/local_stf/stream_ctx_lifetime_btb.cu

Comment thread cudax/test/stf/local_stf/stream_ctx_lifetime_btb.cu Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cudax/include/cuda/experimental/__places/stream_pool.cuh (1)

167-182: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Destructor must be marked noexcept.

The destructor ignores CUDA errors and doesn't throw, so per coding guidelines it must use noexcept.

Proposed fix
-    ~impl()
+    ~impl() noexcept
     {

As per coding guidelines: All functions that don't throw exceptions must use noexcept.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 01f4dc4e-26ea-4e5f-903f-ace5e4f0d3fd

📥 Commits

Reviewing files that changed from the base of the PR and between a6b3a10 and 9144eb2.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__places/stream_pool.cuh

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cudax/include/cuda/experimental/__places/stream_pool.cuh (1)

167-178: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

important: Align destructor declaration and CUDA free-function call with cudax header rules. At Line 167, ~impl() noexcept is missing a _CCCL_* API annotation, and at Line 178, cudaStreamDestroy should be called as ::cudaStreamDestroy(...) to satisfy the global qualification rule.

As per coding guidelines: "All functions must be marked with _CCCL_HOST_API, _CCCL_DEVICE_API, or _CCCL_API" and "All calls to free functions must be fully qualified from the global namespace."


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8afb961b-1b67-42f1-af61-908fd897af82

📥 Commits

Reviewing files that changed from the base of the PR and between bab1068 and 4311afe.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__places/stream_pool.cuh

@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test ae25afe

@github-actions

This comment has been minimized.

@caugonnet
Copy link
Copy Markdown
Contributor Author

/ok to test 9de92be

@caugonnet
Copy link
Copy Markdown
Contributor Author

Need to be investigated

  The following tests FAILED:
  	208 - cudax.test.stf.threads.axpy-threads-graph (Subprocess aborted)

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 33m 32s: Pass: 100%/55 | Total: 4h 50m | Max: 33m 32s | Hits: 100%/32746

See results here.

@caugonnet caugonnet merged commit 9722f26 into NVIDIA:main May 26, 2026
75 of 76 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL May 26, 2026
@andralex andralex deleted the stream-ctx-capture-safe branch May 26, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants