[STF] Add re-launchable popped graphs to stackable_ctx by caugonnet · Pull Request #9178 · NVIDIA/cccl

caugonnet · 2026-05-29T08:29:29Z

Splits graph_ctx_node finalization into phases so a popped nested graph can be instantiated once and launched many times before the matching epilogue runs. Adds three public surfaces on stackable_ctx:

pop_prologue() / pop_epilogue() returning a launchable_graph_handle that exposes exec(), stream(), graph(), and launch();
launchable_graph_scope, an RAII guard that pairs push() with a lazy pop_prologue() and runs pop_epilogue() in its destructor;
pop_prologue_shared() returning a copyable/storable launchable_graph whose destructor runs pop_epilogue() when the last copy dies.

The non-nested finalize path now flows through prepare_graph -> ensure_instantiated -> launch_once -> finalize_after_launch; the existing nested-graph behavior is preserved verbatim in finalize_nested(). push() / pop() guard against being called while a pop_prologue is still pending its matching pop_epilogue.

Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated launch, manual cudaGraphLaunch via exec()/stream(), zero-launch, handle invalidation, RAII scope, shared basic/copies/container/manual epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.

Description

closes

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes.

Splits graph_ctx_node finalization into phases so a popped nested graph can be instantiated once and launched many times before the matching epilogue runs. Adds three public surfaces on stackable_ctx: * pop_prologue() / pop_epilogue() returning a launchable_graph_handle that exposes exec(), stream(), graph(), and launch(); * launchable_graph_scope, an RAII guard that pairs push() with a lazy pop_prologue() and runs pop_epilogue() in its destructor; * pop_prologue_shared() returning a copyable/storable launchable_graph whose destructor runs pop_epilogue() when the last copy dies. The non-nested finalize path now flows through prepare_graph -> ensure_instantiated -> launch_once -> finalize_after_launch; the existing nested-graph behavior is preserved verbatim in finalize_nested(). push() / pop() guard against being called while a pop_prologue is still pending its matching pop_epilogue. Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated launch, manual cudaGraphLaunch via exec()/stream(), zero-launch, handle invalidation, RAII scope, shared basic/copies/container/manual epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.

copy-pr-bot · 2026-05-29T08:29:33Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

caugonnet · 2026-05-29T08:30:12Z

/ok to test 43486c3

caugonnet · 2026-05-31T08:16:50Z

/ok to test edbe4ad

coderabbitai · 2026-05-31T08:17:40Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

suggestion:

Walkthrough

This PR adds a two-phase re-launchable pop workflow: pop_prologue() returns launchable handles for deferred/lazy execution, and pop_epilogue() finalizes and invalidates outstanding handles. Graph finalization is split into prepare/instantiate/sync/launch/finalize phases, with push/pop sequencing guards and shared/copyable RAII handle wrappers.

Changes

Re-launchable graph pop API and implementation

Layer / File(s)	Summary
Graph finalization refactor `cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh`	Splits `graph_ctx_node::finalize()` into phased helpers (`prepare_graph`, `ensure_instantiated`, `ensure_prereqs_synced`, `launch_once`, `finalize_after_launch`) and extracts `finalize_nested()`; adds per-node flags and cached `exec_graph_` state used by deferred launches.
Two-phase pop machinery `cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh`	Implements `pop_prologue_impl()`/`pop_epilogue_impl()`, `pop_prologue_result`, and helpers (`launch_prepared_graph`, `prepare_handle_for_exec`, `prepare_handle_for_graph`) enabling lazy instantiation and strict sequencing for launchable handles.
Pending-epilogue verification helpers `cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh`	Adds helpers that validate a handle's pending token and node offset before accessing the prepared node.
Pending-epilogue state `cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh`	Adds `pending_epilogue_token_` and `pending_epilogue_node_offset_` to atomically track and invalidate outstanding launch handles when `pop_epilogue()` runs.
Push/pop sequencing guards `cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh`	Adds guards that abort when `push()` is called while an epilogue is pending and when `pop()` is invoked during the prologue/epilogue window.
Public API and handle declarations `cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh`, `.../stackable_ctx_impl.cuh`	Adds forward decl for `launchable_graph_handle`, public `stackable_ctx::pop_prologue()` (returns `launchable_graph_handle`) and `stackable_ctx::pop_epilogue()` declarations, and friend/forwarders so handles drive prepared-graph execution and lazy prereq syncing.

Suggested reviewers:

andralex
oleksandr-pavlyk
NaderAlAwar

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh (1)

801-818: 💤 Low value

suggestion: launched_ is set on line 689 but never read. Remove or use it.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5cb2f811-0bb5-4423-b7f6-236b7bd3fc9d

📥 Commits

Reviewing files that changed from the base of the PR and between fb8629d and edbe4ad.

📒 Files selected for processing (2)

cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh

Dep-A ordering is already tracked by synced_; launched_ was set in launch_once() but never read.

caugonnet · 2026-05-31T19:52:11Z

/ok to test 4b21ca3

Address review follow-ups on the re-launchable popped graphs: * Fix docs that claimed pop_prologue() eagerly instantiates the cudaGraphExec_t. Instantiation is lazy (first exec()/launch()); graph() consumers never instantiate. Drop the stale prepare_launch() references. * Route launchable_graph_handle through thin private stackable_ctx wrappers (launch_prepared_graph / prepare_handle_for_exec / prepare_handle_for_graph) instead of reaching into pimpl directly, mirroring the pop_epilogue() surface. * Replace the ad-hoc validate_/check_ helpers and the impl-side fprintf+abort misuse guards with _CCCL_VERIFY, which stays enabled in release builds (unlike _CCCL_ASSERT). Genuine internal invariants remain _CCCL_ASSERT. * Add a unit test that embeds handle.graph() as a child graph node via cudaGraphAddChildGraphNode, orders dep-A through an event on handle.stream(), and documents the pop_epilogue() ordering caveat.

andralex · 2026-06-02T21:07:01Z

+  auto r = pimpl->pop_prologue_impl();
+
+  launchable_graph_handle h;
+  h.token_          = r.token;


Suggested change

h.token_ = r.token;

launchable_graph_handle h = {};

so we initialize whatever fields we might add in the future. No efficiency impact, compiler will eliminate the dead assignments.

andralex · 2026-06-02T21:08:18Z

+class stackable_ctx::launchable_graph_scope
+{
+public:
+  using context_type = stackable_ctx;


there's no usefulness to this

andralex · 2026-06-02T21:15:23Z

+  using context_type = stackable_ctx;
+
+  explicit launchable_graph_scope(stackable_ctx& ctx,
+                                  const ::cuda::std::source_location& loc = ::cuda::std::source_location::current())


Suggested change

const ::cuda::std::source_location& loc = ::cuda::std::source_location::current())

::cuda::std::source_location loc = ::cuda::std::source_location::current())

it's cheap

andralex · 2026-06-02T21:17:38Z

+  {
+    if (released_)
+    {
+      return;
+    }
+    released_ = true;
+
+    if (!prepared_)
+    {
+      // No one ever called launch()/exec()/stream()/graph(): we still ran push()
+      // in the constructor, so we must match it with a prologue+epilogue
+      // pair to tear the node down cleanly. finalize_after_launch handles
+      // the no-launch case correctly.
+      handle_   = ctx_.pop_prologue();
+      prepared_ = true;
+    }
+    ctx_.pop_epilogue();


something unclear here... if it wasn't prepared and we're to release it, why prepare it? Also released_ needs to be set at the very end for exception safety.

Suggested change

{

if (released_)

{

return;

}

released_ = true;

if (!prepared_)

{

// No one ever called launch()/exec()/stream()/graph(): we still ran push()

// in the constructor, so we must match it with a prologue+epilogue

// pair to tear the node down cleanly. finalize_after_launch handles

// the no-launch case correctly.

handle_ = ctx_.pop_prologue();

prepared_ = true;

}

ctx_.pop_epilogue();

{

if (released_)

{

return;

}

if (!prepared_)

{

// No one ever called launch()/exec()/stream()/graph(): we still ran push()

// in the constructor, so we must match it with a prologue+epilogue

// pair to tear the node down cleanly. finalize_after_launch handles

// the no-launch case correctly.

handle_ = ctx_.pop_prologue();

prepared_ = true;

}

ctx_.pop_epilogue();

released_ = true;

andralex · 2026-06-02T21:18:50Z

+  bool prepared_ = false;
+  bool released_ = false;


Maybe enum class state = { empty, prepared, released };?

Or I suspect "released" is the same as "empty" so really there's only two states?

andralex · 2026-06-02T21:22:20Z

+  void ensure_prepared_()
+  {
+    if (!prepared_)
+    {
+      handle_   = ctx_.pop_prologue();
+      prepared_ = true;
+    }
+  }


I suspect the "prepared" state is entirely encoded by handle_ being non-null, which simplifies things. E.g.. here:

Suggested change

void ensure_prepared_()

{

if (!prepared_)

{

handle_ = ctx_.pop_prologue();

prepared_ = true;

}

}

void ensure_prepared_()

{

if (!handle_ )

{

handle_ = ctx_.pop_prologue();

}

}

Move import/adoption logic into the stackable_logical_data shell, drop the nested impl layer, and simplify launchable_graph_scope handle validity.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh (1)

1046-1070: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Don't use handle_ validity as the prepared-state bit. ensure_prepared_() keys off if (!handle_), but launchable_graph_handle::operator bool() flips to false after a manual ctx.pop_epilogue(). In that sequence, release()/the destructor will call pop_prologue() again on a scope whose pop was already epilogued. Keep a separate prepared_ flag, or at least short-circuit the destructor path when the handle has already been invalidated, and add a regression test for that manual-epilogue case.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 118cb460-b7ca-4dfe-b7bf-7771128d9211

📥 Commits

Reviewing files that changed from the base of the PR and between d9db2ae and 7805cef.

📒 Files selected for processing (1)

cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh

andralex · 2026-06-03T01:21:27Z

/ok to test 441f79d

Consolidate pending-graph validation in stackable_ctx::impl, route launch_once through ensure_prereqs_synced, and finish stackable_logical_data style cleanup (offset checks, user-facing docs, source_location by value).

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8b4c1e8c-2dd1-4fb9-ab94-15c633b9e047

📥 Commits

Reviewing files that changed from the base of the PR and between 7805cef and 5bcefd3.

📒 Files selected for processing (2)

cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh

🚧 Files skipped from review as they are similar to previous changes (1)

cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh

Pass the caller's source_location into repeat_graph_scope_guard so push_while attributes the user's call site, not the guard constructor.

andralex · 2026-06-03T04:41:46Z

/ok to test bf75f3c

andralex

good stuff

github-actions · 2026-06-03T05:28:11Z

🥳 CI Workflow Results

🟩 Finished in 44m 47s: Pass: 100%/55 | Total: 17h 34m | Max: 44m 42s | Hits: 24%/109188

See results here.

A main->stf_c_api merge appended a second identical copy of the 10 re-launchable popped-graph UNITTESTs that already live on main (added in NVIDIA#9178), causing each UNITTEST to be registered and run twice. Restore the single canonical copy by dropping the duplicate block.

caugonnet self-assigned this May 29, 2026

caugonnet added the stf Sequential Task Flow programming model label May 29, 2026

github-project-automation Bot added this to CCCL May 29, 2026

github-project-automation Bot moved this to Todo in CCCL May 29, 2026

cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 29, 2026

This comment has been minimized.

Sign in to view

Merge branch 'main' into stf_launchable_graphs

98f2669

caugonnet marked this pull request as ready for review May 31, 2026 08:16

caugonnet requested a review from a team as a code owner May 31, 2026 08:16

caugonnet requested a review from andralex May 31, 2026 08:16

Merge branch 'main' into stf_launchable_graphs

edbe4ad

cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 31, 2026

coderabbitai Bot reviewed May 31, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

caugonnet and others added 2 commits May 31, 2026 21:49

Remove unused launched_ flag from graph_ctx_node.

7178034

Dep-A ordering is already tracked by synced_; launched_ was set in launch_once() but never read.

Merge branch 'main' into stf_launchable_graphs

4b21ca3

This comment has been minimized.

Sign in to view

caugonnet and others added 2 commits June 1, 2026 16:03

Merge branch 'main' into stf_launchable_graphs

52e0b13

andralex reviewed Jun 2, 2026

View reviewed changes

andralex added 2 commits June 2, 2026 17:25

Merge branch 'main' into pr-9178-stf-launchable-graphs

955172e

[STF] Flatten stackable_logical_data and polish launchable graph scope

7805cef

Move import/adoption logic into the stackable_logical_data shell, drop the nested impl layer, and simplify launchable_graph_scope handle validity.

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh

andralex added 2 commits June 2, 2026 18:46

Merge branch 'main' into stf_launchable_graphs

e5551a6

Merge branch 'main' into stf_launchable_graphs

441f79d

This comment has been minimized.

Sign in to view

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh Outdated

[STF] Forward source_location through repeat_graph_scope

bf75f3c

Pass the caller's source_location into repeat_graph_scope_guard so push_while attributes the user's call site, not the guard constructor.

andralex enabled auto-merge (squash) June 3, 2026 04:41

andralex approved these changes Jun 3, 2026

View reviewed changes

andralex merged commit 7fc1933 into NVIDIA:main Jun 3, 2026
77 checks passed

github-project-automation Bot moved this from In Review to Done in CCCL Jun 3, 2026

andralex deleted the stf_launchable_graphs branch June 3, 2026 05:31

caugonnet mentioned this pull request Jun 3, 2026

[STF] Add tests for re-launchable popped graphs in stackable_ctx #9226

Closed

2 tasks

This was referenced Jun 3, 2026

[STF] Add C bindings for stackable contexts #9233

Merged

cudax/stf: migrate stackable/ from cuda_safe_call to cuda_try #9165

Merged

	const ::cuda::std::source_location& loc = ::cuda::std::source_location::current())
	::cuda::std::source_location loc = ::cuda::std::source_location::current())

Conversation

caugonnet commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented May 29, 2026

Uh oh!

caugonnet commented May 29, 2026

Uh oh!

This comment has been minimized.

caugonnet commented May 31, 2026

Uh oh!

coderabbitai Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

caugonnet commented May 31, 2026

Uh oh!

This comment has been minimized.

andralex Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

andralex Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

andralex Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

andralex Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

andralex Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

andralex Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

andralex Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andralex commented Jun 3, 2026

Uh oh!

This comment has been minimized.

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andralex commented Jun 3, 2026

Uh oh!

andralex left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 3, 2026

🥳 CI Workflow Results

🟩 Finished in 44m 47s: Pass: 100%/55 | Total: 17h 34m | Max: 44m 42s | Hits: 24%/109188

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

caugonnet commented May 29, 2026 •

edited

Loading

coderabbitai Bot commented May 31, 2026 •

edited

Loading