Skip to content

[STF] Add re-launchable popped graphs to stackable_ctx#9178

Merged
andralex merged 13 commits into
NVIDIA:mainfrom
caugonnet:stf_launchable_graphs
Jun 3, 2026
Merged

[STF] Add re-launchable popped graphs to stackable_ctx#9178
andralex merged 13 commits into
NVIDIA:mainfrom
caugonnet:stf_launchable_graphs

Conversation

@caugonnet

@caugonnet caugonnet commented May 29, 2026

Copy link
Copy Markdown
Contributor

Splits graph_ctx_node finalization into phases so a popped nested graph can be instantiated once and launched many times before the matching epilogue runs. Adds three public surfaces on stackable_ctx:

  • pop_prologue() / pop_epilogue() returning a launchable_graph_handle that exposes exec(), stream(), graph(), and launch();
  • launchable_graph_scope, an RAII guard that pairs push() with a lazy pop_prologue() and runs pop_epilogue() in its destructor;
  • pop_prologue_shared() returning a copyable/storable launchable_graph whose destructor runs pop_epilogue() when the last copy dies.

The non-nested finalize path now flows through prepare_graph -> ensure_instantiated -> launch_once -> finalize_after_launch; the existing nested-graph behavior is preserved verbatim in finalize_nested(). push() / pop() guard against being called while a pop_prologue is still pending its matching pop_epilogue.

Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated launch, manual cudaGraphLaunch via exec()/stream(), zero-launch, handle invalidation, RAII scope, shared basic/copies/container/manual epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.

Description

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Splits graph_ctx_node finalization into phases so a popped nested graph
can be instantiated once and launched many times before the matching
epilogue runs. Adds three public surfaces on stackable_ctx:

  * pop_prologue() / pop_epilogue() returning a launchable_graph_handle
    that exposes exec(), stream(), graph(), and launch();
  * launchable_graph_scope, an RAII guard that pairs push() with a
    lazy pop_prologue() and runs pop_epilogue() in its destructor;
  * pop_prologue_shared() returning a copyable/storable launchable_graph
    whose destructor runs pop_epilogue() when the last copy dies.

The non-nested finalize path now flows through prepare_graph ->
ensure_instantiated -> launch_once -> finalize_after_launch; the
existing nested-graph behavior is preserved verbatim in
finalize_nested(). push() / pop() guard against being called while a
pop_prologue is still pending its matching pop_epilogue.

Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated
launch, manual cudaGraphLaunch via exec()/stream(), zero-launch,
handle invalidation, RAII scope, shared basic/copies/container/manual
epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.
@caugonnet caugonnet self-assigned this May 29, 2026
@caugonnet caugonnet added the stf Sequential Task Flow programming model label May 29, 2026
@copy-pr-bot

copy-pr-bot Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 29, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Progress in CCCL May 29, 2026
@caugonnet

Copy link
Copy Markdown
Contributor Author

/ok to test 43486c3

@github-actions

This comment has been minimized.

@caugonnet caugonnet marked this pull request as ready for review May 31, 2026 08:16
@caugonnet caugonnet requested a review from a team as a code owner May 31, 2026 08:16
@caugonnet caugonnet requested a review from andralex May 31, 2026 08:16
@cccl-authenticator-app cccl-authenticator-app Bot moved this from In Progress to In Review in CCCL May 31, 2026
@caugonnet

Copy link
Copy Markdown
Contributor Author

/ok to test edbe4ad

@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

suggestion:

Walkthrough

This PR adds a two-phase re-launchable pop workflow: pop_prologue() returns launchable handles for deferred/lazy execution, and pop_epilogue() finalizes and invalidates outstanding handles. Graph finalization is split into prepare/instantiate/sync/launch/finalize phases, with push/pop sequencing guards and shared/copyable RAII handle wrappers.

Changes

Re-launchable graph pop API and implementation

Layer / File(s) Summary
Graph finalization refactor
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Splits graph_ctx_node::finalize() into phased helpers (prepare_graph, ensure_instantiated, ensure_prereqs_synced, launch_once, finalize_after_launch) and extracts finalize_nested(); adds per-node flags and cached exec_graph_ state used by deferred launches.
Two-phase pop machinery
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Implements pop_prologue_impl()/pop_epilogue_impl(), pop_prologue_result, and helpers (launch_prepared_graph, prepare_handle_for_exec, prepare_handle_for_graph) enabling lazy instantiation and strict sequencing for launchable handles.
Pending-epilogue verification helpers
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Adds helpers that validate a handle's pending token and node offset before accessing the prepared node.
Pending-epilogue state
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Adds pending_epilogue_token_ and pending_epilogue_node_offset_ to atomically track and invalidate outstanding launch handles when pop_epilogue() runs.
Push/pop sequencing guards
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Adds guards that abort when push() is called while an epilogue is pending and when pop() is invoked during the prologue/epilogue window.
Public API and handle declarations
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh, .../stackable_ctx_impl.cuh
Adds forward decl for launchable_graph_handle, public stackable_ctx::pop_prologue() (returns launchable_graph_handle) and stackable_ctx::pop_epilogue() declarations, and friend/forwarders so handles drive prepared-graph execution and lazy prereq syncing.

Suggested reviewers:

  • andralex
  • oleksandr-pavlyk
  • NaderAlAwar

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh (1)

801-818: 💤 Low value

suggestion: launched_ is set on line 689 but never read. Remove or use it.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5cb2f811-0bb5-4423-b7f6-236b7bd3fc9d

📥 Commits

Reviewing files that changed from the base of the PR and between fb8629d and edbe4ad.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh

@github-actions

This comment has been minimized.

caugonnet and others added 2 commits May 31, 2026 21:49
Dep-A ordering is already tracked by synced_; launched_ was set in
launch_once() but never read.
@caugonnet

Copy link
Copy Markdown
Contributor Author

/ok to test 4b21ca3

@github-actions

This comment has been minimized.

caugonnet and others added 2 commits June 1, 2026 16:03
Address review follow-ups on the re-launchable popped graphs:

* Fix docs that claimed pop_prologue() eagerly instantiates the
  cudaGraphExec_t. Instantiation is lazy (first exec()/launch()); graph()
  consumers never instantiate. Drop the stale prepare_launch() references.
* Route launchable_graph_handle through thin private stackable_ctx
  wrappers (launch_prepared_graph / prepare_handle_for_exec /
  prepare_handle_for_graph) instead of reaching into pimpl directly,
  mirroring the pop_epilogue() surface.
* Replace the ad-hoc validate_/check_ helpers and the impl-side
  fprintf+abort misuse guards with _CCCL_VERIFY, which stays enabled in
  release builds (unlike _CCCL_ASSERT). Genuine internal invariants remain
  _CCCL_ASSERT.
* Add a unit test that embeds handle.graph() as a child graph node via
  cudaGraphAddChildGraphNode, orders dep-A through an event on
  handle.stream(), and documents the pop_epilogue() ordering caveat.
auto r = pimpl->pop_prologue_impl();

launchable_graph_handle h;
h.token_ = r.token;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
h.token_ = r.token;
launchable_graph_handle h = {};

so we initialize whatever fields we might add in the future. No efficiency impact, compiler will eliminate the dead assignments.

class stackable_ctx::launchable_graph_scope
{
public:
using context_type = stackable_ctx;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no usefulness to this

using context_type = stackable_ctx;

explicit launchable_graph_scope(stackable_ctx& ctx,
const ::cuda::std::source_location& loc = ::cuda::std::source_location::current())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const ::cuda::std::source_location& loc = ::cuda::std::source_location::current())
::cuda::std::source_location loc = ::cuda::std::source_location::current())

it's cheap

Comment on lines +1253 to +1269
{
if (released_)
{
return;
}
released_ = true;

if (!prepared_)
{
// No one ever called launch()/exec()/stream()/graph(): we still ran push()
// in the constructor, so we must match it with a prologue+epilogue
// pair to tear the node down cleanly. finalize_after_launch handles
// the no-launch case correctly.
handle_ = ctx_.pop_prologue();
prepared_ = true;
}
ctx_.pop_epilogue();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something unclear here... if it wasn't prepared and we're to release it, why prepare it? Also released_ needs to be set at the very end for exception safety.

Suggested change
{
if (released_)
{
return;
}
released_ = true;
if (!prepared_)
{
// No one ever called launch()/exec()/stream()/graph(): we still ran push()
// in the constructor, so we must match it with a prologue+epilogue
// pair to tear the node down cleanly. finalize_after_launch handles
// the no-launch case correctly.
handle_ = ctx_.pop_prologue();
prepared_ = true;
}
ctx_.pop_epilogue();
{
if (released_)
{
return;
}
if (!prepared_)
{
// No one ever called launch()/exec()/stream()/graph(): we still ran push()
// in the constructor, so we must match it with a prologue+epilogue
// pair to tear the node down cleanly. finalize_after_launch handles
// the no-launch case correctly.
handle_ = ctx_.pop_prologue();
prepared_ = true;
}
ctx_.pop_epilogue();
released_ = true;

Comment on lines +1284 to +1285
bool prepared_ = false;
bool released_ = false;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe enum class state = { empty, prepared, released };?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or I suspect "released" is the same as "empty" so really there's only two states?

Comment on lines +1273 to +1280
void ensure_prepared_()
{
if (!prepared_)
{
handle_ = ctx_.pop_prologue();
prepared_ = true;
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the "prepared" state is entirely encoded by handle_ being non-null, which simplifies things. E.g.. here:

Suggested change
void ensure_prepared_()
{
if (!prepared_)
{
handle_ = ctx_.pop_prologue();
prepared_ = true;
}
}
void ensure_prepared_()
{
if (!handle_ )
{
handle_ = ctx_.pop_prologue();
}
}

andralex added 2 commits June 2, 2026 17:25
Move import/adoption logic into the stackable_logical_data shell, drop the
nested impl layer, and simplify launchable_graph_scope handle validity.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh (1)

1046-1070: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Don't use handle_ validity as the prepared-state bit. ensure_prepared_() keys off if (!handle_), but launchable_graph_handle::operator bool() flips to false after a manual ctx.pop_epilogue(). In that sequence, release()/the destructor will call pop_prologue() again on a scope whose pop was already epilogued. Keep a separate prepared_ flag, or at least short-circuit the destructor path when the handle has already been invalidated, and add a regression test for that manual-epilogue case.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 118cb460-b7ca-4dfe-b7bf-7771128d9211

📥 Commits

Reviewing files that changed from the base of the PR and between d9db2ae and 7805cef.

📒 Files selected for processing (1)
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh

Comment thread cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
@andralex

andralex commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

/ok to test 441f79d

@github-actions

This comment has been minimized.

Consolidate pending-graph validation in stackable_ctx::impl, route launch_once
through ensure_prereqs_synced, and finish stackable_logical_data style cleanup
(offset checks, user-facing docs, source_location by value).

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8b4c1e8c-2dd1-4fb9-ab94-15c633b9e047

📥 Commits

Reviewing files that changed from the base of the PR and between 7805cef and 5bcefd3.

📒 Files selected for processing (2)
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
  • cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh

Comment thread cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh Outdated
Pass the caller's source_location into repeat_graph_scope_guard so
push_while attributes the user's call site, not the guard constructor.
@andralex andralex enabled auto-merge (squash) June 3, 2026 04:41
@andralex

andralex commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

/ok to test bf75f3c

@andralex andralex left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good stuff

@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 44m 47s: Pass: 100%/55 | Total: 17h 34m | Max: 44m 42s | Hits: 24%/109188

See results here.

@andralex andralex merged commit 7fc1933 into NVIDIA:main Jun 3, 2026
77 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Jun 3, 2026
@andralex andralex deleted the stf_launchable_graphs branch June 3, 2026 05:31
caugonnet added a commit to caugonnet/cccl that referenced this pull request Jun 3, 2026
A main->stf_c_api merge appended a second identical copy of the 10
re-launchable popped-graph UNITTESTs that already live on main (added in
NVIDIA#9178), causing each UNITTEST to be registered and run twice. Restore the
single canonical copy by dropping the duplicate block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants