[STF] Add re-launchable popped graphs to stackable_ctx#9178
Conversation
Splits graph_ctx_node finalization into phases so a popped nested graph
can be instantiated once and launched many times before the matching
epilogue runs. Adds three public surfaces on stackable_ctx:
* pop_prologue() / pop_epilogue() returning a launchable_graph_handle
that exposes exec(), stream(), graph(), and launch();
* launchable_graph_scope, an RAII guard that pairs push() with a
lazy pop_prologue() and runs pop_epilogue() in its destructor;
* pop_prologue_shared() returning a copyable/storable launchable_graph
whose destructor runs pop_epilogue() when the last copy dies.
The non-nested finalize path now flows through prepare_graph ->
ensure_instantiated -> launch_once -> finalize_after_launch; the
existing nested-graph behavior is preserved verbatim in
finalize_nested(). push() / pop() guard against being called while a
pop_prologue is still pending its matching pop_epilogue.
Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated
launch, manual cudaGraphLaunch via exec()/stream(), zero-launch,
handle invalidation, RAII scope, shared basic/copies/container/manual
epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.
|
/ok to test 43486c3 |
This comment has been minimized.
This comment has been minimized.
|
/ok to test edbe4ad |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
suggestion: WalkthroughThis PR adds a two-phase re-launchable pop workflow: pop_prologue() returns launchable handles for deferred/lazy execution, and pop_epilogue() finalizes and invalidates outstanding handles. Graph finalization is split into prepare/instantiate/sync/launch/finalize phases, with push/pop sequencing guards and shared/copyable RAII handle wrappers. ChangesRe-launchable graph pop API and implementation
Suggested reviewers:
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh (1)
801-818: 💤 Low valuesuggestion:
launched_is set on line 689 but never read. Remove or use it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5cb2f811-0bb5-4423-b7f6-236b7bd3fc9d
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuhcudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
This comment has been minimized.
This comment has been minimized.
Dep-A ordering is already tracked by synced_; launched_ was set in launch_once() but never read.
|
/ok to test 4b21ca3 |
This comment has been minimized.
This comment has been minimized.
Address review follow-ups on the re-launchable popped graphs: * Fix docs that claimed pop_prologue() eagerly instantiates the cudaGraphExec_t. Instantiation is lazy (first exec()/launch()); graph() consumers never instantiate. Drop the stale prepare_launch() references. * Route launchable_graph_handle through thin private stackable_ctx wrappers (launch_prepared_graph / prepare_handle_for_exec / prepare_handle_for_graph) instead of reaching into pimpl directly, mirroring the pop_epilogue() surface. * Replace the ad-hoc validate_/check_ helpers and the impl-side fprintf+abort misuse guards with _CCCL_VERIFY, which stays enabled in release builds (unlike _CCCL_ASSERT). Genuine internal invariants remain _CCCL_ASSERT. * Add a unit test that embeds handle.graph() as a child graph node via cudaGraphAddChildGraphNode, orders dep-A through an event on handle.stream(), and documents the pop_epilogue() ordering caveat.
| auto r = pimpl->pop_prologue_impl(); | ||
|
|
||
| launchable_graph_handle h; | ||
| h.token_ = r.token; |
There was a problem hiding this comment.
| h.token_ = r.token; | |
| launchable_graph_handle h = {}; |
so we initialize whatever fields we might add in the future. No efficiency impact, compiler will eliminate the dead assignments.
| class stackable_ctx::launchable_graph_scope | ||
| { | ||
| public: | ||
| using context_type = stackable_ctx; |
There was a problem hiding this comment.
there's no usefulness to this
| using context_type = stackable_ctx; | ||
|
|
||
| explicit launchable_graph_scope(stackable_ctx& ctx, | ||
| const ::cuda::std::source_location& loc = ::cuda::std::source_location::current()) |
There was a problem hiding this comment.
| const ::cuda::std::source_location& loc = ::cuda::std::source_location::current()) | |
| ::cuda::std::source_location loc = ::cuda::std::source_location::current()) |
it's cheap
| { | ||
| if (released_) | ||
| { | ||
| return; | ||
| } | ||
| released_ = true; | ||
|
|
||
| if (!prepared_) | ||
| { | ||
| // No one ever called launch()/exec()/stream()/graph(): we still ran push() | ||
| // in the constructor, so we must match it with a prologue+epilogue | ||
| // pair to tear the node down cleanly. finalize_after_launch handles | ||
| // the no-launch case correctly. | ||
| handle_ = ctx_.pop_prologue(); | ||
| prepared_ = true; | ||
| } | ||
| ctx_.pop_epilogue(); |
There was a problem hiding this comment.
something unclear here... if it wasn't prepared and we're to release it, why prepare it? Also released_ needs to be set at the very end for exception safety.
| { | |
| if (released_) | |
| { | |
| return; | |
| } | |
| released_ = true; | |
| if (!prepared_) | |
| { | |
| // No one ever called launch()/exec()/stream()/graph(): we still ran push() | |
| // in the constructor, so we must match it with a prologue+epilogue | |
| // pair to tear the node down cleanly. finalize_after_launch handles | |
| // the no-launch case correctly. | |
| handle_ = ctx_.pop_prologue(); | |
| prepared_ = true; | |
| } | |
| ctx_.pop_epilogue(); | |
| { | |
| if (released_) | |
| { | |
| return; | |
| } | |
| if (!prepared_) | |
| { | |
| // No one ever called launch()/exec()/stream()/graph(): we still ran push() | |
| // in the constructor, so we must match it with a prologue+epilogue | |
| // pair to tear the node down cleanly. finalize_after_launch handles | |
| // the no-launch case correctly. | |
| handle_ = ctx_.pop_prologue(); | |
| prepared_ = true; | |
| } | |
| ctx_.pop_epilogue(); | |
| released_ = true; |
| bool prepared_ = false; | ||
| bool released_ = false; |
There was a problem hiding this comment.
Maybe enum class state = { empty, prepared, released };?
There was a problem hiding this comment.
Or I suspect "released" is the same as "empty" so really there's only two states?
| void ensure_prepared_() | ||
| { | ||
| if (!prepared_) | ||
| { | ||
| handle_ = ctx_.pop_prologue(); | ||
| prepared_ = true; | ||
| } | ||
| } |
There was a problem hiding this comment.
I suspect the "prepared" state is entirely encoded by handle_ being non-null, which simplifies things. E.g.. here:
| void ensure_prepared_() | |
| { | |
| if (!prepared_) | |
| { | |
| handle_ = ctx_.pop_prologue(); | |
| prepared_ = true; | |
| } | |
| } | |
| void ensure_prepared_() | |
| { | |
| if (!handle_ ) | |
| { | |
| handle_ = ctx_.pop_prologue(); | |
| } | |
| } |
Move import/adoption logic into the stackable_logical_data shell, drop the nested impl layer, and simplify launchable_graph_scope handle validity.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh (1)
1046-1070:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: Don't use
handle_validity as the prepared-state bit.ensure_prepared_()keys offif (!handle_), butlaunchable_graph_handle::operator bool()flips to false after a manualctx.pop_epilogue(). In that sequence,release()/the destructor will callpop_prologue()again on a scope whose pop was already epilogued. Keep a separateprepared_flag, or at least short-circuit the destructor path when the handle has already been invalidated, and add a regression test for that manual-epilogue case.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 118cb460-b7ca-4dfe-b7bf-7771128d9211
📒 Files selected for processing (1)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuh
|
/ok to test 441f79d |
This comment has been minimized.
This comment has been minimized.
Consolidate pending-graph validation in stackable_ctx::impl, route launch_once through ensure_prereqs_synced, and finish stackable_logical_data style cleanup (offset checks, user-facing docs, source_location by value).
There was a problem hiding this comment.
Actionable comments posted: 1
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8b4c1e8c-2dd1-4fb9-ab94-15c633b9e047
📒 Files selected for processing (2)
cudax/include/cuda/experimental/__stf/stackable/stackable_ctx.cuhcudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
🚧 Files skipped from review as they are similar to previous changes (1)
- cudax/include/cuda/experimental/__stf/stackable/stackable_ctx_impl.cuh
Pass the caller's source_location into repeat_graph_scope_guard so push_while attributes the user's call site, not the guard constructor.
|
/ok to test bf75f3c |
🥳 CI Workflow Results🟩 Finished in 44m 47s: Pass: 100%/55 | Total: 17h 34m | Max: 44m 42s | Hits: 24%/109188See results here. |
A main->stf_c_api merge appended a second identical copy of the 10 re-launchable popped-graph UNITTESTs that already live on main (added in NVIDIA#9178), causing each UNITTEST to be registered and run twice. Restore the single canonical copy by dropping the duplicate block.
Splits graph_ctx_node finalization into phases so a popped nested graph can be instantiated once and launched many times before the matching epilogue runs. Adds three public surfaces on stackable_ctx:
The non-nested finalize path now flows through prepare_graph -> ensure_instantiated -> launch_once -> finalize_after_launch; the existing nested-graph behavior is preserved verbatim in finalize_nested(). push() / pop() guard against being called while a pop_prologue is still pending its matching pop_epilogue.
Coverage lives in the stackable_ctx.cuh inline UNITTESTs: repeated launch, manual cudaGraphLaunch via exec()/stream(), zero-launch, handle invalidation, RAII scope, shared basic/copies/container/manual epilogue, and a CTK-12.4 pop_prologue + repeat_graph_scope test.
Description
closes
Checklist