[STF] Add C bindings for stackable contexts#9233
Conversation
Extends the experimental STF C API with the stackable-context layer: - stackable context create/finalize/fence and push_graph/pop, plus launchable graphs (launch/exec/stream/graph) and a shared-ownership flavor for replayable / embeddable graphs. - while- and repeat-loop scopes built on CUDA conditional graph nodes, with a built-in scalar comparison condition helper (CUDA 12.4+). - stackable logical data and tokens, and stackable host_launch deps. Adds test_stackable.cu, test_stackable_token_push.cu, and stackable coverage in test_host_launch.cu. Extracted from the python-bindings PR to keep that change reviewable.
|
/ok to test 26e1bf9 |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
suggestion: Walkthroughsuggestion: Adds a StackableContext C API and implementation enabling nested STF graph scopes with two-phase pop (prologue/epilogue), relaunchable launchable-graphs (shared and non-shared), CUDA 12.4+ while/repeat scopes, stack-aware logical-data/tokens, task and host-launch bindings, and accompanying tests. ChangesStackable Context API
Possibly related PRs
Suggested reviewers
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (5)
c/experimental/stf/include/cccl/c/experimental/stf/stf.h (1)
1676-1685: 💤 Low valuesuggestion: The
stf_while_scope_handleandstf_repeat_scope_handletypedefs are declared outside theCUDART_VERSION >= 12040guard (lines 1676-1680), while the functions using them are inside (starting at line 1685). On older CUDA, these types exist but are unusable. Consider moving the typedefs inside the guard for consistency, or add a comment explaining why they're intentionally exposed unconditionally (e.g., for FFI struct compatibility).c/experimental/stf/test/test_stackable_token_push.cu (4)
33-33: 💤 Low valuesuggestion: Remove unused
<cmath>include. No math functions are called in this file.As per coding guidelines: Remove unused headers.
79-79: 💤 Low valuesuggestion: Use
constexprinstead ofconstforrelaunchN. This variable is a compile-time constant and should be markedconstexprper the coding guidelines.- const int relaunchN = 4; + constexpr int relaunchN = 4;As per coding guidelines: All variables that can be evaluated at compile-time must use
constexpr.
131-131: 💤 Low valuesuggestion: Use
constexprinstead ofconstforrelaunchN. This variable is a compile-time constant and should be markedconstexprper the coding guidelines.- const int relaunchN = 4; + constexpr int relaunchN = 4;As per coding guidelines: All variables that can be evaluated at compile-time must use
constexpr.
182-183: 💤 Low valuesuggestion: Use
constexprinstead ofconstforNandrelaunchN. These variables are compile-time constants and should be markedconstexprper the coding guidelines.- const size_t N = 8; - const int relaunchN = 4; + constexpr size_t N = 8; + constexpr int relaunchN = 4;As per coding guidelines: All variables that can be evaluated at compile-time must use
constexpr.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c8848a75-324b-45c1-ae60-060a40d953b4
📒 Files selected for processing (5)
c/experimental/stf/include/cccl/c/experimental/stf/stf.hc/experimental/stf/src/stf.cuc/experimental/stf/test/test_host_launch.cuc/experimental/stf/test/test_stackable.cuc/experimental/stf/test/test_stackable_token_push.cu
Address review feedback: verify the cudaMallocHost allocation for host_dep succeeds before using the pointer, so allocation failures fail the test instead of leading to UB / false passes.
This comment has been minimized.
This comment has been minimized.
|
/ok to test e042f4f |
This comment has been minimized.
This comment has been minimized.
- test_host_launch: enable capture on the manual task inside the nested graph scope so it uses the graph's capture stream, fixing a race between the fill kernel and the host verifier (flaky passed == false). - stf_stackable_push_repeat: reject count == 0 to avoid unsigned underflow / non-terminating loops, matching the documented contract. - stf.h: document the lazy dep-A sync performed by the graph() accessors (stf_launchable_graph_graph / _shared_graph), which order the support stream behind the nested context's freeze events. - test_stackable: check cudaMallocHost results via REQUIRE to avoid UB on allocation failure.
|
@NaderAlAwar thanks these were really insightful comments ... |
|
/ok to test a2bd359 |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
c/experimental/stf/include/cccl/c/experimental/stf/stf.h (1)
1770-1776:⚠️ Potential issue | 🟠 Major | ⚡ Quick winimportant: mirror the outer-stream lifetime caveat on the shared graph accessor.
The last
stf_launchable_graph_shared_free()implicitly runsstf_stackable_pop_epilogue(), so callers that embed the returnedcudaGraph_tinto an outer graph still need to keep the final shared handle alive until that outer launch stream has completed.stf_launchable_graph_shared_graph()documents the lazy dep-A sync, but not this matching teardown constraint, which leaves the same unfreeze-vs-execution race undocumented on the shared path.Also applies to: 1840-1846
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a66f72ce-e208-4587-bc4d-d47f92bb7f8e
📒 Files selected for processing (4)
c/experimental/stf/include/cccl/c/experimental/stf/stf.hc/experimental/stf/src/stf.cuc/experimental/stf/test/test_host_launch.cuc/experimental/stf/test/test_stackable.cu
🚧 Files skipped from review as they are similar to previous changes (2)
- c/experimental/stf/test/test_stackable.cu
- c/experimental/stf/src/stf.cu
🥳 CI Workflow Results🟩 Finished in 34m 15s: Pass: 100%/7 | Total: 51m 58s | Max: 34m 10s | Hits: 83%/948See results here. |
The shared launchable-graph path runs pop_epilogue() implicitly when the last reference is released via stf_launchable_graph_shared_free(). That epilogue unfreezes the pushed data and only synchronizes the support stream (dep A), not the user's outer launch stream. Mirror the outer-stream lifetime caveat already documented on the single-handle stf_launchable_graph_graph() onto stf_launchable_graph_shared_graph() and stf_launchable_graph_shared_free(), and add a worked example showing the safe ordering (sync the launch stream before freeing the last ref).
stackable_ctx::finalize() asserts the stack is already back at the root and only finalizes the root context. The previous doc wrongly implied it blocks on tasks in still-open scopes. Reword to state that all nested scopes must be popped first and that finalize only waits for pending root-level work and releases root resources.
The pop prologue no longer instantiates a cudaGraphExec_t: it only pops pushed data and finalizes the child graph, with instantiation happening lazily on the first launch()/exec(). Update the pop_prologue / launch / exec docs to match that lazy-instantiation contract. Also remove the internal "dep A" label from the public C header (it was only ever defined in the C++ layer) and describe the synchronization in plain terms: ordering the support stream behind the nested context's freeze/get events.
…sk_get_graph In a graph context, stf_task_get_custream() only returns a valid stream once stf_task_enable_capture() has been called. Previously it silently returned a null stream, so kernels launched on it ran on the NULL stream outside the STF graph (leaving the task's graph node empty) and only appeared correct because finalize() globally synchronizes. Now stf_task_get_custream() asserts the stream is non-null with a message pointing at the two supported options. For the expert/explicit path, add stf_task_get_graph(), which returns the task's child cudaGraph_t so callers can add graph nodes directly instead of capturing a stream (mutually exclusive with capture). Backed by a new context::unified_task::get_graph() accessor. Fix the stackable tests that launched kernels via stf_task_get_custream() inside a graph scope without enabling capture (test_stackable.cu and test_stackable_token_push.cu); they were silently running outside the graph. Built with the cccl-c-stf preset and ran the stackable, stackable_token_push, and host_launch C tests (all pass).
stf_task_get_graph() returns a raw cudaGraph_t for explicit node insertion. In C++ this handle can be adopted via cuda::experimental::graph_builder::from_native_handle(), but cuda.core's Python GraphBuilder cannot adopt an existing cudaGraph_t (it is created from a device/stream and builds via capture). Document that this entry point is a C/C++ affordance that may not be backed by a Python interface, and that Python users should prefer the capture path.
…in release The previous _CCCL_ASSERT guard in stf_task_get_custream() is gated on CCCL_ENABLE_HOST_ASSERTIONS and compiles to a no-op in default release builds (e.g. the cccl-c-stf preset), so the "launching outside the graph" misuse would still pass silently there. Switch to _CCCL_VERIFY, which is unconditionally enabled (even under NDEBUG) and is the macro reserved for critical always-on checks. Verified empirically: with the guard a task that uses stf_task_get_custream() in a graph scope without stf_task_enable_capture() now aborts (SIGABRT) in the release build instead of silently corrupting results.
The else branch in unified_task::get_graph() is an unconditional misuse path (a stream-context task has no graph). Use _CCCL_VERIFY so it aborts in release builds too, instead of _CCCL_ASSERT which compiles out and silently returns a null graph.
|
/ok to test df56c2d |
| using task_t = ::std::decay_t<decltype(self)>; | ||
| if constexpr (::std::is_same_v<task_t, graph_task<Deps...>>) | ||
| { | ||
| return self.get_graph(); |
There was a problem hiding this comment.
Important: stf_task_get_graph() is documented as mutually exclusive with the capture path, but this wrapper does not enforce that. If a C caller calls stf_task_enable_capture(t) and then stf_task_get_graph(t), graph_task::get_graph() creates a child graph, while end_uncleared() takes the capture/done_nodes path and never embeds that child graph, so nodes inserted through the returned graph are silently dropped
The explicit-graph path (graph_task::get_graph()) is documented as mutually exclusive with stream capture, but nothing enforced it. With capture enabled, end_uncleared() takes the done_nodes path and never embeds the child graph, so nodes added through get_graph() were silently dropped. Add a _CCCL_VERIFY guard in get_graph() so this misuse aborts loudly (in release too) instead of corrupting results. Also add test_task_get_graph.cu exercising the explicit-graph path (adding a kernel node directly into the task's child graph, no capture).
…eHost The stackable nested-graph-scope example in stf.h launched work through stf_task_get_custream() without stf_task_enable_capture(), which now aborts. Add the enable_capture() call to the example and a note on stf_stackable_task_create() about the capture requirement (or the stf_task_get_graph() explicit path) inside a graph scope. Also check the cudaFreeHost result in test_task_get_graph.cu.
Wrap the previously unchecked cudaMallocHost (test_host_launch.cu) and cudaFreeHost calls across the stackable / host_launch C tests with REQUIRE(... == cudaSuccess), consistent with the rest of the suite, so allocation/free failures fail the test instead of being ignored.
Replace the unique_ptr/cudaFreeHost deleter (which ignored the free result) on the host-pinned test with the explicit checked alloc/free pattern used elsewhere, so a cudaFreeHost failure fails the test.
|
Many thanks @NaderAlAwar ! |
|
/ok to test 726bad1 |
|
/ok to test 6c441e6 |
Description
Extracted from #5315 (STF Python bindings) to keep that PR reviewable. This PR contains only the stackable-context additions to the experimental STF C API.
Adds C bindings for the stackable-context layer:
push_graph/pop, plus launchable graphs (launch/exec/stream/graph) and a shared-ownership flavor for replayable / embeddable graphs.host_launchdeps.Tests
test_stackable.cu,test_stackable_token_push.cu, and stackable coverage added totest_host_launch.cu.Checklist