[STF] Add C bindings for stackable contexts by caugonnet · Pull Request #9233 · NVIDIA/cccl

caugonnet · 2026-06-03T09:21:42Z

Description

Extracted from #5315 (STF Python bindings) to keep that PR reviewable. This PR contains only the stackable-context additions to the experimental STF C API.

Adds C bindings for the stackable-context layer:

stackable context create/finalize/fence and push_graph/pop, plus launchable graphs (launch/exec/stream/graph) and a shared-ownership flavor for replayable / embeddable graphs.
while- and repeat-loop scopes built on CUDA conditional graph nodes, with a built-in scalar comparison condition helper (CUDA 12.4+).
stackable logical data and tokens, and stackable host_launch deps.

Tests

test_stackable.cu, test_stackable_token_push.cu, and stackable coverage added to test_host_launch.cu.

Checklist

New or existing tests cover these changes.
The documentation is up to date with these changes (Doxygen comments in the header).

Extends the experimental STF C API with the stackable-context layer: - stackable context create/finalize/fence and push_graph/pop, plus launchable graphs (launch/exec/stream/graph) and a shared-ownership flavor for replayable / embeddable graphs. - while- and repeat-loop scopes built on CUDA conditional graph nodes, with a built-in scalar comparison condition helper (CUDA 12.4+). - stackable logical data and tokens, and stackable host_launch deps. Adds test_stackable.cu, test_stackable_token_push.cu, and stackable coverage in test_host_launch.cu. Extracted from the python-bindings PR to keep that change reviewable.

copy-pr-bot · 2026-06-03T09:21:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

caugonnet · 2026-06-03T09:26:29Z

/ok to test 26e1bf9

coderabbitai · 2026-06-03T09:27:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

suggestion:

Walkthrough

suggestion: Adds a StackableContext C API and implementation enabling nested STF graph scopes with two-phase pop (prologue/epilogue), relaunchable launchable-graphs (shared and non-shared), CUDA 12.4+ while/repeat scopes, stack-aware logical-data/tokens, task and host-launch bindings, and accompanying tests.

Changes

Stackable Context API

Layer / File(s)	Summary
Core implementation utilities `c/experimental/stf/src/stf.cu`	Adds `<memory>` include and internal helpers for smart-pointer-backed shared launchable handles, stackable logical-data/token discriminator, and device scalar-condition kernel template.
C header declarations for stackable APIs `c/experimental/stf/include/cccl/c/experimental/stf/stf.h`	Appends `StackableContext` header group: context lifecycle (`create`/`finalize`/`fence`), `push_graph`/`pop`, non-shared and shared launchable-graph prologue/epilogue and accessors, CUDA 12.4+ while/repeat scope types and `while_cond_scalar`, logical-data/token constructors and modifiers, task and host-launch stackable APIs, and moves `extern "C"` close to file end.
C ABI externs implementation `c/experimental/stf/src/stf.cu`	Implements `extern "C"` entry points for stackable ctx push/pop/prologue/epilogue, launchable graph operations (non-shared + shared dup/free/valid/launch), while/repeat scope management and scalar-condition updater, stackable logical-data/token create/set/push/destroy, and stackable task + host-launch dependency binding and submission with scope validation.
Stackable integration tests `c/experimental/stf/test/test_stackable.cu`	Tests for nested `push_graph`/`pop`, `pop_prologue` relaunch and zero-launch semantics, accessor guarantees (`graph`/`exec`/`stream`), child-graph embedding/instantiate, shared-handle dup/free semantics, nested scopes, token+fence sequencing, and repeat/while regression sweep (toolkit-gated).
Host-launch tests `c/experimental/stf/test/test_host_launch.cu`	Two tests validating stackable `host_launch` behavior with pinned host memory: one standard flow and one inside a nested child-graph scope (capture-enabled).
Token regression tests `c/experimental/stf/test/test_stackable_token_push.cu`	Regression and sanity tests exercising token-based dependencies in `push_graph`/`pop_prologue` and `pop_prologue_shared`, repeatedly launching to validate no hard abort and exercising a real-data workaround.

Possibly related PRs

NVIDIA/cccl#9178: Introduces the underlying C++ stackable_ctx::pop_prologue/pop_epilogue and launchable_graph_handle/pop_prologue_shared infrastructure that this PR exposes as a public C API.

Suggested reviewers

NaderAlAwar
andralex
oleksandr-pavlyk

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

c/experimental/stf/include/cccl/c/experimental/stf/stf.h (1)

1676-1685: 💤 Low value

suggestion: The stf_while_scope_handle and stf_repeat_scope_handle typedefs are declared outside the CUDART_VERSION >= 12040 guard (lines 1676-1680), while the functions using them are inside (starting at line 1685). On older CUDA, these types exist but are unusable. Consider moving the typedefs inside the guard for consistency, or add a comment explaining why they're intentionally exposed unconditionally (e.g., for FFI struct compatibility).
c/experimental/stf/test/test_stackable_token_push.cu (4)
33-33: 💤 Low value

suggestion: Remove unused <cmath> include. No math functions are called in this file.

As per coding guidelines: Remove unused headers.

79-79: 💤 Low value

suggestion: Use constexpr instead of const for relaunchN. This variable is a compile-time constant and should be marked constexpr per the coding guidelines.
-  const int relaunchN = 4;
+  constexpr int relaunchN = 4;
As per coding guidelines: All variables that can be evaluated at compile-time must use constexpr.

131-131: 💤 Low value

suggestion: Use constexpr instead of const for relaunchN. This variable is a compile-time constant and should be marked constexpr per the coding guidelines.
-  const int relaunchN = 4;
+  constexpr int relaunchN = 4;
As per coding guidelines: All variables that can be evaluated at compile-time must use constexpr.

182-183: 💤 Low value

suggestion: Use constexpr instead of const for N and relaunchN. These variables are compile-time constants and should be marked constexpr per the coding guidelines.
-  const size_t N      = 8;
-  const int relaunchN = 4;
+  constexpr size_t N      = 8;
+  constexpr int relaunchN = 4;
As per coding guidelines: All variables that can be evaluated at compile-time must use constexpr.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c8848a75-324b-45c1-ae60-060a40d953b4

📥 Commits

Reviewing files that changed from the base of the PR and between cfe7e26 and 97c6092.

📒 Files selected for processing (5)

c/experimental/stf/include/cccl/c/experimental/stf/stf.h
c/experimental/stf/src/stf.cu
c/experimental/stf/test/test_host_launch.cu
c/experimental/stf/test/test_stackable.cu
c/experimental/stf/test/test_stackable_token_push.cu

Address review feedback: verify the cudaMallocHost allocation for host_dep succeeds before using the pointer, so allocation failures fail the test instead of leading to UB / false passes.

caugonnet · 2026-06-03T12:10:02Z

/ok to test e042f4f

- test_host_launch: enable capture on the manual task inside the nested graph scope so it uses the graph's capture stream, fixing a race between the fill kernel and the host verifier (flaky passed == false). - stf_stackable_push_repeat: reject count == 0 to avoid unsigned underflow / non-terminating loops, matching the documented contract. - stf.h: document the lazy dep-A sync performed by the graph() accessors (stf_launchable_graph_graph / _shared_graph), which order the support stream behind the nested context's freeze events. - test_stackable: check cudaMallocHost results via REQUIRE to avoid UB on allocation failure.

caugonnet · 2026-06-08T18:46:14Z

@NaderAlAwar thanks these were really insightful comments ...

caugonnet · 2026-06-08T18:53:01Z

/ok to test a2bd359

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

c/experimental/stf/include/cccl/c/experimental/stf/stf.h (1)

1770-1776: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: mirror the outer-stream lifetime caveat on the shared graph accessor.

The last stf_launchable_graph_shared_free() implicitly runs stf_stackable_pop_epilogue(), so callers that embed the returned cudaGraph_t into an outer graph still need to keep the final shared handle alive until that outer launch stream has completed. stf_launchable_graph_shared_graph() documents the lazy dep-A sync, but not this matching teardown constraint, which leaves the same unfreeze-vs-execution race undocumented on the shared path.

Also applies to: 1840-1846

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a66f72ce-e208-4587-bc4d-d47f92bb7f8e

📥 Commits

Reviewing files that changed from the base of the PR and between debf868 and a2bd359.

📒 Files selected for processing (4)

c/experimental/stf/include/cccl/c/experimental/stf/stf.h
c/experimental/stf/src/stf.cu
c/experimental/stf/test/test_host_launch.cu
c/experimental/stf/test/test_stackable.cu

🚧 Files skipped from review as they are similar to previous changes (2)

c/experimental/stf/test/test_stackable.cu
c/experimental/stf/src/stf.cu

github-actions · 2026-06-08T19:28:53Z

🥳 CI Workflow Results

🟩 Finished in 34m 15s: Pass: 100%/7 | Total: 51m 58s | Max: 34m 10s | Hits: 83%/948

See results here.

The shared launchable-graph path runs pop_epilogue() implicitly when the last reference is released via stf_launchable_graph_shared_free(). That epilogue unfreezes the pushed data and only synchronizes the support stream (dep A), not the user's outer launch stream. Mirror the outer-stream lifetime caveat already documented on the single-handle stf_launchable_graph_graph() onto stf_launchable_graph_shared_graph() and stf_launchable_graph_shared_free(), and add a worked example showing the safe ordering (sync the launch stream before freeing the last ref).

stackable_ctx::finalize() asserts the stack is already back at the root and only finalizes the root context. The previous doc wrongly implied it blocks on tasks in still-open scopes. Reword to state that all nested scopes must be popped first and that finalize only waits for pending root-level work and releases root resources.

The pop prologue no longer instantiates a cudaGraphExec_t: it only pops pushed data and finalizes the child graph, with instantiation happening lazily on the first launch()/exec(). Update the pop_prologue / launch / exec docs to match that lazy-instantiation contract. Also remove the internal "dep A" label from the public C header (it was only ever defined in the C++ layer) and describe the synchronization in plain terms: ordering the support stream behind the nested context's freeze/get events.

…sk_get_graph In a graph context, stf_task_get_custream() only returns a valid stream once stf_task_enable_capture() has been called. Previously it silently returned a null stream, so kernels launched on it ran on the NULL stream outside the STF graph (leaving the task's graph node empty) and only appeared correct because finalize() globally synchronizes. Now stf_task_get_custream() asserts the stream is non-null with a message pointing at the two supported options. For the expert/explicit path, add stf_task_get_graph(), which returns the task's child cudaGraph_t so callers can add graph nodes directly instead of capturing a stream (mutually exclusive with capture). Backed by a new context::unified_task::get_graph() accessor. Fix the stackable tests that launched kernels via stf_task_get_custream() inside a graph scope without enabling capture (test_stackable.cu and test_stackable_token_push.cu); they were silently running outside the graph. Built with the cccl-c-stf preset and ran the stackable, stackable_token_push, and host_launch C tests (all pass).

stf_task_get_graph() returns a raw cudaGraph_t for explicit node insertion. In C++ this handle can be adopted via cuda::experimental::graph_builder::from_native_handle(), but cuda.core's Python GraphBuilder cannot adopt an existing cudaGraph_t (it is created from a device/stream and builds via capture). Document that this entry point is a C/C++ affordance that may not be backed by a Python interface, and that Python users should prefer the capture path.

…in release The previous _CCCL_ASSERT guard in stf_task_get_custream() is gated on CCCL_ENABLE_HOST_ASSERTIONS and compiles to a no-op in default release builds (e.g. the cccl-c-stf preset), so the "launching outside the graph" misuse would still pass silently there. Switch to _CCCL_VERIFY, which is unconditionally enabled (even under NDEBUG) and is the macro reserved for critical always-on checks. Verified empirically: with the guard a task that uses stf_task_get_custream() in a graph scope without stf_task_enable_capture() now aborts (SIGABRT) in the release build instead of silently corrupting results.

The else branch in unified_task::get_graph() is an unconditional misuse path (a stream-context task has no graph). Use _CCCL_VERIFY so it aborts in release builds too, instead of _CCCL_ASSERT which compiles out and silently returns a null graph.

andralex

lgtm

andralex · 2026-06-09T13:54:29Z

/ok to test df56c2d

NaderAlAwar · 2026-06-09T14:01:12Z

+        using task_t = ::std::decay_t<decltype(self)>;
+        if constexpr (::std::is_same_v<task_t, graph_task<Deps...>>)
+        {
+          return self.get_graph();


Important: stf_task_get_graph() is documented as mutually exclusive with the capture path, but this wrapper does not enforce that. If a C caller calls stf_task_enable_capture(t) and then stf_task_get_graph(t), graph_task::get_graph() creates a child graph, while end_uncleared() takes the capture/done_nodes path and never embeds that child graph, so nodes inserted through the returned graph are silently dropped

indeed, done

The explicit-graph path (graph_task::get_graph()) is documented as mutually exclusive with stream capture, but nothing enforced it. With capture enabled, end_uncleared() takes the done_nodes path and never embeds the child graph, so nodes added through get_graph() were silently dropped. Add a _CCCL_VERIFY guard in get_graph() so this misuse aborts loudly (in release too) instead of corrupting results. Also add test_task_get_graph.cu exercising the explicit-graph path (adding a kernel node directly into the task's child graph, no capture).

…eHost The stackable nested-graph-scope example in stf.h launched work through stf_task_get_custream() without stf_task_enable_capture(), which now aborts. Add the enable_capture() call to the example and a note on stf_stackable_task_create() about the capture requirement (or the stf_task_get_graph() explicit path) inside a graph scope. Also check the cudaFreeHost result in test_task_get_graph.cu.

Wrap the previously unchecked cudaMallocHost (test_host_launch.cu) and cudaFreeHost calls across the stackable / host_launch C tests with REQUIRE(... == cudaSuccess), consistent with the rest of the suite, so allocation/free failures fail the test instead of being ignored.

Replace the unique_ptr/cudaFreeHost deleter (which ignored the free result) on the host-pinned test with the explicit checked alloc/free pattern used elsewhere, so a cudaFreeHost failure fails the test.

caugonnet · 2026-06-09T14:29:32Z

Many thanks @NaderAlAwar !

caugonnet · 2026-06-09T14:29:50Z

/ok to test 726bad1

caugonnet · 2026-06-09T14:47:02Z

/ok to test 6c441e6

caugonnet requested a review from a team as a code owner June 3, 2026 09:21

caugonnet requested a review from rwgk June 3, 2026 09:21

github-project-automation Bot added this to CCCL Jun 3, 2026

github-project-automation Bot moved this to Todo in CCCL Jun 3, 2026

cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 3, 2026

caugonnet self-assigned this Jun 3, 2026

caugonnet added the stf Sequential Task Flow programming model label Jun 3, 2026

Merge branch 'main' into stf_c_api_stackable

26e1bf9

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread c/experimental/stf/test/test_stackable_token_push.cu Outdated

[STF] Check cudaMallocHost result in test_stackable_token_push

debf868

Address review feedback: verify the cudaMallocHost allocation for host_dep succeeds before using the pointer, so allocation failures fail the test instead of leading to UB / false passes.

This comment has been minimized.

Sign in to view

Merge branch 'main' into stf_c_api_stackable

e042f4f

This comment has been minimized.

Sign in to view

Merge branch 'main' into stf_c_api_stackable

afa69ec

NaderAlAwar reviewed Jun 8, 2026

View reviewed changes

Comment thread c/experimental/stf/test/test_host_launch.cu

Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated

Comment thread c/experimental/stf/src/stf.cu

Comment thread c/experimental/stf/test/test_stackable.cu Outdated

Merge branch 'main' into stf_c_api_stackable

a2bd359

caugonnet enabled auto-merge (squash) June 8, 2026 18:53

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

caugonnet and others added 2 commits June 8, 2026 21:48

Merge branch 'main' into stf_c_api_stackable

ee14a67

NaderAlAwar reviewed Jun 9, 2026

View reviewed changes

Comment thread c/experimental/stf/src/stf.cu

Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated

Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated

caugonnet and others added 4 commits June 9, 2026 06:57

Merge branch 'main' into stf_c_api_stackable

900c7c2

caugonnet requested a review from a team as a code owner June 9, 2026 07:26

caugonnet requested a review from andralex June 9, 2026 07:26

caugonnet and others added 4 commits June 9, 2026 09:35

Merge branch 'main' into stf_c_api_stackable

7807856

andralex approved these changes Jun 9, 2026

View reviewed changes

Merge branch 'main' into stf_c_api_stackable

df56c2d

NaderAlAwar reviewed Jun 9, 2026

View reviewed changes

caugonnet added 4 commits June 9, 2026 16:16

[STF] Check cudaFreeHost result in test_logical_data_with_place

726bad1

Replace the unique_ptr/cudaFreeHost deleter (which ignored the free result) on the host-pinned test with the explicit checked alloc/free pattern used elsewhere, so a cudaFreeHost failure fails the test.

NaderAlAwar approved these changes Jun 9, 2026

View reviewed changes

Merge branch 'main' into stf_c_api_stackable

6c441e6

Conversation

caugonnet commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

copy-pr-bot Bot commented Jun 3, 2026

Uh oh!

caugonnet commented Jun 3, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

caugonnet commented Jun 3, 2026

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

caugonnet commented Jun 8, 2026

Uh oh!

caugonnet commented Jun 8, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 8, 2026

🥳 CI Workflow Results

🟩 Finished in 34m 15s: Pass: 100%/7 | Total: 51m 58s | Max: 34m 10s | Hits: 83%/948

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andralex left a comment

Choose a reason for hiding this comment

Uh oh!

andralex commented Jun 9, 2026

Uh oh!

NaderAlAwar Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

caugonnet Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

caugonnet commented Jun 9, 2026

Uh oh!

caugonnet commented Jun 9, 2026

Uh oh!

caugonnet commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

caugonnet commented Jun 3, 2026 •

edited

Loading

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading