Skip to content

[STF] Add C bindings for stackable contexts#9233

Open
caugonnet wants to merge 23 commits into
NVIDIA:mainfrom
caugonnet:stf_c_api_stackable
Open

[STF] Add C bindings for stackable contexts#9233
caugonnet wants to merge 23 commits into
NVIDIA:mainfrom
caugonnet:stf_c_api_stackable

Conversation

@caugonnet

@caugonnet caugonnet commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Description

Extracted from #5315 (STF Python bindings) to keep that PR reviewable. This PR contains only the stackable-context additions to the experimental STF C API.

Adds C bindings for the stackable-context layer:

  • stackable context create/finalize/fence and push_graph/pop, plus launchable graphs (launch/exec/stream/graph) and a shared-ownership flavor for replayable / embeddable graphs.
  • while- and repeat-loop scopes built on CUDA conditional graph nodes, with a built-in scalar comparison condition helper (CUDA 12.4+).
  • stackable logical data and tokens, and stackable host_launch deps.

Tests

  • test_stackable.cu, test_stackable_token_push.cu, and stackable coverage added to test_host_launch.cu.

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes (Doxygen comments in the header).

Extends the experimental STF C API with the stackable-context layer:

- stackable context create/finalize/fence and push_graph/pop, plus
  launchable graphs (launch/exec/stream/graph) and a shared-ownership
  flavor for replayable / embeddable graphs.
- while- and repeat-loop scopes built on CUDA conditional graph nodes,
  with a built-in scalar comparison condition helper (CUDA 12.4+).
- stackable logical data and tokens, and stackable host_launch deps.

Adds test_stackable.cu, test_stackable_token_push.cu, and stackable
coverage in test_host_launch.cu. Extracted from the python-bindings PR
to keep that change reviewable.
@caugonnet caugonnet requested a review from a team as a code owner June 3, 2026 09:21
@caugonnet caugonnet requested a review from rwgk June 3, 2026 09:21
@copy-pr-bot

copy-pr-bot Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 3, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 3, 2026
@caugonnet caugonnet self-assigned this Jun 3, 2026
@caugonnet caugonnet added the stf Sequential Task Flow programming model label Jun 3, 2026
@caugonnet

Copy link
Copy Markdown
Contributor Author

/ok to test 26e1bf9

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

suggestion:

Walkthrough

suggestion: Adds a StackableContext C API and implementation enabling nested STF graph scopes with two-phase pop (prologue/epilogue), relaunchable launchable-graphs (shared and non-shared), CUDA 12.4+ while/repeat scopes, stack-aware logical-data/tokens, task and host-launch bindings, and accompanying tests.

Changes

Stackable Context API

Layer / File(s) Summary
Core implementation utilities
c/experimental/stf/src/stf.cu
Adds <memory> include and internal helpers for smart-pointer-backed shared launchable handles, stackable logical-data/token discriminator, and device scalar-condition kernel template.
C header declarations for stackable APIs
c/experimental/stf/include/cccl/c/experimental/stf/stf.h
Appends StackableContext header group: context lifecycle (create/finalize/fence), push_graph/pop, non-shared and shared launchable-graph prologue/epilogue and accessors, CUDA 12.4+ while/repeat scope types and while_cond_scalar, logical-data/token constructors and modifiers, task and host-launch stackable APIs, and moves extern "C" close to file end.
C ABI externs implementation
c/experimental/stf/src/stf.cu
Implements extern "C" entry points for stackable ctx push/pop/prologue/epilogue, launchable graph operations (non-shared + shared dup/free/valid/launch), while/repeat scope management and scalar-condition updater, stackable logical-data/token create/set/push/destroy, and stackable task + host-launch dependency binding and submission with scope validation.
Stackable integration tests
c/experimental/stf/test/test_stackable.cu
Tests for nested push_graph/pop, pop_prologue relaunch and zero-launch semantics, accessor guarantees (graph/exec/stream), child-graph embedding/instantiate, shared-handle dup/free semantics, nested scopes, token+fence sequencing, and repeat/while regression sweep (toolkit-gated).
Host-launch tests
c/experimental/stf/test/test_host_launch.cu
Two tests validating stackable host_launch behavior with pinned host memory: one standard flow and one inside a nested child-graph scope (capture-enabled).
Token regression tests
c/experimental/stf/test/test_stackable_token_push.cu
Regression and sanity tests exercising token-based dependencies in push_graph/pop_prologue and pop_prologue_shared, repeatedly launching to validate no hard abort and exercising a real-data workaround.

Possibly related PRs

  • NVIDIA/cccl#9178: Introduces the underlying C++ stackable_ctx::pop_prologue/pop_epilogue and launchable_graph_handle/pop_prologue_shared infrastructure that this PR exposes as a public C API.

Suggested reviewers

  • NaderAlAwar
  • andralex
  • oleksandr-pavlyk

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (5)
c/experimental/stf/include/cccl/c/experimental/stf/stf.h (1)

1676-1685: 💤 Low value

suggestion: The stf_while_scope_handle and stf_repeat_scope_handle typedefs are declared outside the CUDART_VERSION >= 12040 guard (lines 1676-1680), while the functions using them are inside (starting at line 1685). On older CUDA, these types exist but are unusable. Consider moving the typedefs inside the guard for consistency, or add a comment explaining why they're intentionally exposed unconditionally (e.g., for FFI struct compatibility).

c/experimental/stf/test/test_stackable_token_push.cu (4)

33-33: 💤 Low value

suggestion: Remove unused <cmath> include. No math functions are called in this file.

As per coding guidelines: Remove unused headers.


79-79: 💤 Low value

suggestion: Use constexpr instead of const for relaunchN. This variable is a compile-time constant and should be marked constexpr per the coding guidelines.

-  const int relaunchN = 4;
+  constexpr int relaunchN = 4;

As per coding guidelines: All variables that can be evaluated at compile-time must use constexpr.


131-131: 💤 Low value

suggestion: Use constexpr instead of const for relaunchN. This variable is a compile-time constant and should be marked constexpr per the coding guidelines.

-  const int relaunchN = 4;
+  constexpr int relaunchN = 4;

As per coding guidelines: All variables that can be evaluated at compile-time must use constexpr.


182-183: 💤 Low value

suggestion: Use constexpr instead of const for N and relaunchN. These variables are compile-time constants and should be marked constexpr per the coding guidelines.

-  const size_t N      = 8;
-  const int relaunchN = 4;
+  constexpr size_t N      = 8;
+  constexpr int relaunchN = 4;

As per coding guidelines: All variables that can be evaluated at compile-time must use constexpr.


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c8848a75-324b-45c1-ae60-060a40d953b4

📥 Commits

Reviewing files that changed from the base of the PR and between cfe7e26 and 97c6092.

📒 Files selected for processing (5)
  • c/experimental/stf/include/cccl/c/experimental/stf/stf.h
  • c/experimental/stf/src/stf.cu
  • c/experimental/stf/test/test_host_launch.cu
  • c/experimental/stf/test/test_stackable.cu
  • c/experimental/stf/test/test_stackable_token_push.cu

Comment thread c/experimental/stf/test/test_stackable_token_push.cu Outdated
Address review feedback: verify the cudaMallocHost allocation for
host_dep succeeds before using the pointer, so allocation failures
fail the test instead of leading to UB / false passes.
@github-actions

This comment has been minimized.

@caugonnet

Copy link
Copy Markdown
Contributor Author

/ok to test e042f4f

@github-actions

This comment has been minimized.

Comment thread c/experimental/stf/test/test_host_launch.cu
Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated
Comment thread c/experimental/stf/src/stf.cu
Comment thread c/experimental/stf/test/test_stackable.cu Outdated
- test_host_launch: enable capture on the manual task inside the nested
  graph scope so it uses the graph's capture stream, fixing a race
  between the fill kernel and the host verifier (flaky passed == false).
- stf_stackable_push_repeat: reject count == 0 to avoid unsigned
  underflow / non-terminating loops, matching the documented contract.
- stf.h: document the lazy dep-A sync performed by the graph() accessors
  (stf_launchable_graph_graph / _shared_graph), which order the support
  stream behind the nested context's freeze events.
- test_stackable: check cudaMallocHost results via REQUIRE to avoid UB
  on allocation failure.
@caugonnet

Copy link
Copy Markdown
Contributor Author

@NaderAlAwar thanks these were really insightful comments ...

@caugonnet

Copy link
Copy Markdown
Contributor Author

/ok to test a2bd359

@caugonnet caugonnet enabled auto-merge (squash) June 8, 2026 18:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
c/experimental/stf/include/cccl/c/experimental/stf/stf.h (1)

1770-1776: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: mirror the outer-stream lifetime caveat on the shared graph accessor.

The last stf_launchable_graph_shared_free() implicitly runs stf_stackable_pop_epilogue(), so callers that embed the returned cudaGraph_t into an outer graph still need to keep the final shared handle alive until that outer launch stream has completed. stf_launchable_graph_shared_graph() documents the lazy dep-A sync, but not this matching teardown constraint, which leaves the same unfreeze-vs-execution race undocumented on the shared path.

Also applies to: 1840-1846


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a66f72ce-e208-4587-bc4d-d47f92bb7f8e

📥 Commits

Reviewing files that changed from the base of the PR and between debf868 and a2bd359.

📒 Files selected for processing (4)
  • c/experimental/stf/include/cccl/c/experimental/stf/stf.h
  • c/experimental/stf/src/stf.cu
  • c/experimental/stf/test/test_host_launch.cu
  • c/experimental/stf/test/test_stackable.cu
🚧 Files skipped from review as they are similar to previous changes (2)
  • c/experimental/stf/test/test_stackable.cu
  • c/experimental/stf/src/stf.cu

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 34m 15s: Pass: 100%/7 | Total: 51m 58s | Max: 34m 10s | Hits: 83%/948

See results here.

caugonnet and others added 2 commits June 8, 2026 21:48
The shared launchable-graph path runs pop_epilogue() implicitly when the
last reference is released via stf_launchable_graph_shared_free(). That
epilogue unfreezes the pushed data and only synchronizes the support
stream (dep A), not the user's outer launch stream. Mirror the
outer-stream lifetime caveat already documented on the single-handle
stf_launchable_graph_graph() onto stf_launchable_graph_shared_graph()
and stf_launchable_graph_shared_free(), and add a worked example showing
the safe ordering (sync the launch stream before freeing the last ref).
Comment thread c/experimental/stf/src/stf.cu
Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated
Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h Outdated
caugonnet and others added 4 commits June 9, 2026 06:57
stackable_ctx::finalize() asserts the stack is already back at the root
and only finalizes the root context. The previous doc wrongly implied it
blocks on tasks in still-open scopes. Reword to state that all nested
scopes must be popped first and that finalize only waits for pending
root-level work and releases root resources.
The pop prologue no longer instantiates a cudaGraphExec_t: it only pops
pushed data and finalizes the child graph, with instantiation happening
lazily on the first launch()/exec(). Update the pop_prologue / launch /
exec docs to match that lazy-instantiation contract.

Also remove the internal "dep A" label from the public C header (it was
only ever defined in the C++ layer) and describe the synchronization in
plain terms: ordering the support stream behind the nested context's
freeze/get events.
…sk_get_graph

In a graph context, stf_task_get_custream() only returns a valid stream
once stf_task_enable_capture() has been called. Previously it silently
returned a null stream, so kernels launched on it ran on the NULL stream
outside the STF graph (leaving the task's graph node empty) and only
appeared correct because finalize() globally synchronizes. Now
stf_task_get_custream() asserts the stream is non-null with a message
pointing at the two supported options.

For the expert/explicit path, add stf_task_get_graph(), which returns the
task's child cudaGraph_t so callers can add graph nodes directly instead
of capturing a stream (mutually exclusive with capture). Backed by a new
context::unified_task::get_graph() accessor.

Fix the stackable tests that launched kernels via stf_task_get_custream()
inside a graph scope without enabling capture (test_stackable.cu and
test_stackable_token_push.cu); they were silently running outside the
graph. Built with the cccl-c-stf preset and ran the stackable,
stackable_token_push, and host_launch C tests (all pass).
@caugonnet caugonnet requested a review from a team as a code owner June 9, 2026 07:26
@caugonnet caugonnet requested a review from andralex June 9, 2026 07:26
caugonnet and others added 4 commits June 9, 2026 09:35
stf_task_get_graph() returns a raw cudaGraph_t for explicit node
insertion. In C++ this handle can be adopted via
cuda::experimental::graph_builder::from_native_handle(), but cuda.core's
Python GraphBuilder cannot adopt an existing cudaGraph_t (it is created
from a device/stream and builds via capture). Document that this entry
point is a C/C++ affordance that may not be backed by a Python interface,
and that Python users should prefer the capture path.
…in release

The previous _CCCL_ASSERT guard in stf_task_get_custream() is gated on
CCCL_ENABLE_HOST_ASSERTIONS and compiles to a no-op in default release
builds (e.g. the cccl-c-stf preset), so the "launching outside the graph"
misuse would still pass silently there. Switch to _CCCL_VERIFY, which is
unconditionally enabled (even under NDEBUG) and is the macro reserved for
critical always-on checks. Verified empirically: with the guard a task
that uses stf_task_get_custream() in a graph scope without
stf_task_enable_capture() now aborts (SIGABRT) in the release build
instead of silently corrupting results.
The else branch in unified_task::get_graph() is an unconditional misuse
path (a stream-context task has no graph). Use _CCCL_VERIFY so it aborts
in release builds too, instead of _CCCL_ASSERT which compiles out and
silently returns a null graph.

@andralex andralex left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@andralex

andralex commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

/ok to test df56c2d

using task_t = ::std::decay_t<decltype(self)>;
if constexpr (::std::is_same_v<task_t, graph_task<Deps...>>)
{
return self.get_graph();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important: stf_task_get_graph() is documented as mutually exclusive with the capture path, but this wrapper does not enforce that. If a C caller calls stf_task_enable_capture(t) and then stf_task_get_graph(t), graph_task::get_graph() creates a child graph, while end_uncleared() takes the capture/done_nodes path and never embeds that child graph, so nodes inserted through the returned graph are silently dropped

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, done

Comment thread c/experimental/stf/include/cccl/c/experimental/stf/stf.h
caugonnet added 4 commits June 9, 2026 16:16
The explicit-graph path (graph_task::get_graph()) is documented as
mutually exclusive with stream capture, but nothing enforced it. With
capture enabled, end_uncleared() takes the done_nodes path and never
embeds the child graph, so nodes added through get_graph() were silently
dropped. Add a _CCCL_VERIFY guard in get_graph() so this misuse aborts
loudly (in release too) instead of corrupting results.

Also add test_task_get_graph.cu exercising the explicit-graph path
(adding a kernel node directly into the task's child graph, no capture).
…eHost

The stackable nested-graph-scope example in stf.h launched work through
stf_task_get_custream() without stf_task_enable_capture(), which now
aborts. Add the enable_capture() call to the example and a note on
stf_stackable_task_create() about the capture requirement (or the
stf_task_get_graph() explicit path) inside a graph scope.

Also check the cudaFreeHost result in test_task_get_graph.cu.
Wrap the previously unchecked cudaMallocHost (test_host_launch.cu) and
cudaFreeHost calls across the stackable / host_launch C tests with
REQUIRE(... == cudaSuccess), consistent with the rest of the suite, so
allocation/free failures fail the test instead of being ignored.
Replace the unique_ptr/cudaFreeHost deleter (which ignored the free
result) on the host-pinned test with the explicit checked alloc/free
pattern used elsewhere, so a cudaFreeHost failure fails the test.
@caugonnet

Copy link
Copy Markdown
Contributor Author

Many thanks @NaderAlAwar !

@caugonnet

Copy link
Copy Markdown
Contributor Author

/ok to test 726bad1

@caugonnet

Copy link
Copy Markdown
Contributor Author

/ok to test 6c441e6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stf Sequential Task Flow programming model

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

3 participants