Skip to content

Add CPU callbacks for stream capture & Cythonize GraphBuilder#1814

Open
Andy-Jost wants to merge 4 commits intoNVIDIA:mainfrom
Andy-Jost:cythonize-graph-builder
Open

Add CPU callbacks for stream capture & Cythonize GraphBuilder#1814
Andy-Jost wants to merge 4 commits intoNVIDIA:mainfrom
Andy-Jost:cythonize-graph-builder

Conversation

@Andy-Jost
Copy link
Contributor

Summary

  • Implements CUDA graph phase N - CPU callbacks & user objects #1328: CPU callbacks during stream capture via GraphBuilder.callback(), mirroring the existing GraphDef.callback() API
  • Cythonizes _graph/_graph_builder.pyx (converts from pure Python to Cython .pyx)
  • Extracts shared callback infrastructure into _graph/_utils.pyx to avoid circular imports

Changes

  • _graph/_utils.pyx / _graph/_utils.pxd (new): Shared callback infrastructure — _attach_user_object, _attach_host_callback_to_graph, _py_host_trampoline/_py_host_destructor, and _is_py_host_trampoline helper
  • _graph/_graph_builder.pyx: Converted from .py to .pyx; added callback() method using cuLaunchHostFunc; centralized version caching via get_driver_version()/get_binding_version() (removed per-module _lazy_init)
  • _graph/_graphdef.pyx: Refactored GraphNode.callback() to use the shared _attach_host_callback_to_graph helper; removed duplicated callback infrastructure

Test Coverage

  • New tests in tests/graph/test_basic.py: Python callable callback, ctypes CFuncPtr callback with user_data, and user_data rejection for Python callables
  • All existing explicit graph callback and lifetime tests continue to pass

Related Work

Made with Cursor

Move the GraphBuilder/Graph/GraphCompleteOptions/GraphDebugPrintOptions
implementation out of _graph/__init__.py into _graph/_graph_builder.pyx
so it is compiled by Cython. A thin __init__.py re-exports the public
names so all existing import sites continue to work unchanged.

Cython compatibility adjustments:
- Remove `from __future__ import annotations` (unsupported by Cython)
- Remove TYPE_CHECKING guard; quote annotations that reference Stream
  (circular import), forward-reference GraphBuilder/Graph, or use
  X | None union syntax
- Update _graphdef.pyx lazy imports to point directly at _graph_builder

No build_hooks.py changes needed — the build system auto-discovers .pyx
files via glob.

Ref: NVIDIA#1076
Made-with: Cursor
Replace the per-module _lazy_init / _inited / _driver_ver / _py_major_minor
pattern in _graph_builder.pyx with direct calls to centralized cached
functions in cuda_utils:

- Add get_driver_version() with @functools.cache alongside get_binding_version
- Switch get_binding_version from @functools.lru_cache to @functools.cache
  (cleaner for nullary functions)
- Fix split() to return tuple(result) — Cython enforces return type
  annotations unlike pure Python
- Fix _cond_with_params annotation from -> GraphBuilder to -> tuple
  to match actual return value

Made-with: Cursor
@Andy-Jost Andy-Jost added this to the cuda.core v1.0.0 milestone Mar 24, 2026
@Andy-Jost Andy-Jost added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Mar 24, 2026
@Andy-Jost Andy-Jost self-assigned this Mar 24, 2026
@Andy-Jost Andy-Jost requested review from cpcloud, leofang, mdboom, rparolin and rwgk and removed request for mdboom, rparolin and rwgk March 24, 2026 20:37
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was moved to _graph/_graph_builder.pyx and replaced with a thin re-exporter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from _graph/__init__.py. Changes to this file:

  1. Replaced explicit _lazy_init with direct calls to cached functions get_binding_version and get_driver_version
  2. Added GraphBuilder.callback

Comment on lines +690 to +735
def callback(self, fn, *, user_data=None):
"""Add a host callback to the graph during stream capture.

The callback runs on the host CPU when the graph reaches this point
in execution. Two modes are supported:

- **Python callable**: Pass any callable. The GIL is acquired
automatically. The callable must take no arguments; use closures
or ``functools.partial`` to bind state.
- **ctypes function pointer**: Pass a ``ctypes.CFUNCTYPE`` instance.
The function receives a single ``void*`` argument (the
``user_data``). The caller must keep the ctypes wrapper alive
for the lifetime of the graph.

.. warning::

Callbacks must not call CUDA API functions. Doing so may
deadlock or corrupt driver state.

Parameters
----------
fn : callable or ctypes function pointer
The callback function.
user_data : int or bytes-like, optional
Only for ctypes function pointers. If ``int``, passed as a raw
pointer (caller manages lifetime). If bytes-like, the data is
copied and its lifetime is tied to the graph.
"""
cdef Stream stream = <Stream>self._mnff.stream
cdef cydriver.CUstream c_stream = as_cu(stream._h_stream)
cdef cydriver.CUstreamCaptureStatus capture_status
cdef cydriver.CUgraph c_graph = NULL

with nogil:
HANDLE_RETURN(cydriver.cuStreamGetCaptureInfo(
c_stream, &capture_status, NULL, &c_graph, NULL, NULL, NULL))

if capture_status != cydriver.CU_STREAM_CAPTURE_STATUS_ACTIVE:
raise RuntimeError("Cannot add callback when graph is not being built")

cdef cydriver.CUhostFn c_fn
cdef void* c_user_data = NULL
_attach_host_callback_to_graph(c_graph, fn, user_data, &c_fn, &c_user_data)

with nogil:
HANDLE_RETURN(cydriver.cuLaunchHostFunc(c_stream, c_fn, c_user_data))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New code

Implements NVIDIA#1328: host callbacks during stream capture via
cuLaunchHostFunc, mirroring the existing GraphDef.callback API.

Extracts shared callback infrastructure (_attach_user_object,
_attach_host_callback_to_graph, trampoline/destructor) into a new
_graph/_utils.pyx module to avoid circular imports between
_graph_builder and _graphdef.

Made-with: Cursor
@Andy-Jost Andy-Jost force-pushed the cythonize-graph-builder branch from 39e5c57 to edbc361 Compare March 24, 2026 20:46
Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GraphBuilder.callback() in cuda_core/cuda/core/_graph/_graph_builder.pyx now calls cydriver.cuStreamGetCaptureInfo(...) directly. That Cython symbol is only generated when the bindings headers expose cuStreamGetCaptureInfo_v3 (cuda_bindings/cuda/bindings/cydriver.pxd.in), but CI still rebuilds cuda_core against the previous supported CUDA major (12.9.1).

That matches the PR's current Linux build failures in the second Build cuda.core wheel phase. Please switch this path back to the existing Python wrapper (driver.cuStreamGetCaptureInfo(...)) or otherwise gate/fallback the direct C call so the callback implementation still builds against the CUDA 12 compatibility configuration.

@github-actions
Copy link

@Andy-Jost
Copy link
Contributor Author

GraphBuilder.callback() in cuda_core/cuda/core/_graph/_graph_builder.pyx now calls cydriver.cuStreamGetCaptureInfo(...) directly. That Cython symbol is only generated when the bindings headers expose cuStreamGetCaptureInfo_v3 (cuda_bindings/cuda/bindings/cydriver.pxd.in), but CI still rebuilds cuda_core against the previous supported CUDA major (12.9.1).

That matches the PR's current Linux build failures in the second Build cuda.core wheel phase. Please switch this path back to the existing Python wrapper (driver.cuStreamGetCaptureInfo(...)) or otherwise gate/fallback the direct C call so the callback implementation still builds against the CUDA 12 compatibility configuration.

Thanks, @cpcloud. This is fixed with edbc361

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA graph phase N - CPU callbacks & user objects

2 participants