Implements GraphMemoryResource #1235

Andy-Jost · 2025-11-12T21:13:14Z

Description

Implements GraphMemoryResource for memory interactions with the graph memory allocator. Allocations from this object succeed only when graph capturing is active. Conversely, allocations from DeviceMemoryResource now raise an exception when graph capturing is active.

A new test module is added.

This change also simplifies and extends the logic for accepting arbitrary stream parameters as objects implementing __cuda_stream__. Support for that protocol was added in several places, allowing GraphBuilder to be used anywhere a stream is expected, including memory resource and buffer methods.

closes #963

…h capture state is not as expected.

…source methods to take any kind of stream-providing object. Update graph allocation tests.

…rapper (testing only)

copy-pr-bot · 2025-11-12T21:13:18Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Andy-Jost · 2025-11-12T21:26:07Z

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pxd

+
+from cuda.bindings cimport cydriver
+from cuda.core.experimental._memory._buffer cimport MemoryResource
+from cuda.core.experimental._memory._device_memory_resource cimport DeviceMemoryResource


I'll remove the cruft here.

Andy-Jost · 2025-11-12T21:29:56Z

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx

+        if _settable:
+            def fset(GraphMemoryResourceAttributes self, uint64_t value):
+                if value != 0:
+                    raise AttributeError(f"Attribute {stub.__name__!r} may only be set to zero (got {value}).")


The driver checks for this condition in cuDeviceSetGraphMemAttribute and issues a log message: "High watermark can only be reset to 0"

It's a shame we cannot access that message programmatically for use in the Python error.

Good news: CUDA 13 adds functions for error log management. It looks like cuLogsRegisterCallback might help here.

Andy-Jost · 2025-11-12T21:32:11Z

cuda_core/cuda/core/experimental/_device.pyx

        ctx = self._get_current_context()
        return Event._init(self._id, ctx, options, True)

-    def allocate(self, size, stream: Stream | None = None) -> Buffer:


I updated Stream arguments to IsStreamT throughout.

I was confused by this change. Why did we do that?

The GraphBuilder object contains a stream and can be used in place of streams. In test_graph.py you can find lots of statements like this launch(gb, LaunchConfig(grid=1, block=1), empty_kernel).

This was only partly implemented. For instance, launch contained logic to convert arguments to streams, but nothing in the memory module did.

This change extends the support. In general, we should uniformly use IsStreamT for any API function that accepts a stream-like argument and then call Stream._init(arg) to convert it.

Andy-Jost · 2025-11-12T21:33:10Z

cuda_core/cuda/core/experimental/_launcher.pyx

-            raise ValueError(
-                f"stream must either be a Stream object or support __cuda_stream__ (got {type(stream)})"
-            ) from None
+    stream = Stream._init(stream)


The canonical way to invoke the __cuda_stream__ protocol now is to call Stream._init. It will either succeed in creating a Stream object or raise an exception.

Andy-Jost · 2025-11-12T21:36:29Z

cuda_core/cuda/core/experimental/_stream.pyx

        ...


-cdef cydriver.CUstream _try_to_get_stream_ptr(obj: IsStreamT) except*:


This code block was moved below without changes.

Andy-Jost · 2025-11-12T21:42:43Z

cuda_core/tests/test_graph_mem.py

+    gb = device.create_graph_builder().begin_building(mode=mode)
+    with pytest.raises(
+        RuntimeError,
+        match=r"DeviceMemoryResource cannot perform memory operations on a capturing "
+        r"stream \(consider using GraphMemoryResource\)\.",
+    ):
+        dmr.allocate(1, stream=gb)
+    gb.end_building().complete()


This section illustrates a drawback of not using with contexts. Ignore the fact that the error is caught here (that's just for testing). If an exception is thrown during graph capture, control can easily escape without making a call to gb.end_building. That leaves the surrounding code in an unexpected state (capturing on).

Andy-Jost · 2025-11-12T21:45:41Z

cuda_core/cuda/core/experimental/_launcher.pyx

-    if stream is None:
-        raise ValueError("stream cannot be None, stream must either be a Stream object or support __cuda_stream__")
-    try:
-        stream_handle = stream.handle


This appears to be a (now fixed) bug. Any object with a handle attribute would use that preferentially over the stream protocol, even if the handle was not for a stream.

Andy-Jost · 2025-11-12T22:49:07Z

/ok to test 13e3dfb

github-actions · 2025-11-12T22:59:55Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1235/
https://nvidia.github.io/cuda-python/pr-preview/pr-1235/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1235/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1235/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

rparolin · 2025-11-13T00:24:50Z

cuda_core/cuda/core/experimental/_memory/_buffer.pyx

        dst : :obj:`~_memory.Buffer`
            Source buffer to copy data from
-        stream : Stream
+        stream : IsStreamT


IsStreamT is still a Stream type?

IsStreamT identifies any type that supports conversion to Stream via __cuda_stream__, including Stream itself.

rparolin · 2025-11-13T00:27:15Z

cuda_core/cuda/core/experimental/_memory/_device_memory_resource.pyx

        return self._ipc_data._alloc_handle

-    def allocate(self, size_t size, stream: Stream = None) -> Buffer:
+    def allocate(self, size_t size, stream: Optional[IsStreamT] = None) -> Buffer:


Does this allocate member function need to take an alignment parameter?

rparolin · 2025-11-13T00:28:59Z

cuda_core/cuda/core/experimental/_memory/_device_memory_resource.pyx

            is used.
        """
-        DMR_deallocate(self, <uintptr_t>ptr, size, <Stream>stream)
+        stream = Stream._init(stream) if stream is not None else default_stream()


Is there a way to capture the user error of neglecting to pass the stream that memory allocation came from? I'm thinking of a debug assert that verify if the allocated memory address came from the provided stream? Another potential option, is if the user neglects providing a stream, we do a look to determine where the address came from. Not sure if that is possible given the current lower level API.

Since the stream only defines an ordering, it is legal to allocate on one stream and deallocate on another.

That said, I think the overall memory management scheme could be clarified and potentially improved.

rparolin · 2025-11-13T00:33:34Z

cuda_core/cuda/core/experimental/_memory/_device_memory_resource.pyx

+                           "a capturing stream (consider using GraphMemoryResource).")
+
+
+cdef inline Buffer DMR_allocate(DeviceMemoryResource self, size_t size, Stream stream):


What does inline do here?

Cython will mark the function CYTHON_INLINE in the generated C++.

#ifndef CYTHON_INLINE #if defined(__clang__) #define CYTHON_INLINE __inline__ __attribute__ ((__unused__)) #else #define CYTHON_INLINE inline #endif #endif

rparolin · 2025-11-13T00:38:01Z

cuda_core/cuda/core/experimental/_memory/_graph_memory_resource.pyx

+    """
+
+    def __new__(cls, device_id: int | Device):
+        cdef int c_device_id = getattr(device_id, 'device_id', device_id)


This is where I wish we had function overloading in the language..

Yes. It can also be done with decorators and/or pybind11. The main problem with general overloading for us is that the signature matching is done at runtime in Python and we are very sensitive to performance.

In many simple cases, arguments are orthogonal and we just want to reduce a set of types to a certain type. That's what we have here ({Device, int} -> Device) and elsewhere in this change with stream conversions (IsStreamT -> Stream). For these situations, I think the best solution is to make a standard conversion for each type: to convert "any supported object," s, to a Stream, use Stream._init(s). Similarly, we should probably update code like this to Device._init(d) or Device(d). We could also consider adding IsDeviceT and __cuda_device__ for symmetry.

(I would actually prefer to remove the _init methods in favor of constructors, but I understand why we have those and it's a different discussion, anyway.)

rparolin · 2025-11-13T00:50:52Z

Comment: I'd recommend in future chunking large PRs into smaller reviewable chunks. IMHO, it makes it more approachable and consumable for reviewers.

leofang · 2025-11-18T01:26:46Z

cuda_core/cuda/core/experimental/_device.pyx

        raise NotImplementedError("WIP: https://github.com/NVIDIA/cuda-python/issues/189")

-    def create_stream(self, obj: IsStreamT | None = None, options: StreamOptions | None = None) -> Stream:
+    def create_stream(self, obj: Optional[IsStreamT] = None, options: StreamOptions | None = None) -> Stream:


Q: We seem to be introducing inconsistent syntax preference here, any reason for this change? The old typing and the new one are equivalent. Also, we still keep the typing for options the same.

I think if ruff were able to lint Cython code, this would have been flagged.

IIRC in "modern" (in relative terms, since Python typing changes so rapidly) typing using | is preferred over Optional. We should get a typing expert to help review (I am not one) 🙂

leofang · 2025-11-18T01:30:16Z

cuda_core/cuda/core/experimental/_device.pyx

        return Event._init(self._id, ctx, options, True)

-    def allocate(self, size, stream: Stream | None = None) -> Buffer:
+    def allocate(self, size, stream: Optional[IsStreamT] = None) -> Buffer:


ditto

Suggested change

def allocate(self, size, stream: Optional[IsStreamT] = None) -> Buffer:

def allocate(self, size, stream: IsStreamT | None = None) -> Buffer:

Andy-Jost added 11 commits November 12, 2025 12:48

Implement non-pooling memory allocation.

befa768

Add GraphMemoryResource.

88834f7

Remove mempool_enabled option now that GraphMemoryResource is ready.

57855a1

Add docstring and make GraphMemoryResource a singleton.

0375941

Move tests to a separate file.

0b82b1f

Add errors for DeviceMemoryResource and GraphMemoryResource when grap…

53b1c58

…h capture state is not as expected.

Add tests for attributes and memory allocation escaping graphs.

1b8409b

Simplify logic for converting IsStreamT arguments.

e9422b2

Standardize Stream arguments to IsStreamT. Update Buffer and MemoryRe…

3e21b9b

…source methods to take any kind of stream-providing object. Update graph allocation tests.

Add tests for IsStreamT conversions.

98ccfc7

Expand files named _gmr.*. Add __eq__ and __hash__ support to StreamW…

183f7af

…rapper (testing only)

Andy-Jost added this to the cuda.core beta 9 milestone Nov 12, 2025

Andy-Jost requested review from leofang and rparolin November 12, 2025 21:13

Andy-Jost self-assigned this Nov 12, 2025

Andy-Jost added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Nov 12, 2025

Fix format/lint issues.

13e3dfb

Andy-Jost commented Nov 12, 2025

View reviewed changes

rparolin reviewed Nov 13, 2025

View reviewed changes

Minor clean up.

7408fe8

rparolin approved these changes Nov 18, 2025

View reviewed changes

leofang reviewed Nov 18, 2025

View reviewed changes

		...


		cdef cydriver.CUstream _try_to_get_stream_ptr(obj: IsStreamT) except*:

		"a capturing stream (consider using GraphMemoryResource).")


		cdef inline Buffer DMR_allocate(DeviceMemoryResource self, size_t size, Stream stream):

	def allocate(self, size, stream: Optional[IsStreamT] = None) -> Buffer:
	def allocate(self, size, stream: IsStreamT \| None = None) -> Buffer:

Implements GraphMemoryResource #1235

Are you sure you want to change the base?

Implements GraphMemoryResource #1235

Conversation

Andy-Jost commented Nov 12, 2025

Description

Uh oh!

copy-pr-bot bot commented Nov 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rparolin Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andy-Jost Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rparolin commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rparolin Nov 13, 2025 •

edited

Loading

Andy-Jost Nov 13, 2025 •

edited

Loading