Skip to content

Conversation

@Andy-Jost
Copy link
Contributor

Description

Implements GraphMemoryResource for memory interactions with the graph memory allocator. Allocations from this object succeed only when graph capturing is active. Conversely, allocations from DeviceMemoryResource now raise an exception when graph capturing is active.

A new test module is added.

This change also simplifies and extends the logic for accepting arbitrary stream parameters as objects implementing __cuda_stream__. Support for that protocol was added in several places, allowing GraphBuilder to be used anywhere a stream is expected, including memory resource and buffer methods.

closes #963

@Andy-Jost Andy-Jost added this to the cuda.core beta 9 milestone Nov 12, 2025
@Andy-Jost Andy-Jost self-assigned this Nov 12, 2025
@Andy-Jost Andy-Jost added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Nov 12, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Nov 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


from cuda.bindings cimport cydriver
from cuda.core.experimental._memory._buffer cimport MemoryResource
from cuda.core.experimental._memory._device_memory_resource cimport DeviceMemoryResource
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the cruft here.

if _settable:
def fset(GraphMemoryResourceAttributes self, uint64_t value):
if value != 0:
raise AttributeError(f"Attribute {stub.__name__!r} may only be set to zero (got {value}).")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The driver checks for this condition in cuDeviceSetGraphMemAttribute and issues a log message: "High watermark can only be reset to 0"

It's a shame we cannot access that message programmatically for use in the Python error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news: CUDA 13 adds functions for error log management. It looks like cuLogsRegisterCallback might help here.

ctx = self._get_current_context()
return Event._init(self._id, ctx, options, True)

def allocate(self, size, stream: Stream | None = None) -> Buffer:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated Stream arguments to IsStreamT throughout.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused by this change. Why did we do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GraphBuilder object contains a stream and can be used in place of streams. In test_graph.py you can find lots of statements like this launch(gb, LaunchConfig(grid=1, block=1), empty_kernel).

This was only partly implemented. For instance, launch contained logic to convert arguments to streams, but nothing in the memory module did.

This change extends the support. In general, we should uniformly use IsStreamT for any API function that accepts a stream-like argument and then call Stream._init(arg) to convert it.

raise ValueError(
f"stream must either be a Stream object or support __cuda_stream__ (got {type(stream)})"
) from None
stream = Stream._init(stream)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical way to invoke the __cuda_stream__ protocol now is to call Stream._init. It will either succeed in creating a Stream object or raise an exception.

...


cdef cydriver.CUstream _try_to_get_stream_ptr(obj: IsStreamT) except*:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code block was moved below without changes.

Comment on lines +247 to +254
gb = device.create_graph_builder().begin_building(mode=mode)
with pytest.raises(
RuntimeError,
match=r"DeviceMemoryResource cannot perform memory operations on a capturing "
r"stream \(consider using GraphMemoryResource\)\.",
):
dmr.allocate(1, stream=gb)
gb.end_building().complete()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section illustrates a drawback of not using with contexts. Ignore the fact that the error is caught here (that's just for testing). If an exception is thrown during graph capture, control can easily escape without making a call to gb.end_building. That leaves the surrounding code in an unexpected state (capturing on).

if stream is None:
raise ValueError("stream cannot be None, stream must either be a Stream object or support __cuda_stream__")
try:
stream_handle = stream.handle
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a (now fixed) bug. Any object with a handle attribute would use that preferentially over the stream protocol, even if the handle was not for a stream.

@Andy-Jost
Copy link
Contributor Author

/ok to test 13e3dfb

@github-actions
Copy link

dst : :obj:`~_memory.Buffer`
Source buffer to copy data from
stream : Stream
stream : IsStreamT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsStreamT is still a Stream type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsStreamT identifies any type that supports conversion to Stream via __cuda_stream__, including Stream itself.

return self._ipc_data._alloc_handle

def allocate(self, size_t size, stream: Stream = None) -> Buffer:
def allocate(self, size_t size, stream: Optional[IsStreamT] = None) -> Buffer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this allocate member function need to take an alignment parameter?

is used.
"""
DMR_deallocate(self, <uintptr_t>ptr, size, <Stream>stream)
stream = Stream._init(stream) if stream is not None else default_stream()
Copy link
Collaborator

@rparolin rparolin Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to capture the user error of neglecting to pass the stream that memory allocation came from? I'm thinking of a debug assert that verify if the allocated memory address came from the provided stream? Another potential option, is if the user neglects providing a stream, we do a look to determine where the address came from. Not sure if that is possible given the current lower level API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the stream only defines an ordering, it is legal to allocate on one stream and deallocate on another.

That said, I think the overall memory management scheme could be clarified and potentially improved.

"a capturing stream (consider using GraphMemoryResource).")


cdef inline Buffer DMR_allocate(DeviceMemoryResource self, size_t size, Stream stream):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does inline do here?

Copy link
Contributor Author

@Andy-Jost Andy-Jost Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cython will mark the function CYTHON_INLINE in the generated C++.

#ifndef CYTHON_INLINE
  #if defined(__clang__)
    #define CYTHON_INLINE __inline__ __attribute__ ((__unused__))
  #else
    #define CYTHON_INLINE inline
  #endif
#endif

"""

def __new__(cls, device_id: int | Device):
cdef int c_device_id = getattr(device_id, 'device_id', device_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where I wish we had function overloading in the language..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It can also be done with decorators and/or pybind11. The main problem with general overloading for us is that the signature matching is done at runtime in Python and we are very sensitive to performance.

In many simple cases, arguments are orthogonal and we just want to reduce a set of types to a certain type. That's what we have here ({Device, int} -> Device) and elsewhere in this change with stream conversions (IsStreamT -> Stream). For these situations, I think the best solution is to make a standard conversion for each type: to convert "any supported object," s, to a Stream, use Stream._init(s). Similarly, we should probably update code like this to Device._init(d) or Device(d). We could also consider adding IsDeviceT and __cuda_device__ for symmetry.

(I would actually prefer to remove the _init methods in favor of constructors, but I understand why we have those and it's a different discussion, anyway.)

@rparolin
Copy link
Collaborator

Comment: I'd recommend in future chunking large PRs into smaller reviewable chunks. IMHO, it makes it more approachable and consumable for reviewers.

raise NotImplementedError("WIP: https://github.com/NVIDIA/cuda-python/issues/189")

def create_stream(self, obj: IsStreamT | None = None, options: StreamOptions | None = None) -> Stream:
def create_stream(self, obj: Optional[IsStreamT] = None, options: StreamOptions | None = None) -> Stream:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: We seem to be introducing inconsistent syntax preference here, any reason for this change? The old typing and the new one are equivalent. Also, we still keep the typing for options the same.

I think if ruff were able to lint Cython code, this would have been flagged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC in "modern" (in relative terms, since Python typing changes so rapidly) typing using | is preferred over Optional. We should get a typing expert to help review (I am not one) 🙂

return Event._init(self._id, ctx, options, True)

def allocate(self, size, stream: Stream | None = None) -> Buffer:
def allocate(self, size, stream: Optional[IsStreamT] = None) -> Buffer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Suggested change
def allocate(self, size, stream: Optional[IsStreamT] = None) -> Buffer:
def allocate(self, size, stream: IsStreamT | None = None) -> Buffer:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA graph phase 2 - memory nodes

3 participants