Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
283 changes: 153 additions & 130 deletions cuda_core/cuda/core/experimental/_memoryview.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -88,20 +88,155 @@ cdef class StridedMemoryView:
cdef DLTensor *dl_tensor

# Memoized properties
cdef tuple _shape
cdef tuple _strides
cdef bint _strides_init # Has the strides tuple been init'ed?
cdef object _dtype

def __init__(self, obj=None, stream_ptr=None):
if obj is not None:
# populate self's attributes
if check_has_dlpack(obj):
view_as_dlpack(obj, stream_ptr, self)
else:
view_as_cai(obj, stream_ptr, self)
Comment on lines -96 to -102
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cpcloud This is a breaking change. Can we please restore the old behavior and add a deprecation warning? We need a release note, too.

If we want to push for alternative constructors, we should eventually raise in __init__ following what we do in other objects.

Copy link
Contributor Author

@cpcloud cpcloud Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to downstream breakages so I can understand the impact? Across all of GitHub the only use of StridedMemoryView is in a comment:

https://github.com/search?q=StridedMemoryView+NOT+is%3Afork+NOT+repo%3ANVIDIA%2Fcuda-python+NOT+owner%3APopcornFX&type=code

I'd strongly prefer not to restore the behavior without evidence that this change causes problems.

Happy to add the release note.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have real users that cannot be found this way. Examples include issues filed by people outside of the team:

We have way too many invisible users at this point (check out the download counts from e.g. pypistats). We have also been slammed by our PM for breaking too often and too much (despite I disagreed in that case; in any case I was the one arguing and taking time in sorting things out, so plz back me up).

We need to make sure major breaking changes like this have a deprecation period, even before GA. In particular, in this case it is absolutely fine that we add extra constructors without breaking anything. I am not sure why we must leap so aggressively.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @leofang guidance. Breaking existing public facing API should go through our teams standard deprecation flow where we notify users at least a release before to when we remove the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for approving while misunderstanding the backward-compat guarantees of the experimental namespace.

I think this is all a reasonable policy to have, but challenging to follow as long as we are releasing from main. If we had a maintenance branch, we could put the "backward-compatible-with-deprecation-warnings" version of this on the maintenance branch, and the aspirational "where-we-want-to-be" on main. As it stands, we need to do the first now, put a reminder in a calendar to do the second and hope that reviews get it merged in the expected window.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've meant to follow up here and offline with Phillip, but I got distracted (as always) by other things including the pixi stuff that Phillip started playing around with 😛

I did not mean that this PR is bad. I apologize if my earlier comments gave this impression. The alternative constructors are totally fine. They give users finer control if they already have prior knowledge on what protocols a foreign object supports. I meant that we just need to preserve the default constructor's behavior for a little longer, at least for another minor release. We could just revert the change on __init__() and leave everything else as-is, though now that we have an urgent (TBD) patch release in preparation, it is good that it's reverted for now.

We're dealing with an interesting situation. On the one hand we're still beta/experimental software, and we do allow breaking changes in each minor release. On the other hand we are perhaps too successful in telling our CUDA Python story that we have unexpectedly high number of users, and this does increasingly keep me up at night. If we break too often, downstream projects end up pinning too tightly (example), resulting in NVIDIA software unable to be installed in the same environment. This hurts our storytelling.

I suggest that we clearly highlight all PRs that contain breaking changes with the "breaking" label and a rel-note entry under the Breaking Change section. With such marks we can make informed decisions and evaluate all breaking changes on a case-by-case basis. There is also no need to seek short turnaround in merging breaking PRs. We still want to design cuda.core properly in a future-looking fashion to avoid as much breaks as possible, it's been one of the core principles since Day 1, and for that I am fine taking a bit more time thinking through the design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't interpret your comments as this PR being bad, just haven't had a chance to re-PR the changes while preserving the current constructor behavior. I will PR that today.

cdef:
tuple _shape
tuple _strides
# a `None` value for _strides has defined meaning in dlpack and
# the cuda array interface, meaning C order, contiguous.
#
# this flag helps prevent unnecessary recompuation of _strides
bint _strides_init
object _dtype

def __init__(
self,
*,
ptr: intptr_t,
device_id: int,
is_device_accessible: bint,
readonly: bint,
metadata: object,
exporting_obj: object,
dl_tensor: intptr_t = 0,
) -> None:
self.ptr = ptr
self.device_id = device_id
self.is_device_accessible = is_device_accessible
self.readonly = readonly
self.metadata = metadata
self.exporting_obj = exporting_obj
self.dl_tensor = <DLTensor*>dl_tensor
self._shape = None
self._strides = None
self._strides_init = False
self._dtype = None

@classmethod
def from_dlpack(cls, obj: object, stream_ptr: int | None=None) -> StridedMemoryView:
cdef int dldevice, device_id
cdef bint is_device_accessible, is_readonly
is_device_accessible = False
dldevice, device_id = obj.__dlpack_device__()
if dldevice == _kDLCPU:
assert device_id == 0
device_id = -1
if stream_ptr is None:
raise BufferError("stream=None is ambiguous with view()")
elif stream_ptr == -1:
stream_ptr = None
elif dldevice == _kDLCUDA:
assert device_id >= 0
is_device_accessible = True
# no need to check other stream values, it's a pass-through
if stream_ptr is None:
raise BufferError("stream=None is ambiguous with view()")
elif dldevice in (_kDLCUDAHost, _kDLCUDAManaged):
is_device_accessible = True
# just do a pass-through without any checks, as pinned/managed memory can be
# accessed on both host and device
else:
pass
raise BufferError("device not supported")

cdef object capsule
try:
capsule = obj.__dlpack__(
stream=int(stream_ptr) if stream_ptr else None,
max_version=(DLPACK_MAJOR_VERSION, DLPACK_MINOR_VERSION))
except TypeError:
capsule = obj.__dlpack__(
stream=int(stream_ptr) if stream_ptr else None)

cdef void* data = NULL
cdef DLTensor* dl_tensor
cdef DLManagedTensorVersioned* dlm_tensor_ver
cdef DLManagedTensor* dlm_tensor
cdef const char *used_name
if cpython.PyCapsule_IsValid(
capsule, DLPACK_VERSIONED_TENSOR_UNUSED_NAME):
data = cpython.PyCapsule_GetPointer(
capsule, DLPACK_VERSIONED_TENSOR_UNUSED_NAME)
dlm_tensor_ver = <DLManagedTensorVersioned*>data
dl_tensor = &dlm_tensor_ver.dl_tensor
is_readonly = (dlm_tensor_ver.flags & DLPACK_FLAG_BITMASK_READ_ONLY) != 0
used_name = DLPACK_VERSIONED_TENSOR_USED_NAME
else:
assert cpython.PyCapsule_IsValid(
capsule, DLPACK_TENSOR_UNUSED_NAME)
data = cpython.PyCapsule_GetPointer(
capsule, DLPACK_TENSOR_UNUSED_NAME)
dlm_tensor = <DLManagedTensor*>data
dl_tensor = &dlm_tensor.dl_tensor
is_readonly = False
used_name = DLPACK_TENSOR_USED_NAME

cpython.PyCapsule_SetName(capsule, used_name)

return cls(
ptr=<intptr_t>dl_tensor.data,
device_id=int(device_id),
is_device_accessible=is_device_accessible,
readonly=is_readonly,
metadata=capsule,
exporting_obj=obj,
dl_tensor=<intptr_t>dl_tensor,
)

@classmethod
def from_cuda_array_interface(cls, obj: object, stream_ptr: int | None=None) -> StridedMemoryView:
cdef dict cai_data = obj.__cuda_array_interface__
if cai_data["version"] < 3:
raise BufferError("only CUDA Array Interface v3 or above is supported")
if cai_data.get("mask") is not None:
raise BufferError("mask is not supported")
if stream_ptr is None:
raise BufferError("stream=None is ambiguous with view()")

cdef intptr_t producer_s, consumer_s
stream_ptr = int(stream_ptr)
if stream_ptr != -1:
stream = cai_data.get("stream")
if stream is not None:
producer_s = <intptr_t>(stream)
consumer_s = <intptr_t>(stream_ptr)
assert producer_s > 0
# establish stream order
if producer_s != consumer_s:
e = handle_return(driver.cuEventCreate(
driver.CUevent_flags.CU_EVENT_DISABLE_TIMING))
handle_return(driver.cuEventRecord(e, producer_s))
handle_return(driver.cuStreamWaitEvent(consumer_s, e, 0))
handle_return(driver.cuEventDestroy(e))

cdef intptr_t ptr = int(cai_data["data"][0])
return cls(
ptr=ptr,
device_id=handle_return(
driver.cuPointerGetAttribute(
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL,
ptr
)
),
is_device_accessible=True,
readonly=cai_data["data"][1],
metadata=cai_data,
exporting_obj=obj,
)

@classmethod
def from_any_interface(cls, obj: object, stream_ptr: int | None = None) -> StridedMemoryView:
if check_has_dlpack(obj):
return cls.from_dlpack(obj, stream_ptr)
return cls.from_cuda_array_interface(obj, stream_ptr)

def __dealloc__(self):
if self.dl_tensor == NULL:
Expand All @@ -121,7 +256,7 @@ cdef class StridedMemoryView:
dlm_tensor.deleter(dlm_tensor)

@property
def shape(self) -> tuple[int]:
def shape(self) -> tuple[int, ...]:
if self._shape is None:
if self.exporting_obj is not None:
if self.dl_tensor != NULL:
Expand All @@ -136,7 +271,7 @@ cdef class StridedMemoryView:
return self._shape

@property
def strides(self) -> Optional[tuple[int]]:
def strides(self) -> Optional[tuple[int, ...]]:
cdef int itemsize
if self._strides_init is False:
if self.exporting_obj is not None:
Expand Down Expand Up @@ -206,8 +341,7 @@ cdef bint check_has_dlpack(obj) except*:


cdef class _StridedMemoryViewProxy:

cdef:
cdef readonly:
object obj
bint has_dlpack

Expand All @@ -217,82 +351,11 @@ cdef class _StridedMemoryViewProxy:

cpdef StridedMemoryView view(self, stream_ptr=None):
if self.has_dlpack:
return view_as_dlpack(self.obj, stream_ptr)
return StridedMemoryView.from_dlpack(self.obj, stream_ptr)
else:
return view_as_cai(self.obj, stream_ptr)
return StridedMemoryView.from_cuda_array_interface(self.obj, stream_ptr)


cdef StridedMemoryView view_as_dlpack(obj, stream_ptr, view=None):
cdef int dldevice, device_id
cdef bint is_device_accessible, is_readonly
is_device_accessible = False
dldevice, device_id = obj.__dlpack_device__()
if dldevice == _kDLCPU:
assert device_id == 0
device_id = -1
if stream_ptr is None:
raise BufferError("stream=None is ambiguous with view()")
elif stream_ptr == -1:
stream_ptr = None
elif dldevice == _kDLCUDA:
assert device_id >= 0
is_device_accessible = True
# no need to check other stream values, it's a pass-through
if stream_ptr is None:
raise BufferError("stream=None is ambiguous with view()")
elif dldevice in (_kDLCUDAHost, _kDLCUDAManaged):
is_device_accessible = True
# just do a pass-through without any checks, as pinned/managed memory can be
# accessed on both host and device
else:
raise BufferError("device not supported")

cdef object capsule
try:
capsule = obj.__dlpack__(
stream=int(stream_ptr) if stream_ptr else None,
max_version=(DLPACK_MAJOR_VERSION, DLPACK_MINOR_VERSION))
except TypeError:
capsule = obj.__dlpack__(
stream=int(stream_ptr) if stream_ptr else None)

cdef void* data = NULL
cdef DLTensor* dl_tensor
cdef DLManagedTensorVersioned* dlm_tensor_ver
cdef DLManagedTensor* dlm_tensor
cdef const char *used_name
if cpython.PyCapsule_IsValid(
capsule, DLPACK_VERSIONED_TENSOR_UNUSED_NAME):
data = cpython.PyCapsule_GetPointer(
capsule, DLPACK_VERSIONED_TENSOR_UNUSED_NAME)
dlm_tensor_ver = <DLManagedTensorVersioned*>data
dl_tensor = &dlm_tensor_ver.dl_tensor
is_readonly = bool((dlm_tensor_ver.flags & DLPACK_FLAG_BITMASK_READ_ONLY) != 0)
used_name = DLPACK_VERSIONED_TENSOR_USED_NAME
elif cpython.PyCapsule_IsValid(
capsule, DLPACK_TENSOR_UNUSED_NAME):
data = cpython.PyCapsule_GetPointer(
capsule, DLPACK_TENSOR_UNUSED_NAME)
dlm_tensor = <DLManagedTensor*>data
dl_tensor = &dlm_tensor.dl_tensor
is_readonly = False
used_name = DLPACK_TENSOR_USED_NAME
else:
assert False

cpython.PyCapsule_SetName(capsule, used_name)

cdef StridedMemoryView buf = StridedMemoryView() if view is None else view
buf.dl_tensor = dl_tensor
buf.metadata = capsule
buf.ptr = <intptr_t>(dl_tensor.data)
buf.device_id = device_id
buf.is_device_accessible = is_device_accessible
buf.readonly = is_readonly
buf.exporting_obj = obj

return buf


cdef object dtype_dlpack_to_numpy(DLDataType* dtype):
cdef int bits = dtype.bits
Expand Down Expand Up @@ -354,46 +417,6 @@ cdef object dtype_dlpack_to_numpy(DLDataType* dtype):
return numpy.dtype(np_dtype)


# Also generate for Python so we can test this code path
cpdef StridedMemoryView view_as_cai(obj, stream_ptr, view=None):
cdef dict cai_data = obj.__cuda_array_interface__
if cai_data["version"] < 3:
raise BufferError("only CUDA Array Interface v3 or above is supported")
if cai_data.get("mask") is not None:
raise BufferError("mask is not supported")
if stream_ptr is None:
raise BufferError("stream=None is ambiguous with view()")

cdef StridedMemoryView buf = StridedMemoryView() if view is None else view
buf.exporting_obj = obj
buf.metadata = cai_data
buf.dl_tensor = NULL
buf.ptr, buf.readonly = cai_data["data"]
buf.is_device_accessible = True
buf.device_id = handle_return(
driver.cuPointerGetAttribute(
driver.CUpointer_attribute.CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL,
buf.ptr))

cdef intptr_t producer_s, consumer_s
stream_ptr = int(stream_ptr)
if stream_ptr != -1:
stream = cai_data.get("stream")
if stream is not None:
producer_s = <intptr_t>(stream)
consumer_s = <intptr_t>(stream_ptr)
assert producer_s > 0
# establish stream order
if producer_s != consumer_s:
e = handle_return(driver.cuEventCreate(
driver.CUevent_flags.CU_EVENT_DISABLE_TIMING))
handle_return(driver.cuEventRecord(e, producer_s))
handle_return(driver.cuStreamWaitEvent(consumer_s, e, 0))
handle_return(driver.cuEventDestroy(e))

return buf


def args_viewable_as_strided_memory(tuple arg_indices):
"""
Decorator to create proxy objects to :obj:`StridedMemoryView` for the
Expand Down
4 changes: 2 additions & 2 deletions cuda_core/tests/test_memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -609,15 +609,15 @@ def test_strided_memory_view_leak():
arr = np.zeros(1048576, dtype=np.uint8)
before = sys.getrefcount(arr)
for idx in range(10):
StridedMemoryView(arr, stream_ptr=-1)
StridedMemoryView.from_any_interface(arr, stream_ptr=-1)
after = sys.getrefcount(arr)
assert before == after


def test_strided_memory_view_refcnt():
# Use Fortran ordering so strides is used
a = np.zeros((64, 4), dtype=np.uint8, order="F")
av = StridedMemoryView(a, stream_ptr=-1)
av = StridedMemoryView.from_any_interface(a, stream_ptr=-1)
# segfaults if refcnt is wrong
assert av.shape[0] == 64
assert sys.getrefcount(av.shape) >= 2
Expand Down
7 changes: 3 additions & 4 deletions cuda_core/tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@
import numpy as np
import pytest
from cuda.core.experimental import Device
from cuda.core.experimental._memoryview import view_as_cai
from cuda.core.experimental.utils import StridedMemoryView, args_viewable_as_strided_memory


Expand Down Expand Up @@ -78,7 +77,7 @@ def my_func(arr):

def test_strided_memory_view_cpu(self, in_arr):
# stream_ptr=-1 means "the consumer does not care"
view = StridedMemoryView(in_arr, stream_ptr=-1)
view = StridedMemoryView.from_any_interface(in_arr, stream_ptr=-1)
self._check_view(view, in_arr)

def _check_view(self, view, in_arr):
Expand Down Expand Up @@ -147,7 +146,7 @@ def test_strided_memory_view_cpu(self, in_arr, use_stream):
# This is the consumer stream
s = dev.create_stream() if use_stream else None

view = StridedMemoryView(in_arr, stream_ptr=s.handle if s else -1)
view = StridedMemoryView.from_any_interface(in_arr, stream_ptr=s.handle if s else -1)
self._check_view(view, in_arr, dev)

def _check_view(self, view, in_arr, dev):
Expand Down Expand Up @@ -179,7 +178,7 @@ def test_cuda_array_interface_gpu(self, in_arr, use_stream):
# The usual path in `StridedMemoryView` prefers the DLPack interface
# over __cuda_array_interface__, so we call `view_as_cai` directly
# here so we can test the CAI code path.
view = view_as_cai(in_arr, stream_ptr=s.handle if s else -1)
view = StridedMemoryView.from_cuda_array_interface(in_arr, stream_ptr=s.handle if s else -1)
self._check_view(view, in_arr, dev)

def _check_view(self, view, in_arr, dev):
Expand Down
Loading