Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 13 additions & 14 deletions cuda_core/cuda/core/experimental/_memoryview.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -122,16 +122,17 @@ cdef class StridedMemoryView:

@property
def shape(self) -> tuple[int]:
if self._shape is None and self.exporting_obj is not None:
if self.dl_tensor != NULL:
self._shape = cuda_utils.carray_int64_t_to_tuple(
self.dl_tensor.shape,
self.dl_tensor.ndim
)
if self._shape is None:
if self.exporting_obj is not None:
if self.dl_tensor != NULL:
self._shape = cuda_utils.carray_int64_t_to_tuple(
self.dl_tensor.shape,
self.dl_tensor.ndim
)
else:
self._shape = self.metadata["shape"]
else:
self._shape = self.metadata["shape"]
else:
self._shape = ()
self._shape = ()
return self._shape

@property
Expand All @@ -146,14 +147,12 @@ cdef class StridedMemoryView:
self.dl_tensor.ndim
)
else:
# This is a Python interface anyway, so not much point
# to using the optimization in cuda_utils.carray_int64_t_to_tuple
strides = self.metadata.get("strides")
if strides is not None:
itemsize = self.dtype.itemsize
self._strides = cpython.PyTuple_New(len(strides))
for i in range(len(strides)):
cpython.PyTuple_SET_ITEM(
self._strides, i, strides[i] // itemsize
)
self._strides = tuple(x // itemsize for x in strides)
self._strides_init = True
return self._strides

Expand Down
13 changes: 11 additions & 2 deletions cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
# SPDX-License-Identifier: Apache-2.0

cimport cpython
from cpython.object cimport PyObject
from libc.stdint cimport int64_t

from cuda.bindings cimport cydriver
Expand Down Expand Up @@ -32,9 +33,17 @@ cpdef int _check_nvrtc_error(error) except?-1
cpdef check_or_create_options(type cls, options, str options_description=*, bint keep_none=*)


# Create low-level externs so Cython won't "helpfully" handle reference counting
# for us. Prefixing with an underscore to distinguish it from the definition in
# cpython.long.
cdef extern from "Python.h":
PyObject *_PyLong_FromLongLong "PyLong_FromLongLong" (long long val) except NULL
void _PyTuple_SET_ITEM "PyTuple_SET_ITEM" (object p, Py_ssize_t pos, PyObject *o)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is wild that this is the way to handle the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could have been handled by adding an explicit incref to counteract the implicit decref that Cython generates, but this is the only way I could find to avoid the unnecessary nonsense.

Cython and performance optimization are at odds.

This fixes the problem at hand, but it caused me to look more closely at the reference counting the Cython generates, and there is a /lot/ of unnecessary work it does. I don't know how much of that the C compiler can see through and optimize away, but I doubt it's everything. Cython really needs its own reduction pass, probably. This will become more of a problem for free-threaded builds where reference counting is more expensive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what was the code that Cython generated without this cdef extern hack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code it generated before:

    __pyx_t_1 = PyLong_FromLongLong((__pyx_v_ptr[__pyx_v_i])); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 39, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    PyTuple_SET_ITEM(__pyx_v_result, __pyx_v_i, __pyx_t_1);
    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;  // <--- wrong!

That DECREF is incorrect because PyTuple_SET_ITEM steals a reference. @stiepan's suggestion in #1188 was to explicitly INCREF that value, which is totally correct, but would churn the refcount unnecessarily. My fix here basically uses raw PyObject * instead to avoid Cython being "helpful".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdboom I see. IIUC this will generate the equivalent code with less boilerplate and without unnecessary churns (untested):

item = cpython.PyLong_FromLongLong(ptr[i])
cpython.Py_INCREF(item)
cpython.PyTuple_SET_ITEM(result, i, <cpython.PyObject*>(item))

This is because in Cython the 3rd arg is typed as object, for which Cython handles refcount before/after the function call. Casting it to PyObject* bypasses Cython's mechanism and gives us full control over refcount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried this. The cast doesn't work because Cython still wants you to cast back into an object to pass to its wrapper of PyTuple_SET_ITEM:

          cpython.PyTuple_SET_ITEM(result, i, <cpython.PyObject *>cpython.PyLong_FromLongLong(ptr[i]))
                                                  ^
      ------------------------------------------------------------

      ./cuda/core/experimental/_utils/cuda_utils.pxd:39:44: Casting temporary Python object to non-numeric non-Python type

It also causes unnecessary reference count churn. (We incref the item only because Cython is implicitly and unnecessarily decref'ing it later). The C compiler generally can't see through and optimize away that kind of thing because it can't reason that the reference count won't ever go to zero and that freeing the object is always a no-op.

As I look at this stuff more, I think a better approach is probably to drop all the way down to C for these performance-critical sections. The whole cimport cpython tree of things is full of rough edges because it doesn't model the behavior of the Python/C API completely (it doesn't understand steals, for example). To the point where maybe it's a good guideline to "avoid cimport cpython". For example, maybe we should do this (only for these sort of localized hot path optimizations of course). It's a lot more verbose due to the error checking, but it's a lot easier to reason about what it's doing and where the performance optimizations are:

cdef extern from *:
    """
    static inline PyObject* carray_int64_to_tuple(uint64_t *var, Py_ssize_t size) {
        PyObject* result = PyTuple_New(size);
        if (result == NULL) {
            return NULL;
        }
        PyObject* item;
        for (Py_ssize_t i = 0; i < size; i++) {
            item = PyLong_FromLongLong(var[i]);
            if (item == NULL) {
                Py_DECREF(result);
                return NULL;
            }
            PyTuple_SET_ITEM(result, i, item);
        }
        return result;
    }
    """
    PyObject *carray_int64_to_tuple(uintptr_t var, Py_ssize_t size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast doesn't work because Cython still wants you to cast back into an object to pass to its wrapper of PyTuple_SET_ITEM

Ah right. This makes me sad...



cdef inline tuple carray_int64_t_to_tuple(int64_t *ptr, int length):
# Construct shape and strides tuples using the Python/C API for speed
result = cpython.PyTuple_New(length)
cdef tuple result = cpython.PyTuple_New(length)
for i in range(length):
cpython.PyTuple_SET_ITEM(result, i, cpython.PyLong_FromLongLong(ptr[i]))
_PyTuple_SET_ITEM(result, i, _PyLong_FromLongLong(ptr[i]))
return result
29 changes: 29 additions & 0 deletions cuda_core/docs/source/release/0.4.X-notes.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
.. SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
.. SPDX-License-Identifier: Apache-2.0

.. currentmodule:: cuda.core.experimental

``cuda.core`` 0.4.X Release Notes
=================================


Highlights
----------


Breaking Changes
----------------


New features
------------


New examples
------------


Fixes and enhancements
----------------------

- Fixed a segmentation fault when accessing :class:`StridedMemoryView` ``shape`` and ``strides`` members.
13 changes: 13 additions & 0 deletions cuda_core/tests/test_memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -612,3 +612,16 @@ def test_strided_memory_view_leak():
StridedMemoryView(arr, stream_ptr=-1)
after = sys.getrefcount(arr)
assert before == after


def test_strided_memory_view_refcnt():
# Use Fortran ordering so strides is used
a = np.zeros((64, 4), dtype=np.uint8, order="F")
av = StridedMemoryView(a, stream_ptr=-1)
# segfaults if refcnt is wrong
assert av.shape[0] == 64
assert sys.getrefcount(av.shape) >= 2

assert av.strides[0] == 1
assert av.strides[1] == 64
assert sys.getrefcount(av.strides) >= 2
Loading