Skip to content

Conversation

@mdboom
Copy link
Contributor

@mdboom mdboom commented Oct 24, 2025

To avoid adding unnecessary reference counts to counteract Cython's unnecessary reference counts, this fix adds "non-magical" externs to the CPython API to make the reference counting correct.

In the process of writing a test investigating that bug, I also discovered a bug in the caching logic for shape.

Lastly, for the case where strides comes from CAI rather than DLPack, rather than porting this fix there, since that's already a "slow" Python API path, I just reverted that to a "tuple" comprehension.

@mdboom mdboom self-assigned this Oct 24, 2025
@mdboom mdboom added P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Oct 24, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@mdboom mdboom requested a review from leofang October 24, 2025 16:51
@mdboom
Copy link
Contributor Author

mdboom commented Oct 24, 2025

/ok to test

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 24, 2025

/ok to test

@mdboom, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a segmentation fault (issue #1186) that occurred when accessing the shape and strides properties of StridedMemoryView in the cuda.core package. The root cause was incorrect reference counting in Cython-generated code when constructing Python tuples from C arrays. The fix introduces low-level CPython API externs (_PyLong_FromLongLong, _PyTuple_SET_ITEM) in cuda_utils.pxd to bypass Cython's automatic reference counting, which was causing objects to be prematurely deallocated. Additionally, the PR corrects a caching logic bug where _shape would incorrectly cache an empty tuple even when exporting_obj could be populated later. For the CAI (CUDA Array Interface) code path where strides come from Python metadata rather than DLPack, the fix simplifies the implementation by using a tuple comprehension instead of manual CPython API calls, since that path is already slower. The changes are localized to the memory view implementation in cuda_core and include a comprehensive regression test that validates both correct values and proper reference counting.

Important Files Changed

Filename Score Overview
cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd 1/5 Adds CPython API externs to fix reference counting but with potential type mixing issues
cuda_core/cuda/core/experimental/_memoryview.pyx 0/5 Fixes segfault and caching logic but introduces undefined variable itemsize at line 154
cuda_core/tests/test_memory.py 5/5 Adds regression test for segfault with reference count validation
cuda_core/docs/source/release/0.4.X-notes.rst 5/5 Documents the segfault fix in standard release notes format

Confidence score: 0/5

  • This PR contains critical bugs that will cause immediate runtime failures and cannot be safely merged without fixes
  • Score reflects undefined variable itemsize at line 154 in _memoryview.pyx (will raise NameError), and potential reference counting issues from mixing object and PyObject* types in the CPython API externs in cuda_utils.pxd
  • Pay extremely close attention to cuda_core/cuda/core/experimental/_memoryview.pyx line 154 where itemsize is referenced but never defined, and to cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd lines 39-41 where the extern declarations may not correctly achieve the intended reference counting behavior

Sequence Diagram

sequenceDiagram
    participant User
    participant StridedMemoryView
    participant cuda_utils
    participant CPython_API as CPython API

    User->>StridedMemoryView: Create StridedMemoryView(obj, stream_ptr)
    activate StridedMemoryView
    StridedMemoryView->>StridedMemoryView: Store dl_tensor or metadata
    deactivate StridedMemoryView

    User->>StridedMemoryView: Access .shape property
    activate StridedMemoryView
    alt shape not cached
        alt Using DLPack (dl_tensor != NULL)
            StridedMemoryView->>cuda_utils: carray_int64_t_to_tuple(dl_tensor.shape, ndim)
            activate cuda_utils
            cuda_utils->>CPython_API: PyTuple_New(length)
            activate CPython_API
            CPython_API-->>cuda_utils: new tuple
            deactivate CPython_API
            loop For each dimension
                cuda_utils->>CPython_API: _PyLong_FromLongLong(ptr[i])
                activate CPython_API
                CPython_API-->>cuda_utils: PyObject*
                deactivate CPython_API
                cuda_utils->>CPython_API: _PyTuple_SET_ITEM(result, i, PyObject*)
                Note right of cuda_utils: No extra INCREF (steals reference)
            end
            cuda_utils-->>StridedMemoryView: tuple with correct refcount
            deactivate cuda_utils
        else Using CAI
            StridedMemoryView->>StridedMemoryView: Get shape from metadata dict
        end
        StridedMemoryView->>StridedMemoryView: Cache in self._shape
    end
    StridedMemoryView-->>User: Return shape tuple
    deactivate StridedMemoryView

    User->>StridedMemoryView: Access .strides property
    activate StridedMemoryView
    alt strides not initialized
        alt Using DLPack with strides
            StridedMemoryView->>cuda_utils: carray_int64_t_to_tuple(dl_tensor.strides, ndim)
            activate cuda_utils
            Note right of cuda_utils: Same safe conversion process
            cuda_utils-->>StridedMemoryView: tuple with correct refcount
            deactivate cuda_utils
        else Using CAI
            StridedMemoryView->>StridedMemoryView: Compute strides from metadata
        end
        StridedMemoryView->>StridedMemoryView: Cache in self._strides
        StridedMemoryView->>StridedMemoryView: Set _strides_init = True
    end
    StridedMemoryView-->>User: Return strides tuple or None
    deactivate StridedMemoryView

    User->>StridedMemoryView: Delete reference
    activate StridedMemoryView
    StridedMemoryView->>StridedMemoryView: __dealloc__()
    Note right of StridedMemoryView: Proper cleanup without segfault
    deactivate StridedMemoryView
Loading

4 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

# cpython.long.
cdef extern from "Python.h":
PyObject *_PyLong_FromLongLong "PyLong_FromLongLong" (long long val) except NULL
void _PyTuple_SET_ITEM "PyTuple_SET_ITEM" (object p, Py_ssize_t pos, PyObject *o)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is wild that this is the way to handle the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could have been handled by adding an explicit incref to counteract the implicit decref that Cython generates, but this is the only way I could find to avoid the unnecessary nonsense.

Cython and performance optimization are at odds.

This fixes the problem at hand, but it caused me to look more closely at the reference counting the Cython generates, and there is a /lot/ of unnecessary work it does. I don't know how much of that the C compiler can see through and optimize away, but I doubt it's everything. Cython really needs its own reduction pass, probably. This will become more of a problem for free-threaded builds where reference counting is more expensive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what was the code that Cython generated without this cdef extern hack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the code it generated before:

    __pyx_t_1 = PyLong_FromLongLong((__pyx_v_ptr[__pyx_v_i])); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 39, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    PyTuple_SET_ITEM(__pyx_v_result, __pyx_v_i, __pyx_t_1);
    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;  // <--- wrong!

That DECREF is incorrect because PyTuple_SET_ITEM steals a reference. @stiepan's suggestion in #1188 was to explicitly INCREF that value, which is totally correct, but would churn the refcount unnecessarily. My fix here basically uses raw PyObject * instead to avoid Cython being "helpful".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdboom I see. IIUC this will generate the equivalent code with less boilerplate and without unnecessary churns (untested):

item = cpython.PyLong_FromLongLong(ptr[i])
cpython.Py_INCREF(item)
cpython.PyTuple_SET_ITEM(result, i, <cpython.PyObject*>(item))

This is because in Cython the 3rd arg is typed as object, for which Cython handles refcount before/after the function call. Casting it to PyObject* bypasses Cython's mechanism and gives us full control over refcount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I tried this. The cast doesn't work because Cython still wants you to cast back into an object to pass to its wrapper of PyTuple_SET_ITEM:

          cpython.PyTuple_SET_ITEM(result, i, <cpython.PyObject *>cpython.PyLong_FromLongLong(ptr[i]))
                                                  ^
      ------------------------------------------------------------

      ./cuda/core/experimental/_utils/cuda_utils.pxd:39:44: Casting temporary Python object to non-numeric non-Python type

It also causes unnecessary reference count churn. (We incref the item only because Cython is implicitly and unnecessarily decref'ing it later). The C compiler generally can't see through and optimize away that kind of thing because it can't reason that the reference count won't ever go to zero and that freeing the object is always a no-op.

As I look at this stuff more, I think a better approach is probably to drop all the way down to C for these performance-critical sections. The whole cimport cpython tree of things is full of rough edges because it doesn't model the behavior of the Python/C API completely (it doesn't understand steals, for example). To the point where maybe it's a good guideline to "avoid cimport cpython". For example, maybe we should do this (only for these sort of localized hot path optimizations of course). It's a lot more verbose due to the error checking, but it's a lot easier to reason about what it's doing and where the performance optimizations are:

cdef extern from *:
    """
    static inline PyObject* carray_int64_to_tuple(uint64_t *var, Py_ssize_t size) {
        PyObject* result = PyTuple_New(size);
        if (result == NULL) {
            return NULL;
        }
        PyObject* item;
        for (Py_ssize_t i = 0; i < size; i++) {
            item = PyLong_FromLongLong(var[i]);
            if (item == NULL) {
                Py_DECREF(result);
                return NULL;
            }
            PyTuple_SET_ITEM(result, i, item);
        }
        return result;
    }
    """
    PyObject *carray_int64_to_tuple(uintptr_t var, Py_ssize_t size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cast doesn't work because Cython still wants you to cast back into an object to pass to its wrapper of PyTuple_SET_ITEM

Ah right. This makes me sad...

cpcloud
cpcloud previously approved these changes Oct 24, 2025
Copy link
Contributor

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest update addresses a critical runtime bug by adding the missing itemsize definition in the CAI (CUDA Array Interface) code path of the strides property getter. Previously, when strides were retrieved from CAI metadata, the code attempted to divide by itemsize to convert byte-based strides to count-based strides, but itemsize was never defined, causing a NameError at runtime. The fix adds itemsize = self.dtype.itemsize before the tuple comprehension, mirroring the pattern already established in the DLPack code path earlier in the same method. This ensures proper stride conversion for both protocols and resolves a segmentation fault issue.

Important Files Changed

Filename Score Overview
cuda_core/cuda/core/experimental/_memoryview.pyx 5/5 Added missing itemsize variable definition before its use in strides tuple comprehension for CAI path

Confidence score: 5/5

  • This PR is safe to merge with minimal risk
  • Score reflects a trivial one-line fix that addresses a clear NameError bug with no side effects or breaking changes
  • No files require special attention

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

cpcloud
cpcloud previously approved these changes Oct 24, 2025
@mdboom
Copy link
Contributor Author

mdboom commented Oct 24, 2025

/ok to test

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 24, 2025

/ok to test

@mdboom, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. In this latest update, the test array size in test_strided_memory_view_refcnt was reduced from (11171, 4) to (64, 4). This change optimizes test execution time while preserving the critical test coverage. The test validates that accessing shape and strides properties on StridedMemoryView objects with Fortran ordering doesn't trigger segmentation faults due to incorrect reference counting. The reduction in array size doesn't impact the test's ability to catch the bug, as the Fortran ordering (not array size) is what triggers the specific strides code path being validated. This change aligns with good testing practices of using minimal test data that still exercises the bug condition.

Important Files Changed

Filename Score Overview
cuda_core/tests/test_memory.py 5/5 Reduced test array dimensions from (11171,4) to (64,4) for efficiency without sacrificing coverage

Confidence score: 5/5

  • This PR change is safe to merge with minimal risk
  • Score reflects a simple, well-justified optimization to test data that improves test suite performance without reducing test coverage
  • No files require special attention

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@mdboom
Copy link
Contributor Author

mdboom commented Oct 24, 2025

/ok to test 4811649

@github-actions

This comment has been minimized.

@rparolin rparolin added this to the cuda.core beta 8 milestone Oct 24, 2025
@cpcloud cpcloud merged commit c404452 into NVIDIA:main Oct 24, 2025
64 checks passed
@github-actions
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@leofang leofang removed this from the cuda.core beta 9 milestone Oct 28, 2025
@leofang leofang added this to the cuda.core beta 8 milestone Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants