- 
                Notifications
    You must be signed in to change notification settings 
- Fork 217
Fix #1186: Fix segmentation fault when accessing StridedMemoryView #1190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…iew shape and strides
| /ok to test | 
| 
 @mdboom, there was an error processing your request:  See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a segmentation fault (issue #1186) that occurred when accessing the shape and strides properties of StridedMemoryView in the cuda.core package. The root cause was incorrect reference counting in Cython-generated code when constructing Python tuples from C arrays. The fix introduces low-level CPython API externs (_PyLong_FromLongLong, _PyTuple_SET_ITEM) in cuda_utils.pxd to bypass Cython's automatic reference counting, which was causing objects to be prematurely deallocated. Additionally, the PR corrects a caching logic bug where _shape would incorrectly cache an empty tuple even when exporting_obj could be populated later. For the CAI (CUDA Array Interface) code path where strides come from Python metadata rather than DLPack, the fix simplifies the implementation by using a tuple comprehension instead of manual CPython API calls, since that path is already slower. The changes are localized to the memory view implementation in cuda_core and include a comprehensive regression test that validates both correct values and proper reference counting.
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd | 1/5 | Adds CPython API externs to fix reference counting but with potential type mixing issues | 
| cuda_core/cuda/core/experimental/_memoryview.pyx | 0/5 | Fixes segfault and caching logic but introduces undefined variable itemsizeat line 154 | 
| cuda_core/tests/test_memory.py | 5/5 | Adds regression test for segfault with reference count validation | 
| cuda_core/docs/source/release/0.4.X-notes.rst | 5/5 | Documents the segfault fix in standard release notes format | 
Confidence score: 0/5
- This PR contains critical bugs that will cause immediate runtime failures and cannot be safely merged without fixes
- Score reflects undefined variable itemsizeat line 154 in_memoryview.pyx(will raise NameError), and potential reference counting issues from mixingobjectandPyObject*types in the CPython API externs incuda_utils.pxd
- Pay extremely close attention to cuda_core/cuda/core/experimental/_memoryview.pyxline 154 whereitemsizeis referenced but never defined, and tocuda_core/cuda/core/experimental/_utils/cuda_utils.pxdlines 39-41 where the extern declarations may not correctly achieve the intended reference counting behavior
Sequence Diagram
sequenceDiagram
    participant User
    participant StridedMemoryView
    participant cuda_utils
    participant CPython_API as CPython API
    User->>StridedMemoryView: Create StridedMemoryView(obj, stream_ptr)
    activate StridedMemoryView
    StridedMemoryView->>StridedMemoryView: Store dl_tensor or metadata
    deactivate StridedMemoryView
    User->>StridedMemoryView: Access .shape property
    activate StridedMemoryView
    alt shape not cached
        alt Using DLPack (dl_tensor != NULL)
            StridedMemoryView->>cuda_utils: carray_int64_t_to_tuple(dl_tensor.shape, ndim)
            activate cuda_utils
            cuda_utils->>CPython_API: PyTuple_New(length)
            activate CPython_API
            CPython_API-->>cuda_utils: new tuple
            deactivate CPython_API
            loop For each dimension
                cuda_utils->>CPython_API: _PyLong_FromLongLong(ptr[i])
                activate CPython_API
                CPython_API-->>cuda_utils: PyObject*
                deactivate CPython_API
                cuda_utils->>CPython_API: _PyTuple_SET_ITEM(result, i, PyObject*)
                Note right of cuda_utils: No extra INCREF (steals reference)
            end
            cuda_utils-->>StridedMemoryView: tuple with correct refcount
            deactivate cuda_utils
        else Using CAI
            StridedMemoryView->>StridedMemoryView: Get shape from metadata dict
        end
        StridedMemoryView->>StridedMemoryView: Cache in self._shape
    end
    StridedMemoryView-->>User: Return shape tuple
    deactivate StridedMemoryView
    User->>StridedMemoryView: Access .strides property
    activate StridedMemoryView
    alt strides not initialized
        alt Using DLPack with strides
            StridedMemoryView->>cuda_utils: carray_int64_t_to_tuple(dl_tensor.strides, ndim)
            activate cuda_utils
            Note right of cuda_utils: Same safe conversion process
            cuda_utils-->>StridedMemoryView: tuple with correct refcount
            deactivate cuda_utils
        else Using CAI
            StridedMemoryView->>StridedMemoryView: Compute strides from metadata
        end
        StridedMemoryView->>StridedMemoryView: Cache in self._strides
        StridedMemoryView->>StridedMemoryView: Set _strides_init = True
    end
    StridedMemoryView-->>User: Return strides tuple or None
    deactivate StridedMemoryView
    User->>StridedMemoryView: Delete reference
    activate StridedMemoryView
    StridedMemoryView->>StridedMemoryView: __dealloc__()
    Note right of StridedMemoryView: Proper cleanup without segfault
    deactivate StridedMemoryView
    4 files reviewed, 2 comments
| # cpython.long. | ||
| cdef extern from "Python.h": | ||
| PyObject *_PyLong_FromLongLong "PyLong_FromLongLong" (long long val) except NULL | ||
| void _PyTuple_SET_ITEM "PyTuple_SET_ITEM" (object p, Py_ssize_t pos, PyObject *o) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is wild that this is the way to handle the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could have been handled by adding an explicit incref to counteract the implicit decref that Cython generates, but this is the only way I could find to avoid the unnecessary nonsense.
Cython and performance optimization are at odds.
This fixes the problem at hand, but it caused me to look more closely at the reference counting the Cython generates, and there is a /lot/ of unnecessary work it does. I don't know how much of that the C compiler can see through and optimize away, but I doubt it's everything. Cython really needs its own reduction pass, probably. This will become more of a problem for free-threaded builds where reference counting is more expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what was the code that Cython generated without this cdef extern hack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the code it generated before:
    __pyx_t_1 = PyLong_FromLongLong((__pyx_v_ptr[__pyx_v_i])); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 39, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    PyTuple_SET_ITEM(__pyx_v_result, __pyx_v_i, __pyx_t_1);
    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;  // <--- wrong!
That DECREF is incorrect because PyTuple_SET_ITEM steals a reference.  @stiepan's suggestion in #1188 was to explicitly INCREF that value, which is totally correct, but would churn the refcount unnecessarily.  My fix here basically uses raw PyObject * instead to avoid Cython being "helpful".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mdboom I see. IIUC this will generate the equivalent code with less boilerplate and without unnecessary churns (untested):
item = cpython.PyLong_FromLongLong(ptr[i])
cpython.Py_INCREF(item)
cpython.PyTuple_SET_ITEM(result, i, <cpython.PyObject*>(item))This is because in Cython the 3rd arg is typed as object, for which Cython handles refcount before/after the function call. Casting it to PyObject* bypasses Cython's mechanism and gives us full control over refcount.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I tried this.  The cast doesn't work because Cython still wants you to cast back into an object to pass to its wrapper of PyTuple_SET_ITEM:
          cpython.PyTuple_SET_ITEM(result, i, <cpython.PyObject *>cpython.PyLong_FromLongLong(ptr[i]))
                                                  ^
      ------------------------------------------------------------
      ./cuda/core/experimental/_utils/cuda_utils.pxd:39:44: Casting temporary Python object to non-numeric non-Python type
It also causes unnecessary reference count churn. (We incref the item only because Cython is implicitly and unnecessarily decref'ing it later). The C compiler generally can't see through and optimize away that kind of thing because it can't reason that the reference count won't ever go to zero and that freeing the object is always a no-op.
As I look at this stuff more, I think a better approach is probably to drop all the way down to C for these performance-critical sections.  The whole cimport cpython tree of things is full of rough edges because it doesn't model the behavior of the Python/C API completely (it doesn't understand steals, for example).  To the point where maybe it's a good guideline to "avoid cimport cpython".  For example, maybe we should do this (only for these sort of localized hot path optimizations of course).  It's a lot more verbose due to the error checking, but it's a lot easier to reason about what it's doing and where the performance optimizations are:
cdef extern from *:
    """
    static inline PyObject* carray_int64_to_tuple(uint64_t *var, Py_ssize_t size) {
        PyObject* result = PyTuple_New(size);
        if (result == NULL) {
            return NULL;
        }
        PyObject* item;
        for (Py_ssize_t i = 0; i < size; i++) {
            item = PyLong_FromLongLong(var[i]);
            if (item == NULL) {
                Py_DECREF(result);
                return NULL;
            }
            PyTuple_SET_ITEM(result, i, item);
        }
        return result;
    }
    """
    PyObject *carray_int64_to_tuple(uintptr_t var, Py_ssize_t size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cast doesn't work because Cython still wants you to cast back into an
objectto pass to its wrapper ofPyTuple_SET_ITEM
Ah right. This makes me sad...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The latest update addresses a critical runtime bug by adding the missing itemsize definition in the CAI (CUDA Array Interface) code path of the strides property getter. Previously, when strides were retrieved from CAI metadata, the code attempted to divide by itemsize to convert byte-based strides to count-based strides, but itemsize was never defined, causing a NameError at runtime. The fix adds itemsize = self.dtype.itemsize before the tuple comprehension, mirroring the pattern already established in the DLPack code path earlier in the same method. This ensures proper stride conversion for both protocols and resolves a segmentation fault issue.
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| cuda_core/cuda/core/experimental/_memoryview.pyx | 5/5 | Added missing itemsizevariable definition before its use in strides tuple comprehension for CAI path | 
Confidence score: 5/5
- This PR is safe to merge with minimal risk
- Score reflects a trivial one-line fix that addresses a clear NameError bug with no side effects or breaking changes
- No files require special attention
1 file reviewed, no comments
| /ok to test | 
| 
 @mdboom, there was an error processing your request:  See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. In this latest update, the test array size in test_strided_memory_view_refcnt was reduced from (11171, 4) to (64, 4). This change optimizes test execution time while preserving the critical test coverage. The test validates that accessing shape and strides properties on StridedMemoryView objects with Fortran ordering doesn't trigger segmentation faults due to incorrect reference counting. The reduction in array size doesn't impact the test's ability to catch the bug, as the Fortran ordering (not array size) is what triggers the specific strides code path being validated. This change aligns with good testing practices of using minimal test data that still exercises the bug condition.
Important Files Changed
| Filename | Score | Overview | 
|---|---|---|
| cuda_core/tests/test_memory.py | 5/5 | Reduced test array dimensions from (11171,4) to (64,4) for efficiency without sacrificing coverage | 
Confidence score: 5/5
- This PR change is safe to merge with minimal risk
- Score reflects a simple, well-justified optimization to test data that improves test suite performance without reducing test coverage
- No files require special attention
1 file reviewed, no comments
| /ok to test 4811649 | 
      
        
              This comment has been minimized.
        
        
      
    
  This comment has been minimized.
| 
 | 
To avoid adding unnecessary reference counts to counteract Cython's unnecessary reference counts, this fix adds "non-magical" externs to the CPython API to make the reference counting correct.
In the process of writing a test investigating that bug, I also discovered a bug in the caching logic for
shape.Lastly, for the case where strides comes from
CAIrather thanDLPack, rather than porting this fix there, since that's already a "slow" Python API path, I just reverted that to a "tuple" comprehension.