Fix #1186: Fix segmentation fault when accessing StridedMemoryView #1190

mdboom · 2025-10-24T16:51:49Z

To avoid adding unnecessary reference counts to counteract Cython's unnecessary reference counts, this fix adds "non-magical" externs to the CPython API to make the reference counting correct.

In the process of writing a test investigating that bug, I also discovered a bug in the caching logic for shape.

Lastly, for the case where strides comes from CAI rather than DLPack, rather than porting this fix there, since that's already a "slow" Python API path, I just reverted that to a "tuple" comprehension.

…iew shape and strides

copy-pr-bot · 2025-10-24T16:51:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

mdboom · 2025-10-24T16:52:37Z

/ok to test

copy-pr-bot · 2025-10-24T16:52:40Z

/ok to test

@mdboom, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a segmentation fault (issue #1186) that occurred when accessing the shape and strides properties of StridedMemoryView in the cuda.core package. The root cause was incorrect reference counting in Cython-generated code when constructing Python tuples from C arrays. The fix introduces low-level CPython API externs (_PyLong_FromLongLong, _PyTuple_SET_ITEM) in cuda_utils.pxd to bypass Cython's automatic reference counting, which was causing objects to be prematurely deallocated. Additionally, the PR corrects a caching logic bug where _shape would incorrectly cache an empty tuple even when exporting_obj could be populated later. For the CAI (CUDA Array Interface) code path where strides come from Python metadata rather than DLPack, the fix simplifies the implementation by using a tuple comprehension instead of manual CPython API calls, since that path is already slower. The changes are localized to the memory view implementation in cuda_core and include a comprehensive regression test that validates both correct values and proper reference counting.

Important Files Changed

Filename	Score	Overview
cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd	1/5	Adds CPython API externs to fix reference counting but with potential type mixing issues
cuda_core/cuda/core/experimental/_memoryview.pyx	0/5	Fixes segfault and caching logic but introduces undefined variable `itemsize` at line 154
cuda_core/tests/test_memory.py	5/5	Adds regression test for segfault with reference count validation
cuda_core/docs/source/release/0.4.X-notes.rst	5/5	Documents the segfault fix in standard release notes format

Confidence score: 0/5

This PR contains critical bugs that will cause immediate runtime failures and cannot be safely merged without fixes
Score reflects undefined variable itemsize at line 154 in _memoryview.pyx (will raise NameError), and potential reference counting issues from mixing object and PyObject* types in the CPython API externs in cuda_utils.pxd
Pay extremely close attention to cuda_core/cuda/core/experimental/_memoryview.pyx line 154 where itemsize is referenced but never defined, and to cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd lines 39-41 where the extern declarations may not correctly achieve the intended reference counting behavior

Sequence Diagram

sequenceDiagram
    participant User
    participant StridedMemoryView
    participant cuda_utils
    participant CPython_API as CPython API

    User->>StridedMemoryView: Create StridedMemoryView(obj, stream_ptr)
    activate StridedMemoryView
    StridedMemoryView->>StridedMemoryView: Store dl_tensor or metadata
    deactivate StridedMemoryView

    User->>StridedMemoryView: Access .shape property
    activate StridedMemoryView
    alt shape not cached
        alt Using DLPack (dl_tensor != NULL)
            StridedMemoryView->>cuda_utils: carray_int64_t_to_tuple(dl_tensor.shape, ndim)
            activate cuda_utils
            cuda_utils->>CPython_API: PyTuple_New(length)
            activate CPython_API
            CPython_API-->>cuda_utils: new tuple
            deactivate CPython_API
            loop For each dimension
                cuda_utils->>CPython_API: _PyLong_FromLongLong(ptr[i])
                activate CPython_API
                CPython_API-->>cuda_utils: PyObject*
                deactivate CPython_API
                cuda_utils->>CPython_API: _PyTuple_SET_ITEM(result, i, PyObject*)
                Note right of cuda_utils: No extra INCREF (steals reference)
            end
            cuda_utils-->>StridedMemoryView: tuple with correct refcount
            deactivate cuda_utils
        else Using CAI
            StridedMemoryView->>StridedMemoryView: Get shape from metadata dict
        end
        StridedMemoryView->>StridedMemoryView: Cache in self._shape
    end
    StridedMemoryView-->>User: Return shape tuple
    deactivate StridedMemoryView

    User->>StridedMemoryView: Access .strides property
    activate StridedMemoryView
    alt strides not initialized
        alt Using DLPack with strides
            StridedMemoryView->>cuda_utils: carray_int64_t_to_tuple(dl_tensor.strides, ndim)
            activate cuda_utils
            Note right of cuda_utils: Same safe conversion process
            cuda_utils-->>StridedMemoryView: tuple with correct refcount
            deactivate cuda_utils
        else Using CAI
            StridedMemoryView->>StridedMemoryView: Compute strides from metadata
        end
        StridedMemoryView->>StridedMemoryView: Cache in self._strides
        StridedMemoryView->>StridedMemoryView: Set _strides_init = True
    end
    StridedMemoryView-->>User: Return strides tuple or None
    deactivate StridedMemoryView

    User->>StridedMemoryView: Delete reference
    activate StridedMemoryView
    StridedMemoryView->>StridedMemoryView: __dealloc__()
    Note right of StridedMemoryView: Proper cleanup without segfault
    deactivate StridedMemoryView

_{4 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

cuda_core/cuda/core/experimental/_memoryview.pyx

cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd

cpcloud · 2025-10-24T16:52:57Z

cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd

+# cpython.long.
+cdef extern from "Python.h":
+    PyObject *_PyLong_FromLongLong "PyLong_FromLongLong" (long long val) except NULL
+    void _PyTuple_SET_ITEM "PyTuple_SET_ITEM" (object p, Py_ssize_t pos, PyObject *o)


It is wild that this is the way to handle the problem.

It could have been handled by adding an explicit incref to counteract the implicit decref that Cython generates, but this is the only way I could find to avoid the unnecessary nonsense.

Cython and performance optimization are at odds.

This fixes the problem at hand, but it caused me to look more closely at the reference counting the Cython generates, and there is a /lot/ of unnecessary work it does. I don't know how much of that the C compiler can see through and optimize away, but I doubt it's everything. Cython really needs its own reduction pass, probably. This will become more of a problem for free-threaded builds where reference counting is more expensive.

So what was the code that Cython generated without this cdef extern hack?

This is the code it generated before:

__pyx_t_1 = PyLong_FromLongLong((__pyx_v_ptr[__pyx_v_i])); if (unlikely(!__pyx_t_1)) __PYX_ERR(1, 39, __pyx_L1_error) __Pyx_GOTREF(__pyx_t_1); PyTuple_SET_ITEM(__pyx_v_result, __pyx_v_i, __pyx_t_1); __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0; // <--- wrong!

That DECREF is incorrect because PyTuple_SET_ITEM steals a reference. @stiepan's suggestion in #1188 was to explicitly INCREF that value, which is totally correct, but would churn the refcount unnecessarily. My fix here basically uses raw PyObject * instead to avoid Cython being "helpful".

@mdboom I see. IIUC this will generate the equivalent code with less boilerplate and without unnecessary churns (untested):

item = cpython.PyLong_FromLongLong(ptr[i]) cpython.Py_INCREF(item) cpython.PyTuple_SET_ITEM(result, i, <cpython.PyObject*>(item))

This is because in Cython the 3rd arg is typed as object, for which Cython handles refcount before/after the function call. Casting it to PyObject* bypasses Cython's mechanism and gives us full control over refcount.

Yeah, I tried this. The cast doesn't work because Cython still wants you to cast back into an object to pass to its wrapper of PyTuple_SET_ITEM:

cpython.PyTuple_SET_ITEM(result, i, <cpython.PyObject *>cpython.PyLong_FromLongLong(ptr[i])) ^ ------------------------------------------------------------ ./cuda/core/experimental/_utils/cuda_utils.pxd:39:44: Casting temporary Python object to non-numeric non-Python type

It also causes unnecessary reference count churn. (We incref the item only because Cython is implicitly and unnecessarily decref'ing it later). The C compiler generally can't see through and optimize away that kind of thing because it can't reason that the reference count won't ever go to zero and that freeing the object is always a no-op.

As I look at this stuff more, I think a better approach is probably to drop all the way down to C for these performance-critical sections. The whole cimport cpython tree of things is full of rough edges because it doesn't model the behavior of the Python/C API completely (it doesn't understand steals, for example). To the point where maybe it's a good guideline to "avoid cimport cpython". For example, maybe we should do this (only for these sort of localized hot path optimizations of course). It's a lot more verbose due to the error checking, but it's a lot easier to reason about what it's doing and where the performance optimizations are:

cdef extern from *: """ static inline PyObject* carray_int64_to_tuple(uint64_t *var, Py_ssize_t size) { PyObject* result = PyTuple_New(size); if (result == NULL) { return NULL; } PyObject* item; for (Py_ssize_t i = 0; i < size; i++) { item = PyLong_FromLongLong(var[i]); if (item == NULL) { Py_DECREF(result); return NULL; } PyTuple_SET_ITEM(result, i, item); } return result; } """ PyObject *carray_int64_to_tuple(uintptr_t var, Py_ssize_t size)

The cast doesn't work because Cython still wants you to cast back into an object to pass to its wrapper of PyTuple_SET_ITEM

Ah right. This makes me sad...

cuda_core/tests/test_memory.py

cpcloud

🚢

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest update addresses a critical runtime bug by adding the missing itemsize definition in the CAI (CUDA Array Interface) code path of the strides property getter. Previously, when strides were retrieved from CAI metadata, the code attempted to divide by itemsize to convert byte-based strides to count-based strides, but itemsize was never defined, causing a NameError at runtime. The fix adds itemsize = self.dtype.itemsize before the tuple comprehension, mirroring the pattern already established in the DLPack code path earlier in the same method. This ensures proper stride conversion for both protocols and resolves a segmentation fault issue.

Important Files Changed

Filename	Score	Overview
cuda_core/cuda/core/experimental/_memoryview.pyx	5/5	Added missing `itemsize` variable definition before its use in strides tuple comprehension for CAI path

Confidence score: 5/5

This PR is safe to merge with minimal risk
Score reflects a trivial one-line fix that addresses a clear NameError bug with no side effects or breaking changes
No files require special attention

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

mdboom · 2025-10-24T17:06:36Z

/ok to test

copy-pr-bot · 2025-10-24T17:06:39Z

/ok to test

@mdboom, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. In this latest update, the test array size in test_strided_memory_view_refcnt was reduced from (11171, 4) to (64, 4). This change optimizes test execution time while preserving the critical test coverage. The test validates that accessing shape and strides properties on StridedMemoryView objects with Fortran ordering doesn't trigger segmentation faults due to incorrect reference counting. The reduction in array size doesn't impact the test's ability to catch the bug, as the Fortran ordering (not array size) is what triggers the specific strides code path being validated. This change aligns with good testing practices of using minimal test data that still exercises the bug condition.

Important Files Changed

Filename	Score	Overview
cuda_core/tests/test_memory.py	5/5	Reduced test array dimensions from (11171,4) to (64,4) for efficiency without sacrificing coverage

Confidence score: 5/5

This PR change is safe to merge with minimal risk
Score reflects a simple, well-justified optimization to test data that improves test suite performance without reducing test coverage
No files require special attention

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

mdboom · 2025-10-24T17:10:51Z

/ok to test 4811649

github-actions · 2025-10-24T19:55:22Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

mdboom added 2 commits October 24, 2025 12:44

Fix NVIDIA#1186: Fix segmentation fault when accessing StridedMemoryV…

1bde9af

…iew shape and strides

Improve strides test

b4dceb0

mdboom self-assigned this Oct 24, 2025

mdboom added P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Oct 24, 2025

mdboom requested a review from leofang October 24, 2025 16:51

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memoryview.pyx Show resolved Hide resolved

cuda_core/cuda/core/experimental/_utils/cuda_utils.pxd Show resolved Hide resolved

cpcloud reviewed Oct 24, 2025

View reviewed changes

cuda_core/tests/test_memory.py Outdated Show resolved Hide resolved

cpcloud previously approved these changes Oct 24, 2025

View reviewed changes

Apply suggestion from @greptile-apps[bot]

96357a0

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

mdboom dismissed cpcloud’s stale review via 96357a0 October 24, 2025 17:01

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

cpcloud previously approved these changes Oct 24, 2025

View reviewed changes

Use a smaller array

4811649

mdboom dismissed cpcloud’s stale review via 4811649 October 24, 2025 17:06

greptile-apps bot reviewed Oct 24, 2025

View reviewed changes

cpcloud approved these changes Oct 24, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

rparolin added this to the cuda.core beta 8 milestone Oct 24, 2025

rparolin approved these changes Oct 24, 2025

View reviewed changes

cpcloud merged commit c404452 into NVIDIA:main Oct 24, 2025
64 checks passed

leofang removed this from the cuda.core beta 9 milestone Oct 28, 2025

leofang added this to the cuda.core beta 8 milestone Oct 28, 2025

Fix #1186: Fix segmentation fault when accessing StridedMemoryView #1190

Fix #1186: Fix segmentation fault when accessing StridedMemoryView #1190

Uh oh!

Conversation

mdboom commented Oct 24, 2025 • edited by rparolin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 24, 2025

Uh oh!

mdboom commented Oct 24, 2025

Uh oh!

copy-pr-bot bot commented Oct 24, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 0/5

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

cpcloud Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

mdboom Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

mdboom Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

mdboom Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpcloud left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

mdboom commented Oct 24, 2025

Uh oh!

copy-pr-bot bot commented Oct 24, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Uh oh!

mdboom commented Oct 24, 2025

Uh oh!

This comment has been minimized.

Uh oh!

github-actions bot commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mdboom commented Oct 24, 2025 •

edited by rparolin

Loading