Skip to content

Conversation

Andy-Jost
Copy link
Contributor

@Andy-Jost Andy-Jost commented Oct 9, 2025

Errors occurring during Buffer.close are not raised. This change adds tests demonstrating the issue. See #1118.

@Andy-Jost Andy-Jost self-assigned this Oct 9, 2025
Copy link
Contributor

copy-pr-bot bot commented Oct 9, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Andy-Jost
Copy link
Contributor Author

/ok to test 845fbd4

Copy link

github-actions bot commented Oct 9, 2025

@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from 845fbd4 to adfb7e5 Compare October 9, 2025 22:08
@Andy-Jost
Copy link
Contributor Author

/ok to test 0986f5e

@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from 0986f5e to 78a4815 Compare October 9, 2025 22:17
@Andy-Jost
Copy link
Contributor Author

/ok to test 1d5248e

@Andy-Jost Andy-Jost added test Improvements or additions to tests cuda.core Everything related to the cuda.core module labels Oct 9, 2025
@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from 1d5248e to bbdbbcd Compare October 9, 2025 22:33
@Andy-Jost Andy-Jost changed the title Add skipped tests demonstrating that errors in Buffer.close are not raised Add (failing) tests demonstrating that errors in Buffer.close are not raised Oct 9, 2025
@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from d916308 to b7f8c2a Compare October 9, 2025 22:35
@Andy-Jost Andy-Jost force-pushed the ipc_suppressed_errors branch from 6e9c283 to fea6f8a Compare October 9, 2025 22:36
mr.close()


@pytest.mark.xfail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work here?

@pytest.mark.xfail(reason="Issue #1118", strict=True)

The important part is strict=True.

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HUGE 👎 in testing this.

Comment on lines +153 to +158
def test_error_in_close_memory_resource(ipc_memory_resource):
"""Test that errors when closing a memory resource are raised."""
mr = ipc_memory_resource
driver.cuMemPoolDestroy(mr.handle)
with pytest.raises(CUDAError, match=".*CUDA_ERROR_INVALID_VALUE.*"):
mr.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is illegal and I disagree we need to test this. This is Python and we can't possibly guard against all kinds of bizarre ways of trying to mutate the state of our Python objects behind our back. In particular, as noted in both #1074 (comment) and offline discussion, errors like CUDA_ERROR_INVALID_VALUE are due to multiple frees. I thought we've moved on?

This test is just another instance of the same class of errors: We free the handle of an object through a direct C API call, bypassing our safeguard mechanism (under the hood we do check if the handle is already null before freeing, and then after free we set the handle to null to avoid double free), so our destructor kicks in again, either through an explicit close() call or implicitly when going out of scope.

Copy link
Contributor

@cpcloud cpcloud Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this test is asserting that different levels of APIs offer the same guarantees, which would make developing layers of APIs really really difficult.

It seems roughly analogous to calling into the Python C API through ctypes, and expecting Python to somehow know you didn't mean to cause a segmentation violation:

❯ python3.13 -q
>>> x = 1
>>> import ctypes
>>> ctypes.pythonapi.Py_DecRef(x)
zsh: segmentation fault (core dumped)  python3.13 -q

Copy link
Contributor

@cpcloud cpcloud Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would guess there's probably another way to write this test such that the behavior is buggy without crossing into the C abyss of naked bindings.

Would it be enough to just call close() twice? That seems like something we should perhaps be robust to if we're not already:

❯ python -q
>>> f = open('/tmp/x', 'w')
>>> f.close()
>>> f.close()

Comment on lines +167 to +169
driver.cuMemFree(buffer.handle)
with pytest.raises(CUDAError, match=".*CUDA_ERROR_INVALID_VALUE.*"):
buffer.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +181 to +186
try:
driver.cuMemPoolDestroy(self.mr.handle)
except Exception: # noqa: S110
pass
else:
self.mr.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@leofang
Copy link
Member

leofang commented Oct 10, 2025

FWIW cccl-runtime, C++ stdandad library, or any high-level frameworks/libraries have the same issue. We give you the access to the underlying handle of a container, does not mean that you can free it through free() or delete behind the container's back. This is UB and by testing it we are guaranteeing certain behavior (whatever it is).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module test Improvements or additions to tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants