Avoid race due to deleted imports when cleaning up objects at interpreter shutdown #973

brandon-b-miller · 2025-09-16T15:33:12Z

Closes #803

I verified the fixes locally when applied on top of the 12.9.2 tag as I'm on cuda 12. Since _memory.pyx was not present at that time the changes there are "blind" for now. However the change to the stream object seems to get me around the original error reliably.

copy-pr-bot · 2025-09-16T15:33:15Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cuda_bindings/cuda/bindings/utils/__init__.py

leofang · 2025-09-16T16:07:22Z

Thanks, Brandon! Please let's not touch the .close() method. It is an explicit call, meaning there is no chance the shutdown issue is hit. The is_shutting_down call should only be added to __del__, where all issues happen.

cpcloud · 2025-09-16T16:16:11Z

cuda_core/cuda/core/experimental/_event.pyx

        if self._handle is not None:
-            err, = driver.cuEventDestroy(self._handle)
+            if not is_shutting_down()
+                err, = driver.cuEventDestroy(self._handle)


What's the issue with this?

Suggested change

err, = driver.cuEventDestroy(self._handle)

if (destroy := getattr(driver, "cuEventDestroy", None)) is not None:

err, = destroy(self._handle)

Then you don't need the global hack.

We want to use a field-tested solution (see the internal thread) instead of coming up with a new one.

How about sys.is_finalizing()? There's probably not a more field tested solution than what's in the standard library.

In this case it also seems particularly suited to the problem being solved here.

This is definitely better, but this is still a hack where if we run things under tools like cuda-memcheck that check for resource leaks it will still pop.

Additionally, we'd need to carry these checks everywhere that code could be called in __del__ functions, I believe even transitively. I.E. the raise_if_driver_error function.

What if we moved to a __dealloc__ function?

As discussed internally,sys.is_finalizing() - while official - only returns True at the very late stage of interpreter shutdown, later than all of the exit handlers (which this PR is based on). It is unclear to me if this solves the problem. I feel nervous about this. Do we have any known, big projects using this solution?

What if we moved to a __dealloc__ function?

No, we can't do this (yet), because we currently call Python bindings. Once #866 lands we can switch to this, but I need some time to work it out and I prefer this to be fixed independently (and asap).

Here are the relevant steps, in order, during interpreter shutdown:

wait for threads to shutdown

wait for any pending calls

call atexit handlers (where the flag would be set)

set the interpreter to be officially in finalizing mode (this information is what sys.is_finalizing() uses)

collect garbage (__del__ would be called here)

So, it doesn't really matter which approach we take, and it's overall less code and less hacky code to use a standard library builtin.

For reference: https://github.com/python/cpython/blob/4305cc3ef35a35f9b2c163ae0ffeea219da5af5d/Python/pylifecycle.c?plain=1#L2144

For posterity: Turns out is_finalizing becomes True too late, and this PR does not fix all shutdown errors: #1063. We'll fix this in #1070.

This makes it sound like the previously proposed solution using atexit would work, but it wouldn't, because __del__ is still called at the same point in the program regardless of where the state of the interpreter is checked.

It would be more helpful to provide thorough reasoning (as I did above). Right now, it just looks like everything I said was incorrect without any description as to why that is.

brandon-b-miller · 2025-09-16T19:09:44Z

Thanks, Brandon! Please let's not touch the .close() method. It is an explicit call, meaning there is no chance the shutdown issue is hit. The is_shutting_down call should only be added to __del__, where all issues happen.

@leofang the problem with this approach is that cuda.core.Stream is a cdef class which requires a fixed signature for __del__ itself whereas the cupy stream object for instance is just a pure python class. But maybe we can have __del__ and close stay the same if we introduce something like a safe_close method called by __del__ only that runs the extra check? That would keep existing calls to close the same while only affecting the process when closing is happening under a __del__.

brandon-b-miller · 2025-09-16T19:38:16Z

safe_close() currently violates DRY but I wanted to get the idea out there before factoring things.

leofang · 2025-09-17T17:46:38Z

@leofang the problem with this approach is that cuda.core.Stream is a cdef class which requires a fixed signature for __del__ itself whereas the cupy stream object for instance is just a pure python class.

Ah! This is what I have missed - thanks!

But maybe we can have __del__ and close stay the same if we introduce something like a safe_close method called by __del__ only that runs the extra check? That would keep existing calls to close the same while only affecting the process when closing is happening under a __del__.

Sounds like a good idea. I like a variant of this:

cdef _shutdown_safe_close(self, is_shutting_down=is_shutting_down):
    if is_shutting_down and is_shutting_down():
        return
    # do cleanup

def __del__(self):
    self._shutdown_safe_close()

cpdef close(self, stream=None):
    self._shutdown_safe_close(is_shutting_down=None)  # bypass the shutdown check

cuda_core/cuda/core/experimental/_memory.pyx

cuda_core/cuda/core/experimental/_stream.pyx

cuda_core/cuda/core/experimental/_memory.pyx

cuda_core/cuda/core/experimental/_event.pyx

leofang

LGTM!

leofang

Ah, sorry Brandon, one more thing: Could you add a rel note entry to cuda_core/docs/source/release/0.X.Y-notes.rst?

leofang · 2025-09-26T21:35:23Z

/ok to test

github-actions · 2025-09-27T00:42:45Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

is_shutting_down

a9165d6

brandon-b-miller commented Sep 16, 2025

View reviewed changes

cuda_bindings/cuda/bindings/utils/__init__.py Outdated Show resolved Hide resolved

Update cuda_bindings/cuda/bindings/utils/__init__.py

28db9df

leofang assigned brandon-b-miller Sep 16, 2025

leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Sep 16, 2025

leofang added this to the cuda.core beta 7 milestone Sep 16, 2025

cpcloud reviewed Sep 16, 2025

View reviewed changes

safe_close()

33f68da

cpcloud reviewed Sep 17, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_memory.pyx Show resolved Hide resolved

is_finalizing

7155fd1

cpcloud reviewed Sep 19, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_stream.pyx Outdated Show resolved Hide resolved

merge/resolve/fix

e0bea0c

leofang mentioned this pull request Sep 24, 2025

fix(library): avoid spurious close of cached shared library when calling cudart.getLocalRuntimeVersion #1010

Merged

cpcloud reviewed Sep 25, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_stream.pyx Outdated Show resolved Hide resolved

cuda_core/cuda/core/experimental/_memory.pyx Show resolved Hide resolved

leofang reviewed Sep 25, 2025

View reviewed changes

cuda_core/cuda/core/experimental/_event.pyx Outdated Show resolved Hide resolved

refactor

b68a5d1

leofang approved these changes Sep 26, 2025

View reviewed changes

leofang requested changes Sep 26, 2025

View reviewed changes

add note

013cf22

leofang approved these changes Sep 26, 2025

View reviewed changes

leofang enabled auto-merge (squash) September 26, 2025 21:35

This comment has been minimized.

Sign in to view

leofang merged commit 62d6963 into NVIDIA:main Sep 27, 2025
56 checks passed

	err, = driver.cuEventDestroy(self._handle)
	if (destroy := getattr(driver, "cuEventDestroy", None)) is not None:
	err, = destroy(self._handle)

Avoid race due to deleted imports when cleaning up objects at interpreter shutdown #973

Avoid race due to deleted imports when cleaning up objects at interpreter shutdown #973

Uh oh!

Conversation

brandon-b-miller commented Sep 16, 2025

Uh oh!

copy-pr-bot bot commented Sep 16, 2025

Uh oh!

Uh oh!

leofang commented Sep 16, 2025

Uh oh!

cpcloud Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leofang Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

kkraus14 Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

leofang Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandon-b-miller commented Sep 16, 2025

Uh oh!

brandon-b-miller commented Sep 16, 2025

Uh oh!

leofang commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

leofang left a comment

Choose a reason for hiding this comment

Uh oh!

leofang commented Sep 26, 2025

Uh oh!

This comment has been minimized.

Uh oh!

github-actions bot commented Sep 27, 2025

Uh oh!

Uh oh!

cpcloud Sep 16, 2025 •

edited

Loading

leofang Sep 17, 2025 •

edited

Loading

cpcloud Sep 17, 2025 •

edited

Loading

cpcloud Oct 3, 2025 •

edited

Loading