Skip to content

Conversation

brandon-b-miller
Copy link
Contributor

Closes #803

I verified the fixes locally when applied on top of the 12.9.2 tag as I'm on cuda 12. Since _memory.pyx was not present at that time the changes there are "blind" for now. However the change to the stream object seems to get me around the original error reliably.

Copy link
Contributor

copy-pr-bot bot commented Sep 16, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@leofang
Copy link
Member

leofang commented Sep 16, 2025

Thanks, Brandon! Please let's not touch the .close() method. It is an explicit call, meaning there is no chance the shutdown issue is hit. The is_shutting_down call should only be added to __del__, where all issues happen.

@leofang leofang added bug Something isn't working P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Sep 16, 2025
@leofang leofang added this to the cuda.core beta 7 milestone Sep 16, 2025
if self._handle is not None:
err, = driver.cuEventDestroy(self._handle)
if not is_shutting_down()
err, = driver.cuEventDestroy(self._handle)
Copy link
Contributor

@cpcloud cpcloud Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue with this?

Suggested change
err, = driver.cuEventDestroy(self._handle)
if (destroy := getattr(driver, "cuEventDestroy", None)) is not None:
err, = destroy(self._handle)

Then you don't need the global hack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to use a field-tested solution (see the internal thread) instead of coming up with a new one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about sys.is_finalizing()? There's probably not a more field tested solution than what's in the standard library.

In this case it also seems particularly suited to the problem being solved here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely better, but this is still a hack where if we run things under tools like cuda-memcheck that check for resource leaks it will still pop.

Additionally, we'd need to carry these checks everywhere that code could be called in __del__ functions, I believe even transitively. I.E. the raise_if_driver_error function.

What if we moved to a __dealloc__ function?

Copy link
Member

@leofang leofang Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed internally,sys.is_finalizing() - while official - only returns True at the very late stage of interpreter shutdown, later than all of the exit handlers (which this PR is based on). It is unclear to me if this solves the problem. I feel nervous about this. Do we have any known, big projects using this solution?

What if we moved to a __dealloc__ function?

No, we can't do this (yet), because we currently call Python bindings. Once #866 lands we can switch to this, but I need some time to work it out and I prefer this to be fixed independently (and asap).

Copy link
Contributor

@cpcloud cpcloud Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the relevant steps, in order, during interpreter shutdown:

  1. wait for threads to shutdown
  2. wait for any pending calls
  3. call atexit handlers (where the flag would be set)
  4. set the interpreter to be officially in finalizing mode (this information is what sys.is_finalizing() uses)
  5. collect garbage (__del__ would be called here)

So, it doesn't really matter which approach we take, and it's overall less code and less hacky code to use a standard library builtin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For posterity: Turns out is_finalizing becomes True too late, and this PR does not fix all shutdown errors: #1063. We'll fix this in #1070.

Copy link
Contributor

@cpcloud cpcloud Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like the previously proposed solution using atexit would work, but it wouldn't, because __del__ is still called at the same point in the program regardless of where the state of the interpreter is checked.

It would be more helpful to provide thorough reasoning (as I did above). Right now, it just looks like everything I said was incorrect without any description as to why that is.

@brandon-b-miller
Copy link
Contributor Author

Thanks, Brandon! Please let's not touch the .close() method. It is an explicit call, meaning there is no chance the shutdown issue is hit. The is_shutting_down call should only be added to __del__, where all issues happen.

@leofang the problem with this approach is that cuda.core.Stream is a cdef class which requires a fixed signature for __del__ itself whereas the cupy stream object for instance is just a pure python class. But maybe we can have __del__ and close stay the same if we introduce something like a safe_close method called by __del__ only that runs the extra check? That would keep existing calls to close the same while only affecting the process when closing is happening under a __del__.

@brandon-b-miller
Copy link
Contributor Author

safe_close() currently violates DRY but I wanted to get the idea out there before factoring things.

@leofang
Copy link
Member

leofang commented Sep 17, 2025

@leofang the problem with this approach is that cuda.core.Stream is a cdef class which requires a fixed signature for __del__ itself whereas the cupy stream object for instance is just a pure python class.

Ah! This is what I have missed - thanks!

But maybe we can have __del__ and close stay the same if we introduce something like a safe_close method called by __del__ only that runs the extra check? That would keep existing calls to close the same while only affecting the process when closing is happening under a __del__.

Sounds like a good idea. I like a variant of this:

cdef _shutdown_safe_close(self, is_shutting_down=is_shutting_down):
    if is_shutting_down and is_shutting_down():
        return
    # do cleanup

def __del__(self):
    self._shutdown_safe_close()

cpdef close(self, stream=None):
    self._shutdown_safe_close(is_shutting_down=None)  # bypass the shutdown check

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry Brandon, one more thing: Could you add a rel note entry to cuda_core/docs/source/release/0.X.Y-notes.rst?

@leofang
Copy link
Member

leofang commented Sep 26, 2025

/ok to test

@leofang leofang enabled auto-merge (squash) September 26, 2025 21:35

This comment has been minimized.

@leofang leofang merged commit 62d6963 into NVIDIA:main Sep 27, 2025
56 checks passed
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.core Everything related to the cuda.core module P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: cuda.core Event failed to complete cleanup properly
4 participants