Skip to content

Conversation

cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Sep 23, 2025

Updated codegen for fixing an issue with calling cudart.getLocalRuntimeVersion more than once. See the generated comment for details.

Copy link
Contributor

copy-pr-bot bot commented Sep 23, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cpcloud
Copy link
Contributor Author

cpcloud commented Sep 23, 2025

/ok to test

@cpcloud cpcloud changed the title leave dl dangling fix(library): avoid spurious close of cached shared library when calling cudart.getLocalRuntimeVersion Sep 23, 2025
kkraus14
kkraus14 previously approved these changes Sep 23, 2025
@rwgk
Copy link
Collaborator

rwgk commented Sep 23, 2025

I feel pretty uneasy about this approach, also the unnecessary duplication of the test code.

This comment has been minimized.

# achieved by caching (ultimately) `ctypes.CDLL` calls.
#
# Long(er)-term we can explore cleaning up the library using higher-level
# Python mechanisms, like `__del__` or `weakref.finalizer`s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying my comment from the private PR:

Could you help me understand the context more? Where is the functools.cache?

I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the functools.cache?

https://github.com/nvidia/cuda-python/blob/main/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py?plain=1#L54

I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.

The caching isn't done for the function's result.

It's that this function loads the library and then closes it, invalidating the handle to that library that is cached by functools.cache that decorates load_nvidia_dynamic_lib. The pointer itself remains valid, but the symbol table (at least in the elf loader) contains NULL pointers that are eventually dereferenced during a subsequent dlsym call with that (now invalid) handle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reposting my responpose also:

Ah ... thanks. Could you please give me a moment to think about this?

I didn't realize that the caching implies: never close the handle. That's not good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep the discussion here to avoid duplicating on cuda-python-private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent most of the day going down this rabbit hole, so I'm happy to talk it through IRL if that helps.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you need to get this issue out of the way asap:

WDYT about:

Comment out the code here (but don't delete for easy reference).

Add this comment:

# Skip closing handle until https://github.com/NVIDIA/cuda-python/issues/1011 is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment out the code here (but don't delete for easy reference).

Not really a huge fan of that in general.

We have git history if someone really needs the exact code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you need to get this issue out of the way asap:

There's no rush here since 3.13t is experimental and 3.14 is still an RC.

If you have a solution you want to explore, have at it!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will not have a solution overnight. For the moment I'd just do this:

# Currently pathfinder does not support closing the handle.
# See https://github.com/NVIDIA/cuda-python/issues/1011 for background.

It's fine to delete the code for closing the handle entirely. From what I learned yesterday afternoon, the code here will have to change for sure, if we decide to support closing the handles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following what problem remains with the weakref finalize solution. I'm not saying it is without problems, but I currently don't see how it doesn't solve the problem of correctly closing a CDLL-opened library at the right time.

@cpcloud
Copy link
Contributor Author

cpcloud commented Sep 23, 2025

the unnecessary duplication of the test code.

The point of the duplication is to test a specific use case with a new test name. I suppose I can only call the function twice, since that's the test. Would that be better?

# achieved by caching (ultimately) `ctypes.CDLL` calls.
#
# Long(er)-term we can explore cleaning up the library using higher-level
# Python mechanisms, like `__del__` or `weakref.finalizer`s.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you need to get this issue out of the way asap:

WDYT about:

Comment out the code here (but don't delete for easy reference).

Add this comment:

# Skip closing handle until https://github.com/NVIDIA/cuda-python/issues/1011 is resolved.

@cpcloud
Copy link
Contributor Author

cpcloud commented Sep 23, 2025

I have an implementation of proper cleanup. Should I push it here, or should I make a new PR? It still requires removing the unloading calls from Cython, so it might make more sense to push it up here.

@cpcloud
Copy link
Contributor Author

cpcloud commented Sep 23, 2025

@rwgk I pushed up a commit with the proposed clean up mechanism. Happy to discuss tomorrow as I am checking out for the day!

# We only need the address, so the rest of whatever is in `library` is free
# to be cleaned up. The integer address is immutable, so it gets copied
# upon being referenced here
weakref.finalize(library, unload_dl, ctypes.c_void_p(library._handle_uint))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where within _load_lib_no_cache if check_if_already_loaded_from_elsewhere returns the LoadedDL then we probably shouldn't close the handle either since we're just resolving the handle to an already loaded library from elsewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's quite complex. I took me 5 minutes to write this prompt and upload the files, then the reasoning part took a whopping 10 minutes (longest I've seen so far):

https://chatgpt.com/share/68d3709e-a60c-8008-95c2-978251b18874

I already spend 20 minutes reviewing the response. I think it's really good, although I already discovered one small flaw in the reference implementation. I'll work on it a little more to be sure it can be made to work in general, then we can decide what balance of complexity vs features we want.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14

we probably shouldn't close the handle either since we're just resolving the handle to an already loaded library from elsewhere

If we're dlopen-ing it, then we need a corresponding dlclose. Are we dlopen-ing the library when we resolve the handle?

@rwgk

I really think we should try to think through the problem and not just go with whatever chatgpt gives us.

I personally find it hard to justify a complicated AI-generated solution that includes additional APIs, each of which includes a lot of concepts (such as pinning) that are novel in Python APIs and the need to maintain a solution that a human did not come up with and likely cannot explain with the same level of understanding they had if they produced the solution themselves.

That said, this is just my opinion, so take it with a grain of salt. If the AI solution is really better than getting to the bottom of the problem and building a tailor-made solution then perhaps it's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since RTLD_NOLOAD increfs, then the finalize here is still necessary even if the library was loaded from elsewhere.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RTLD_NOLOAD is via a ctypes.CDLL call though as opposed to dlopen directly. It seems ctypes doesn't actually close the handle, which is surprising to me and makes my previous statement incorrect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, ok, tx.

I've been blissfully ignorant of the handle refcount system, I need to review the existing code with a new mindset. Unblocking you here with a minimal change, and looking carefully in a separate PR is preferred.

One thing I want to look into: Is it ok for the existing clients (mainly cython-gen, cybind) if we close the handle, which may trigger unloading the library? The clients are holding on to pointers into the dynamic libs, e.g.

  • with gil, __symbol_lock:
    # Load library
    handle = load_nvidia_dynamic_lib("nvJitLink")._handle_uint
    # Load function
    global __nvJitLinkCreate
    __nvJitLinkCreate = GetProcAddress(handle, 'nvJitLinkCreate')
    global __nvJitLinkDestroy
    __nvJitLinkDestroy = GetProcAddress(handle, 'nvJitLinkDestroy')
    global __nvJitLinkAddData
    __nvJitLinkAddData = GetProcAddress(handle, 'nvJitLinkAddData')
    global __nvJitLinkAddFile
    __nvJitLinkAddFile = GetProcAddress(handle, 'nvJitLinkAddFile')
    global __nvJitLinkComplete
    __nvJitLinkComplete = GetProcAddress(handle, 'nvJitLinkComplete')
    global __nvJitLinkGetLinkedCubinSize
    __nvJitLinkGetLinkedCubinSize = GetProcAddress(handle, 'nvJitLinkGetLinkedCubinSize')
    global __nvJitLinkGetLinkedCubin
    __nvJitLinkGetLinkedCubin = GetProcAddress(handle, 'nvJitLinkGetLinkedCubin')
    global __nvJitLinkGetLinkedPtxSize
    __nvJitLinkGetLinkedPtxSize = GetProcAddress(handle, 'nvJitLinkGetLinkedPtxSize')
    global __nvJitLinkGetLinkedPtx
    __nvJitLinkGetLinkedPtx = GetProcAddress(handle, 'nvJitLinkGetLinkedPtx')
    global __nvJitLinkGetErrorLogSize
    __nvJitLinkGetErrorLogSize = GetProcAddress(handle, 'nvJitLinkGetErrorLogSize')
    global __nvJitLinkGetErrorLog
    __nvJitLinkGetErrorLog = GetProcAddress(handle, 'nvJitLinkGetErrorLog')
    global __nvJitLinkGetInfoLogSize
    __nvJitLinkGetInfoLogSize = GetProcAddress(handle, 'nvJitLinkGetInfoLogSize')
    global __nvJitLinkGetInfoLog
    __nvJitLinkGetInfoLog = GetProcAddress(handle, 'nvJitLinkGetInfoLog')
    global __nvJitLinkVersion
    __nvJitLinkVersion = GetProcAddress(handle, 'nvJitLinkVersion')
    __py_nvjitlink_init = True
    return 0

I need to find out: What happens if we trigger unloading a library while we still have the addresses where the symbols were loaded into memory?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only runs when the process is finalizing.

I don't think that's true.

You're right, sorry.

Actually, no, based on this quick experiment it is true:

  • I checked out your leave-dl-dangling branch (this PR)

  • I added this one line:

--- a/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py
+++ b/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py
@@ -47,6 +47,7 @@ RTLD_DI_ORIGIN = 6


 def unload_dl(handle: ctypes.c_void_p) -> None:
+    print(f"\nLOOOK unload_dl({handle=!r})", flush=True)
     result = LIBDL.dlclose(handle)
     if result:
         raise RuntimeError(LIBDL.dlerror())
  • Then I ran the cuda_bindings unit tests:
(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $ pytest -ra -s -v tests/
====================================================================================================== test session starts =======================================================================================================
platform linux -- Python 3.12.3, pytest-8.4.2, pluggy-1.6.0 -- /home/rgrossekunst/forked/cuda-python/cuda_pathfinder/WslLocalCudaVenv/bin/python3
cachedir: .pytest_cache
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/rgrossekunst/forked/cuda-python/cuda_bindings/tests
configfile: pytest.ini
plugins: benchmark-5.1.0
collected 192 items / 1 skipped

tests/test_cuda.py::test_cuda_memcpy PASSED
tests/test_cuda.py::test_cuda_array PASSED
tests/test_cuda.py::test_cuda_repr_primitive PASSED
<... snip ...>
tests/test_cudart.py::test_getLocalRuntimeVersion PASSED
<... snip ...>
tests/test_utils.py::test_cyclical_imports[nvvm] PASSED
tests/test_utils.py::test_cyclical_imports[runtime] PASSED
tests/test_utils.py::test_cyclical_imports[cufile] PASSED

==================================================================================================== short test summary info =====================================================================================================
SKIPPED [1] tests/test_cufile.py:40: skipping cuFile tests on WSL
================================================================================================= 192 passed, 1 skipped in 7.03s =================================================================================================

LOOOK unload_dl(handle=c_void_p(791712928))

LOOOK unload_dl(handle=c_void_p(781336800))

LOOOK unload_dl(handle=c_void_p(732155104))

LOOOK unload_dl(handle=c_void_p(731635248))
(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $

Because of functools.cache, the finalizers only run in the process tear-down, when the handles are sure to disappear anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very nervous about the finalizer solution because as per Phillip's investigation last week we will call the C APIs to do cleanup, but aren't the finalizers called before that step and leaving us dangling pointers?

We have been leaking DSO handles on purpose since pretty much the beginning of this project (~2021) and no one complained, in particular with respect to our handling of the driver (libcuda). Why is this leaking suddenly not acceptable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... aren't the finalizers called before that step and leaving us dangling pointers?

Is it possible that the void* function pointers in various places get called after the library is closed, which happens at the module level? This would mean that the function (literally, the function object corresponding to the thing created by def; this matters because that's where the cache is stored) gets torn down and the finalizer called on LoadedDL before invoking one of those functions.

I can imagine the following scenario:

lib = load_nvidia_shared_library("nvjitlink")
foo = create_something_that_eventually_calls_a_cleanup_routine() # note the possible implicit call to `load_nvidia_shared_library`

# ... many lines of code later
#
# Python shuts down
# Python calls the finalizer for `lib`
# Python GC's `foo` which needs an open handle corresponding to `lib`'s address

So, we are probably always going to be constrained by this if we support code paths that can call globally-defined-and-expected-to-be-valid-pointers at any point in a program's lifetime.

I just spoke with @rwgk in person and we agreed that leaking the handles is preferable for now. I will revert the most recent commit, leave the comment, and then we can merge this fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the best any of us knows, the answer is "we don't know, but we know that it might be possible to need to call these functions at any point, so we leak the handles to avoid a worse outcome"

Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpcloud, and sorry we accidentally sent you into this hole!

@leofang
Copy link
Member

leofang commented Sep 24, 2025

/ok to test cd89f39

@leofang leofang added bug Something isn't working P0 High priority - Must do! cuda.bindings Everything related to the cuda.bindings module labels Sep 24, 2025
@leofang leofang added this to the cuda-python 13-next, 12-next milestone Sep 24, 2025
@leofang leofang enabled auto-merge (squash) September 24, 2025 22:41
@leofang leofang merged commit 0bcce3f into NVIDIA:main Sep 24, 2025
73 checks passed
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.

@cpcloud cpcloud deleted the leave-dl-dangling branch September 25, 2025 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda.bindings Everything related to the cuda.bindings module P0 High priority - Must do!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants