fix(library): avoid spurious close of cached shared library when calling `cudart.getLocalRuntimeVersion` #1010

cpcloud · 2025-09-23T19:35:07Z

Updated codegen for fixing an issue with calling cudart.getLocalRuntimeVersion more than once. See the generated comment for details.

copy-pr-bot · 2025-09-23T19:35:11Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpcloud · 2025-09-23T19:36:17Z

/ok to test

rwgk · 2025-09-23T19:52:58Z

I feel pretty uneasy about this approach, also the unnecessary duplication of the test code.

rwgk · 2025-09-23T19:55:14Z

cuda_bindings/cuda/bindings/cyruntime.pyx.in

+    # achieved by caching (ultimately) `ctypes.CDLL` calls.
+    #
+    # Long(er)-term we can explore cleaning up the library using higher-level
+    # Python mechanisms, like `__del__` or `weakref.finalizer`s.


Copying my comment from the private PR:

Could you help me understand the context more? Where is the functools.cache?

I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.

Where is the functools.cache?

https://github.com/nvidia/cuda-python/blob/main/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py?plain=1#L54

I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.

The caching isn't done for the function's result.

It's that this function loads the library and then closes it, invalidating the handle to that library that is cached by functools.cache that decorates load_nvidia_dynamic_lib. The pointer itself remains valid, but the symbol table (at least in the elf loader) contains NULL pointers that are eventually dereferenced during a subsequent dlsym call with that (now invalid) handle.

Reposting my responpose also:

Ah ... thanks. Could you please give me a moment to think about this?

I didn't realize that the caching implies: never close the handle. That's not good.

We can keep the discussion here to avoid duplicating on cuda-python-private.

I spent most of the day going down this rabbit hole, so I'm happy to talk it through IRL if that helps.

Assuming you need to get this issue out of the way asap:

WDYT about:

Comment out the code here (but don't delete for easy reference).

Add this comment:

# Skip closing handle until https://github.com/NVIDIA/cuda-python/issues/1011 is resolved.

Comment out the code here (but don't delete for easy reference).

Not really a huge fan of that in general.

We have git history if someone really needs the exact code.

Assuming you need to get this issue out of the way asap:

There's no rush here since 3.13t is experimental and 3.14 is still an RC.

If you have a solution you want to explore, have at it!

I think we will not have a solution overnight. For the moment I'd just do this:

# Currently pathfinder does not support closing the handle. # See https://github.com/NVIDIA/cuda-python/issues/1011 for background.

It's fine to delete the code for closing the handle entirely. From what I learned yesterday afternoon, the code here will have to change for sure, if we decide to support closing the handles.

I'm not following what problem remains with the weakref finalize solution. I'm not saying it is without problems, but I currently don't see how it doesn't solve the problem of correctly closing a CDLL-opened library at the right time.

cpcloud · 2025-09-23T19:59:54Z

the unnecessary duplication of the test code.

The point of the duplication is to test a specific use case with a new test name. I suppose I can only call the function twice, since that's the test. Would that be better?

rwgk · 2025-09-23T20:19:13Z

cuda_bindings/cuda/bindings/cyruntime.pyx.in

+    # achieved by caching (ultimately) `ctypes.CDLL` calls.
+    #
+    # Long(er)-term we can explore cleaning up the library using higher-level
+    # Python mechanisms, like `__del__` or `weakref.finalizer`s.


Assuming you need to get this issue out of the way asap:

WDYT about:

Comment out the code here (but don't delete for easy reference).

Add this comment:

# Skip closing handle until https://github.com/NVIDIA/cuda-python/issues/1011 is resolved.

cuda_bindings/tests/test_cudart.py

cpcloud · 2025-09-23T20:56:45Z

I have an implementation of proper cleanup. Should I push it here, or should I make a new PR? It still requires removing the unloading calls from Cython, so it might make more sense to push it up here.

…izer

cpcloud · 2025-09-23T21:05:31Z

@rwgk I pushed up a commit with the proposed clean up mechanism. Happy to discuss tomorrow as I am checking out for the day!

kkraus14 · 2025-09-24T01:22:05Z

cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_nvidia_dynamic_lib.py

+    # We only need the address, so the rest of whatever is in `library` is free
+    # to be cleaned up. The integer address is immutable, so it gets copied
+    # upon being referenced here
+    weakref.finalize(library, unload_dl, ctypes.c_void_p(library._handle_uint))


In the case where within _load_lib_no_cache if check_if_already_loaded_from_elsewhere returns the LoadedDL then we probably shouldn't close the handle either since we're just resolving the handle to an already loaded library from elsewhere?

Yeah, it's quite complex. I took me 5 minutes to write this prompt and upload the files, then the reasoning part took a whopping 10 minutes (longest I've seen so far):

https://chatgpt.com/share/68d3709e-a60c-8008-95c2-978251b18874

I already spend 20 minutes reviewing the response. I think it's really good, although I already discovered one small flaw in the reference implementation. I'll work on it a little more to be sure it can be made to work in general, then we can decide what balance of complexity vs features we want.

@kkraus14

we probably shouldn't close the handle either since we're just resolving the handle to an already loaded library from elsewhere

If we're dlopen-ing it, then we need a corresponding dlclose. Are we dlopen-ing the library when we resolve the handle?

@rwgk

I really think we should try to think through the problem and not just go with whatever chatgpt gives us.

I personally find it hard to justify a complicated AI-generated solution that includes additional APIs, each of which includes a lot of concepts (such as pinning) that are novel in Python APIs and the need to maintain a solution that a human did not come up with and likely cannot explain with the same level of understanding they had if they produced the solution themselves.

That said, this is just my opinion, so take it with a grain of salt. If the AI solution is really better than getting to the bottom of the problem and building a tailor-made solution then perhaps it's fine.

Since RTLD_NOLOAD increfs, then the finalize here is still necessary even if the library was loaded from elsewhere.

The RTLD_NOLOAD is via a ctypes.CDLL call though as opposed to dlopen directly. It seems ctypes doesn't actually close the handle, which is surprising to me and makes my previous statement incorrect.

Wow, ok, tx.

I've been blissfully ignorant of the handle refcount system, I need to review the existing code with a new mindset. Unblocking you here with a minimal change, and looking carefully in a separate PR is preferred.

One thing I want to look into: Is it ok for the existing clients (mainly cython-gen, cybind) if we close the handle, which may trigger unloading the library? The clients are holding on to pointers into the dynamic libs, e.g.

cuda-python/cuda_bindings/cuda/bindings/_internal/nvjitlink_windows.pyx

Lines 101 to 149 in d2978aa

with gil, __symbol_lock:

# Load library

handle = load_nvidia_dynamic_lib("nvJitLink")._handle_uint

# Load function

global __nvJitLinkCreate

__nvJitLinkCreate = GetProcAddress(handle, 'nvJitLinkCreate')

global __nvJitLinkDestroy

__nvJitLinkDestroy = GetProcAddress(handle, 'nvJitLinkDestroy')

global __nvJitLinkAddData

__nvJitLinkAddData = GetProcAddress(handle, 'nvJitLinkAddData')

global __nvJitLinkAddFile

__nvJitLinkAddFile = GetProcAddress(handle, 'nvJitLinkAddFile')

global __nvJitLinkComplete

__nvJitLinkComplete = GetProcAddress(handle, 'nvJitLinkComplete')

global __nvJitLinkGetLinkedCubinSize

__nvJitLinkGetLinkedCubinSize = GetProcAddress(handle, 'nvJitLinkGetLinkedCubinSize')

global __nvJitLinkGetLinkedCubin

__nvJitLinkGetLinkedCubin = GetProcAddress(handle, 'nvJitLinkGetLinkedCubin')

global __nvJitLinkGetLinkedPtxSize

__nvJitLinkGetLinkedPtxSize = GetProcAddress(handle, 'nvJitLinkGetLinkedPtxSize')

global __nvJitLinkGetLinkedPtx

__nvJitLinkGetLinkedPtx = GetProcAddress(handle, 'nvJitLinkGetLinkedPtx')

global __nvJitLinkGetErrorLogSize

__nvJitLinkGetErrorLogSize = GetProcAddress(handle, 'nvJitLinkGetErrorLogSize')

global __nvJitLinkGetErrorLog

__nvJitLinkGetErrorLog = GetProcAddress(handle, 'nvJitLinkGetErrorLog')

global __nvJitLinkGetInfoLogSize

__nvJitLinkGetInfoLogSize = GetProcAddress(handle, 'nvJitLinkGetInfoLogSize')

global __nvJitLinkGetInfoLog

__nvJitLinkGetInfoLog = GetProcAddress(handle, 'nvJitLinkGetInfoLog')

global __nvJitLinkVersion

__nvJitLinkVersion = GetProcAddress(handle, 'nvJitLinkVersion')

__py_nvjitlink_init = True

return 0

I need to find out: What happens if we trigger unloading a library while we still have the addresses where the symbols were loaded into memory?

It only runs when the process is finalizing.

I don't think that's true.

You're right, sorry.

Actually, no, based on this quick experiment it is true:

I checked out your leave-dl-dangling branch (this PR)

I added this one line:

--- a/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py +++ b/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py @@ -47,6 +47,7 @@ RTLD_DI_ORIGIN = 6 def unload_dl(handle: ctypes.c_void_p) -> None: + print(f"\nLOOOK unload_dl({handle=!r})", flush=True) result = LIBDL.dlclose(handle) if result: raise RuntimeError(LIBDL.dlerror())

Then I ran the cuda_bindings unit tests:

(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $ pytest -ra -s -v tests/ ====================================================================================================== test session starts ======================================================================================================= platform linux -- Python 3.12.3, pytest-8.4.2, pluggy-1.6.0 -- /home/rgrossekunst/forked/cuda-python/cuda_pathfinder/WslLocalCudaVenv/bin/python3 cachedir: .pytest_cache benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /home/rgrossekunst/forked/cuda-python/cuda_bindings/tests configfile: pytest.ini plugins: benchmark-5.1.0 collected 192 items / 1 skipped tests/test_cuda.py::test_cuda_memcpy PASSED tests/test_cuda.py::test_cuda_array PASSED tests/test_cuda.py::test_cuda_repr_primitive PASSED <... snip ...> tests/test_cudart.py::test_getLocalRuntimeVersion PASSED <... snip ...> tests/test_utils.py::test_cyclical_imports[nvvm] PASSED tests/test_utils.py::test_cyclical_imports[runtime] PASSED tests/test_utils.py::test_cyclical_imports[cufile] PASSED ==================================================================================================== short test summary info ===================================================================================================== SKIPPED [1] tests/test_cufile.py:40: skipping cuFile tests on WSL ================================================================================================= 192 passed, 1 skipped in 7.03s ================================================================================================= LOOOK unload_dl(handle=c_void_p(791712928)) LOOOK unload_dl(handle=c_void_p(781336800)) LOOOK unload_dl(handle=c_void_p(732155104)) LOOOK unload_dl(handle=c_void_p(731635248)) (WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $

Because of functools.cache, the finalizers only run in the process tear-down, when the handles are sure to disappear anyway.

I am very nervous about the finalizer solution because as per Phillip's investigation last week we will call the C APIs to do cleanup, but aren't the finalizers called before that step and leaving us dangling pointers?

We have been leaking DSO handles on purpose since pretty much the beginning of this project (~2021) and no one complained, in particular with respect to our handling of the driver (libcuda). Why is this leaking suddenly not acceptable here?

... aren't the finalizers called before that step and leaving us dangling pointers?

Is it possible that the void* function pointers in various places get called after the library is closed, which happens at the module level? This would mean that the function (literally, the function object corresponding to the thing created by def; this matters because that's where the cache is stored) gets torn down and the finalizer called on LoadedDL before invoking one of those functions.

I can imagine the following scenario:

lib = load_nvidia_shared_library("nvjitlink") foo = create_something_that_eventually_calls_a_cleanup_routine() # note the possible implicit call to `load_nvidia_shared_library` # ... many lines of code later # # Python shuts down # Python calls the finalizer for `lib` # Python GC's `foo` which needs an open handle corresponding to `lib`'s address

So, we are probably always going to be constrained by this if we support code paths that can call globally-defined-and-expected-to-be-valid-pointers at any point in a program's lifetime.

I just spoke with @rwgk in person and we agreed that leaking the handles is preferable for now. I will revert the most recent commit, leave the comment, and then we can merge this fix.

So, the best any of us knows, the answer is "we don't know, but we know that it might be possible to need to call these functions at any point, so we leak the handles to avoid a worse outcome"

… a finalizer This reverts commit 73c43ae.

rwgk

Thanks @cpcloud, and sorry we accidentally sent you into this hole!

leofang · 2025-09-24T22:40:53Z

/ok to test cd89f39

github-actions · 2025-09-24T23:51:06Z

Doc Preview CI
Preview removed because the pull request was closed or merged.

cpcloud added 2 commits September 23, 2025 15:30

test: add failing test of free-threading loading of SO

bc41afe

fix(library): avoid spurious close of cached shared library

09a594c

cpcloud changed the title ~~leave dl dangling~~ fix(library): avoid spurious close of cached shared library when calling cudart.getLocalRuntimeVersion Sep 23, 2025

kkraus14 previously approved these changes Sep 23, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

rwgk reviewed Sep 23, 2025

View reviewed changes

rwgk mentioned this pull request Sep 23, 2025

[BUG]: Caching of load_nvidia_dynamic_lib result implies "never close the handle". #1011

Open

rwgk reviewed Sep 23, 2025

View reviewed changes

test: deduplicate

8cf16e2

cpcloud dismissed kkraus14’s stale review via 8cf16e2 September 23, 2025 20:57

fix(library): alternative implementation that cleans up using a final…

73c43ae

…izer

kkraus14 reviewed Sep 24, 2025

View reviewed changes

revert: fix(library): alternative implementation that cleans up using…

cd89f39

… a finalizer This reverts commit 73c43ae.

rwgk approved these changes Sep 24, 2025

View reviewed changes

kkraus14 approved these changes Sep 24, 2025

View reviewed changes

leofang approved these changes Sep 24, 2025

View reviewed changes

leofang assigned cpcloud Sep 24, 2025

leofang added bug Something isn't working P0 High priority - Must do! cuda.bindings Everything related to the cuda.bindings module labels Sep 24, 2025

leofang added this to the cuda-python 13-next, 12-next milestone Sep 24, 2025

leofang enabled auto-merge (squash) September 24, 2025 22:41

leofang merged commit 0bcce3f into NVIDIA:main Sep 24, 2025
73 checks passed

cpcloud deleted the leave-dl-dangling branch September 25, 2025 13:18

	with gil, __symbol_lock:
	# Load library
	handle = load_nvidia_dynamic_lib("nvJitLink")._handle_uint

	# Load function
	global __nvJitLinkCreate
	__nvJitLinkCreate = GetProcAddress(handle, 'nvJitLinkCreate')

	global __nvJitLinkDestroy
	__nvJitLinkDestroy = GetProcAddress(handle, 'nvJitLinkDestroy')

	global __nvJitLinkAddData
	__nvJitLinkAddData = GetProcAddress(handle, 'nvJitLinkAddData')

	global __nvJitLinkAddFile
	__nvJitLinkAddFile = GetProcAddress(handle, 'nvJitLinkAddFile')

	global __nvJitLinkComplete
	__nvJitLinkComplete = GetProcAddress(handle, 'nvJitLinkComplete')

	global __nvJitLinkGetLinkedCubinSize
	__nvJitLinkGetLinkedCubinSize = GetProcAddress(handle, 'nvJitLinkGetLinkedCubinSize')

	global __nvJitLinkGetLinkedCubin
	__nvJitLinkGetLinkedCubin = GetProcAddress(handle, 'nvJitLinkGetLinkedCubin')

	global __nvJitLinkGetLinkedPtxSize
	__nvJitLinkGetLinkedPtxSize = GetProcAddress(handle, 'nvJitLinkGetLinkedPtxSize')

	global __nvJitLinkGetLinkedPtx
	__nvJitLinkGetLinkedPtx = GetProcAddress(handle, 'nvJitLinkGetLinkedPtx')

	global __nvJitLinkGetErrorLogSize
	__nvJitLinkGetErrorLogSize = GetProcAddress(handle, 'nvJitLinkGetErrorLogSize')

	global __nvJitLinkGetErrorLog
	__nvJitLinkGetErrorLog = GetProcAddress(handle, 'nvJitLinkGetErrorLog')

	global __nvJitLinkGetInfoLogSize
	__nvJitLinkGetInfoLogSize = GetProcAddress(handle, 'nvJitLinkGetInfoLogSize')

	global __nvJitLinkGetInfoLog
	__nvJitLinkGetInfoLog = GetProcAddress(handle, 'nvJitLinkGetInfoLog')

	global __nvJitLinkVersion
	__nvJitLinkVersion = GetProcAddress(handle, 'nvJitLinkVersion')

	__py_nvjitlink_init = True
	return 0

fix(library): avoid spurious close of cached shared library when calling cudart.getLocalRuntimeVersion #1010

fix(library): avoid spurious close of cached shared library when calling cudart.getLocalRuntimeVersion #1010

Uh oh!

Conversation

cpcloud commented Sep 23, 2025

Uh oh!

copy-pr-bot bot commented Sep 23, 2025

Uh oh!

cpcloud commented Sep 23, 2025

Uh oh!

rwgk commented Sep 23, 2025

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cpcloud commented Sep 23, 2025

Uh oh!

cpcloud commented Sep 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rwgk left a comment

Choose a reason for hiding this comment

Uh oh!

leofang commented Sep 24, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

Uh oh!

fix(library): avoid spurious close of cached shared library when calling `cudart.getLocalRuntimeVersion` #1010

fix(library): avoid spurious close of cached shared library when calling `cudart.getLocalRuntimeVersion` #1010