-
Notifications
You must be signed in to change notification settings - Fork 212
fix(library): avoid spurious close of cached shared library when calling cudart.getLocalRuntimeVersion
#1010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
cudart.getLocalRuntimeVersion
I feel pretty uneasy about this approach, also the unnecessary duplication of the test code. |
This comment has been minimized.
This comment has been minimized.
# achieved by caching (ultimately) `ctypes.CDLL` calls. | ||
# | ||
# Long(er)-term we can explore cleaning up the library using higher-level | ||
# Python mechanisms, like `__del__` or `weakref.finalizer`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copying my comment from the private PR:
Could you help me understand the context more? Where is the functools.cache?
I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the
functools.cache
?
I'm thinking it wouldn't be difficult to change this function to do the caching of the result/error right here.
The caching isn't done for the function's result.
It's that this function loads the library and then closes it, invalidating the handle to that library that is cached by functools.cache
that decorates load_nvidia_dynamic_lib
. The pointer itself remains valid, but the symbol table (at least in the elf loader) contains NULL pointers that are eventually dereferenced during a subsequent dlsym
call with that (now invalid) handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reposting my responpose also:
Ah ... thanks. Could you please give me a moment to think about this?
I didn't realize that the caching implies: never close the handle. That's not good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep the discussion here to avoid duplicating on cuda-python-private.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent most of the day going down this rabbit hole, so I'm happy to talk it through IRL if that helps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you need to get this issue out of the way asap:
WDYT about:
Comment out the code here (but don't delete for easy reference).
Add this comment:
# Skip closing handle until https://github.com/NVIDIA/cuda-python/issues/1011 is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment out the code here (but don't delete for easy reference).
Not really a huge fan of that in general.
We have git history if someone really needs the exact code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you need to get this issue out of the way asap:
There's no rush here since 3.13t is experimental and 3.14 is still an RC.
If you have a solution you want to explore, have at it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we will not have a solution overnight. For the moment I'd just do this:
# Currently pathfinder does not support closing the handle.
# See https://github.com/NVIDIA/cuda-python/issues/1011 for background.
It's fine to delete the code for closing the handle entirely. From what I learned yesterday afternoon, the code here will have to change for sure, if we decide to support closing the handles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following what problem remains with the weakref finalize solution. I'm not saying it is without problems, but I currently don't see how it doesn't solve the problem of correctly closing a CDLL-opened library at the right time.
The point of the duplication is to test a specific use case with a new test name. I suppose I can only call the function twice, since that's the test. Would that be better? |
# achieved by caching (ultimately) `ctypes.CDLL` calls. | ||
# | ||
# Long(er)-term we can explore cleaning up the library using higher-level | ||
# Python mechanisms, like `__del__` or `weakref.finalizer`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you need to get this issue out of the way asap:
WDYT about:
Comment out the code here (but don't delete for easy reference).
Add this comment:
# Skip closing handle until https://github.com/NVIDIA/cuda-python/issues/1011 is resolved.
I have an implementation of proper cleanup. Should I push it here, or should I make a new PR? It still requires removing the unloading calls from Cython, so it might make more sense to push it up here. |
@rwgk I pushed up a commit with the proposed clean up mechanism. Happy to discuss tomorrow as I am checking out for the day! |
# We only need the address, so the rest of whatever is in `library` is free | ||
# to be cleaned up. The integer address is immutable, so it gets copied | ||
# upon being referenced here | ||
weakref.finalize(library, unload_dl, ctypes.c_void_p(library._handle_uint)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where within _load_lib_no_cache
if check_if_already_loaded_from_elsewhere
returns the LoadedDL
then we probably shouldn't close the handle either since we're just resolving the handle to an already loaded library from elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's quite complex. I took me 5 minutes to write this prompt and upload the files, then the reasoning part took a whopping 10 minutes (longest I've seen so far):
https://chatgpt.com/share/68d3709e-a60c-8008-95c2-978251b18874
I already spend 20 minutes reviewing the response. I think it's really good, although I already discovered one small flaw in the reference implementation. I'll work on it a little more to be sure it can be made to work in general, then we can decide what balance of complexity vs features we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably shouldn't close the handle either since we're just resolving the handle to an already loaded library from elsewhere
If we're dlopen
-ing it, then we need a corresponding dlclose
. Are we dlopen
-ing the library when we resolve the handle?
I really think we should try to think through the problem and not just go with whatever chatgpt gives us.
I personally find it hard to justify a complicated AI-generated solution that includes additional APIs, each of which includes a lot of concepts (such as pinning) that are novel in Python APIs and the need to maintain a solution that a human did not come up with and likely cannot explain with the same level of understanding they had if they produced the solution themselves.
That said, this is just my opinion, so take it with a grain of salt. If the AI solution is really better than getting to the bottom of the problem and building a tailor-made solution then perhaps it's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since RTLD_NOLOAD
increfs, then the finalize here is still necessary even if the library was loaded from elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RTLD_NOLOAD
is via a ctypes.CDLL
call though as opposed to dlopen
directly. It seems ctypes doesn't actually close the handle, which is surprising to me and makes my previous statement incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, ok, tx.
I've been blissfully ignorant of the handle refcount system, I need to review the existing code with a new mindset. Unblocking you here with a minimal change, and looking carefully in a separate PR is preferred.
One thing I want to look into: Is it ok for the existing clients (mainly cython-gen, cybind) if we close the handle, which may trigger unloading the library? The clients are holding on to pointers into the dynamic libs, e.g.
cuda-python/cuda_bindings/cuda/bindings/_internal/nvjitlink_windows.pyx
Lines 101 to 149 in d2978aa
with gil, __symbol_lock: # Load library handle = load_nvidia_dynamic_lib("nvJitLink")._handle_uint # Load function global __nvJitLinkCreate __nvJitLinkCreate = GetProcAddress(handle, 'nvJitLinkCreate') global __nvJitLinkDestroy __nvJitLinkDestroy = GetProcAddress(handle, 'nvJitLinkDestroy') global __nvJitLinkAddData __nvJitLinkAddData = GetProcAddress(handle, 'nvJitLinkAddData') global __nvJitLinkAddFile __nvJitLinkAddFile = GetProcAddress(handle, 'nvJitLinkAddFile') global __nvJitLinkComplete __nvJitLinkComplete = GetProcAddress(handle, 'nvJitLinkComplete') global __nvJitLinkGetLinkedCubinSize __nvJitLinkGetLinkedCubinSize = GetProcAddress(handle, 'nvJitLinkGetLinkedCubinSize') global __nvJitLinkGetLinkedCubin __nvJitLinkGetLinkedCubin = GetProcAddress(handle, 'nvJitLinkGetLinkedCubin') global __nvJitLinkGetLinkedPtxSize __nvJitLinkGetLinkedPtxSize = GetProcAddress(handle, 'nvJitLinkGetLinkedPtxSize') global __nvJitLinkGetLinkedPtx __nvJitLinkGetLinkedPtx = GetProcAddress(handle, 'nvJitLinkGetLinkedPtx') global __nvJitLinkGetErrorLogSize __nvJitLinkGetErrorLogSize = GetProcAddress(handle, 'nvJitLinkGetErrorLogSize') global __nvJitLinkGetErrorLog __nvJitLinkGetErrorLog = GetProcAddress(handle, 'nvJitLinkGetErrorLog') global __nvJitLinkGetInfoLogSize __nvJitLinkGetInfoLogSize = GetProcAddress(handle, 'nvJitLinkGetInfoLogSize') global __nvJitLinkGetInfoLog __nvJitLinkGetInfoLog = GetProcAddress(handle, 'nvJitLinkGetInfoLog') global __nvJitLinkVersion __nvJitLinkVersion = GetProcAddress(handle, 'nvJitLinkVersion') __py_nvjitlink_init = True return 0
I need to find out: What happens if we trigger unloading a library while we still have the addresses where the symbols were loaded into memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only runs when the process is finalizing.
I don't think that's true.
You're right, sorry.
Actually, no, based on this quick experiment it is true:
-
I checked out your
leave-dl-dangling
branch (this PR) -
I added this one line:
--- a/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py
+++ b/cuda_pathfinder/cuda/pathfinder/_dynamic_libs/load_dl_linux.py
@@ -47,6 +47,7 @@ RTLD_DI_ORIGIN = 6
def unload_dl(handle: ctypes.c_void_p) -> None:
+ print(f"\nLOOOK unload_dl({handle=!r})", flush=True)
result = LIBDL.dlclose(handle)
if result:
raise RuntimeError(LIBDL.dlerror())
- Then I ran the cuda_bindings unit tests:
(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $ pytest -ra -s -v tests/
====================================================================================================== test session starts =======================================================================================================
platform linux -- Python 3.12.3, pytest-8.4.2, pluggy-1.6.0 -- /home/rgrossekunst/forked/cuda-python/cuda_pathfinder/WslLocalCudaVenv/bin/python3
cachedir: .pytest_cache
benchmark: 5.1.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/rgrossekunst/forked/cuda-python/cuda_bindings/tests
configfile: pytest.ini
plugins: benchmark-5.1.0
collected 192 items / 1 skipped
tests/test_cuda.py::test_cuda_memcpy PASSED
tests/test_cuda.py::test_cuda_array PASSED
tests/test_cuda.py::test_cuda_repr_primitive PASSED
<... snip ...>
tests/test_cudart.py::test_getLocalRuntimeVersion PASSED
<... snip ...>
tests/test_utils.py::test_cyclical_imports[nvvm] PASSED
tests/test_utils.py::test_cyclical_imports[runtime] PASSED
tests/test_utils.py::test_cyclical_imports[cufile] PASSED
==================================================================================================== short test summary info =====================================================================================================
SKIPPED [1] tests/test_cufile.py:40: skipping cuFile tests on WSL
================================================================================================= 192 passed, 1 skipped in 7.03s =================================================================================================
LOOOK unload_dl(handle=c_void_p(791712928))
LOOOK unload_dl(handle=c_void_p(781336800))
LOOOK unload_dl(handle=c_void_p(732155104))
LOOOK unload_dl(handle=c_void_p(731635248))
(WslLocalCudaVenv) rwgk-win11.localdomain:~/forked/cuda-python/cuda_bindings $
Because of functools.cache
, the finalizers only run in the process tear-down, when the handles are sure to disappear anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am very nervous about the finalizer solution because as per Phillip's investigation last week we will call the C APIs to do cleanup, but aren't the finalizers called before that step and leaving us dangling pointers?
We have been leaking DSO handles on purpose since pretty much the beginning of this project (~2021) and no one complained, in particular with respect to our handling of the driver (libcuda). Why is this leaking suddenly not acceptable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... aren't the finalizers called before that step and leaving us dangling pointers?
Is it possible that the void*
function pointers in various places get called after the library is closed, which happens at the module level? This would mean that the function (literally, the function object corresponding to the thing created by def
; this matters because that's where the cache is stored) gets torn down and the finalizer called on LoadedDL
before invoking one of those functions.
I can imagine the following scenario:
lib = load_nvidia_shared_library("nvjitlink")
foo = create_something_that_eventually_calls_a_cleanup_routine() # note the possible implicit call to `load_nvidia_shared_library`
# ... many lines of code later
#
# Python shuts down
# Python calls the finalizer for `lib`
# Python GC's `foo` which needs an open handle corresponding to `lib`'s address
So, we are probably always going to be constrained by this if we support code paths that can call globally-defined-and-expected-to-be-valid-pointers at any point in a program's lifetime.
I just spoke with @rwgk in person and we agreed that leaking the handles is preferable for now. I will revert the most recent commit, leave the comment, and then we can merge this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the best any of us knows, the answer is "we don't know, but we know that it might be possible to need to call these functions at any point, so we leak the handles to avoid a worse outcome"
… a finalizer This reverts commit 73c43ae.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpcloud, and sorry we accidentally sent you into this hole!
/ok to test cd89f39 |
|
Updated codegen for fixing an issue with calling
cudart.getLocalRuntimeVersion
more than once. See the generated comment for details.