Rework Linker dispatching for cross-major nvJitLink/driver skew#1911
Open
cpcloud wants to merge 4 commits intoNVIDIA:mainfrom
Open
Rework Linker dispatching for cross-major nvJitLink/driver skew#1911cpcloud wants to merge 4 commits intoNVIDIA:mainfrom
cpcloud wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
|
0c94703 to
5d8fa24
Compare
2ff8d39 to
35e5282
Compare
Replace module-level "decide once" backend selection with per-Linker instance dispatch at __init__ time. Factor the decision into a pure _choose_backend() helper so it can be unit-tested without a GPU. Handle nvJitLink/driver major-version skew: fall back to the driver linker for non-LTO linking, raise RuntimeError for LTO when the backends are incompatible. Probe driver_version() lazily so environments with nvJitLink but no driver (e.g., build containers) still work; only CUDAError from handle_return is treated as "driver unknown" so other exceptions propagate. _probe_nvjitlink() is cached and warns at most once when nvJitLink is absent. Validate each input as ObjectCode before the code_type pre-scan so invalid inputs surface a TypeError instead of a backend-dispatch RuntimeError. Breaking change: options.link_time_optimization=True with nvJitLink absent now raises RuntimeError instead of silently passing CU_JIT_LTO to the driver (which was not real LTO linking). Closes NVIDIA#712
Linker_link was nulling self._drv_log_bufs right after cuLinkComplete, releasing the bytearrays whose addresses were handed to the driver via CU_JIT_INFO_LOG_BUFFER and CU_JIT_ERROR_LOG_BUFFER at cuLinkCreate time. The CUlinkState retains those pointers until cuLinkDestroy, which runs during Linker tp_dealloc. Freeing the bytearrays first left the driver with dangling pointers and corrupted the heap; subsequent CUDA calls (e.g. NVRTC compilation in the next test fixture) segfaulted. This path became reachable in CI with the new per-instance backend dispatch: CTK 12.9.1 + driver 13.0 runners now hit the driver linker for cross-major linking, which was never exercised before. Retain _drv_log_bufs until the cdef class is deallocated; pxd declaration order ensures _culink_handle (and therefore cuLinkDestroy) runs before the bytearrays are cleared.
…etime The CUDA driver docs state: "optionValues must remain valid for the life of the CUlinkState if output options are used." The driver writes log- fill sizes (output) back into the optionValues slots for CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES and CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES. Linker_init previously declared c_jit_keys/c_jit_values as local cdef vector[...] on the stack of Linker_init; they were destroyed when the function returned, leaving the driver with dangling writes during subsequent cuLinkAddData/cuLinkComplete/cuLinkDestroy calls. This was always latent. It became reachable with the per-instance backend dispatch (CTK 12.9.1 runners now select the driver linker when they pair with a driver 13 install), and only manifested on driver 13 as heap corruption that killed the next NVRTC or link call. Promote the two arrays to cdef class fields declared after _culink_handle in the pxd. Cython's tp_dealloc destroys C++ fields in pxd declaration order, so the vectors are destroyed after the shared_ptr deleter runs cuLinkDestroy. The cuda.bindings high-level wrapper (driver.cuLinkCreate) already handles this by attaching a keepalive to CUlinkState; cuda.core's low-level cydriver.cuLinkCreate path did not. Also drop the now-unused void_ptr ctypedef.
Linker.close() reset only the _culink_handle, leaving the retained option-key/value vectors and the log-buffer bytearrays alive until Python GC/tp_dealloc. Those buffers exist for cuLinkDestroy's sake, but cuLinkDestroy has already run at this point via the shared_ptr deleter, so they can be released immediately. Cache decoded logs into _info_log/_error_log before releasing the raw buffers so get_error_log() / get_info_log() remain callable after close(), including on the failed-link path where link() never caches them itself. Swap the option vectors with empty locals to actually free the backing allocation (std::vector::clear only sets size to 0 and keeps capacity). Adds two driver-backend regression tests: one that links successfully, closes + drops, then performs another full compile + link cycle (prior heap-corruption bugs only surfaced in the next CUDA op after teardown); another that triggers a link failure, closes, and checks get_error_log() still returns the captured diagnostic.
35e5282 to
bd1874c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reworks the
Linkerbackend selection and fixes three latent driver-linker resource-lifetime bugs that the rework surfaced on cross-major CTK + driver CI runners.Per-instance dispatch (
refactor(core): per-instance Linker backend dispatch)Linker-instance dispatch at__init__time_choose_backend()helper for GPU-free unit testingRuntimeErrorfor LTO when the backends are incompatibledriver_version()lazily; onlyCUDAErrorfromcuDriverGetVersionis treated as "driver unknown" (other exceptions propagate)_probe_nvjitlink()is cached and warns at most once when nvJitLink is absentObjectCodebefore thecode_typepre-scan so malformed inputs raiseTypeErrorinstead of a dispatchRuntimeErrorBreaking change:
options.link_time_optimization=Truewith nvJitLink absent now raisesRuntimeErrorinstead of silently passingCU_JIT_LTOto the driver (which was not real LTO linking).Decision matrix
Driver-linker lifetime fixes
With per-instance dispatch, CTK 12.9 + driver 13.0 runners now hit the driver linker for cross-major linking, which was never exercised before. That exposed three independent latent bugs:
fix(core): retain driver log buffers for CUlinkState lifetime—Linker_linkwas nulling_drv_log_bufsright aftercuLinkComplete, leavingcuLinkDestroy(run later intp_dealloc) with dangling pointers to the info/error log bytearrays. Retain the bytearrays until thecdef classis deallocated;.pxddeclaration order ensures_culink_handleis destroyed before the buffers.fix(core): retain cuLinkCreate optionValues array for CUlinkState lifetime—c_jit_keys/c_jit_valueswere stack-localvectors destroyed onLinker_initreturn, but CUDA requiresoptionValuesto stay valid for the life of theCUlinkState(the driver writes log-fill sizes back into its slots). Promote them tocdef classfields declared after_culink_handlesotp_deallocdestroys them aftercuLinkDestroyruns.fix(core): release driver-backend buffers in Linker.close()—close()now caches decoded logs into_info_log/_error_logbefore dropping_drv_log_bufs, soget_error_log()/get_info_log()still work afterclose()on both success and failed-link paths. The option vectors are swapped with empty locals (not.clear()ed) so the backing allocation is actually freed rather than retaining capacity.Test plan
_choose_backend()decision matrix, including thedriver_major=None(build-container) path (tests/test_linker_dispatch.py)Linker, then do a full compile + link cycle (test_driver_linker_lifetime_no_heap_corruption)close()→get_error_log()regression test (test_driver_linker_get_error_log_after_close_on_failed_link)Closes #712
🤖 Generated with Claude Code