Rework Linker dispatching for cross-major nvJitLink/driver skew by cpcloud · Pull Request #1911 · NVIDIA/cuda-python

cpcloud · 2026-04-14T21:05:10Z

Summary

Reworks the Linker backend selection and fixes three latent driver-linker resource-lifetime bugs that the rework surfaced on cross-major CTK + driver CI runners.

Per-instance dispatch (`refactor(core): per-instance Linker backend dispatch`)

Replaces module-level "decide once" backend selection with per-Linker-instance dispatch at __init__ time
Factors the decision into a pure _choose_backend() helper for GPU-free unit testing
Handles nvJitLink/driver major-version mismatches: falls back to the driver linker for non-LTO linking, raises RuntimeError for LTO when the backends are incompatible
Probes driver_version() lazily; only CUDAError from cuDriverGetVersion is treated as "driver unknown" (other exceptions propagate)
_probe_nvjitlink() is cached and warns at most once when nvJitLink is absent
Validates every input is an ObjectCode before the code_type pre-scan so malformed inputs raise TypeError instead of a dispatch RuntimeError

Breaking change: options.link_time_optimization=True with nvJitLink absent now raises RuntimeError instead of silently passing CU_JIT_LTO to the driver (which was not real LTO linking).

Decision matrix

driver	nvJitLink	ltoir input	lto/ptx	result
any	None	no	no	driver
any	None	yes/lto	—	raise
M	(M,*)	any	any	nvJitLink
D≠N	(N,*)	no	no	driver fallback
D≠N	(N,*)	yes/lto	—	raise
None	available	any	any	nvJitLink

Driver-linker lifetime fixes

With per-instance dispatch, CTK 12.9 + driver 13.0 runners now hit the driver linker for cross-major linking, which was never exercised before. That exposed three independent latent bugs:

fix(core): retain driver log buffers for CUlinkState lifetime — Linker_link was nulling _drv_log_bufs right after cuLinkComplete, leaving cuLinkDestroy (run later in tp_dealloc) with dangling pointers to the info/error log bytearrays. Retain the bytearrays until the cdef class is deallocated; .pxd declaration order ensures _culink_handle is destroyed before the buffers.
fix(core): retain cuLinkCreate optionValues array for CUlinkState lifetime — c_jit_keys/c_jit_values were stack-local vectors destroyed on Linker_init return, but CUDA requires optionValues to stay valid for the life of the CUlinkState (the driver writes log-fill sizes back into its slots). Promote them to cdef class fields declared after _culink_handle so tp_dealloc destroys them after cuLinkDestroy runs.
fix(core): release driver-backend buffers in Linker.close() — close() now caches decoded logs into _info_log/_error_log before dropping _drv_log_bufs, so get_error_log()/get_info_log() still work after close() on both success and failed-link paths. The option vectors are swapped with empty locals (not .clear()ed) so the backing allocation is actually freed rather than retaining capacity.

Test plan

GPU-free parameterized tests for the full _choose_backend() decision matrix, including the driver_major=None (build-container) path (tests/test_linker_dispatch.py)
Test helpers handle driver-version failure gracefully
Driver-backend regression test: force the driver backend, link + close + drop a Linker, then do a full compile + link cycle (test_driver_linker_lifetime_no_heap_corruption)
Failed-link + close() → get_error_log() regression test (test_driver_linker_get_error_log_after_close_on_failed_link)
CI: existing GPU tests pass with per-instance dispatch
CI: cross-major behavior verified (CTK 12.9.1 + driver 13.0 runners exercise the driver-fallback path that surfaced the lifetime bugs)

Closes #712

🤖 Generated with Claude Code

github-actions · 2026-04-14T21:19:30Z

Doc Preview CI
🚀 View preview at https://nvidia.github.io/cuda-python/pr-preview/pr-1911/
https://nvidia.github.io/cuda-python/pr-preview/pr-1911/cuda-core/
https://nvidia.github.io/cuda-python/pr-preview/pr-1911/cuda-bindings/
https://nvidia.github.io/cuda-python/pr-preview/pr-1911/cuda-pathfinder/
Preview will be ready when the GitHub Pages deployment is complete.

Replace module-level "decide once" backend selection with per-Linker instance dispatch at __init__ time. Factor the decision into a pure _choose_backend() helper so it can be unit-tested without a GPU. Handle nvJitLink/driver major-version skew: fall back to the driver linker for non-LTO linking, raise RuntimeError for LTO when the backends are incompatible. Probe driver_version() lazily so environments with nvJitLink but no driver (e.g., build containers) still work; only CUDAError from handle_return is treated as "driver unknown" so other exceptions propagate. _probe_nvjitlink() is cached and warns at most once when nvJitLink is absent. Validate each input as ObjectCode before the code_type pre-scan so invalid inputs surface a TypeError instead of a backend-dispatch RuntimeError. Breaking change: options.link_time_optimization=True with nvJitLink absent now raises RuntimeError instead of silently passing CU_JIT_LTO to the driver (which was not real LTO linking). Closes NVIDIA#712

Linker_link was nulling self._drv_log_bufs right after cuLinkComplete, releasing the bytearrays whose addresses were handed to the driver via CU_JIT_INFO_LOG_BUFFER and CU_JIT_ERROR_LOG_BUFFER at cuLinkCreate time. The CUlinkState retains those pointers until cuLinkDestroy, which runs during Linker tp_dealloc. Freeing the bytearrays first left the driver with dangling pointers and corrupted the heap; subsequent CUDA calls (e.g. NVRTC compilation in the next test fixture) segfaulted. This path became reachable in CI with the new per-instance backend dispatch: CTK 12.9.1 + driver 13.0 runners now hit the driver linker for cross-major linking, which was never exercised before. Retain _drv_log_bufs until the cdef class is deallocated; pxd declaration order ensures _culink_handle (and therefore cuLinkDestroy) runs before the bytearrays are cleared.

…etime The CUDA driver docs state: "optionValues must remain valid for the life of the CUlinkState if output options are used." The driver writes log- fill sizes (output) back into the optionValues slots for CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES and CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES. Linker_init previously declared c_jit_keys/c_jit_values as local cdef vector[...] on the stack of Linker_init; they were destroyed when the function returned, leaving the driver with dangling writes during subsequent cuLinkAddData/cuLinkComplete/cuLinkDestroy calls. This was always latent. It became reachable with the per-instance backend dispatch (CTK 12.9.1 runners now select the driver linker when they pair with a driver 13 install), and only manifested on driver 13 as heap corruption that killed the next NVRTC or link call. Promote the two arrays to cdef class fields declared after _culink_handle in the pxd. Cython's tp_dealloc destroys C++ fields in pxd declaration order, so the vectors are destroyed after the shared_ptr deleter runs cuLinkDestroy. The cuda.bindings high-level wrapper (driver.cuLinkCreate) already handles this by attaching a keepalive to CUlinkState; cuda.core's low-level cydriver.cuLinkCreate path did not. Also drop the now-unused void_ptr ctypedef.

Linker.close() reset only the _culink_handle, leaving the retained option-key/value vectors and the log-buffer bytearrays alive until Python GC/tp_dealloc. Those buffers exist for cuLinkDestroy's sake, but cuLinkDestroy has already run at this point via the shared_ptr deleter, so they can be released immediately. Cache decoded logs into _info_log/_error_log before releasing the raw buffers so get_error_log() / get_info_log() remain callable after close(), including on the failed-link path where link() never caches them itself. Swap the option vectors with empty locals to actually free the backing allocation (std::vector::clear only sets size to 0 and keeps capacity). Adds two driver-backend regression tests: one that links successfully, closes + drops, then performs another full compile + link cycle (prior heap-corruption bugs only surfaced in the next CUDA op after teardown); another that triggers a link failure, closes, and checks get_error_log() still returns the captured diagnostic.

cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026

cpcloud added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module breaking Breaking changes are introduced labels Apr 14, 2026

cpcloud self-assigned this Apr 14, 2026

cpcloud force-pushed the linker-dispatch-rework-712 branch 2 times, most recently from 0c94703 to 5d8fa24 Compare April 16, 2026 21:58

cpcloud requested review from leofang and mdboom April 17, 2026 13:07

cpcloud force-pushed the linker-dispatch-rework-712 branch 4 times, most recently from 2ff8d39 to 35e5282 Compare April 19, 2026 11:04

cpcloud added 4 commits April 20, 2026 07:45

cpcloud force-pushed the linker-dispatch-rework-712 branch from 35e5282 to bd1874c Compare April 20, 2026 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework Linker dispatching for cross-major nvJitLink/driver skew#1911

Rework Linker dispatching for cross-major nvJitLink/driver skew#1911
cpcloud wants to merge 4 commits intoNVIDIA:mainfrom
cpcloud:linker-dispatch-rework-712

cpcloud commented Apr 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cpcloud commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Per-instance dispatch (refactor(core): per-instance Linker backend dispatch)

Decision matrix

Driver-linker lifetime fixes

Test plan

Uh oh!

github-actions bot commented Apr 14, 2026

Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cpcloud commented Apr 14, 2026 •

edited

Loading

Per-instance dispatch (`refactor(core): per-instance Linker backend dispatch`)