Skip to content

Rework Linker dispatching for cross-major nvJitLink/driver skew#1911

Open
cpcloud wants to merge 4 commits intoNVIDIA:mainfrom
cpcloud:linker-dispatch-rework-712
Open

Rework Linker dispatching for cross-major nvJitLink/driver skew#1911
cpcloud wants to merge 4 commits intoNVIDIA:mainfrom
cpcloud:linker-dispatch-rework-712

Conversation

@cpcloud
Copy link
Copy Markdown
Contributor

@cpcloud cpcloud commented Apr 14, 2026

Summary

Reworks the Linker backend selection and fixes three latent driver-linker resource-lifetime bugs that the rework surfaced on cross-major CTK + driver CI runners.

Per-instance dispatch (refactor(core): per-instance Linker backend dispatch)

  • Replaces module-level "decide once" backend selection with per-Linker-instance dispatch at __init__ time
  • Factors the decision into a pure _choose_backend() helper for GPU-free unit testing
  • Handles nvJitLink/driver major-version mismatches: falls back to the driver linker for non-LTO linking, raises RuntimeError for LTO when the backends are incompatible
  • Probes driver_version() lazily; only CUDAError from cuDriverGetVersion is treated as "driver unknown" (other exceptions propagate)
  • _probe_nvjitlink() is cached and warns at most once when nvJitLink is absent
  • Validates every input is an ObjectCode before the code_type pre-scan so malformed inputs raise TypeError instead of a dispatch RuntimeError

Breaking change: options.link_time_optimization=True with nvJitLink absent now raises RuntimeError instead of silently passing CU_JIT_LTO to the driver (which was not real LTO linking).

Decision matrix

driver nvJitLink ltoir input lto/ptx result
any None no no driver
any None yes/lto raise
M (M,*) any any nvJitLink
D≠N (N,*) no no driver fallback
D≠N (N,*) yes/lto raise
None available any any nvJitLink

Driver-linker lifetime fixes

With per-instance dispatch, CTK 12.9 + driver 13.0 runners now hit the driver linker for cross-major linking, which was never exercised before. That exposed three independent latent bugs:

  • fix(core): retain driver log buffers for CUlinkState lifetimeLinker_link was nulling _drv_log_bufs right after cuLinkComplete, leaving cuLinkDestroy (run later in tp_dealloc) with dangling pointers to the info/error log bytearrays. Retain the bytearrays until the cdef class is deallocated; .pxd declaration order ensures _culink_handle is destroyed before the buffers.
  • fix(core): retain cuLinkCreate optionValues array for CUlinkState lifetimec_jit_keys/c_jit_values were stack-local vectors destroyed on Linker_init return, but CUDA requires optionValues to stay valid for the life of the CUlinkState (the driver writes log-fill sizes back into its slots). Promote them to cdef class fields declared after _culink_handle so tp_dealloc destroys them after cuLinkDestroy runs.
  • fix(core): release driver-backend buffers in Linker.close()close() now caches decoded logs into _info_log/_error_log before dropping _drv_log_bufs, so get_error_log()/get_info_log() still work after close() on both success and failed-link paths. The option vectors are swapped with empty locals (not .clear()ed) so the backing allocation is actually freed rather than retaining capacity.

Test plan

  • GPU-free parameterized tests for the full _choose_backend() decision matrix, including the driver_major=None (build-container) path (tests/test_linker_dispatch.py)
  • Test helpers handle driver-version failure gracefully
  • Driver-backend regression test: force the driver backend, link + close + drop a Linker, then do a full compile + link cycle (test_driver_linker_lifetime_no_heap_corruption)
  • Failed-link + close()get_error_log() regression test (test_driver_linker_get_error_log_after_close_on_failed_link)
  • CI: existing GPU tests pass with per-instance dispatch
  • CI: cross-major behavior verified (CTK 12.9.1 + driver 13.0 runners exercise the driver-fallback path that surfaced the lifetime bugs)

Closes #712

🤖 Generated with Claude Code

@cpcloud cpcloud added this to the cuda.core v1.0.0 milestone Apr 14, 2026
@cpcloud cpcloud added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module breaking Breaking changes are introduced labels Apr 14, 2026
@cpcloud cpcloud self-assigned this Apr 14, 2026
@github-actions
Copy link
Copy Markdown

@cpcloud cpcloud force-pushed the linker-dispatch-rework-712 branch 2 times, most recently from 0c94703 to 5d8fa24 Compare April 16, 2026 21:58
@cpcloud cpcloud requested review from leofang and mdboom April 17, 2026 13:07
@cpcloud cpcloud force-pushed the linker-dispatch-rework-712 branch 4 times, most recently from 2ff8d39 to 35e5282 Compare April 19, 2026 11:04
cpcloud added 4 commits April 20, 2026 07:45
Replace module-level "decide once" backend selection with per-Linker
instance dispatch at __init__ time. Factor the decision into a pure
_choose_backend() helper so it can be unit-tested without a GPU.

Handle nvJitLink/driver major-version skew: fall back to the driver
linker for non-LTO linking, raise RuntimeError for LTO when the
backends are incompatible. Probe driver_version() lazily so
environments with nvJitLink but no driver (e.g., build containers)
still work; only CUDAError from handle_return is treated as "driver
unknown" so other exceptions propagate. _probe_nvjitlink() is cached
and warns at most once when nvJitLink is absent.

Validate each input as ObjectCode before the code_type pre-scan so
invalid inputs surface a TypeError instead of a backend-dispatch
RuntimeError.

Breaking change: options.link_time_optimization=True with nvJitLink
absent now raises RuntimeError instead of silently passing
CU_JIT_LTO to the driver (which was not real LTO linking).

Closes NVIDIA#712
Linker_link was nulling self._drv_log_bufs right after cuLinkComplete,
releasing the bytearrays whose addresses were handed to the driver via
CU_JIT_INFO_LOG_BUFFER and CU_JIT_ERROR_LOG_BUFFER at cuLinkCreate time.
The CUlinkState retains those pointers until cuLinkDestroy, which runs
during Linker tp_dealloc. Freeing the bytearrays first left the driver
with dangling pointers and corrupted the heap; subsequent CUDA calls
(e.g. NVRTC compilation in the next test fixture) segfaulted.

This path became reachable in CI with the new per-instance backend
dispatch: CTK 12.9.1 + driver 13.0 runners now hit the driver linker
for cross-major linking, which was never exercised before.

Retain _drv_log_bufs until the cdef class is deallocated; pxd
declaration order ensures _culink_handle (and therefore cuLinkDestroy)
runs before the bytearrays are cleared.
…etime

The CUDA driver docs state: "optionValues must remain valid for the life
of the CUlinkState if output options are used." The driver writes log-
fill sizes (output) back into the optionValues slots for
CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES and CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES.
Linker_init previously declared c_jit_keys/c_jit_values as local
cdef vector[...] on the stack of Linker_init; they were destroyed when
the function returned, leaving the driver with dangling writes during
subsequent cuLinkAddData/cuLinkComplete/cuLinkDestroy calls.

This was always latent. It became reachable with the per-instance
backend dispatch (CTK 12.9.1 runners now select the driver linker when
they pair with a driver 13 install), and only manifested on driver 13
as heap corruption that killed the next NVRTC or link call.

Promote the two arrays to cdef class fields declared after
_culink_handle in the pxd. Cython's tp_dealloc destroys C++ fields in
pxd declaration order, so the vectors are destroyed after the shared_ptr
deleter runs cuLinkDestroy. The cuda.bindings high-level wrapper
(driver.cuLinkCreate) already handles this by attaching a keepalive to
CUlinkState; cuda.core's low-level cydriver.cuLinkCreate path did not.

Also drop the now-unused void_ptr ctypedef.
Linker.close() reset only the _culink_handle, leaving the retained
option-key/value vectors and the log-buffer bytearrays alive until
Python GC/tp_dealloc. Those buffers exist for cuLinkDestroy's sake,
but cuLinkDestroy has already run at this point via the shared_ptr
deleter, so they can be released immediately.

Cache decoded logs into _info_log/_error_log before releasing the
raw buffers so get_error_log() / get_info_log() remain callable
after close(), including on the failed-link path where link()
never caches them itself. Swap the option vectors with empty
locals to actually free the backing allocation (std::vector::clear
only sets size to 0 and keeps capacity).

Adds two driver-backend regression tests: one that links
successfully, closes + drops, then performs another full
compile + link cycle (prior heap-corruption bugs only surfaced in
the next CUDA op after teardown); another that triggers a link
failure, closes, and checks get_error_log() still returns the
captured diagnostic.
@cpcloud cpcloud force-pushed the linker-dispatch-rework-712 branch from 35e5282 to bd1874c Compare April 20, 2026 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking changes are introduced cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Re-work on Linker dispatching logic

1 participant