PTX: modernize compilation now that LLVM 22 is used#836
Merged
Conversation
Machine code is generated by an external, up-to-date back-end, so the target should not be limited by what the in-process LLVM supports. That LLVM may not even know the targeted device -- unknown CPUs noisily degrade to the sm_20 subtarget -- and only drives middle-end heuristics, so construct its TargetMachine for the generic subtarget.
Reflection-based dispatch in libdevice can select intrinsics that only the external back-end supports, e.g. llvm.nvvm.tanh.approx.f32 chosen by __nv_fast_tanhf under __CUDA_ARCH >= 730. Validation used to reject these as unknown functions; defer to the back-end instead.
The LLVM 22 back-end accesses byval kernel arguments directly in parameter space when all uses are simple (possibly GEP-indexed) loads, which Julia's immutable semantics make the common case: such arguments qualify as implicit grid constants and are read via ld.param without a local copy, even when dynamically indexed. Arguments with complex uses (phis, selects, calls) do still get copied, but during machine-code generation, where the optimizer cannot clean that up anymore; keep materializing those early. The PTX-level parameter layout is identical either way, so the launch ABI does not change.
The ptx_kernel calling convention is what marks the entry point; the back-end auto-upgrades any remaining "kernel" annotation to it, and on LLVM >= 20 the redundant annotation even caused miscompilations.
String function attributes are accepted by any in-process LLVM, so there is no reason to gate them on LLVM 21. Keep the nvvm.annotations form too, for the in-process LLVM's NVPTX passes to consume.
The back-end carries the upstreamed equivalent of this pass, applying it only where needed (the ptxas unreachable bug is fixed in PTX 8.3).
The back-end natively honors per-instruction afn flags for f32 fdiv/sqrt/rsqrt, selecting approximate (and, based on the function's denormal mode, flushing) instructions, so the f32 rewrites were redundant. The f64 paths remain: NVPTX has no fast f64 lowerings, and does not select rsqrt.approx.d for 1/sqrt(x).
Notably, this declares the 32-bit tensor-memory address space, and aligns i128 like Julia does (16 bytes on Julia 1.12 and later, which is also the back-end's preference), keeping kernel argument layouts consistent between host and device.
llc inherits our standard streams, leaking warnings to the terminal and losing the actual error on failure. Capture its output, include it in compilation errors, and warn when a successful invocation still printed diagnostics (e.g. having ignored an unrecognized CPU).
The built-in NVVMReflectPass, scheduled through the in-process LLVM's TargetMachine since LLVM 17, derives __CUDA_ARCH from the middle-end subtarget -- now the generic one, which would select the most conservative code from arch-dispatching libraries like libdevice. Instead, fold reflect calls ourselves on every version, at the true target, right after linking device libraries; the built-in pass is left with nothing to do.
Member
Author
|
CI failure is benign; we're actually generating better code now. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PTX machine code is now generated by an external, up-to-date LLVM 22; the in-process LLVM only drives the middle end. This PR ports several improvements that build on that split:
TargetMachine, folds__nvvm_reflectitself at the true target, and accepts NVVM intrinsics the in-process LLVM does not know about. Together this fully supports devices newer than the in-process LLVM;lower_unreachable!, byval lowering of arguments it wouldn't copy, and fast-math rewrites of f32 operations it handles natively;llcare now surfaced.