Skip to content

PTX: modernize compilation now that LLVM 22 is used#836

Merged
maleadt merged 10 commits into
mainfrom
tb/ptx_llvm22
Jun 6, 2026
Merged

PTX: modernize compilation now that LLVM 22 is used#836
maleadt merged 10 commits into
mainfrom
tb/ptx_llvm22

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Jun 6, 2026

PTX machine code is now generated by an external, up-to-date LLVM 22; the in-process LLVM only drives the middle end. This PR ports several improvements that build on that split:

  • the middle end now uses a generic TargetMachine, folds __nvvm_reflect itself at the true target, and accepts NVVM intrinsics the in-process LLVM does not know about. Together this fully supports devices newer than the in-process LLVM;
  • the datalayout, launch-bound attributes, and kernel calling convention now match what the external back-end expects;
  • workarounds the back-end no longer needs are gone: lower_unreachable!, byval lowering of arguments it wouldn't copy, and fast-math rewrites of f32 operations it handles natively;
  • warnings and errors from the external llc are now surfaced.

maleadt added 10 commits June 6, 2026 09:40
Machine code is generated by an external, up-to-date back-end, so the
target should not be limited by what the in-process LLVM supports.
That LLVM may not even know the targeted device -- unknown CPUs noisily
degrade to the sm_20 subtarget -- and only drives middle-end
heuristics, so construct its TargetMachine for the generic subtarget.
Reflection-based dispatch in libdevice can select intrinsics that only
the external back-end supports, e.g. llvm.nvvm.tanh.approx.f32 chosen
by __nv_fast_tanhf under __CUDA_ARCH >= 730. Validation used to reject
these as unknown functions; defer to the back-end instead.
The LLVM 22 back-end accesses byval kernel arguments directly in
parameter space when all uses are simple (possibly GEP-indexed) loads,
which Julia's immutable semantics make the common case: such arguments
qualify as implicit grid constants and are read via ld.param without a
local copy, even when dynamically indexed. Arguments with complex uses
(phis, selects, calls) do still get copied, but during machine-code
generation, where the optimizer cannot clean that up anymore; keep
materializing those early. The PTX-level parameter layout is identical
either way, so the launch ABI does not change.
The ptx_kernel calling convention is what marks the entry point; the
back-end auto-upgrades any remaining "kernel" annotation to it, and on
LLVM >= 20 the redundant annotation even caused miscompilations.
String function attributes are accepted by any in-process LLVM, so
there is no reason to gate them on LLVM 21. Keep the nvvm.annotations
form too, for the in-process LLVM's NVPTX passes to consume.
The back-end carries the upstreamed equivalent of this pass, applying
it only where needed (the ptxas unreachable bug is fixed in PTX 8.3).
The back-end natively honors per-instruction afn flags for f32
fdiv/sqrt/rsqrt, selecting approximate (and, based on the function's
denormal mode, flushing) instructions, so the f32 rewrites were
redundant. The f64 paths remain: NVPTX has no fast f64 lowerings, and
does not select rsqrt.approx.d for 1/sqrt(x).
Notably, this declares the 32-bit tensor-memory address space, and
aligns i128 like Julia does (16 bytes on Julia 1.12 and later, which
is also the back-end's preference), keeping kernel argument layouts
consistent between host and device.
llc inherits our standard streams, leaking warnings to the terminal and
losing the actual error on failure. Capture its output, include it in
compilation errors, and warn when a successful invocation still printed
diagnostics (e.g. having ignored an unrecognized CPU).
The built-in NVVMReflectPass, scheduled through the in-process LLVM's
TargetMachine since LLVM 17, derives __CUDA_ARCH from the middle-end
subtarget -- now the generic one, which would select the most
conservative code from arch-dispatching libraries like libdevice.
Instead, fold reflect calls ourselves on every version, at the true
target, right after linking device libraries; the built-in pass is
left with nothing to do.
@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Jun 6, 2026

CI failure is benign; we're actually generating better code now.

@maleadt maleadt merged commit 8adcb92 into main Jun 6, 2026
35 of 37 checks passed
@maleadt maleadt deleted the tb/ptx_llvm22 branch June 6, 2026 09:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant