PTX: modernize compilation now that LLVM 22 is used by maleadt · Pull Request #836 · JuliaGPU/GPUCompiler.jl

maleadt · 2026-06-06T07:52:58Z

PTX machine code is now generated by an external, up-to-date LLVM 22; the in-process LLVM only drives the middle end. This PR ports several improvements that build on that split:

the middle end now uses a generic TargetMachine, folds __nvvm_reflect itself at the true target, and accepts NVVM intrinsics the in-process LLVM does not know about. Together this fully supports devices newer than the in-process LLVM;
the datalayout, launch-bound attributes, and kernel calling convention now match what the external back-end expects;
workarounds the back-end no longer needs are gone: lower_unreachable!, byval lowering of arguments it wouldn't copy, and fast-math rewrites of f32 operations it handles natively;
warnings and errors from the external llc are now surfaced.

Machine code is generated by an external, up-to-date back-end, so the target should not be limited by what the in-process LLVM supports. That LLVM may not even know the targeted device -- unknown CPUs noisily degrade to the sm_20 subtarget -- and only drives middle-end heuristics, so construct its TargetMachine for the generic subtarget.

Reflection-based dispatch in libdevice can select intrinsics that only the external back-end supports, e.g. llvm.nvvm.tanh.approx.f32 chosen by __nv_fast_tanhf under __CUDA_ARCH >= 730. Validation used to reject these as unknown functions; defer to the back-end instead.

The LLVM 22 back-end accesses byval kernel arguments directly in parameter space when all uses are simple (possibly GEP-indexed) loads, which Julia's immutable semantics make the common case: such arguments qualify as implicit grid constants and are read via ld.param without a local copy, even when dynamically indexed. Arguments with complex uses (phis, selects, calls) do still get copied, but during machine-code generation, where the optimizer cannot clean that up anymore; keep materializing those early. The PTX-level parameter layout is identical either way, so the launch ABI does not change.

The ptx_kernel calling convention is what marks the entry point; the back-end auto-upgrades any remaining "kernel" annotation to it, and on LLVM >= 20 the redundant annotation even caused miscompilations.

String function attributes are accepted by any in-process LLVM, so there is no reason to gate them on LLVM 21. Keep the nvvm.annotations form too, for the in-process LLVM's NVPTX passes to consume.

The back-end carries the upstreamed equivalent of this pass, applying it only where needed (the ptxas unreachable bug is fixed in PTX 8.3).

The back-end natively honors per-instruction afn flags for f32 fdiv/sqrt/rsqrt, selecting approximate (and, based on the function's denormal mode, flushing) instructions, so the f32 rewrites were redundant. The f64 paths remain: NVPTX has no fast f64 lowerings, and does not select rsqrt.approx.d for 1/sqrt(x).

Notably, this declares the 32-bit tensor-memory address space, and aligns i128 like Julia does (16 bytes on Julia 1.12 and later, which is also the back-end's preference), keeping kernel argument layouts consistent between host and device.

llc inherits our standard streams, leaking warnings to the terminal and losing the actual error on failure. Capture its output, include it in compilation errors, and warn when a successful invocation still printed diagnostics (e.g. having ignored an unrecognized CPU).

The built-in NVVMReflectPass, scheduled through the in-process LLVM's TargetMachine since LLVM 17, derives __CUDA_ARCH from the middle-end subtarget -- now the generic one, which would select the most conservative code from arch-dispatching libraries like libdevice. Instead, fold reflect calls ourselves on every version, at the true target, right after linking device libraries; the built-in pass is left with nothing to do.

maleadt · 2026-06-06T09:08:00Z

CI failure is benign; we're actually generating better code now.

maleadt added 10 commits June 6, 2026 09:40

PTX: drop the "kernel" nvvm.annotation.

8a25ecc

The ptx_kernel calling convention is what marks the entry point; the back-end auto-upgrades any remaining "kernel" annotation to it, and on LLVM >= 20 the redundant annotation even caused miscompilations.

PTX: emit launch-bound attributes unconditionally.

625b564

String function attributes are accepted by any in-process LLVM, so there is no reason to gate them on LLVM 21. Keep the nvvm.annotations form too, for the in-process LLVM's NVPTX passes to consume.

PTX: drop lower_unreachable!.

d995060

The back-end carries the upstreamed equivalent of this pass, applying it only where needed (the ptxas unreachable bug is fixed in PTX 8.3).

maleadt merged commit 8adcb92 into main Jun 6, 2026
35 of 37 checks passed

maleadt deleted the tb/ptx_llvm22 branch June 6, 2026 09:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PTX: modernize compilation now that LLVM 22 is used#836

PTX: modernize compilation now that LLVM 22 is used#836
maleadt merged 10 commits into
mainfrom
tb/ptx_llvm22

maleadt commented Jun 6, 2026

Uh oh!

maleadt commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Jun 6, 2026

Uh oh!

maleadt commented Jun 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant