Metal: Improve lowering of math intrinsics by maleadt · Pull Request #821 · JuliaGPU/GPUCompiler.jl

maleadt · 2026-06-02T17:47:05Z

Just like we did for CUDA.jl, this will allow us to use LLVM intrinsics in Metal.jl.

AIR has no vector floating-point min/max intrinsic, only the scalar air.fmin/air.fmax. Vector llvm.minimum/llvm.maximum (which Julia's NaN-propagating min/max can produce once LLVM vectorizes a min/max loop) hit the "Unsupported maximum/minimum type" error, and vector llvm.minnum/llvm.maxnum would lower to a nonexistent air.fmin.v4f32-style call. Scalarize each vector min/max into element-wise scalar intrinsic calls before AIR lowering so the existing scalar path applies. Learned from the native AIR LLVM backend, which scalarizes vector min/max for the same reason. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Move the math-function lowering out of the front-end and into the back-end, matching the PTX design (front-ends emit plain LLVM; GPUCompiler lowers to device intrinsics). `lower_math_intrinsics!` maps the LLVM math intrinsics Julia generates — sqrt, fma, floor, ceil, trunc and rint (which Julia uses for round and every rounding mode) — to Metal's `air.<op>` device functions for f16/f32, and to the relaxed, f32-only `air.fast_<op>` when the call carries the `afn` fast-math flag. f64 and vector types are left untouched (no matching AIR intrinsic). The precise/fast/f16 mapping was verified against Apple's own frontend (`xcrun metal -S -emit-llvm`, precise vs -ffast-math): all six ops have f16+f32 forms; all but fma (which is exact) have an f32 fast variant; half always stays precise. Add a `fastmath` field to MetalCompilerTarget (defaulting to the process --math-mode, like the PTX target) that, via `apply_fastmath!` in finish_linked_module!, flags every FP op `afn` so the lowering selects the fast variants module-wide. This lets Metal.jl drop its hand-written air.*/air.fast_* overrides for these ops and rely on the intrinsics Julia emits natively. Transcendentals without LLVM intrinsics (sin/cos/exp/log/pow/...) stay as direct air.* calls in the front-end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Julia emits LLVM intrinsics for the integer ops too — llvm.abs, llvm.{s,u}{min,max}, llvm.ctlz/cttz/ctpop/bitreverse — so lower them in the back-end rather than making front-ends wrap each one (matching how CUDA.jl relies on NVPTX). - Fix the existing llvm.abs lowering: llvm.abs is (value, i1 is_int_min_poison) but air.abs.{s,u} takes only the value, so drop the trailing i1 operand (the old code passed it through, producing a malformed air.abs call — latent because Metal.jl currently overrides Base.abs and preempts this path). - Add pure renames for the bit intrinsics to Apple's builtin names, preserving signatures: llvm.ctlz->air.clz, llvm.cttz->air.ctz (both keep the i1), llvm.ctpop->air.popcount, llvm.bitreverse->air.reverse_bits. Apple's frontend emits these air.* forms, so we rename rather than rely on the metallib loader accepting the llvm.* names. Names/signatures verified against `xcrun metal -S -emit-llvm`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@fastmath

Validated end-to-end on Apple GPU hardware (AGXG13G). - have_fma(::MetalCompilerTarget) = true (mirroring the PTX target): Apple GPUs have FMA, so Julia's `fma` should use the hardware instruction (llvm.fma -> air.fma) rather than its Float64 `fma_emulated` fallback. Without this, dropping Metal.jl's fma override made `fma(::Float32)` emulate in Float64, which Metal rejects. (Float16 still needs a front-end override: Julia gates it on have_fma(Float16), which lowers to an unfoldable jl_have_fma call.) - Honor `nnan` on llvm.minimum/maximum: @fastmath / fastmath=true assume no NaNs, so lower to the relaxed air.fast_fmin/fmax (f32; f16 has no fast form, so air.fmin/fmax) instead of the NaN-propagating wrapper. This lets Metal.jl drop its FastMath.min_fast/max_fast overrides too. Plain min/max keep the wrapper, matching CPU Julia's NaN-propagating semantics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a back-end peephole that folds air.{min,max}.{s,u}.iN of a single-use call to the same builtin into AGX's 3-way air.{min,max}3.{s,u}.iN. AGX has native 3-way min/max, but neither Julia (which reduces min(a,b,c) to nested 2-arg calls) nor Apple's own frontend emits it. Doing this in the back-end catches every chained min/max, not just the literal 3-argument calls a front-end override could intercept, and lets Metal.jl drop its min3/max3 overrides. Integer only: float min/max go through the NaN-propagating wrapper, which air.f{min,max}3 would not preserve. Validated on hardware. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

GPUCompiler 1.16 (JuliaGPU/GPUCompiler.jl#821) started lowering llvm.ctlz/llvm.cttz to air.clz/air.ctz using the 2-operand form emitted by Apple's frontend (value + i1 is_zero_undef). Released Metal.jl versions still provide device overrides that ccall the same intrinsics in their 1-operand form, so kernels reaching both sources put two signatures behind one symbol, failing to compile or miscompiling (JuliaGPU/GPUCompiler.jl#832). The overrides were removed on Metal.jl master (JuliaGPU/Metal.jl#801), whose next release requires GPUCompiler 1.17; retro-cap released versions to 1.15. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

maleadt and others added 5 commits June 2, 2026 16:49

maleadt merged commit 2df76fe into main Jun 3, 2026
34 of 37 checks passed

maleadt deleted the tb/metal branch June 3, 2026 05:38

maleadt mentioned this pull request Jun 3, 2026

Use LLVM intrinsics for many math operations JuliaGPU/Metal.jl#801

Merged

aplavin mentioned this pull request Jun 4, 2026

Metal: drop the spurious i1 operand when lowering ctlz/cttz to air.clz/air.ctz #832

Closed

maleadt mentioned this pull request Jun 5, 2026

Cap GPUCompiler compat of Metal 1.5-1.9 at 1.15 JuliaRegistries/General#157199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal: Improve lowering of math intrinsics#821

Metal: Improve lowering of math intrinsics#821
maleadt merged 5 commits into
mainfrom
tb/metal

maleadt commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant