Conversation
AIR has no vector floating-point min/max intrinsic, only the scalar air.fmin/air.fmax. Vector llvm.minimum/llvm.maximum (which Julia's NaN-propagating min/max can produce once LLVM vectorizes a min/max loop) hit the "Unsupported maximum/minimum type" error, and vector llvm.minnum/llvm.maxnum would lower to a nonexistent air.fmin.v4f32-style call. Scalarize each vector min/max into element-wise scalar intrinsic calls before AIR lowering so the existing scalar path applies. Learned from the native AIR LLVM backend, which scalarizes vector min/max for the same reason. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the math-function lowering out of the front-end and into the back-end, matching the PTX design (front-ends emit plain LLVM; GPUCompiler lowers to device intrinsics). `lower_math_intrinsics!` maps the LLVM math intrinsics Julia generates — sqrt, fma, floor, ceil, trunc and rint (which Julia uses for round and every rounding mode) — to Metal's `air.<op>` device functions for f16/f32, and to the relaxed, f32-only `air.fast_<op>` when the call carries the `afn` fast-math flag. f64 and vector types are left untouched (no matching AIR intrinsic). The precise/fast/f16 mapping was verified against Apple's own frontend (`xcrun metal -S -emit-llvm`, precise vs -ffast-math): all six ops have f16+f32 forms; all but fma (which is exact) have an f32 fast variant; half always stays precise. Add a `fastmath` field to MetalCompilerTarget (defaulting to the process --math-mode, like the PTX target) that, via `apply_fastmath!` in finish_linked_module!, flags every FP op `afn` so the lowering selects the fast variants module-wide. This lets Metal.jl drop its hand-written air.*/air.fast_* overrides for these ops and rely on the intrinsics Julia emits natively. Transcendentals without LLVM intrinsics (sin/cos/exp/log/pow/...) stay as direct air.* calls in the front-end. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Julia emits LLVM intrinsics for the integer ops too — llvm.abs, llvm.{s,u}{min,max},
llvm.ctlz/cttz/ctpop/bitreverse — so lower them in the back-end rather than
making front-ends wrap each one (matching how CUDA.jl relies on NVPTX).
- Fix the existing llvm.abs lowering: llvm.abs is (value, i1 is_int_min_poison)
but air.abs.{s,u} takes only the value, so drop the trailing i1 operand (the
old code passed it through, producing a malformed air.abs call — latent
because Metal.jl currently overrides Base.abs and preempts this path).
- Add pure renames for the bit intrinsics to Apple's builtin names, preserving
signatures: llvm.ctlz->air.clz, llvm.cttz->air.ctz (both keep the i1),
llvm.ctpop->air.popcount, llvm.bitreverse->air.reverse_bits. Apple's frontend
emits these air.* forms, so we rename rather than rely on the metallib loader
accepting the llvm.* names.
Names/signatures verified against `xcrun metal -S -emit-llvm`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Validated end-to-end on Apple GPU hardware (AGXG13G). - have_fma(::MetalCompilerTarget) = true (mirroring the PTX target): Apple GPUs have FMA, so Julia's `fma` should use the hardware instruction (llvm.fma -> air.fma) rather than its Float64 `fma_emulated` fallback. Without this, dropping Metal.jl's fma override made `fma(::Float32)` emulate in Float64, which Metal rejects. (Float16 still needs a front-end override: Julia gates it on have_fma(Float16), which lowers to an unfoldable jl_have_fma call.) - Honor `nnan` on llvm.minimum/maximum: @fastmath / fastmath=true assume no NaNs, so lower to the relaxed air.fast_fmin/fmax (f32; f16 has no fast form, so air.fmin/fmax) instead of the NaN-propagating wrapper. This lets Metal.jl drop its FastMath.min_fast/max_fast overrides too. Plain min/max keep the wrapper, matching CPU Julia's NaN-propagating semantics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a back-end peephole that folds air.{min,max}.{s,u}.iN of a single-use call
to the same builtin into AGX's 3-way air.{min,max}3.{s,u}.iN. AGX has native
3-way min/max, but neither Julia (which reduces min(a,b,c) to nested 2-arg
calls) nor Apple's own frontend emits it. Doing this in the back-end catches
every chained min/max, not just the literal 3-argument calls a front-end
override could intercept, and lets Metal.jl drop its min3/max3 overrides.
Integer only: float min/max go through the NaN-propagating wrapper, which
air.f{min,max}3 would not preserve. Validated on hardware.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ViralBShah
pushed a commit
to JuliaRegistries/General
that referenced
this pull request
Jun 5, 2026
GPUCompiler 1.16 (JuliaGPU/GPUCompiler.jl#821) started lowering llvm.ctlz/llvm.cttz to air.clz/air.ctz using the 2-operand form emitted by Apple's frontend (value + i1 is_zero_undef). Released Metal.jl versions still provide device overrides that ccall the same intrinsics in their 1-operand form, so kernels reaching both sources put two signatures behind one symbol, failing to compile or miscompiling (JuliaGPU/GPUCompiler.jl#832). The overrides were removed on Metal.jl master (JuliaGPU/Metal.jl#801), whose next release requires GPUCompiler 1.17; retro-cap released versions to 1.15. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Just like we did for CUDA.jl, this will allow us to use LLVM intrinsics in Metal.jl.