Skip to content

Metal: Improve lowering of math intrinsics#821

Merged
maleadt merged 5 commits into
mainfrom
tb/metal
Jun 3, 2026
Merged

Metal: Improve lowering of math intrinsics#821
maleadt merged 5 commits into
mainfrom
tb/metal

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Jun 2, 2026

Just like we did for CUDA.jl, this will allow us to use LLVM intrinsics in Metal.jl.

maleadt and others added 5 commits June 2, 2026 16:49
AIR has no vector floating-point min/max intrinsic, only the scalar
air.fmin/air.fmax. Vector llvm.minimum/llvm.maximum (which Julia's
NaN-propagating min/max can produce once LLVM vectorizes a min/max loop)
hit the "Unsupported maximum/minimum type" error, and vector
llvm.minnum/llvm.maxnum would lower to a nonexistent air.fmin.v4f32-style
call. Scalarize each vector min/max into element-wise scalar intrinsic
calls before AIR lowering so the existing scalar path applies.

Learned from the native AIR LLVM backend, which scalarizes vector min/max
for the same reason.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move the math-function lowering out of the front-end and into the back-end,
matching the PTX design (front-ends emit plain LLVM; GPUCompiler lowers to
device intrinsics). `lower_math_intrinsics!` maps the LLVM math intrinsics
Julia generates — sqrt, fma, floor, ceil, trunc and rint (which Julia uses
for round and every rounding mode) — to Metal's `air.<op>` device functions
for f16/f32, and to the relaxed, f32-only `air.fast_<op>` when the call
carries the `afn` fast-math flag. f64 and vector types are left untouched (no
matching AIR intrinsic). The precise/fast/f16 mapping was verified against
Apple's own frontend (`xcrun metal -S -emit-llvm`, precise vs -ffast-math):
all six ops have f16+f32 forms; all but fma (which is exact) have an f32 fast
variant; half always stays precise.

Add a `fastmath` field to MetalCompilerTarget (defaulting to the process
--math-mode, like the PTX target) that, via `apply_fastmath!` in
finish_linked_module!, flags every FP op `afn` so the lowering selects the
fast variants module-wide.

This lets Metal.jl drop its hand-written air.*/air.fast_* overrides for these
ops and rely on the intrinsics Julia emits natively. Transcendentals without
LLVM intrinsics (sin/cos/exp/log/pow/...) stay as direct air.* calls in the
front-end.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Julia emits LLVM intrinsics for the integer ops too — llvm.abs, llvm.{s,u}{min,max},
llvm.ctlz/cttz/ctpop/bitreverse — so lower them in the back-end rather than
making front-ends wrap each one (matching how CUDA.jl relies on NVPTX).

- Fix the existing llvm.abs lowering: llvm.abs is (value, i1 is_int_min_poison)
  but air.abs.{s,u} takes only the value, so drop the trailing i1 operand (the
  old code passed it through, producing a malformed air.abs call — latent
  because Metal.jl currently overrides Base.abs and preempts this path).
- Add pure renames for the bit intrinsics to Apple's builtin names, preserving
  signatures: llvm.ctlz->air.clz, llvm.cttz->air.ctz (both keep the i1),
  llvm.ctpop->air.popcount, llvm.bitreverse->air.reverse_bits. Apple's frontend
  emits these air.* forms, so we rename rather than rely on the metallib loader
  accepting the llvm.* names.

Names/signatures verified against `xcrun metal -S -emit-llvm`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Validated end-to-end on Apple GPU hardware (AGXG13G).

- have_fma(::MetalCompilerTarget) = true (mirroring the PTX target): Apple GPUs
  have FMA, so Julia's `fma` should use the hardware instruction (llvm.fma ->
  air.fma) rather than its Float64 `fma_emulated` fallback. Without this,
  dropping Metal.jl's fma override made `fma(::Float32)` emulate in Float64,
  which Metal rejects. (Float16 still needs a front-end override: Julia gates
  it on have_fma(Float16), which lowers to an unfoldable jl_have_fma call.)
- Honor `nnan` on llvm.minimum/maximum: @fastmath / fastmath=true assume no
  NaNs, so lower to the relaxed air.fast_fmin/fmax (f32; f16 has no fast form,
  so air.fmin/fmax) instead of the NaN-propagating wrapper. This lets Metal.jl
  drop its FastMath.min_fast/max_fast overrides too. Plain min/max keep the
  wrapper, matching CPU Julia's NaN-propagating semantics.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a back-end peephole that folds air.{min,max}.{s,u}.iN of a single-use call
to the same builtin into AGX's 3-way air.{min,max}3.{s,u}.iN. AGX has native
3-way min/max, but neither Julia (which reduces min(a,b,c) to nested 2-arg
calls) nor Apple's own frontend emits it. Doing this in the back-end catches
every chained min/max, not just the literal 3-argument calls a front-end
override could intercept, and lets Metal.jl drop its min3/max3 overrides.

Integer only: float min/max go through the NaN-propagating wrapper, which
air.f{min,max}3 would not preserve. Validated on hardware.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@maleadt maleadt merged commit 2df76fe into main Jun 3, 2026
34 of 37 checks passed
@maleadt maleadt deleted the tb/metal branch June 3, 2026 05:38
ViralBShah pushed a commit to JuliaRegistries/General that referenced this pull request Jun 5, 2026
GPUCompiler 1.16 (JuliaGPU/GPUCompiler.jl#821) started lowering
llvm.ctlz/llvm.cttz to air.clz/air.ctz using the 2-operand form emitted
by Apple's frontend (value + i1 is_zero_undef). Released Metal.jl
versions still provide device overrides that ccall the same intrinsics
in their 1-operand form, so kernels reaching both sources put two
signatures behind one symbol, failing to compile or miscompiling
(JuliaGPU/GPUCompiler.jl#832). The overrides were removed on Metal.jl
master (JuliaGPU/Metal.jl#801), whose next release requires
GPUCompiler 1.17; retro-cap released versions to 1.15.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant