Skip to content

PTX: lower @fastmath sqrt to NVPTX approx intrinsics (for LLVM 18)#805

Merged
maleadt merged 1 commit into
mainfrom
tb/ptx_fast_sqrt
May 20, 2026
Merged

PTX: lower @fastmath sqrt to NVPTX approx intrinsics (for LLVM 18)#805
maleadt merged 1 commit into
mainfrom
tb/ptx_fast_sqrt

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented May 20, 2026

Add PTXFSqrtFastPass alongside PTXFDivFastPass. It rewrites afn-flagged llvm.sqrt.f{32,64} to llvm.nvvm.sqrt.approx{,.ftz}.f (f32) and rcp(rsqrt(x)) (f64), which is the same sequences NVPTX' getSqrtEstimate emits on LLVM 21+ where per-instruction afn and the function unsafe-fp-math attribute are honored. Like PTXFDivFastPass's f32 path, both sqrt paths are temporary backports for LLVM 18, which only consults TargetMachine.Options.UnsafeFPMath (unreachable through LLVM.jl).

This will allow us to get rid of most of libdevice in CUDA.jl.

Add `PTXFSqrtFastPass` alongside `PTXFDivFastPass`. It rewrites
`afn`-flagged `llvm.sqrt.f{32,64}` to `llvm.nvvm.sqrt.approx{,.ftz}.f`
(f32) and `rcp(rsqrt(x))` (f64), which is the same sequences NVPTX'
`getSqrtEstimate` emits on LLVM 21+, where per-instruction `afn` and the
function `unsafe-fp-math` attribute are honored. Like `PTXFDivFastPass`'s
f32 path, both sqrt paths are temporary backports for LLVM 18, which only
consults `TargetMachine.Options.UnsafeFPMath` (unreachable through LLVM.jl).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt maleadt changed the title PTX: lower @fastmath sqrt to NVPTX approx intrinsics. PTX: lower @fastmath sqrt to NVPTX approx intrinsics (for LLVM 18) May 20, 2026
@maleadt maleadt merged commit f7d7418 into main May 20, 2026
36 of 37 checks passed
@maleadt maleadt deleted the tb/ptx_fast_sqrt branch May 20, 2026 14:06
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 75.95%. Comparing base (eded413) to head (afd4765).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/ptx.jl 97.87% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #805      +/-   ##
==========================================
+ Coverage   75.69%   75.95%   +0.25%     
==========================================
  Files          25       25              
  Lines        3983     4026      +43     
==========================================
+ Hits         3015     3058      +43     
  Misses        968      968              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maleadt added a commit to JuliaGPU/CUDA.jl that referenced this pull request May 21, 2026
Building on JuliaGPU/GPUCompiler.jl#805, JuliaGPU/GPUCompiler.jl#804, JuliaGPU/GPUCompiler.jl#800, avoid some of the uses of `libdevice`'s intrinsics, instead emitting vanilla LLVM IR and having GPUCompiler.jl post-process it into what we need in PTX. This has many advantages, including (potentially) better optimization, compatibility with LLVM tools like Enzyme, etc.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant