Skip to content

Revert "Fix always_inline via inlining policy override"#797

Merged
maleadt merged 1 commit into
masterfrom
revert-795-tb/always_inline
May 13, 2026
Merged

Revert "Fix always_inline via inlining policy override"#797
maleadt merged 1 commit into
masterfrom
revert-795-tb/always_inline

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented May 13, 2026

Reverts #795, which breaks the following MWE:

using CUDA
function mykernel(A)
    i = (Int(threadIdx().x) - 1) + (Int(blockIdx().x) - 1) * Int(blockDim().x)
    A[i+1] = i   # out-of-range index ⇒ throw_boundserror reachable
    return
end
A = CuArray{Int}(undef, 16)
@cuda always_inline=true threads=8 mykernel(A)   # InvalidIRError
@cuda                  threads=8 mykernel(A)     # OK
InvalidIRError: ... resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to unsafe_store!)
Stacktrace:
 [1] setproperty!         @ CUDACore/src/device/runtime.jl:89
 [2] throw_boundserror    @ CUDACore/src/device/quirks.jl:13
 [3] throw_boundserror    @ CUDACore/src/device/quirks.jl:53
 [4] arrayset             @ CUDACore/src/device/array.jl:130
 [5] setindex!            @ CUDACore/src/device/array.jl:177
 [6] mykernel             @ ...

This surfaced with KernelAbstraction.jl's unittest_testsuite which contains analways_inline=true`.

@aviatesk Could you take a look at this? I wanted to fix this downstream since JuliaLang/julia#48257 doesn't seem to be moving forwards, and we really need this in the GPU stack (in fact, cuTile.jl depends on the ability to inline everything).

Here's a description by Claude:

What's happening

CUDA's Base.setproperty!(::ExceptionInfo, sym::Symbol, value) (CUDACore/src/device/runtime.jl:79) has six Symbol-dispatched branches with different value types per branch (Int32, @NamedTuple{x,y,z::Int32}, Ptr{UInt8}). Its body is big enough that inlining_cost saturates to MAX_INLINE_COST=0xffff — verified directly:

m = first(methods(setproperty!, (CUDACore.ExceptionInfo, Symbol, Any)))
Base.uncompressed_ir(m).inlining_cost   # 0xffff

Before #795, the default policy rejected this saturated MaybeCompressed source (is_inlineable=false), so setproperty! stayed out-of-line and the standalone-compiled body — with type-incompatible branches correctly folded to Core.throw_methoderror calls by inference — validated fine.

After #795, the override force-inlines the generic MaybeCompressed body (slottypes [..., Symbol, Ptr{UInt8}]) into throw_boundserror. The full if/elseif chain survives:

%6  = builtin (:subtype === :status)::Bool       # not folded!
%17 = builtin (:subtype === :threadIdx)::Bool    # not folded!
... 12 such comparisons preserved as runtime checks

The dead :threadIdx/:blockIdx branches contain unsafe_store!(::Ptr{NamedTuple{...}}, ::Ptr{UInt8}) — type-incompatible dispatches that ssa_inlining + optimization can't resolve statically. They end up as ijl_apply_generic(unsafe_store!, ...) calls in the final IR, which validation rejects.

The const-prop'd specialized versions (cost=10, body folded to a single pointerset) are offered to the policy too, but the inliner picks the generic source for some calls. I couldn't track down exactly why in the time I spent on it.

Probable cause

Force-inlining a MaybeCompressed source whose inlining_cost saturated is unsafe when the body has dead branches that only fold under const-prop. Julia's inliner doesn't re-run the constant-folding that would have eliminated those branches during regular inference; ssa_inlining_pass! + compact! + ADCE don't fold === on Symbol literals after they've been inlined as builtin calls into a larger body.

The motivating case in #795 (philox2x_rounds-style: saturated from raw call count, no Symbol-dispatched branches) doesn't hit this — its inlined body has nothing to fold. cuTile uses the same policy shape and works for the same reason.

Sanity checks while debugging

  • Reverting just may_discard_trees: still fails.
  • Reverting just src_inlining_policy: passes. So the policy override is the trigger, not the tree-preservation change.
  • Matching cuTile's policy exactly (no @invoke fall-through, inline_cost_threshold=typemax(Int)): still fails. cuTile would fail on this CUDA pattern too; it just doesn't have one.
  • Restricting the override to non-ConstCallInfo calls: fixes the @cuda MWE but not the KA path (where the inliner offers both generic and const-prop'd sources and still picks the generic one).
  • Restricting the override to IRCode only: fixes both. But in the failing PTX compile the inliner never passes IRCode, so this is effectively a full revert.

@maleadt maleadt merged commit bff356c into master May 13, 2026
31 of 37 checks passed
@maleadt maleadt deleted the revert-795-tb/always_inline branch May 13, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant