Skip to content

EnzymeCoreExt: overlay autodiff/deferred_codegen for kernel codegen#3150

Open
gbaraldi wants to merge 2 commits into
JuliaGPU:mainfrom
gbaraldi:enzyme-autodiff-overlay
Open

EnzymeCoreExt: overlay autodiff/deferred_codegen for kernel codegen#3150
gbaraldi wants to merge 2 commits into
JuliaGPU:mainfrom
gbaraldi:enzyme-autodiff-overlay

Conversation

@gbaraldi
Copy link
Copy Markdown
Member

Companion to EnzymeAD/Enzyme.jl#3112 — addresses EnzymeAD/Enzyme.jl#3091.

Summary

Two new overlays in `CUDA.method_table` inside `ext/EnzymeCoreExt.jl`:

  1. `EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` — the body holds the `ccall("extern deferred_codegen", llvmcall, …)` marker that GPUCompiler's scanner picks up to link the inner Enzyme adjoint into the kernel module. EnzymeCore declares `deferred_codegen` as a method-less generic function so that `jl_compile_all_defs` can't despecialize the llvmcall into a host sysimage. CUDA's overlay supplies the body during kernel compilation; Enzyme.jl provides a matching overlay in `GPUCompiler.GLOBAL_METHOD_TABLE` for its own interpreter.

  2. `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` redirect — inside CUDA's `GPUInterpreter` this forwards to `autodiff_deferred`, so user kernels can write plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically. Removes the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today.

The overlays only use EnzymeCore types, so no new weakdep is needed.

Test plan

  • CI green
  • Verified locally with a CUDA kernel calling plain `Enzyme.autodiff(Reverse, kernel, ...)` — gradient comes out correct
  • Existing `test/extensions/enzyme.jl` keeps passing

Two new overlays in `CUDA.method_table`:

1. `EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` —
   The body holds the `ccall("extern deferred_codegen", llvmcall, …)`
   marker that GPUCompiler's scanner picks up to link the inner Enzyme
   adjoint into the kernel module. EnzymeCore declares
   `deferred_codegen` as a method-less generic function so that
   `jl_compile_all_defs` can't despecialize the llvmcall into a host
   sysimage. CUDA's overlay supplies the body during kernel
   compilation; Enzyme.jl provides a matching overlay in
   `GPUCompiler.GLOBAL_METHOD_TABLE` for its own interpreter.
   See EnzymeAD/Enzyme.jl#3091 and the companion PR.

2. `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` —
   Redirect to `autodiff_deferred` inside CUDA's GPUInterpreter so user
   kernels can call plain `Enzyme.autodiff(...)` and get the
   deferred-codegen path automatically, removing the manual
   `autodiff` vs `autodiff_deferred` choice that
   KernelAbstractions's `gpu_fwd` makes today.

Companion to EnzymeAD/Enzyme.jl#3112.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: e874f57 Previous: d96bb17 Ratio
array/accumulate/Float32/1d 101249 ns 99804 ns 1.01
array/accumulate/Float32/dims=1 77191 ns 74695 ns 1.03
array/accumulate/Float32/dims=1L 1586424.5 ns 1575172 ns 1.01
array/accumulate/Float32/dims=2 144038 ns 140050 ns 1.03
array/accumulate/Float32/dims=2L 658149 ns 652746 ns 1.01
array/accumulate/Int64/1d 118341 ns 116667 ns 1.01
array/accumulate/Int64/dims=1 80145 ns 78138 ns 1.03
array/accumulate/Int64/dims=1L 1706952 ns 1683352 ns 1.01
array/accumulate/Int64/dims=2 156969 ns 151632 ns 1.04
array/accumulate/Int64/dims=2L 962119 ns 958723 ns 1.00
array/broadcast 20427 ns 19844 ns 1.03
array/construct 1239 ns 1172.2 ns 1.06
array/copy 17978 ns 16637 ns 1.08
array/copyto!/cpu_to_gpu 215916 ns 213824 ns 1.01
array/copyto!/gpu_to_cpu 284942 ns 278790 ns 1.02
array/copyto!/gpu_to_gpu 10826 ns 10365 ns 1.04
array/iteration/findall/bool 134850 ns 130581 ns 1.03
array/iteration/findall/int 148813.5 ns 144724 ns 1.03
array/iteration/findfirst/bool 82415.5 ns 79128 ns 1.04
array/iteration/findfirst/int 84072.5 ns 80847 ns 1.04
array/iteration/findmin/1d 82650 ns 66285 ns 1.25
array/iteration/findmin/2d 114182 ns 101434 ns 1.13
array/iteration/logical 201744 ns 188254 ns 1.07
array/iteration/scalar 66299 ns 64586 ns 1.03
array/permutedims/2d 52586 ns 49748 ns 1.06
array/permutedims/3d 52731 ns 50091 ns 1.05
array/permutedims/4d 53030.5 ns 49706 ns 1.07
array/random/rand/Float32 12572 ns 12116 ns 1.04
array/random/rand/Int64 36650 ns 23371 ns 1.57
array/random/rand!/Float32 8443.333333333334 ns 7960.333333333333 ns 1.06
array/random/rand!/Int64 34121 ns 20391 ns 1.67
array/random/randn/Float32 37407 ns 34710 ns 1.08
array/random/randn!/Float32 30936 ns 23894 ns 1.29
array/reductions/mapreduce/Float32/1d 33335.5 ns 32647 ns 1.02
array/reductions/mapreduce/Float32/dims=1 39372 ns 37819 ns 1.04
array/reductions/mapreduce/Float32/dims=1L 51253 ns 50203 ns 1.02
array/reductions/mapreduce/Float32/dims=2 56414 ns 55479 ns 1.02
array/reductions/mapreduce/Float32/dims=2L 69109 ns 67023 ns 1.03
array/reductions/mapreduce/Int64/1d 42869 ns 39410 ns 1.09
array/reductions/mapreduce/Int64/dims=1 51434 ns 40857 ns 1.26
array/reductions/mapreduce/Int64/dims=1L 87004 ns 86396 ns 1.01
array/reductions/mapreduce/Int64/dims=2 59396.5 ns 57852 ns 1.03
array/reductions/mapreduce/Int64/dims=2L 84511 ns 82413 ns 1.03
array/reductions/reduce/Float32/1d 34084 ns 32758 ns 1.04
array/reductions/reduce/Float32/dims=1 48998 ns 38011 ns 1.29
array/reductions/reduce/Float32/dims=1L 51206 ns 50349 ns 1.02
array/reductions/reduce/Float32/dims=2 56383 ns 55642 ns 1.01
array/reductions/reduce/Float32/dims=2L 69066 ns 67383 ns 1.02
array/reductions/reduce/Int64/1d 42820 ns 39537 ns 1.08
array/reductions/reduce/Int64/dims=1 41986 ns 40656 ns 1.03
array/reductions/reduce/Int64/dims=1L 86968 ns 86280 ns 1.01
array/reductions/reduce/Int64/dims=2 59257 ns 57865 ns 1.02
array/reductions/reduce/Int64/dims=2L 83898 ns 82279 ns 1.02
array/reverse/1d 17704 ns 16600 ns 1.07
array/reverse/1dL 68278 ns 67647 ns 1.01
array/reverse/1dL_inplace 65712 ns 65318 ns 1.01
array/reverse/1d_inplace 10193.5 ns 8340.333333333334 ns 1.22
array/reverse/2d 21246 ns 19869 ns 1.07
array/reverse/2dL 73393.5 ns 71763 ns 1.02
array/reverse/2dL_inplace 65729 ns 65099 ns 1.01
array/reverse/2d_inplace 10213 ns 9659 ns 1.06
array/sorting/1d 2735901 ns 2713733 ns 1.01
array/sorting/2d 1068724.5 ns 1062891 ns 1.01
array/sorting/by 3303971 ns 3280831 ns 1.01
cuda/synchronization/context/auto 1154.9 ns 1138.4 ns 1.01
cuda/synchronization/context/blocking 938.974358974359 ns 962.5652173913044 ns 0.98
cuda/synchronization/context/nonblocking 7578.6 ns 5944.2 ns 1.27
cuda/synchronization/stream/auto 1013.3 ns 997.8333333333334 ns 1.02
cuda/synchronization/stream/blocking 824.1359223300971 ns 821.8235294117648 ns 1.00
cuda/synchronization/stream/nonblocking 7164.4 ns 5912.4 ns 1.21
integration/byval/reference 143670 ns 143240 ns 1.00
integration/byval/slices=1 145773 ns 145261 ns 1.00
integration/byval/slices=2 284454 ns 283622 ns 1.00
integration/byval/slices=3 423151 ns 421946 ns 1.00
integration/cudadevrt 102401 ns 101870 ns 1.01
integration/volumerhs 9960003 ns 9906591 ns 1.01
kernel/indexing 13275 ns 12574 ns 1.06
kernel/indexing_checked 14035 ns 13327 ns 1.05
kernel/launch 2156.1111111111113 ns 2037.111111111111 ns 1.06
kernel/occupancy 677.404458598726 ns 696.7364864864865 ns 0.97
kernel/rand 18050.5 ns 14332 ns 1.26
latency/import 3548878692 ns 3854422204 ns 0.92
latency/precompile 45248138979 ns 4627097887 ns 9.78
latency/ttfp 13051637302 ns 4514039137 ns 2.89

This comment was automatically generated by workflow using github-action-benchmark.

gbaraldi added a commit to EnzymeAD/Enzyme.jl that referenced this pull request May 21, 2026
`Compiler.deferred_codegen` is the function whose body contains
`ccall("extern deferred_codegen", llvmcall, …)`, which is not valid for
normal Julia codegen — `jl_compile_all_defs` enqueues the despecialised
body during a sysimage build and that fails (#3091).

Split it into:
  * `EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` — a 1-arg shell
    declared in EnzymeCore. Its default body is just
    `reinterpret(Ptr{Cvoid}, id)`. `id` is already the OrcV2 trampoline
    pointer allocated by `Compiler.deferred_id_codegen`, so this default
    matches the runtime behaviour of the existing `extern deferred_codegen`
    symbol (which is also identity, registered via
    `@cfunction(deferred_codegen, …)`). Direct CPU calls to
    `autodiff_deferred` continue to work via this trampoline path.
  * `Enzyme.Compiler.deferred_codegen(fa, a, tt, …)` — the existing 12-arg
    method, now restructured so its body computes `id` via
    `deferred_id_codegen` and delegates to `EnzymeCore.deferred_codegen(id)`.
    No more llvmcall in this body.

The `extern deferred_codegen` ccall marker that GPUCompiler's scanner needs
during deferred compilation lives in two method-table overlays of
`EnzymeCore.deferred_codegen(::UInt)`:

  * `GPUCompiler.GLOBAL_METHOD_TABLE` (in `src/compiler.jl`) — consulted by
    Enzyme's own `EnzymeInterpreter`, so higher-order CPU AD (e.g. forward-
    over-reverse from `test/basic.jl:660` "Nested Type Error" or
    `test/advanced.jl:1469` "Hessian") emits the deferred marker correctly.

  * `CUDA.method_table` (in JuliaGPU/CUDA.jl#3150) — consulted by CUDA's
    `GPUInterpreter`, so GPU kernels using `autodiff_deferred` — or, via a
    matching `autodiff → autodiff_deferred` redirect overlay added in the
    same CUDA-side extension, plain `Enzyme.autodiff(...)` — emit the
    marker into the GPU IR for GPUCompiler's scanner to resolve.

Overlay methods have `external_mt` set, so `compile_all_collect__`
(`precompile_utils.c:147`) skips them. The despecialised body that
`jl_compile_all_defs` sees during a sysimage build is the safe
`reinterpret(Ptr{Cvoid}, id)` stub plus the regular Julia call wrapper in
`Compiler.deferred_codegen` — no extern symbol to resolve. Verified locally
with `PackageCompiler.create_sysimage(["Enzyme"]; …)`: the sysimage no
longer references `deferred_codegen` in its undefined symbols.

The companion CUDA-side PR also overlays
`EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` to forward to
`autodiff_deferred` inside CUDA codegen, so user kernels can write plain
`Enzyme.autodiff(...)` and get the deferred-codegen path automatically —
removing the manual `autodiff` vs `autodiff_deferred` choice that
KernelAbstractions's `gpu_fwd` makes today.

Companion change in CUDA.jl: JuliaGPU/CUDA.jl#3150.

Fixes #3091.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gbaraldi added a commit to EnzymeAD/Enzyme.jl that referenced this pull request May 21, 2026
`Compiler.deferred_codegen` is the function whose body contains
`ccall("extern deferred_codegen", llvmcall, …)`, which is not valid for
normal Julia codegen — `jl_compile_all_defs` enqueues the despecialised
body during a sysimage build and that fails (#3091).

Give it a safe default body that returns `reinterpret(Ptr{Cvoid}, id)`,
matching the runtime behaviour of the registered `extern deferred_codegen`
symbol (which is itself identity, see `register_deferred_codegen`). Direct
CPU calls to `autodiff_deferred` continue to work via the resulting
OrcV2 trampoline pointer.

The actual ccall marker lives in a
`Base.Experimental.@overlay GPUCompiler.GLOBAL_METHOD_TABLE` definition
with the same 12-arg signature. `EnzymeInterpreter` (which Enzyme jobs use
by default) consults this overlay, so higher-order CPU AD (e.g. forward-
over-reverse from `test/basic.jl:660` "Nested Type Error" or
`test/advanced.jl:1469` "Hessian") still emits the deferred-codegen marker
into the parent module's IR for GPUCompiler's scanner to resolve.

Overlay methods have `external_mt` set, so `compile_all_collect__`
(`precompile_utils.c:147`) skips them. The despecialised body that
`jl_compile_all_defs` sees during a sysimage build is just the safe
`reinterpret` — no extern symbol to resolve. Verified locally:
`PackageCompiler.create_sysimage(["Enzyme"]; …)` succeeds and the resulting
sysimage no longer references `deferred_codegen` in its undefined symbols.

The companion JuliaGPU/CUDA.jl#3150 adds:
  * An `@overlay CUDA.method_table` for `EnzymeCore.autodiff(...)` to
    forward to `autodiff_deferred` inside CUDA codegen — letting user
    kernels write plain `Enzyme.autodiff(...)` and get the deferred path.
  * A matching `@overlay CUDA.method_table` for `Compiler.deferred_codegen`
    so the marker is emitted into the GPU IR.

Fixes #3091.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gbaraldi gbaraldi force-pushed the enzyme-autodiff-overlay branch from 9a9a252 to 85b1681 Compare May 21, 2026 21:47
gbaraldi added a commit to EnzymeAD/Enzyme.jl that referenced this pull request May 21, 2026
The Enzyme-side overlay attempt at fixing #3091 regressed Julia 1.10
higher-order CPU AD (Enzyme's pass couldn't trace through the
non-inlined overlay function to recognise the indirect call as
active). Reverted; #3091 is left for a structural follow-up (likely
via JuliaGPU/GPUCompiler.jl#582 + #799 or a `@generated` variant).

This PR now only carries the buildkite cross-PR CI link, paired with
JuliaGPU/CUDA.jl#3150, which adds an `@overlay CUDA.method_table` on
`EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` to redirect to
`autodiff_deferred` inside CUDA codegen — so user kernels can call
plain `Enzyme.autodiff(...)` and get the deferred-codegen path
automatically (removing the manual `autodiff` vs `autodiff_deferred`
choice that KernelAbstractions's `gpu_fwd` makes today).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gbaraldi added a commit to EnzymeAD/Enzyme.jl that referenced this pull request May 22, 2026
…e weakdep

Previously the 1-arg `_deferred_marker(id::UInt) -> Ptr{Cvoid}` helper lived in `Enzyme.Compiler`. That meant CUDA's matching `@overlay CUDA.method_table` overlay would need to reach into Enzyme.Compiler — forcing CUDA.jl's EnzymeCoreExt to weak-depend on Enzyme.

Move the helper to EnzymeCore (with the default `reinterpret(Ptr{Cvoid}, id)` body) so CUDA's overlay can stay EnzymeCore-only. The actual `ccall("extern deferred_codegen", llvmcall, …)` body still lives in two method-table overlays:

  * `GPUCompiler.GLOBAL_METHOD_TABLE` (in src/compiler.jl) — consulted by Enzyme's own EnzymeInterpreter for higher-order CPU AD.
  * `CUDA.method_table` (in CUDA.jl/ext/EnzymeCoreExt.jl, see JuliaGPU/CUDA.jl#3150) — consulted by CUDA's GPUInterpreter during kernel compilation.

Both overlays reference `EnzymeCore._deferred_marker(::UInt)` directly — no Enzyme weakdep on the CUDA side.

`Enzyme.Compiler.deferred_codegen` (12-arg wrapper) now delegates to `EnzymeCore._deferred_marker(id)` after computing the id via the generated `deferred_id_codegen`. Its body contains no llvmcall, so its despecialised form is safe for sysimage builds (fixes #3091).

Local verification on Julia 1.12.6: CPU autodiff_deferred, higher-order CPU AD (Hessian, nested HVP), and a CUDA kernel calling plain `Enzyme.autodiff(...)` all pass; `PackageCompiler.create_sysimage(["Enzyme"]; ...)` succeeds and the sysimage no longer references `deferred_codegen` in its undefined symbols.

The Julia 1.10 guards from the previous commit (skipping forward-over-reverse via `autodiff_deferred` in inner functions) remain — that path still trips on 1.10's inlining behaviour even with this refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gbaraldi gbaraldi force-pushed the enzyme-autodiff-overlay branch from 85b1681 to ccce51b Compare May 22, 2026 12:41
Two overlays in CUDA.method_table that route deferred-codegen through CUDA's GPUInterpreter during kernel compilation:

1. `EnzymeCore._deferred_marker(id::UInt)` — its body holds the `extern deferred_codegen` ccall marker that GPUCompiler's scanner picks up to link the inner Enzyme adjoint into the kernel module. The function is declared in EnzymeCore with a safe default (`reinterpret(Ptr{Cvoid}, id)`); the overlay supplies the ccall during CUDA compilation. Sister overlay in `GPUCompiler.GLOBAL_METHOD_TABLE` from Enzyme.jl covers higher-order CPU AD.

2. `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, ...)` — redirect to `autodiff_deferred` so user kernels can call plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically — removing the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today.

Companion to EnzymeAD/Enzyme.jl#3112.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gbaraldi gbaraldi force-pushed the enzyme-autodiff-overlay branch from ccce51b to e874f57 Compare May 22, 2026 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant