EnzymeCoreExt: overlay autodiff/deferred_codegen for kernel codegen#3150
Open
gbaraldi wants to merge 2 commits into
Open
EnzymeCoreExt: overlay autodiff/deferred_codegen for kernel codegen#3150gbaraldi wants to merge 2 commits into
gbaraldi wants to merge 2 commits into
Conversation
Two new overlays in `CUDA.method_table`:
1. `EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` —
The body holds the `ccall("extern deferred_codegen", llvmcall, …)`
marker that GPUCompiler's scanner picks up to link the inner Enzyme
adjoint into the kernel module. EnzymeCore declares
`deferred_codegen` as a method-less generic function so that
`jl_compile_all_defs` can't despecialize the llvmcall into a host
sysimage. CUDA's overlay supplies the body during kernel
compilation; Enzyme.jl provides a matching overlay in
`GPUCompiler.GLOBAL_METHOD_TABLE` for its own interpreter.
See EnzymeAD/Enzyme.jl#3091 and the companion PR.
2. `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` —
Redirect to `autodiff_deferred` inside CUDA's GPUInterpreter so user
kernels can call plain `Enzyme.autodiff(...)` and get the
deferred-codegen path automatically, removing the manual
`autodiff` vs `autodiff_deferred` choice that
KernelAbstractions's `gpu_fwd` makes today.
Companion to EnzymeAD/Enzyme.jl#3112.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: e874f57 | Previous: d96bb17 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101249 ns |
99804 ns |
1.01 |
array/accumulate/Float32/dims=1 |
77191 ns |
74695 ns |
1.03 |
array/accumulate/Float32/dims=1L |
1586424.5 ns |
1575172 ns |
1.01 |
array/accumulate/Float32/dims=2 |
144038 ns |
140050 ns |
1.03 |
array/accumulate/Float32/dims=2L |
658149 ns |
652746 ns |
1.01 |
array/accumulate/Int64/1d |
118341 ns |
116667 ns |
1.01 |
array/accumulate/Int64/dims=1 |
80145 ns |
78138 ns |
1.03 |
array/accumulate/Int64/dims=1L |
1706952 ns |
1683352 ns |
1.01 |
array/accumulate/Int64/dims=2 |
156969 ns |
151632 ns |
1.04 |
array/accumulate/Int64/dims=2L |
962119 ns |
958723 ns |
1.00 |
array/broadcast |
20427 ns |
19844 ns |
1.03 |
array/construct |
1239 ns |
1172.2 ns |
1.06 |
array/copy |
17978 ns |
16637 ns |
1.08 |
array/copyto!/cpu_to_gpu |
215916 ns |
213824 ns |
1.01 |
array/copyto!/gpu_to_cpu |
284942 ns |
278790 ns |
1.02 |
array/copyto!/gpu_to_gpu |
10826 ns |
10365 ns |
1.04 |
array/iteration/findall/bool |
134850 ns |
130581 ns |
1.03 |
array/iteration/findall/int |
148813.5 ns |
144724 ns |
1.03 |
array/iteration/findfirst/bool |
82415.5 ns |
79128 ns |
1.04 |
array/iteration/findfirst/int |
84072.5 ns |
80847 ns |
1.04 |
array/iteration/findmin/1d |
82650 ns |
66285 ns |
1.25 |
array/iteration/findmin/2d |
114182 ns |
101434 ns |
1.13 |
array/iteration/logical |
201744 ns |
188254 ns |
1.07 |
array/iteration/scalar |
66299 ns |
64586 ns |
1.03 |
array/permutedims/2d |
52586 ns |
49748 ns |
1.06 |
array/permutedims/3d |
52731 ns |
50091 ns |
1.05 |
array/permutedims/4d |
53030.5 ns |
49706 ns |
1.07 |
array/random/rand/Float32 |
12572 ns |
12116 ns |
1.04 |
array/random/rand/Int64 |
36650 ns |
23371 ns |
1.57 |
array/random/rand!/Float32 |
8443.333333333334 ns |
7960.333333333333 ns |
1.06 |
array/random/rand!/Int64 |
34121 ns |
20391 ns |
1.67 |
array/random/randn/Float32 |
37407 ns |
34710 ns |
1.08 |
array/random/randn!/Float32 |
30936 ns |
23894 ns |
1.29 |
array/reductions/mapreduce/Float32/1d |
33335.5 ns |
32647 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=1 |
39372 ns |
37819 ns |
1.04 |
array/reductions/mapreduce/Float32/dims=1L |
51253 ns |
50203 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=2 |
56414 ns |
55479 ns |
1.02 |
array/reductions/mapreduce/Float32/dims=2L |
69109 ns |
67023 ns |
1.03 |
array/reductions/mapreduce/Int64/1d |
42869 ns |
39410 ns |
1.09 |
array/reductions/mapreduce/Int64/dims=1 |
51434 ns |
40857 ns |
1.26 |
array/reductions/mapreduce/Int64/dims=1L |
87004 ns |
86396 ns |
1.01 |
array/reductions/mapreduce/Int64/dims=2 |
59396.5 ns |
57852 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=2L |
84511 ns |
82413 ns |
1.03 |
array/reductions/reduce/Float32/1d |
34084 ns |
32758 ns |
1.04 |
array/reductions/reduce/Float32/dims=1 |
48998 ns |
38011 ns |
1.29 |
array/reductions/reduce/Float32/dims=1L |
51206 ns |
50349 ns |
1.02 |
array/reductions/reduce/Float32/dims=2 |
56383 ns |
55642 ns |
1.01 |
array/reductions/reduce/Float32/dims=2L |
69066 ns |
67383 ns |
1.02 |
array/reductions/reduce/Int64/1d |
42820 ns |
39537 ns |
1.08 |
array/reductions/reduce/Int64/dims=1 |
41986 ns |
40656 ns |
1.03 |
array/reductions/reduce/Int64/dims=1L |
86968 ns |
86280 ns |
1.01 |
array/reductions/reduce/Int64/dims=2 |
59257 ns |
57865 ns |
1.02 |
array/reductions/reduce/Int64/dims=2L |
83898 ns |
82279 ns |
1.02 |
array/reverse/1d |
17704 ns |
16600 ns |
1.07 |
array/reverse/1dL |
68278 ns |
67647 ns |
1.01 |
array/reverse/1dL_inplace |
65712 ns |
65318 ns |
1.01 |
array/reverse/1d_inplace |
10193.5 ns |
8340.333333333334 ns |
1.22 |
array/reverse/2d |
21246 ns |
19869 ns |
1.07 |
array/reverse/2dL |
73393.5 ns |
71763 ns |
1.02 |
array/reverse/2dL_inplace |
65729 ns |
65099 ns |
1.01 |
array/reverse/2d_inplace |
10213 ns |
9659 ns |
1.06 |
array/sorting/1d |
2735901 ns |
2713733 ns |
1.01 |
array/sorting/2d |
1068724.5 ns |
1062891 ns |
1.01 |
array/sorting/by |
3303971 ns |
3280831 ns |
1.01 |
cuda/synchronization/context/auto |
1154.9 ns |
1138.4 ns |
1.01 |
cuda/synchronization/context/blocking |
938.974358974359 ns |
962.5652173913044 ns |
0.98 |
cuda/synchronization/context/nonblocking |
7578.6 ns |
5944.2 ns |
1.27 |
cuda/synchronization/stream/auto |
1013.3 ns |
997.8333333333334 ns |
1.02 |
cuda/synchronization/stream/blocking |
824.1359223300971 ns |
821.8235294117648 ns |
1.00 |
cuda/synchronization/stream/nonblocking |
7164.4 ns |
5912.4 ns |
1.21 |
integration/byval/reference |
143670 ns |
143240 ns |
1.00 |
integration/byval/slices=1 |
145773 ns |
145261 ns |
1.00 |
integration/byval/slices=2 |
284454 ns |
283622 ns |
1.00 |
integration/byval/slices=3 |
423151 ns |
421946 ns |
1.00 |
integration/cudadevrt |
102401 ns |
101870 ns |
1.01 |
integration/volumerhs |
9960003 ns |
9906591 ns |
1.01 |
kernel/indexing |
13275 ns |
12574 ns |
1.06 |
kernel/indexing_checked |
14035 ns |
13327 ns |
1.05 |
kernel/launch |
2156.1111111111113 ns |
2037.111111111111 ns |
1.06 |
kernel/occupancy |
677.404458598726 ns |
696.7364864864865 ns |
0.97 |
kernel/rand |
18050.5 ns |
14332 ns |
1.26 |
latency/import |
3548878692 ns |
3854422204 ns |
0.92 |
latency/precompile |
45248138979 ns |
4627097887 ns |
9.78 |
latency/ttfp |
13051637302 ns |
4514039137 ns |
2.89 |
This comment was automatically generated by workflow using github-action-benchmark.
gbaraldi
added a commit
to EnzymeAD/Enzyme.jl
that referenced
this pull request
May 21, 2026
`Compiler.deferred_codegen` is the function whose body contains
`ccall("extern deferred_codegen", llvmcall, …)`, which is not valid for
normal Julia codegen — `jl_compile_all_defs` enqueues the despecialised
body during a sysimage build and that fails (#3091).
Split it into:
* `EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` — a 1-arg shell
declared in EnzymeCore. Its default body is just
`reinterpret(Ptr{Cvoid}, id)`. `id` is already the OrcV2 trampoline
pointer allocated by `Compiler.deferred_id_codegen`, so this default
matches the runtime behaviour of the existing `extern deferred_codegen`
symbol (which is also identity, registered via
`@cfunction(deferred_codegen, …)`). Direct CPU calls to
`autodiff_deferred` continue to work via this trampoline path.
* `Enzyme.Compiler.deferred_codegen(fa, a, tt, …)` — the existing 12-arg
method, now restructured so its body computes `id` via
`deferred_id_codegen` and delegates to `EnzymeCore.deferred_codegen(id)`.
No more llvmcall in this body.
The `extern deferred_codegen` ccall marker that GPUCompiler's scanner needs
during deferred compilation lives in two method-table overlays of
`EnzymeCore.deferred_codegen(::UInt)`:
* `GPUCompiler.GLOBAL_METHOD_TABLE` (in `src/compiler.jl`) — consulted by
Enzyme's own `EnzymeInterpreter`, so higher-order CPU AD (e.g. forward-
over-reverse from `test/basic.jl:660` "Nested Type Error" or
`test/advanced.jl:1469` "Hessian") emits the deferred marker correctly.
* `CUDA.method_table` (in JuliaGPU/CUDA.jl#3150) — consulted by CUDA's
`GPUInterpreter`, so GPU kernels using `autodiff_deferred` — or, via a
matching `autodiff → autodiff_deferred` redirect overlay added in the
same CUDA-side extension, plain `Enzyme.autodiff(...)` — emit the
marker into the GPU IR for GPUCompiler's scanner to resolve.
Overlay methods have `external_mt` set, so `compile_all_collect__`
(`precompile_utils.c:147`) skips them. The despecialised body that
`jl_compile_all_defs` sees during a sysimage build is the safe
`reinterpret(Ptr{Cvoid}, id)` stub plus the regular Julia call wrapper in
`Compiler.deferred_codegen` — no extern symbol to resolve. Verified locally
with `PackageCompiler.create_sysimage(["Enzyme"]; …)`: the sysimage no
longer references `deferred_codegen` in its undefined symbols.
The companion CUDA-side PR also overlays
`EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` to forward to
`autodiff_deferred` inside CUDA codegen, so user kernels can write plain
`Enzyme.autodiff(...)` and get the deferred-codegen path automatically —
removing the manual `autodiff` vs `autodiff_deferred` choice that
KernelAbstractions's `gpu_fwd` makes today.
Companion change in CUDA.jl: JuliaGPU/CUDA.jl#3150.
Fixes #3091.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gbaraldi
added a commit
to EnzymeAD/Enzyme.jl
that referenced
this pull request
May 21, 2026
`Compiler.deferred_codegen` is the function whose body contains
`ccall("extern deferred_codegen", llvmcall, …)`, which is not valid for
normal Julia codegen — `jl_compile_all_defs` enqueues the despecialised
body during a sysimage build and that fails (#3091).
Give it a safe default body that returns `reinterpret(Ptr{Cvoid}, id)`,
matching the runtime behaviour of the registered `extern deferred_codegen`
symbol (which is itself identity, see `register_deferred_codegen`). Direct
CPU calls to `autodiff_deferred` continue to work via the resulting
OrcV2 trampoline pointer.
The actual ccall marker lives in a
`Base.Experimental.@overlay GPUCompiler.GLOBAL_METHOD_TABLE` definition
with the same 12-arg signature. `EnzymeInterpreter` (which Enzyme jobs use
by default) consults this overlay, so higher-order CPU AD (e.g. forward-
over-reverse from `test/basic.jl:660` "Nested Type Error" or
`test/advanced.jl:1469` "Hessian") still emits the deferred-codegen marker
into the parent module's IR for GPUCompiler's scanner to resolve.
Overlay methods have `external_mt` set, so `compile_all_collect__`
(`precompile_utils.c:147`) skips them. The despecialised body that
`jl_compile_all_defs` sees during a sysimage build is just the safe
`reinterpret` — no extern symbol to resolve. Verified locally:
`PackageCompiler.create_sysimage(["Enzyme"]; …)` succeeds and the resulting
sysimage no longer references `deferred_codegen` in its undefined symbols.
The companion JuliaGPU/CUDA.jl#3150 adds:
* An `@overlay CUDA.method_table` for `EnzymeCore.autodiff(...)` to
forward to `autodiff_deferred` inside CUDA codegen — letting user
kernels write plain `Enzyme.autodiff(...)` and get the deferred path.
* A matching `@overlay CUDA.method_table` for `Compiler.deferred_codegen`
so the marker is emitted into the GPU IR.
Fixes #3091.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9a9a252 to
85b1681
Compare
gbaraldi
added a commit
to EnzymeAD/Enzyme.jl
that referenced
this pull request
May 21, 2026
The Enzyme-side overlay attempt at fixing #3091 regressed Julia 1.10 higher-order CPU AD (Enzyme's pass couldn't trace through the non-inlined overlay function to recognise the indirect call as active). Reverted; #3091 is left for a structural follow-up (likely via JuliaGPU/GPUCompiler.jl#582 + #799 or a `@generated` variant). This PR now only carries the buildkite cross-PR CI link, paired with JuliaGPU/CUDA.jl#3150, which adds an `@overlay CUDA.method_table` on `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` to redirect to `autodiff_deferred` inside CUDA codegen — so user kernels can call plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically (removing the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gbaraldi
added a commit
to EnzymeAD/Enzyme.jl
that referenced
this pull request
May 22, 2026
…e weakdep
Previously the 1-arg `_deferred_marker(id::UInt) -> Ptr{Cvoid}` helper lived in `Enzyme.Compiler`. That meant CUDA's matching `@overlay CUDA.method_table` overlay would need to reach into Enzyme.Compiler — forcing CUDA.jl's EnzymeCoreExt to weak-depend on Enzyme.
Move the helper to EnzymeCore (with the default `reinterpret(Ptr{Cvoid}, id)` body) so CUDA's overlay can stay EnzymeCore-only. The actual `ccall("extern deferred_codegen", llvmcall, …)` body still lives in two method-table overlays:
* `GPUCompiler.GLOBAL_METHOD_TABLE` (in src/compiler.jl) — consulted by Enzyme's own EnzymeInterpreter for higher-order CPU AD.
* `CUDA.method_table` (in CUDA.jl/ext/EnzymeCoreExt.jl, see JuliaGPU/CUDA.jl#3150) — consulted by CUDA's GPUInterpreter during kernel compilation.
Both overlays reference `EnzymeCore._deferred_marker(::UInt)` directly — no Enzyme weakdep on the CUDA side.
`Enzyme.Compiler.deferred_codegen` (12-arg wrapper) now delegates to `EnzymeCore._deferred_marker(id)` after computing the id via the generated `deferred_id_codegen`. Its body contains no llvmcall, so its despecialised form is safe for sysimage builds (fixes #3091).
Local verification on Julia 1.12.6: CPU autodiff_deferred, higher-order CPU AD (Hessian, nested HVP), and a CUDA kernel calling plain `Enzyme.autodiff(...)` all pass; `PackageCompiler.create_sysimage(["Enzyme"]; ...)` succeeds and the sysimage no longer references `deferred_codegen` in its undefined symbols.
The Julia 1.10 guards from the previous commit (skipping forward-over-reverse via `autodiff_deferred` in inner functions) remain — that path still trips on 1.10's inlining behaviour even with this refactor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
85b1681 to
ccce51b
Compare
Two overlays in CUDA.method_table that route deferred-codegen through CUDA's GPUInterpreter during kernel compilation:
1. `EnzymeCore._deferred_marker(id::UInt)` — its body holds the `extern deferred_codegen` ccall marker that GPUCompiler's scanner picks up to link the inner Enzyme adjoint into the kernel module. The function is declared in EnzymeCore with a safe default (`reinterpret(Ptr{Cvoid}, id)`); the overlay supplies the ccall during CUDA compilation. Sister overlay in `GPUCompiler.GLOBAL_METHOD_TABLE` from Enzyme.jl covers higher-order CPU AD.
2. `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, ...)` — redirect to `autodiff_deferred` so user kernels can call plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically — removing the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today.
Companion to EnzymeAD/Enzyme.jl#3112.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ccce51b to
e874f57
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion to EnzymeAD/Enzyme.jl#3112 — addresses EnzymeAD/Enzyme.jl#3091.
Summary
Two new overlays in `CUDA.method_table` inside `ext/EnzymeCoreExt.jl`:
`EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` — the body holds the `ccall("extern deferred_codegen", llvmcall, …)` marker that GPUCompiler's scanner picks up to link the inner Enzyme adjoint into the kernel module. EnzymeCore declares `deferred_codegen` as a method-less generic function so that `jl_compile_all_defs` can't despecialize the llvmcall into a host sysimage. CUDA's overlay supplies the body during kernel compilation; Enzyme.jl provides a matching overlay in `GPUCompiler.GLOBAL_METHOD_TABLE` for its own interpreter.
`EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` redirect — inside CUDA's `GPUInterpreter` this forwards to `autodiff_deferred`, so user kernels can write plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically. Removes the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today.
The overlays only use EnzymeCore types, so no new weakdep is needed.
Test plan