EnzymeCoreExt: overlay autodiff/deferred_codegen for kernel codegen by gbaraldi · Pull Request #3150 · JuliaGPU/CUDA.jl

gbaraldi · 2026-05-20T18:56:19Z

Companion to EnzymeAD/Enzyme.jl#3112 — addresses EnzymeAD/Enzyme.jl#3091.

Summary

Two new overlays in `CUDA.method_table` inside `ext/EnzymeCoreExt.jl`:

`EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` — the body holds the `ccall("extern deferred_codegen", llvmcall, …)` marker that GPUCompiler's scanner picks up to link the inner Enzyme adjoint into the kernel module. EnzymeCore declares `deferred_codegen` as a method-less generic function so that `jl_compile_all_defs` can't despecialize the llvmcall into a host sysimage. CUDA's overlay supplies the body during kernel compilation; Enzyme.jl provides a matching overlay in `GPUCompiler.GLOBAL_METHOD_TABLE` for its own interpreter.
`EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` redirect — inside CUDA's `GPUInterpreter` this forwards to `autodiff_deferred`, so user kernels can write plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically. Removes the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today.

The overlays only use EnzymeCore types, so no new weakdep is needed.

Test plan

CI green
Verified locally with a CUDA kernel calling plain `Enzyme.autodiff(Reverse, kernel, ...)` — gradient comes out correct
Existing `test/extensions/enzyme.jl` keeps passing

Two new overlays in `CUDA.method_table`: 1. `EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` — The body holds the `ccall("extern deferred_codegen", llvmcall, …)` marker that GPUCompiler's scanner picks up to link the inner Enzyme adjoint into the kernel module. EnzymeCore declares `deferred_codegen` as a method-less generic function so that `jl_compile_all_defs` can't despecialize the llvmcall into a host sysimage. CUDA's overlay supplies the body during kernel compilation; Enzyme.jl provides a matching overlay in `GPUCompiler.GLOBAL_METHOD_TABLE` for its own interpreter. See EnzymeAD/Enzyme.jl#3091 and the companion PR. 2. `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` — Redirect to `autodiff_deferred` inside CUDA's GPUInterpreter so user kernels can call plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically, removing the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today. Companion to EnzymeAD/Enzyme.jl#3112. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `e874f57`	Previous: `d96bb17`	Ratio
`array/accumulate/Float32/1d`	`101249` ns	`99804` ns	`1.01`
`array/accumulate/Float32/dims=1`	`77191` ns	`74695` ns	`1.03`
`array/accumulate/Float32/dims=1L`	`1586424.5` ns	`1575172` ns	`1.01`
`array/accumulate/Float32/dims=2`	`144038` ns	`140050` ns	`1.03`
`array/accumulate/Float32/dims=2L`	`658149` ns	`652746` ns	`1.01`
`array/accumulate/Int64/1d`	`118341` ns	`116667` ns	`1.01`
`array/accumulate/Int64/dims=1`	`80145` ns	`78138` ns	`1.03`
`array/accumulate/Int64/dims=1L`	`1706952` ns	`1683352` ns	`1.01`
`array/accumulate/Int64/dims=2`	`156969` ns	`151632` ns	`1.04`
`array/accumulate/Int64/dims=2L`	`962119` ns	`958723` ns	`1.00`
`array/broadcast`	`20427` ns	`19844` ns	`1.03`
`array/construct`	`1239` ns	`1172.2` ns	`1.06`
`array/copy`	`17978` ns	`16637` ns	`1.08`
`array/copyto!/cpu_to_gpu`	`215916` ns	`213824` ns	`1.01`
`array/copyto!/gpu_to_cpu`	`284942` ns	`278790` ns	`1.02`
`array/copyto!/gpu_to_gpu`	`10826` ns	`10365` ns	`1.04`
`array/iteration/findall/bool`	`134850` ns	`130581` ns	`1.03`
`array/iteration/findall/int`	`148813.5` ns	`144724` ns	`1.03`
`array/iteration/findfirst/bool`	`82415.5` ns	`79128` ns	`1.04`
`array/iteration/findfirst/int`	`84072.5` ns	`80847` ns	`1.04`
`array/iteration/findmin/1d`	`82650` ns	`66285` ns	`1.25`
`array/iteration/findmin/2d`	`114182` ns	`101434` ns	`1.13`
`array/iteration/logical`	`201744` ns	`188254` ns	`1.07`
`array/iteration/scalar`	`66299` ns	`64586` ns	`1.03`
`array/permutedims/2d`	`52586` ns	`49748` ns	`1.06`
`array/permutedims/3d`	`52731` ns	`50091` ns	`1.05`
`array/permutedims/4d`	`53030.5` ns	`49706` ns	`1.07`
`array/random/rand/Float32`	`12572` ns	`12116` ns	`1.04`
`array/random/rand/Int64`	`36650` ns	`23371` ns	`1.57`
`array/random/rand!/Float32`	`8443.333333333334` ns	`7960.333333333333` ns	`1.06`
`array/random/rand!/Int64`	`34121` ns	`20391` ns	`1.67`
`array/random/randn/Float32`	`37407` ns	`34710` ns	`1.08`
`array/random/randn!/Float32`	`30936` ns	`23894` ns	`1.29`
`array/reductions/mapreduce/Float32/1d`	`33335.5` ns	`32647` ns	`1.02`
`array/reductions/mapreduce/Float32/dims=1`	`39372` ns	`37819` ns	`1.04`
`array/reductions/mapreduce/Float32/dims=1L`	`51253` ns	`50203` ns	`1.02`
`array/reductions/mapreduce/Float32/dims=2`	`56414` ns	`55479` ns	`1.02`
`array/reductions/mapreduce/Float32/dims=2L`	`69109` ns	`67023` ns	`1.03`
`array/reductions/mapreduce/Int64/1d`	`42869` ns	`39410` ns	`1.09`
`array/reductions/mapreduce/Int64/dims=1`	`51434` ns	`40857` ns	`1.26`
`array/reductions/mapreduce/Int64/dims=1L`	`87004` ns	`86396` ns	`1.01`
`array/reductions/mapreduce/Int64/dims=2`	`59396.5` ns	`57852` ns	`1.03`
`array/reductions/mapreduce/Int64/dims=2L`	`84511` ns	`82413` ns	`1.03`
`array/reductions/reduce/Float32/1d`	`34084` ns	`32758` ns	`1.04`
`array/reductions/reduce/Float32/dims=1`	`48998` ns	`38011` ns	`1.29`
`array/reductions/reduce/Float32/dims=1L`	`51206` ns	`50349` ns	`1.02`
`array/reductions/reduce/Float32/dims=2`	`56383` ns	`55642` ns	`1.01`
`array/reductions/reduce/Float32/dims=2L`	`69066` ns	`67383` ns	`1.02`
`array/reductions/reduce/Int64/1d`	`42820` ns	`39537` ns	`1.08`
`array/reductions/reduce/Int64/dims=1`	`41986` ns	`40656` ns	`1.03`
`array/reductions/reduce/Int64/dims=1L`	`86968` ns	`86280` ns	`1.01`
`array/reductions/reduce/Int64/dims=2`	`59257` ns	`57865` ns	`1.02`
`array/reductions/reduce/Int64/dims=2L`	`83898` ns	`82279` ns	`1.02`
`array/reverse/1d`	`17704` ns	`16600` ns	`1.07`
`array/reverse/1dL`	`68278` ns	`67647` ns	`1.01`
`array/reverse/1dL_inplace`	`65712` ns	`65318` ns	`1.01`
`array/reverse/1d_inplace`	`10193.5` ns	`8340.333333333334` ns	`1.22`
`array/reverse/2d`	`21246` ns	`19869` ns	`1.07`
`array/reverse/2dL`	`73393.5` ns	`71763` ns	`1.02`
`array/reverse/2dL_inplace`	`65729` ns	`65099` ns	`1.01`
`array/reverse/2d_inplace`	`10213` ns	`9659` ns	`1.06`
`array/sorting/1d`	`2735901` ns	`2713733` ns	`1.01`
`array/sorting/2d`	`1068724.5` ns	`1062891` ns	`1.01`
`array/sorting/by`	`3303971` ns	`3280831` ns	`1.01`
`cuda/synchronization/context/auto`	`1154.9` ns	`1138.4` ns	`1.01`
`cuda/synchronization/context/blocking`	`938.974358974359` ns	`962.5652173913044` ns	`0.98`
`cuda/synchronization/context/nonblocking`	`7578.6` ns	`5944.2` ns	`1.27`
`cuda/synchronization/stream/auto`	`1013.3` ns	`997.8333333333334` ns	`1.02`
`cuda/synchronization/stream/blocking`	`824.1359223300971` ns	`821.8235294117648` ns	`1.00`
`cuda/synchronization/stream/nonblocking`	`7164.4` ns	`5912.4` ns	`1.21`
`integration/byval/reference`	`143670` ns	`143240` ns	`1.00`
`integration/byval/slices=1`	`145773` ns	`145261` ns	`1.00`
`integration/byval/slices=2`	`284454` ns	`283622` ns	`1.00`
`integration/byval/slices=3`	`423151` ns	`421946` ns	`1.00`
`integration/cudadevrt`	`102401` ns	`101870` ns	`1.01`
`integration/volumerhs`	`9960003` ns	`9906591` ns	`1.01`
`kernel/indexing`	`13275` ns	`12574` ns	`1.06`
`kernel/indexing_checked`	`14035` ns	`13327` ns	`1.05`
`kernel/launch`	`2156.1111111111113` ns	`2037.111111111111` ns	`1.06`
`kernel/occupancy`	`677.404458598726` ns	`696.7364864864865` ns	`0.97`
`kernel/rand`	`18050.5` ns	`14332` ns	`1.26`
`latency/import`	`3548878692` ns	`3854422204` ns	`0.92`
`latency/precompile`	`45248138979` ns	`4627097887` ns	`9.78`
`latency/ttfp`	`13051637302` ns	`4514039137` ns	`2.89`

This comment was automatically generated by workflow using github-action-benchmark.

`Compiler.deferred_codegen` is the function whose body contains `ccall("extern deferred_codegen", llvmcall, …)`, which is not valid for normal Julia codegen — `jl_compile_all_defs` enqueues the despecialised body during a sysimage build and that fails (#3091). Split it into: * `EnzymeCore.deferred_codegen(id::UInt) -> Ptr{Cvoid}` — a 1-arg shell declared in EnzymeCore. Its default body is just `reinterpret(Ptr{Cvoid}, id)`. `id` is already the OrcV2 trampoline pointer allocated by `Compiler.deferred_id_codegen`, so this default matches the runtime behaviour of the existing `extern deferred_codegen` symbol (which is also identity, registered via `@cfunction(deferred_codegen, …)`). Direct CPU calls to `autodiff_deferred` continue to work via this trampoline path. * `Enzyme.Compiler.deferred_codegen(fa, a, tt, …)` — the existing 12-arg method, now restructured so its body computes `id` via `deferred_id_codegen` and delegates to `EnzymeCore.deferred_codegen(id)`. No more llvmcall in this body. The `extern deferred_codegen` ccall marker that GPUCompiler's scanner needs during deferred compilation lives in two method-table overlays of `EnzymeCore.deferred_codegen(::UInt)`: * `GPUCompiler.GLOBAL_METHOD_TABLE` (in `src/compiler.jl`) — consulted by Enzyme's own `EnzymeInterpreter`, so higher-order CPU AD (e.g. forward- over-reverse from `test/basic.jl:660` "Nested Type Error" or `test/advanced.jl:1469` "Hessian") emits the deferred marker correctly. * `CUDA.method_table` (in JuliaGPU/CUDA.jl#3150) — consulted by CUDA's `GPUInterpreter`, so GPU kernels using `autodiff_deferred` — or, via a matching `autodiff → autodiff_deferred` redirect overlay added in the same CUDA-side extension, plain `Enzyme.autodiff(...)` — emit the marker into the GPU IR for GPUCompiler's scanner to resolve. Overlay methods have `external_mt` set, so `compile_all_collect__` (`precompile_utils.c:147`) skips them. The despecialised body that `jl_compile_all_defs` sees during a sysimage build is the safe `reinterpret(Ptr{Cvoid}, id)` stub plus the regular Julia call wrapper in `Compiler.deferred_codegen` — no extern symbol to resolve. Verified locally with `PackageCompiler.create_sysimage(["Enzyme"]; …)`: the sysimage no longer references `deferred_codegen` in its undefined symbols. The companion CUDA-side PR also overlays `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` to forward to `autodiff_deferred` inside CUDA codegen, so user kernels can write plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically — removing the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today. Companion change in CUDA.jl: JuliaGPU/CUDA.jl#3150. Fixes #3091. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@overlay

`Compiler.deferred_codegen` is the function whose body contains `ccall("extern deferred_codegen", llvmcall, …)`, which is not valid for normal Julia codegen — `jl_compile_all_defs` enqueues the despecialised body during a sysimage build and that fails (#3091). Give it a safe default body that returns `reinterpret(Ptr{Cvoid}, id)`, matching the runtime behaviour of the registered `extern deferred_codegen` symbol (which is itself identity, see `register_deferred_codegen`). Direct CPU calls to `autodiff_deferred` continue to work via the resulting OrcV2 trampoline pointer. The actual ccall marker lives in a `Base.Experimental.@overlay GPUCompiler.GLOBAL_METHOD_TABLE` definition with the same 12-arg signature. `EnzymeInterpreter` (which Enzyme jobs use by default) consults this overlay, so higher-order CPU AD (e.g. forward- over-reverse from `test/basic.jl:660` "Nested Type Error" or `test/advanced.jl:1469` "Hessian") still emits the deferred-codegen marker into the parent module's IR for GPUCompiler's scanner to resolve. Overlay methods have `external_mt` set, so `compile_all_collect__` (`precompile_utils.c:147`) skips them. The despecialised body that `jl_compile_all_defs` sees during a sysimage build is just the safe `reinterpret` — no extern symbol to resolve. Verified locally: `PackageCompiler.create_sysimage(["Enzyme"]; …)` succeeds and the resulting sysimage no longer references `deferred_codegen` in its undefined symbols. The companion JuliaGPU/CUDA.jl#3150 adds: * An `@overlay CUDA.method_table` for `EnzymeCore.autodiff(...)` to forward to `autodiff_deferred` inside CUDA codegen — letting user kernels write plain `Enzyme.autodiff(...)` and get the deferred path. * A matching `@overlay CUDA.method_table` for `Compiler.deferred_codegen` so the marker is emitted into the GPU IR. Fixes #3091. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Enzyme-side overlay attempt at fixing #3091 regressed Julia 1.10 higher-order CPU AD (Enzyme's pass couldn't trace through the non-inlined overlay function to recognise the indirect call as active). Reverted; #3091 is left for a structural follow-up (likely via JuliaGPU/GPUCompiler.jl#582 + #799 or a `@generated` variant). This PR now only carries the buildkite cross-PR CI link, paired with JuliaGPU/CUDA.jl#3150, which adds an `@overlay CUDA.method_table` on `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, …)` to redirect to `autodiff_deferred` inside CUDA codegen — so user kernels can call plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically (removing the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e weakdep Previously the 1-arg `_deferred_marker(id::UInt) -> Ptr{Cvoid}` helper lived in `Enzyme.Compiler`. That meant CUDA's matching `@overlay CUDA.method_table` overlay would need to reach into Enzyme.Compiler — forcing CUDA.jl's EnzymeCoreExt to weak-depend on Enzyme. Move the helper to EnzymeCore (with the default `reinterpret(Ptr{Cvoid}, id)` body) so CUDA's overlay can stay EnzymeCore-only. The actual `ccall("extern deferred_codegen", llvmcall, …)` body still lives in two method-table overlays: * `GPUCompiler.GLOBAL_METHOD_TABLE` (in src/compiler.jl) — consulted by Enzyme's own EnzymeInterpreter for higher-order CPU AD. * `CUDA.method_table` (in CUDA.jl/ext/EnzymeCoreExt.jl, see JuliaGPU/CUDA.jl#3150) — consulted by CUDA's GPUInterpreter during kernel compilation. Both overlays reference `EnzymeCore._deferred_marker(::UInt)` directly — no Enzyme weakdep on the CUDA side. `Enzyme.Compiler.deferred_codegen` (12-arg wrapper) now delegates to `EnzymeCore._deferred_marker(id)` after computing the id via the generated `deferred_id_codegen`. Its body contains no llvmcall, so its despecialised form is safe for sysimage builds (fixes #3091). Local verification on Julia 1.12.6: CPU autodiff_deferred, higher-order CPU AD (Hessian, nested HVP), and a CUDA kernel calling plain `Enzyme.autodiff(...)` all pass; `PackageCompiler.create_sysimage(["Enzyme"]; ...)` succeeds and the sysimage no longer references `deferred_codegen` in its undefined symbols. The Julia 1.10 guards from the previous commit (skipping forward-over-reverse via `autodiff_deferred` in inner functions) remain — that path still trips on 1.10's inlining behaviour even with this refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two overlays in CUDA.method_table that route deferred-codegen through CUDA's GPUInterpreter during kernel compilation: 1. `EnzymeCore._deferred_marker(id::UInt)` — its body holds the `extern deferred_codegen` ccall marker that GPUCompiler's scanner picks up to link the inner Enzyme adjoint into the kernel module. The function is declared in EnzymeCore with a safe default (`reinterpret(Ptr{Cvoid}, id)`); the overlay supplies the ccall during CUDA compilation. Sister overlay in `GPUCompiler.GLOBAL_METHOD_TABLE` from Enzyme.jl covers higher-order CPU AD. 2. `EnzymeCore.autodiff(::ReverseMode|::ForwardMode, ...)` — redirect to `autodiff_deferred` so user kernels can call plain `Enzyme.autodiff(...)` and get the deferred-codegen path automatically — removing the manual `autodiff` vs `autodiff_deferred` choice that KernelAbstractions's `gpu_fwd` makes today. Companion to EnzymeAD/Enzyme.jl#3112. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gbaraldi mentioned this pull request May 20, 2026

Hide deferred_codegen ccall body behind overlay table EnzymeAD/Enzyme.jl#3112

Open

3 tasks

github-actions Bot reviewed May 20, 2026

View reviewed changes

gbaraldi force-pushed the enzyme-autodiff-overlay branch from 9a9a252 to 85b1681 Compare May 21, 2026 21:47

gbaraldi force-pushed the enzyme-autodiff-overlay branch from 85b1681 to ccce51b Compare May 22, 2026 12:41

gbaraldi force-pushed the enzyme-autodiff-overlay branch from ccce51b to e874f57 Compare May 22, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EnzymeCoreExt: overlay autodiff/deferred_codegen for kernel codegen#3150

EnzymeCoreExt: overlay autodiff/deferred_codegen for kernel codegen#3150
gbaraldi wants to merge 2 commits into
JuliaGPU:mainfrom
gbaraldi:enzyme-autodiff-overlay

gbaraldi commented May 20, 2026

Uh oh!

github-actions Bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gbaraldi commented May 20, 2026

Summary

Test plan

Uh oh!

github-actions Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot left a comment •

edited

Loading