Integrate with CUDA.jl + reduce launch overhead#214
Merged
Conversation
The package extension was effectively a fiction — every functional path
(launch, broadcast, RNG, examples, all host/device tests) requires
CUDACore. Codegen-only paths still work, but they live behind
CUDACore-aware code anyway.
Moves the ext contents into:
- src/launch.jl — launch infrastructure
- src/broadcast.jl — TiledStyle vs CuArrayStyle override
- src/language/random.jl — host-side rand!/randn!/randexp! on CuArray
Removes the `launch(args...) = error("Please import CUDA.jl …")` stub.
Replaces the bespoke `to_tile_arg` flat dispatch with a `KernelAdaptor` + `cuTileconvert` mirroring `CUDACore.cudaconvert`. User-defined struct types containing arrays now compose naturally — register an `Adapt.adapt_structure` method and they're recursively converted before launch. Adapt's default `adapt_structure(to, ::PermutedDimsArray)` rebuilds the wrapper around the adapted parent, which doesn't fit `TileArray`'s absorb-strides-into-the-struct design. Added an explicit override so the whole wrapper collapses to a single TileArray.
Wires cuTile into CUDACore's `@cuda backend=…` dispatch protocol and introduces a hoistable `cufunction(f, tt)` step that mirrors `CUDACore.cufunction`. Three user-facing entry points now exist: cuTile.launch(f, grid, args…; opts…) # function form @cuda backend=cuTile.TileBackend() blocks=N f(…) # macro form k = cuTile.cufunction(f, tt; opts…); k(args…; …) # pre-compiled The pre-compiled form skips MI lookup and CompilerCaching dispatch on each call — useful for tight launch loops. Caching of the `TileKernel` wrapper rides on CompilerCaching: a new `tile_kernel` field on `CuTileResults` stores it alongside the cubin and CuFunction, so kernel-instance lifecycle follows the underlying `CodeInstance` instead of a separate global Dict. Other changes: - `check_tile_ir_support()` is memoized per `(capability, cuda_version)` instead of running every launch. - `kernel_convert(::TileBackend, x) = cuTileconvert(x)` and `kernel_compile(::TileBackend, f, tt; …) = cufunction(f, tt; …)` register cuTile as a backend with CUDACore.
Each emit phase used to do its own `get(cache, mi)` + `results(cache, ci, ...)`, so a single launch resolved the same `(ci, res)` pair four times. With CompilerCaching's new `lookup` accessor and an `ensure_compiled` helper, `cufunction` resolves them once at the top and threads them down through `emit_function!` → `emit_binary!` → `emit_tile!` → `emit_structured!`. The cache short-circuit in each phase moves to *after* the recurse so `compile_hook` (used by `@device_code_*` reflection) still fires on every launch when downstream artifacts are fully cached. Closes ~1.4 µs / 9 allocations off the function-form launch overhead on a vadd kernel (9.6 µs → 8.2 µs).
…rated`.
The original three passes over `tt.parameters` (`any` for `has_consts`, `map`
for the unwrapped tuple, then a `for ... push!` for the const-argtypes vector)
cost ~1 µs/launch on a vadd kernel. With `tt` `@nospecialize`d, none of it
const-folded.
Drop the `@nospecialize(tt)` so `cufunction` specializes per kernel signature,
and replace the body with a single `@generated` `unwrap_argtypes` that:
- emits the `Tuple{...}` argtypes type at compile time (no allocation),
- emits the `Any[...]` const-argtypes literal so only the runtime CC.Const
boxes for argument values survive,
- short-circuits to `nothing` when `tt` has no `Constant` slots, skipping
the const-prop pipeline entirely.
cufunction time on a vadd kernel: 1.85 µs → 1.14 µs → ~700 ns. Total launch
gap (function-form vs pre-compiled) closed from 4.4 µs to 1.6 µs.
- `DefaultBackend() = TileBackend()` so `@cuda backend=cuTile …` resolves through CUDACore's module-as-backend hook. - Default `blocks=1` on `(::TileKernel)(args...; …)` to match CUDACore's driver-level `launch`/`cudacall`, so a no-grid `@cuda backend=cuTile f(args...)` works. - `public` list now includes `TileBackend` and `DefaultBackend`; doc/example strings lead with the module-form invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Base.rand(Random.RandomDevice(), UInt32)` costs ~110 ns; the task-local `Base.rand(UInt32)` is ~1.5 ns and produces equally distinct seeds across launches. Mirrors what `HostKernel.make_seed` does on the LLVM backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous body built `flat_args` via `Iterators.flatten(map(flatten, args))...`
and `flat_types` via `Tuple{map(typeof, flat_args)...}`. Inference widens both
to abstract tuple/type lattice elements and lowers to `Core._apply_iterate`,
costing ~400 ns of splat overhead per launch. Mirrors the `@generated` callable
on CUDACore's `AbstractKernel`, which folds the same work at compile time.
`_flatten_static!` recurses over the arg types: TileArray expands to
`(ptr, sizes..., strides...)`, ghost types contribute nothing, primitives pass
through, structs recurse field-by-field via `getfield(_, i)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous owner `(:cuTile, opts::NamedTuple)` carried two `VersionNumber`
fields, whose non-isbits prerelease/build tuples forced the per-`lookup` heap
box up to ~80 B / ~17 ns. Pack `sm_arch` and `bytecode_version` into `UInt16`
and use `-1` as the `nothing` sentinel for hint fields, so the owner is
isbits — box drops to ~32 B / ~6 ns. Decoding back to the original
`VersionNumber` / `Union{Int, Nothing}` types only happens once per cache miss
in `emit_binary!` and `emit_tile!`, never on the hot lookup path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀