-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Suggestion Description
RFC : Device and runtime layer (aligned with pluggable compile backends)
| Status | Draft |
|---|---|
| Author | Peter Han |
| Created | 2026-03-18 |
Summary
This RFC proposes a runtime layer for FlyDSL that complements the previous pluggable GPU compile backend work. The runtime exposes Device, Stream, and Event in a vendor-neutral way. Concrete APIs (for example HIP on AMD ROCm vs CUDA driver/runtime on NVIDIA) live inside a single per-process device runtime implementation.
Illustrative stacks: we use ROCm/HIP (current default) and NVIDIA CUDA as examples. The same ideas apply to any other stack that offers stream-like queues and event-like synchronization.
Motivation
-
Compile vs runtime separation — Compile backends decide how Fly IR is lowered and which native libraries the JIT loads. Execution still requires a consistent way to allocate memory, submit work, and synchronize — without scattering
hip*orcu*calls through the Python DSL and generic compiler code. -
Multiple stacks across builds and processes — Different installations or processes may target AMD (HIP) or NVIDIA (CUDA). The abstractions should be shared; implementations are swappable.
-
Opaque Stream/Event — Higher layers should depend only on abstract stream and event handles, not on
hipStream_t,cudaStream_t, or other vendor types.
Non-goals
-
Multiple GPU runtimes in one process — We do not support loading both ROCm/HIP and NVIDIA CUDA (or two unrelated native GPU ABIs) in the same OS process. A process picks one stack for its lifetime.
-
Mixed vendor devices in one process — We do not design for “some tensors on HIP and some on CUDA” concurrently in a single FlyDSL process. Multiple GPUs are still in scope when they are multiple devices under the same stack (e.g. several AMD GPUs or several NVIDIA GPUs).
-
Specifying a second compile backend implementation in this RFC — CUDA-oriented lowering and MLIR pipelines are examples for alignment; concrete CUDA compile-backend work can be a separate RFC.
Core invariants
-
Single runtime stack per process
Exactly oneDeviceRuntime(name illustrative) is active. AllDevice,Stream, andEventinstances are interpreted by that implementation. -
Compile backend matches runtime
The active compile backend (e.g.rocmtoday) must match the loaded runtime kind. On mismatch, FlyDSL should fail explicitly (e.g. at first@jitcompile or first launch), not proceed with undefined behavior. -
Opaque handles
Python and portable C++ surfaces expose opaque stream/event (and optionally device memory) handles. Vendor types and headers are confined to runtime implementation translation units (similar in spirit to isolating HIP insideFlyRocmRuntimeWrappers.cpptoday).
Layered design
| Layer | Responsibility |
|---|---|
| Compile backend (existing direction) | MLIR pipelines, gpu.module targets, ExecutionEngine shared libraries, compile-time cache key inputs. |
| Device | Logical device id, ordinal, and capabilities (memory, wave/warp width, etc.). Does not select the MLIR pipeline. |
| DeviceRuntime | Single per-process implementation: allocation, host↔device copies, create/destroy Stream and Event, synchronization, and glue required for kernel launch. |
| Stream / Event (abstract) | Same conceptual model as HIP/CUDA: an asynchronous queue and synchronization primitives. HIP implements it with hipStream* / hipEvent*; CUDA with cudaStream* / cudaEvent* (or driver API equivalents), behind the runtime boundary. |
Optional: buffers (device pointers + lifetime) are also owned by the runtime so upper layers never call hipMalloc vs cudaMalloc directly.
Relationship to pluggable compile backends
-
Identifiers — Define a clear mapping between compile backend id (e.g.
FLYDSL_COMPILE_BACKEND) and runtime kind (e.g.rocm↔ HIP runtime; a futurecuda↔ CUDA runtime). Avoid two unrelated naming schemes. -
Extensions — A vendor or downstream package may register both a compile backend and a matching runtime, or the project may ship paired defaults so a single configuration switch selects a consistent pair.
-
Caching — JIT disk caches remain partitioned by compile backend and target architecture (and related inputs). Device ordinal typically does not belong in the cache key unless multiple devices in one process can imply different ISAs and incorrect reuse must be prevented.
Registration and initialization
-
Runtime — A registration hook (e.g.
register_device_runtime) remains useful for tests and plugins, but its semantics should reflect at most one active stack: e.g. register once, or set a default HIP implementation that is replaced only before any GPU work. -
Compile — Continues to use the existing compile-backend registry; validation ties the two together early.
Interop with frameworks (e.g. PyTorch)
Adapters may map an external “current CUDA device” or ROCm-visible device to FlyDSL’s Device, but the core runtime API should not depend on PyTorch. Documentation should state when the external framework’s active stack must match FlyDSL’s single runtime choice.
Open questions
-
When to validate compile/runtime pairing — import time vs first
@jitvs first launch. -
AOT cache metadata — Whether to record
compile_backendand target arch in on-disk artifacts for strict runtime checks.
Operating System
No response
GPU
No response
ROCm Component
No response