Skip to content

[Feature]: RFC : Device and runtime layer (aligned with pluggable compile backends) #255

@Peter9606

Description

@Peter9606

Suggestion Description

RFC : Device and runtime layer (aligned with pluggable compile backends)

Status Draft
Author Peter Han
Created 2026-03-18

Summary

This RFC proposes a runtime layer for FlyDSL that complements the previous pluggable GPU compile backend work. The runtime exposes Device, Stream, and Event in a vendor-neutral way. Concrete APIs (for example HIP on AMD ROCm vs CUDA driver/runtime on NVIDIA) live inside a single per-process device runtime implementation.

Illustrative stacks: we use ROCm/HIP (current default) and NVIDIA CUDA as examples. The same ideas apply to any other stack that offers stream-like queues and event-like synchronization.

Motivation

  1. Compile vs runtime separation — Compile backends decide how Fly IR is lowered and which native libraries the JIT loads. Execution still requires a consistent way to allocate memory, submit work, and synchronize — without scattering hip* or cu* calls through the Python DSL and generic compiler code.

  2. Multiple stacks across builds and processes — Different installations or processes may target AMD (HIP) or NVIDIA (CUDA). The abstractions should be shared; implementations are swappable.

  3. Opaque Stream/Event — Higher layers should depend only on abstract stream and event handles, not on hipStream_t, cudaStream_t, or other vendor types.

Non-goals

  1. Multiple GPU runtimes in one process — We do not support loading both ROCm/HIP and NVIDIA CUDA (or two unrelated native GPU ABIs) in the same OS process. A process picks one stack for its lifetime.

  2. Mixed vendor devices in one process — We do not design for “some tensors on HIP and some on CUDA” concurrently in a single FlyDSL process. Multiple GPUs are still in scope when they are multiple devices under the same stack (e.g. several AMD GPUs or several NVIDIA GPUs).

  3. Specifying a second compile backend implementation in this RFC — CUDA-oriented lowering and MLIR pipelines are examples for alignment; concrete CUDA compile-backend work can be a separate RFC.

Core invariants

  1. Single runtime stack per process
    Exactly one DeviceRuntime (name illustrative) is active. All Device, Stream, and Event instances are interpreted by that implementation.

  2. Compile backend matches runtime
    The active compile backend (e.g. rocm today) must match the loaded runtime kind. On mismatch, FlyDSL should fail explicitly (e.g. at first @jit compile or first launch), not proceed with undefined behavior.

  3. Opaque handles
    Python and portable C++ surfaces expose opaque stream/event (and optionally device memory) handles. Vendor types and headers are confined to runtime implementation translation units (similar in spirit to isolating HIP inside FlyRocmRuntimeWrappers.cpp today).

Layered design

Layer Responsibility
Compile backend (existing direction) MLIR pipelines, gpu.module targets, ExecutionEngine shared libraries, compile-time cache key inputs.
Device Logical device id, ordinal, and capabilities (memory, wave/warp width, etc.). Does not select the MLIR pipeline.
DeviceRuntime Single per-process implementation: allocation, host↔device copies, create/destroy Stream and Event, synchronization, and glue required for kernel launch.
Stream / Event (abstract) Same conceptual model as HIP/CUDA: an asynchronous queue and synchronization primitives. HIP implements it with hipStream* / hipEvent*; CUDA with cudaStream* / cudaEvent* (or driver API equivalents), behind the runtime boundary.

Optional: buffers (device pointers + lifetime) are also owned by the runtime so upper layers never call hipMalloc vs cudaMalloc directly.

Relationship to pluggable compile backends

  • Identifiers — Define a clear mapping between compile backend id (e.g. FLYDSL_COMPILE_BACKEND) and runtime kind (e.g. rocm ↔ HIP runtime; a future cuda ↔ CUDA runtime). Avoid two unrelated naming schemes.

  • Extensions — A vendor or downstream package may register both a compile backend and a matching runtime, or the project may ship paired defaults so a single configuration switch selects a consistent pair.

  • Caching — JIT disk caches remain partitioned by compile backend and target architecture (and related inputs). Device ordinal typically does not belong in the cache key unless multiple devices in one process can imply different ISAs and incorrect reuse must be prevented.

Registration and initialization

  • Runtime — A registration hook (e.g. register_device_runtime) remains useful for tests and plugins, but its semantics should reflect at most one active stack: e.g. register once, or set a default HIP implementation that is replaced only before any GPU work.

  • Compile — Continues to use the existing compile-backend registry; validation ties the two together early.

Interop with frameworks (e.g. PyTorch)

Adapters may map an external “current CUDA device” or ROCm-visible device to FlyDSL’s Device, but the core runtime API should not depend on PyTorch. Documentation should state when the external framework’s active stack must match FlyDSL’s single runtime choice.

Open questions

  1. When to validate compile/runtime pairing — import time vs first @jit vs first launch.

  2. AOT cache metadata — Whether to record compile_backend and target arch in on-disk artifacts for strict runtime checks.

Operating System

No response

GPU

No response

ROCm Component

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions