[Feature]: RFC : Device and runtime layer (aligned with pluggable compile backends)

### Suggestion Description

# RFC : Device and runtime layer (aligned with pluggable compile backends)

| Status | **Draft** |
|--------|-----------|
| Author | Peter Han |
| Created | 2026-03-18 |

## Summary

This RFC proposes a **runtime layer** for FlyDSL that complements the previous **[pluggable GPU compile backend](https://github.com/ROCm/FlyDSL/issues/253)** work. The runtime exposes **Device**, **Stream**, and **Event** in a **vendor-neutral** way. Concrete APIs (for example **HIP** on AMD ROCm vs **CUDA driver/runtime** on NVIDIA) live inside a single **per-process device runtime** implementation.

**Illustrative stacks:** we use **ROCm/HIP** (current default) and **NVIDIA CUDA** as examples. The same ideas apply to any other stack that offers stream-like queues and event-like synchronization.

## Motivation

1. **Compile vs runtime separation** — Compile backends decide how Fly IR is lowered and which native libraries the JIT loads. **Execution** still requires a consistent way to allocate memory, submit work, and synchronize — without scattering `hip*` or `cu*` calls through the Python DSL and generic compiler code.

2. **Multiple stacks across builds and processes** — Different installations or processes may target AMD (HIP) or NVIDIA (CUDA). The **abstractions** should be shared; **implementations** are swappable.

3. **Opaque Stream/Event** — Higher layers should depend only on **abstract stream and event handles**, not on `hipStream_t`, `cudaStream_t`, or other vendor types.

## Non-goals

1. **Multiple GPU runtimes in one process** — We do **not** support loading **both** ROCm/HIP and NVIDIA CUDA (or two unrelated native GPU ABIs) **in the same OS process**. A process picks **one** stack for its lifetime.

2. **Mixed vendor devices in one process** — We do **not** design for “some tensors on HIP and some on CUDA” concurrently in a single FlyDSL process. **Multiple GPUs** are still in scope when they are **multiple devices under the same stack** (e.g. several AMD GPUs or several NVIDIA GPUs).

3. **Specifying a second compile backend implementation in this RFC** — CUDA-oriented lowering and MLIR pipelines are **examples** for alignment; concrete CUDA compile-backend work can be a separate RFC.

## Core invariants

1. **Single runtime stack per process**  
   Exactly **one** `DeviceRuntime` (name illustrative) is active. All `Device`, `Stream`, and `Event` instances are interpreted by that implementation.

2. **Compile backend matches runtime**  
   The active **compile backend** (e.g. `rocm` today) must **match** the loaded **runtime kind**. On mismatch, FlyDSL should **fail explicitly** (e.g. at first `@jit` compile or first launch), not proceed with undefined behavior.

3. **Opaque handles**  
   Python and portable C++ surfaces expose **opaque** stream/event (and optionally device memory) handles. Vendor types and headers are confined to **runtime implementation** translation units (similar in spirit to isolating HIP inside `FlyRocmRuntimeWrappers.cpp` today).

## Layered design

| Layer | Responsibility |
|-------|------------------|
| **Compile backend** (existing direction) | MLIR pipelines, `gpu.module` targets, `ExecutionEngine` shared libraries, compile-time cache key inputs. |
| **Device** | Logical device id, ordinal, and **capabilities** (memory, wave/warp width, etc.). Does **not** select the MLIR pipeline. |
| **DeviceRuntime** | **Single per-process implementation**: allocation, host↔device copies, **create/destroy Stream and Event**, synchronization, and glue required for kernel launch. |
| **Stream / Event (abstract)** | Same **conceptual** model as HIP/CUDA: an asynchronous queue and synchronization primitives. **HIP** implements it with `hipStream*` / `hipEvent*`; **CUDA** with `cudaStream*` / `cudaEvent*` (or driver API equivalents), behind the runtime boundary. |

Optional: **buffers** (device pointers + lifetime) are also owned by the runtime so upper layers never call `hipMalloc` vs `cudaMalloc` directly.

## Relationship to pluggable compile backends

- **Identifiers** — Define a clear mapping between **compile backend id** (e.g. `FLYDSL_COMPILE_BACKEND`) and **runtime kind** (e.g. `rocm` ↔ HIP runtime; a future `cuda` ↔ CUDA runtime). Avoid two unrelated naming schemes.

- **Extensions** — A vendor or downstream package may register **both** a compile backend and a matching runtime, or the project may ship **paired** defaults so a single configuration switch selects a consistent pair.

- **Caching** — JIT disk caches remain partitioned by **compile backend** and **target architecture** (and related inputs). **Device ordinal** typically does **not** belong in the cache key unless multiple devices in one process can imply **different ISAs** and incorrect reuse must be prevented.

## Registration and initialization

- **Runtime** — A registration hook (e.g. `register_device_runtime`) remains useful for tests and plugins, but its **semantics** should reflect **at most one active stack**: e.g. register once, or set a default HIP implementation that is replaced only before any GPU work.

- **Compile** — Continues to use the existing compile-backend registry; **validation** ties the two together early.

## Interop with frameworks (e.g. PyTorch)

Adapters may map an external “current CUDA device” or ROCm-visible device to FlyDSL’s **Device**, but the **core** runtime API should not depend on PyTorch. Documentation should state when the external framework’s active stack must match FlyDSL’s **single** runtime choice.

## Open questions

1. **When to validate** compile/runtime pairing — import time vs first `@jit` vs first launch.

2. **AOT cache metadata** — Whether to record `compile_backend` and target arch in on-disk artifacts for strict runtime checks.


### Operating System

_No response_

### GPU

_No response_

### ROCm Component

_No response_

Layer	Responsibility
Compile backend (existing direction)	MLIR pipelines, `gpu.module` targets, `ExecutionEngine` shared libraries, compile-time cache key inputs.
Device	Logical device id, ordinal, and capabilities (memory, wave/warp width, etc.). Does not select the MLIR pipeline.
DeviceRuntime	Single per-process implementation: allocation, host↔device copies, create/destroy Stream and Event, synchronization, and glue required for kernel launch.
Stream / Event (abstract)	Same conceptual model as HIP/CUDA: an asynchronous queue and synchronization primitives. HIP implements it with `hipStream` / `hipEvent`; CUDA with `cudaStream` / `cudaEvent` (or driver API equivalents), behind the runtime boundary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: RFC : Device and runtime layer (aligned with pluggable compile backends) #255

Suggestion Description

RFC : Device and runtime layer (aligned with pluggable compile backends)

Summary

Motivation

Non-goals

Core invariants

Layered design

Relationship to pluggable compile backends

Registration and initialization

Interop with frameworks (e.g. PyTorch)

Open questions

Operating System

GPU

ROCm Component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: RFC : Device and runtime layer (aligned with pluggable compile backends) #255

Description

Suggestion Description

RFC : Device and runtime layer (aligned with pluggable compile backends)

Summary

Motivation

Non-goals

Core invariants

Layered design

Relationship to pluggable compile backends

Registration and initialization

Interop with frameworks (e.g. PyTorch)

Open questions

Operating System

GPU

ROCm Component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions