Skip to content

DaronPopov/feRcuda_runtime

Repository files navigation

feRcuda runtime

feRcuda is a userspace GPU operating layer. It installs a single governed memory and execution regime inside a CUDA process so frameworks become tenants of one machine instead of competing owners of the device.

At the core is a single-pool TLSF memory regime. One LD_PRELOAD intercepts every cudaMalloc from every library — PyTorch, Triton, bitsandbytes, cuBLAS, cuDNN, JAX — and routes it through a deterministic O(1) allocator. No library modifications. No cross-framework heap fragmentation. fallback=0 is the invariant.

On top of that regime, feRcuda provides kernel-style lifecycle control for GPU resources: session ownership, rebootable runtime state, logical buffer management, JIT dispatch, ignition for pre-planned allocation, fused Triton kernels, and streaming inference support for HuggingFace-style models.

The current stack is:

  • CUDA driver and low-level native mechanisms at the bottom
  • Rust as the stable machine and control plane above them
  • Python as the operator surface and library bridge over that Rust-owned substrate

That distinction matters. Python is not a second runtime architecture. It is how frameworks and scripts drive the machine that the native and Rust layers establish.

Quick start

git clone https://github.com/DaronPopov/feRcuda_runtime.git
cd feRcuda_runtime
./uvw run fercuda-install
./uvw run fercuda-install runtimes --list

That installer is the supported build path. It keeps the toolchain and runtime state self-contained under build/, auto-bootstraps a repo-local Rust toolchain if cargo is missing, builds the native and Rust runtime artifacts, and installs the default hf frontend bundle.

Run the Mistral-7B streaming chatbot:

FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python tests/integration/benchmarks/llm_inference/chat_mistral.py

Or Qwen3.5-4B with fused Triton kernels:

FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python tests/integration/benchmarks/llm_inference/chat_qwen.py

Requirements:

  • NVIDIA driver 580+, CUDA toolkit 13.0+
  • Linux with a modern NVIDIA GPU
  • uv (install)

Use uvw for repo-local commands. It pins:

  • UV_PROJECT_ENVIRONMENT=build/.venv
  • PYTHONPYCACHEPREFIX=build/pycache
  • CARGO_TARGET_DIR=build/rust

That keeps the system self-contained under build/. Extra frontends stay opt-in through installer bundles.

Layout

Path Role
native/ptx_core/ Core: TLSF allocator, elastic pool, VMM/VFS OS glue, memcpy, kernels, JIT lowering, Linux preload hook
native/fercuda_api/ Session, planner, pools, c_api.cu
include/fercuda/ Public C/C++/CUDA headers (api/, jit/, daemon/, runtime/)
build/ Native build outputs (.so artifacts)
python/ fercuda-runtime Python package
rust/ Cargo workspace (public + internal + external crates)
tests/integration/benchmarks/ Benchmarks grouped by allocator, framework, mechanics, and LLM inference
tests/integration/native/ Native C++ smoke tests grouped by VMM subsystem
python/tests/ Python pytest coverage for memory, inference, and runtime behavior
tests/ Benchmarks, native smoke tests, and standalone validation scripts
results/paper/ Paper (LaTeX source + PDF)
docs/ Design documents and benchmark results
examples/ Ad hoc example scripts and experiments; not part of the required build

Build artifacts

Library Purpose
libptx_hook.so LD_PRELOAD intercept — routes all cudaMalloc/cudaFree through TLSF
libfercuda_capi.so Shared C API (Python loads via ctypes)
libptx_core.so Hot runtime, TLSF allocator, VMM OS glue
libptx_kernels.so Tensor/graphics-style GPU kernels
libfercuda_ignition.so Ignition engine — pre-planned tensor allocation
libfer_triton.so Triton CUBIN loader + direct cuLaunchKernel dispatch

Memory regime

The core idea: one GPU pool, every library is a tenant.

LD_PRELOAD=build/libptx_hook.so intercepts all cudaMalloc calls at the process level. PyTorch, bitsandbytes, cuBLAS, Triton — none of them hold their own memory pools. They allocate and free through TLSF without knowing it.

When the hook is active, the intended shape is one process-level runtime pool. Native Session objects should attach to that pool, not create parallel pools of their own.

  • fallback=0 is the invariant. If fallback > 0, something escaped the pool.
  • TLSF is O(1) — bounded allocation time, no fragmentation.
  • torch.cuda.memory_reserved() is always ~0.
  • No torch.cuda.empty_cache() hacks needed.

Machine Invariants

These are the rules the repository is trying to preserve:

  • one GPU pool per process when the hook is active
  • no parallel session-owned pool when the hook-owned pool already exists
  • fallback=0 for workloads that are meant to be fully covered
  • buffer_id is the canonical internal execution handle
  • raw device_ptr export is an interop boundary, not the internal execution identity
  • Rust owns machine logic above CUDA; Python describes workloads and bridges libraries onto that substrate

If a change weakens one of those rules, it should be treated as an architectural regression, not just a local implementation detail.

See docs/concepts/memory-regime.md for the full reference.

Python package

The fercuda-runtime package provides:

Module Purpose
scripting_api/ RuntimeSession, RuntimeConfig, ignition (TensorSpec, IgnitionPlan, ignite())
runtime/ ctypes bindings, session management, telemetry, constants
inference/ Model adapters (GGUF), tokenizers, generation pipeline (stream_generate, generate)
guest_hooks/ HF Transformers pre-import patches (hf_hook), JAX plugin hooks, Triton hook (TritonKernel)
platform/ Bootstrap, library discovery, platform compatibility checks

The repo-local uv project already exposes fercuda-runtime from python/ as an editable source, so the supported path is still to enter through ./uvw run fercuda-install.

Benchmarks and chatbots

Script What it does
chat_qwen.py Streaming chatbot — Qwen3.5-4B (4-bit), with fused Triton kernels (--fused/--no-fused)
chat_mistral.py Streaming chatbot — Mistral-7B-Instruct-v0.2 (4-bit)
triton_kernels.py Fused Triton kernels: RMSNorm, SiLU×mul (SwiGLU), rotary embedding
patch_model.py Monkey-patches Qwen3.5 model with fused kernels
bench_triton_regime.py Triton kernels over TLSF: elementwise, softmax, matmul, launch overhead
bench_hf_transformers.py HF Transformers integration benchmark
bench_baseline.py Baseline allocation and throughput
bench_coexistence.py Multi-library coexistence on TLSF
bench_mechanics.py Low-level regime mechanics
bench_throughput.py Allocation throughput
bench_torch_regime.py PyTorch ops over TLSF
bench_jax_regime.py JAX ops over TLSF

Fused Triton kernels

Three kernels targeting per-layer elementwise hotspots in Qwen3.5:

  • fused_rms_norm — single kernel replacing 6 PyTorch ops (cast → pow → mean → rsqrt → mul → cast). Handles the Qwen3.5 (1 + weight) variant.
  • fused_silu_mul — fused SwiGLU activation silu(gate) * up in one pass, eliminating an intermediate tensor.
  • fused_rotary_emb — fused rotate_half + cos/sin multiply for rotary position embeddings.

Applied via patch_qwen35_with_triton(model) after model load. All kernels run on the normal torch CUDA stream — TLSF handles allocation transparently.

Rust workspace

Public crates:

Crate Purpose
ptx-runtime Core feRcuda runtime bindings
fercuda-ignition TLSF ignition cdylib
fercuda-inference Rust inference bridge

Internal: ptx-sys (low-level sys crate)

External crates (rust/external/fer-*):

Crate Purpose
fer-cudarc cudarc 0.13.9 with ptx-alloc
fer-cudarc-ptx cudarc-ptx 0.19.0 with ptx-os feature
fer-aten PyTorch TLSF allocator
fer-torch PyTorch + TLSF integration
fer-ptx-os PTX-OS runtime: RegimeRuntimeCore, DeviceBox
fer-math Pure Rust math: nalgebra + optional faer/candle
fer-ml Pure Rust ML utils: safetensors, serde, sampling
fer-candle-core Modified Candle core with TLSF/ptx-os
fer-candle Candle ML backend: FusedExecutor, GraphCache, PtxDevice
fer-triton Triton CUBIN loader + direct cuLaunchKernel dispatch
fer-ext-kernels FFI to libfer_ext_kernels.so (Q4K GEMV, RMS norm)
fer-bindgen Custom bindgen_cuda fork
fer-ug-cuda ug-cuda shim for candle

Ignition engine

Pre-planned tensor allocation via the ignition engine:

from fercuda_runtime.scripting_api.ignition import TensorSpec, IgnitionPlan, ignite

plan = IgnitionPlan(tensors=(TensorSpec(rank=1, dims=(64,)),), warmup_passes=1)
region = ignite(session, plan)  # RAII — memory freed on region close

Rust cdylib: rust/crates/public/fercuda-ignition/libfercuda_ignition.so

Native build

The supported build path is:

./uvw run fercuda-install

Or, for the OS/runtime layer without default frontend bundles:

./uvw run fercuda-install --no-default-frontends

That installer owns both the native CMake build and the Rust ignition cdylib build, and syncs the runtime-facing shared objects into build/. If cargo is missing, it bootstraps a minimal repo-local Rust toolchain under build/.cargo and build/.rustup.

Optional Inference Runtimes

The installer now has an explicit bundle model:

  • ./uvw run fercuda-install builds native + installs the default hf bundle
  • ./uvw run fercuda-install --no-default-frontends builds only the OS/runtime layer
  • ./uvw run fercuda-install --with gguf adds extra frontends during the main install

List the available bundles:

./uvw run fercuda-install runtimes --list

Preview the default install commands without executing them:

./uvw run fercuda-install --print-only

Build native only:

./uvw run fercuda-install --no-default-frontends

Build native + add optional bundles:

./uvw run fercuda-install --with gguf
./uvw run fercuda-install --with gguf --with dev

Print the exact commands for one or more bundles without rebuilding native:

./uvw run fercuda-install runtimes --print-only hf gguf

Install selected bundles later:

./uvw run fercuda-install runtimes hf
./uvw run fercuda-install runtimes gguf
./uvw run fercuda-install runtimes vllm

Tests

./uvw run fercuda-install --no-default-frontends
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/memory -v
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/inference -v
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/memory/test_subsystem_reboot.py -v
ctest --test-dir build --output-on-failure
cd rust && cargo test -p fercuda-inference                # Rust inference tests

C API

Key functions in include/fercuda/api/c_api.h:

  • fer_alloc_buffer, fer_free_buffer, fer_upload_bytes, fer_export_buffer_device_ptr
  • fer_ptxlaunch_submit — kernel dispatch
  • fer_persistent_dispatcher_boot — boot persistent dispatcher
  • fer_stream_sync — synchronize session stream
  • fercuda_ignite — ignition engine entry point

Hook telemetry: include/fercuda/api/intercept_telemetry.h and fercuda_runtime.runtime.telemetry (ctypes into libptx_hook.so).

Execution ABI

  • C: include/fercuda/api/execution_contract.h — entrypoints (tensor, jit_intent, jit_launch, resident_daemon, framework_allocator), frontends, ABI version.
  • Python: python/src/fercuda_runtime/execution_contract.py — same names plus rust_ferrite_torchframework_allocator.

About

feRcuda is a userspace GPU operating layer. It installs a single governed memory and execution regime inside a CUDA process so frameworks become tenants of one machine instead of competing owners of the device.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors