feRcuda is a userspace GPU operating layer. It installs a single governed memory and execution regime inside a CUDA process so frameworks become tenants of one machine instead of competing owners of the device.
At the core is a single-pool TLSF memory regime. One LD_PRELOAD intercepts every cudaMalloc from every library — PyTorch, Triton, bitsandbytes, cuBLAS, cuDNN, JAX — and routes it through a deterministic O(1) allocator. No library modifications. No cross-framework heap fragmentation. fallback=0 is the invariant.
On top of that regime, feRcuda provides kernel-style lifecycle control for GPU resources: session ownership, rebootable runtime state, logical buffer management, JIT dispatch, ignition for pre-planned allocation, fused Triton kernels, and streaming inference support for HuggingFace-style models.
The current stack is:
- CUDA driver and low-level native mechanisms at the bottom
- Rust as the stable machine and control plane above them
- Python as the operator surface and library bridge over that Rust-owned substrate
That distinction matters. Python is not a second runtime architecture. It is how frameworks and scripts drive the machine that the native and Rust layers establish.
git clone https://github.com/DaronPopov/feRcuda_runtime.git
cd feRcuda_runtime
./uvw run fercuda-install
./uvw run fercuda-install runtimes --listThat installer is the supported build path. It keeps the toolchain and runtime state self-contained under build/, auto-bootstraps a repo-local Rust toolchain if cargo is missing, builds the native and Rust runtime artifacts, and installs the default hf frontend bundle.
Run the Mistral-7B streaming chatbot:
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python tests/integration/benchmarks/llm_inference/chat_mistral.pyOr Qwen3.5-4B with fused Triton kernels:
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python tests/integration/benchmarks/llm_inference/chat_qwen.pyRequirements:
- NVIDIA driver 580+, CUDA toolkit 13.0+
- Linux with a modern NVIDIA GPU
uv(install)
Use uvw for repo-local commands. It pins:
UV_PROJECT_ENVIRONMENT=build/.venvPYTHONPYCACHEPREFIX=build/pycacheCARGO_TARGET_DIR=build/rust
That keeps the system self-contained under build/. Extra frontends stay opt-in through installer bundles.
| Path | Role |
|---|---|
native/ptx_core/ |
Core: TLSF allocator, elastic pool, VMM/VFS OS glue, memcpy, kernels, JIT lowering, Linux preload hook |
native/fercuda_api/ |
Session, planner, pools, c_api.cu |
include/fercuda/ |
Public C/C++/CUDA headers (api/, jit/, daemon/, runtime/) |
build/ |
Native build outputs (.so artifacts) |
python/ |
fercuda-runtime Python package |
rust/ |
Cargo workspace (public + internal + external crates) |
tests/integration/benchmarks/ |
Benchmarks grouped by allocator, framework, mechanics, and LLM inference |
tests/integration/native/ |
Native C++ smoke tests grouped by VMM subsystem |
python/tests/ |
Python pytest coverage for memory, inference, and runtime behavior |
tests/ |
Benchmarks, native smoke tests, and standalone validation scripts |
results/paper/ |
Paper (LaTeX source + PDF) |
docs/ |
Design documents and benchmark results |
examples/ |
Ad hoc example scripts and experiments; not part of the required build |
| Library | Purpose |
|---|---|
libptx_hook.so |
LD_PRELOAD intercept — routes all cudaMalloc/cudaFree through TLSF |
libfercuda_capi.so |
Shared C API (Python loads via ctypes) |
libptx_core.so |
Hot runtime, TLSF allocator, VMM OS glue |
libptx_kernels.so |
Tensor/graphics-style GPU kernels |
libfercuda_ignition.so |
Ignition engine — pre-planned tensor allocation |
libfer_triton.so |
Triton CUBIN loader + direct cuLaunchKernel dispatch |
The core idea: one GPU pool, every library is a tenant.
LD_PRELOAD=build/libptx_hook.so intercepts all cudaMalloc calls at the process level. PyTorch, bitsandbytes, cuBLAS, Triton — none of them hold their own memory pools. They allocate and free through TLSF without knowing it.
When the hook is active, the intended shape is one process-level runtime pool. Native Session objects should attach to that pool, not create parallel pools of their own.
fallback=0is the invariant. If fallback > 0, something escaped the pool.- TLSF is O(1) — bounded allocation time, no fragmentation.
torch.cuda.memory_reserved()is always ~0.- No
torch.cuda.empty_cache()hacks needed.
These are the rules the repository is trying to preserve:
- one GPU pool per process when the hook is active
- no parallel session-owned pool when the hook-owned pool already exists
fallback=0for workloads that are meant to be fully coveredbuffer_idis the canonical internal execution handle- raw
device_ptrexport is an interop boundary, not the internal execution identity - Rust owns machine logic above CUDA; Python describes workloads and bridges libraries onto that substrate
If a change weakens one of those rules, it should be treated as an architectural regression, not just a local implementation detail.
See docs/concepts/memory-regime.md for the full reference.
The fercuda-runtime package provides:
| Module | Purpose |
|---|---|
scripting_api/ |
RuntimeSession, RuntimeConfig, ignition (TensorSpec, IgnitionPlan, ignite()) |
runtime/ |
ctypes bindings, session management, telemetry, constants |
inference/ |
Model adapters (GGUF), tokenizers, generation pipeline (stream_generate, generate) |
guest_hooks/ |
HF Transformers pre-import patches (hf_hook), JAX plugin hooks, Triton hook (TritonKernel) |
platform/ |
Bootstrap, library discovery, platform compatibility checks |
The repo-local uv project already exposes fercuda-runtime from python/ as an editable source, so the supported path is still to enter through ./uvw run fercuda-install.
| Script | What it does |
|---|---|
chat_qwen.py |
Streaming chatbot — Qwen3.5-4B (4-bit), with fused Triton kernels (--fused/--no-fused) |
chat_mistral.py |
Streaming chatbot — Mistral-7B-Instruct-v0.2 (4-bit) |
triton_kernels.py |
Fused Triton kernels: RMSNorm, SiLU×mul (SwiGLU), rotary embedding |
patch_model.py |
Monkey-patches Qwen3.5 model with fused kernels |
bench_triton_regime.py |
Triton kernels over TLSF: elementwise, softmax, matmul, launch overhead |
bench_hf_transformers.py |
HF Transformers integration benchmark |
bench_baseline.py |
Baseline allocation and throughput |
bench_coexistence.py |
Multi-library coexistence on TLSF |
bench_mechanics.py |
Low-level regime mechanics |
bench_throughput.py |
Allocation throughput |
bench_torch_regime.py |
PyTorch ops over TLSF |
bench_jax_regime.py |
JAX ops over TLSF |
Three kernels targeting per-layer elementwise hotspots in Qwen3.5:
fused_rms_norm— single kernel replacing 6 PyTorch ops (cast → pow → mean → rsqrt → mul → cast). Handles the Qwen3.5(1 + weight)variant.fused_silu_mul— fused SwiGLU activationsilu(gate) * upin one pass, eliminating an intermediate tensor.fused_rotary_emb— fused rotate_half + cos/sin multiply for rotary position embeddings.
Applied via patch_qwen35_with_triton(model) after model load. All kernels run on the normal torch CUDA stream — TLSF handles allocation transparently.
Public crates:
| Crate | Purpose |
|---|---|
ptx-runtime |
Core feRcuda runtime bindings |
fercuda-ignition |
TLSF ignition cdylib |
fercuda-inference |
Rust inference bridge |
Internal: ptx-sys (low-level sys crate)
External crates (rust/external/fer-*):
| Crate | Purpose |
|---|---|
fer-cudarc |
cudarc 0.13.9 with ptx-alloc |
fer-cudarc-ptx |
cudarc-ptx 0.19.0 with ptx-os feature |
fer-aten |
PyTorch TLSF allocator |
fer-torch |
PyTorch + TLSF integration |
fer-ptx-os |
PTX-OS runtime: RegimeRuntimeCore, DeviceBox |
fer-math |
Pure Rust math: nalgebra + optional faer/candle |
fer-ml |
Pure Rust ML utils: safetensors, serde, sampling |
fer-candle-core |
Modified Candle core with TLSF/ptx-os |
fer-candle |
Candle ML backend: FusedExecutor, GraphCache, PtxDevice |
fer-triton |
Triton CUBIN loader + direct cuLaunchKernel dispatch |
fer-ext-kernels |
FFI to libfer_ext_kernels.so (Q4K GEMV, RMS norm) |
fer-bindgen |
Custom bindgen_cuda fork |
fer-ug-cuda |
ug-cuda shim for candle |
Pre-planned tensor allocation via the ignition engine:
from fercuda_runtime.scripting_api.ignition import TensorSpec, IgnitionPlan, ignite
plan = IgnitionPlan(tensors=(TensorSpec(rank=1, dims=(64,)),), warmup_passes=1)
region = ignite(session, plan) # RAII — memory freed on region closeRust cdylib: rust/crates/public/fercuda-ignition/ → libfercuda_ignition.so
The supported build path is:
./uvw run fercuda-installOr, for the OS/runtime layer without default frontend bundles:
./uvw run fercuda-install --no-default-frontendsThat installer owns both the native CMake build and the Rust ignition cdylib build, and syncs the runtime-facing shared objects into build/.
If cargo is missing, it bootstraps a minimal repo-local Rust toolchain under build/.cargo and build/.rustup.
The installer now has an explicit bundle model:
./uvw run fercuda-installbuilds native + installs the defaulthfbundle./uvw run fercuda-install --no-default-frontendsbuilds only the OS/runtime layer./uvw run fercuda-install --with ggufadds extra frontends during the main install
List the available bundles:
./uvw run fercuda-install runtimes --listPreview the default install commands without executing them:
./uvw run fercuda-install --print-onlyBuild native only:
./uvw run fercuda-install --no-default-frontendsBuild native + add optional bundles:
./uvw run fercuda-install --with gguf
./uvw run fercuda-install --with gguf --with devPrint the exact commands for one or more bundles without rebuilding native:
./uvw run fercuda-install runtimes --print-only hf ggufInstall selected bundles later:
./uvw run fercuda-install runtimes hf
./uvw run fercuda-install runtimes gguf
./uvw run fercuda-install runtimes vllm./uvw run fercuda-install --no-default-frontends
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/memory -v
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/inference -v
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/memory/test_subsystem_reboot.py -v
ctest --test-dir build --output-on-failure
cd rust && cargo test -p fercuda-inference # Rust inference testsKey functions in include/fercuda/api/c_api.h:
fer_alloc_buffer,fer_free_buffer,fer_upload_bytes,fer_export_buffer_device_ptrfer_ptxlaunch_submit— kernel dispatchfer_persistent_dispatcher_boot— boot persistent dispatcherfer_stream_sync— synchronize session streamfercuda_ignite— ignition engine entry point
Hook telemetry: include/fercuda/api/intercept_telemetry.h and fercuda_runtime.runtime.telemetry (ctypes into libptx_hook.so).
- C:
include/fercuda/api/execution_contract.h— entrypoints (tensor, jit_intent, jit_launch, resident_daemon, framework_allocator), frontends, ABI version. - Python:
python/src/fercuda_runtime/execution_contract.py— same names plusrust_ferrite_torch→framework_allocator.