feRcuda runtime

feRcuda is a userspace GPU operating layer. It installs a single governed memory and execution regime inside a CUDA process so frameworks become tenants of one machine instead of competing owners of the device.

At the core is a single-pool TLSF memory regime. One LD_PRELOAD intercepts every cudaMalloc from every library — PyTorch, Triton, bitsandbytes, cuBLAS, cuDNN, JAX — and routes it through a deterministic O(1) allocator. No library modifications. No cross-framework heap fragmentation. fallback=0 is the invariant.

On top of that regime, feRcuda provides kernel-style lifecycle control for GPU resources: session ownership, rebootable runtime state, logical buffer management, JIT dispatch, ignition for pre-planned allocation, fused Triton kernels, and streaming inference support for HuggingFace-style models.

The current stack is:

CUDA driver and low-level native mechanisms at the bottom
Rust as the stable machine and control plane above them
Python as the operator surface and library bridge over that Rust-owned substrate

That distinction matters. Python is not a second runtime architecture. It is how frameworks and scripts drive the machine that the native and Rust layers establish.

Quick start

git clone https://github.com/DaronPopov/feRcuda_runtime.git
cd feRcuda_runtime
./uvw run fercuda-install
./uvw run fercuda-install runtimes --list

That installer is the supported build path. It keeps the toolchain and runtime state self-contained under build/, auto-bootstraps a repo-local Rust toolchain if cargo is missing, builds the native and Rust runtime artifacts, and installs the default hf frontend bundle.

Run the Mistral-7B streaming chatbot:

FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python tests/integration/benchmarks/llm_inference/chat_mistral.py

Or Qwen3.5-4B with fused Triton kernels:

FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python tests/integration/benchmarks/llm_inference/chat_qwen.py

Requirements:

NVIDIA driver 580+, CUDA toolkit 13.0+
Linux with a modern NVIDIA GPU
uv (install)

Use uvw for repo-local commands. It pins:

UV_PROJECT_ENVIRONMENT=build/.venv
PYTHONPYCACHEPREFIX=build/pycache
CARGO_TARGET_DIR=build/rust

That keeps the system self-contained under build/. Extra frontends stay opt-in through installer bundles.

Layout

Path	Role
`native/ptx_core/`	Core: TLSF allocator, elastic pool, VMM/VFS OS glue, memcpy, kernels, JIT lowering, Linux preload hook
`native/fercuda_api/`	Session, planner, pools, `c_api.cu`
`include/fercuda/`	Public C/C++/CUDA headers (`api/`, `jit/`, `daemon/`, `runtime/`)
`build/`	Native build outputs (`.so` artifacts)
`python/`	`fercuda-runtime` Python package
`rust/`	Cargo workspace (public + internal + external crates)
`tests/integration/benchmarks/`	Benchmarks grouped by allocator, framework, mechanics, and LLM inference
`tests/integration/native/`	Native C++ smoke tests grouped by VMM subsystem
`python/tests/`	Python pytest coverage for memory, inference, and runtime behavior
`tests/`	Benchmarks, native smoke tests, and standalone validation scripts
`results/paper/`	Paper (LaTeX source + PDF)
`docs/`	Design documents and benchmark results
`examples/`	Ad hoc example scripts and experiments; not part of the required build

Build artifacts

Library	Purpose
`libptx_hook.so`	`LD_PRELOAD` intercept — routes all `cudaMalloc`/`cudaFree` through TLSF
`libfercuda_capi.so`	Shared C API (Python loads via ctypes)
`libptx_core.so`	Hot runtime, TLSF allocator, VMM OS glue
`libptx_kernels.so`	Tensor/graphics-style GPU kernels
`libfercuda_ignition.so`	Ignition engine — pre-planned tensor allocation
`libfer_triton.so`	Triton CUBIN loader + direct `cuLaunchKernel` dispatch

Memory regime

The core idea: one GPU pool, every library is a tenant.

LD_PRELOAD=build/libptx_hook.so intercepts all cudaMalloc calls at the process level. PyTorch, bitsandbytes, cuBLAS, Triton — none of them hold their own memory pools. They allocate and free through TLSF without knowing it.

When the hook is active, the intended shape is one process-level runtime pool. Native Session objects should attach to that pool, not create parallel pools of their own.

fallback=0 is the invariant. If fallback > 0, something escaped the pool.
TLSF is O(1) — bounded allocation time, no fragmentation.
torch.cuda.memory_reserved() is always ~0.
No torch.cuda.empty_cache() hacks needed.

Machine Invariants

These are the rules the repository is trying to preserve:

one GPU pool per process when the hook is active
no parallel session-owned pool when the hook-owned pool already exists
fallback=0 for workloads that are meant to be fully covered
buffer_id is the canonical internal execution handle
raw device_ptr export is an interop boundary, not the internal execution identity
Rust owns machine logic above CUDA; Python describes workloads and bridges libraries onto that substrate

If a change weakens one of those rules, it should be treated as an architectural regression, not just a local implementation detail.

See docs/concepts/memory-regime.md for the full reference.

Python package

The fercuda-runtime package provides:

Module	Purpose
`scripting_api/`	`RuntimeSession`, `RuntimeConfig`, ignition (`TensorSpec`, `IgnitionPlan`, `ignite()`)
`runtime/`	ctypes bindings, session management, telemetry, constants
`inference/`	Model adapters (GGUF), tokenizers, generation pipeline (`stream_generate`, `generate`)
`guest_hooks/`	HF Transformers pre-import patches (`hf_hook`), JAX plugin hooks, Triton hook (`TritonKernel`)
`platform/`	Bootstrap, library discovery, platform compatibility checks

The repo-local uv project already exposes fercuda-runtime from python/ as an editable source, so the supported path is still to enter through ./uvw run fercuda-install.

Benchmarks and chatbots

Script	What it does
`chat_qwen.py`	Streaming chatbot — Qwen3.5-4B (4-bit), with fused Triton kernels (`--fused`/`--no-fused`)
`chat_mistral.py`	Streaming chatbot — Mistral-7B-Instruct-v0.2 (4-bit)
`triton_kernels.py`	Fused Triton kernels: RMSNorm, SiLU×mul (SwiGLU), rotary embedding
`patch_model.py`	Monkey-patches Qwen3.5 model with fused kernels
`bench_triton_regime.py`	Triton kernels over TLSF: elementwise, softmax, matmul, launch overhead
`bench_hf_transformers.py`	HF Transformers integration benchmark
`bench_baseline.py`	Baseline allocation and throughput
`bench_coexistence.py`	Multi-library coexistence on TLSF
`bench_mechanics.py`	Low-level regime mechanics
`bench_throughput.py`	Allocation throughput
`bench_torch_regime.py`	PyTorch ops over TLSF
`bench_jax_regime.py`	JAX ops over TLSF

Fused Triton kernels

Three kernels targeting per-layer elementwise hotspots in Qwen3.5:

fused_rms_norm — single kernel replacing 6 PyTorch ops (cast → pow → mean → rsqrt → mul → cast). Handles the Qwen3.5 (1 + weight) variant.
fused_silu_mul — fused SwiGLU activation silu(gate) * up in one pass, eliminating an intermediate tensor.
fused_rotary_emb — fused rotate_half + cos/sin multiply for rotary position embeddings.

Applied via patch_qwen35_with_triton(model) after model load. All kernels run on the normal torch CUDA stream — TLSF handles allocation transparently.

Rust workspace

Public crates:

Crate	Purpose
`ptx-runtime`	Core feRcuda runtime bindings
`fercuda-ignition`	TLSF ignition cdylib
`fercuda-inference`	Rust inference bridge

Internal: ptx-sys (low-level sys crate)

External crates (rust/external/fer-*):

Crate	Purpose
`fer-cudarc`	cudarc 0.13.9 with ptx-alloc
`fer-cudarc-ptx`	cudarc-ptx 0.19.0 with ptx-os feature
`fer-aten`	PyTorch TLSF allocator
`fer-torch`	PyTorch + TLSF integration
`fer-ptx-os`	PTX-OS runtime: RegimeRuntimeCore, DeviceBox
`fer-math`	Pure Rust math: nalgebra + optional faer/candle
`fer-ml`	Pure Rust ML utils: safetensors, serde, sampling
`fer-candle-core`	Modified Candle core with TLSF/ptx-os
`fer-candle`	Candle ML backend: FusedExecutor, GraphCache, PtxDevice
`fer-triton`	Triton CUBIN loader + direct cuLaunchKernel dispatch
`fer-ext-kernels`	FFI to libfer_ext_kernels.so (Q4K GEMV, RMS norm)
`fer-bindgen`	Custom bindgen_cuda fork
`fer-ug-cuda`	ug-cuda shim for candle

Ignition engine

Pre-planned tensor allocation via the ignition engine:

from fercuda_runtime.scripting_api.ignition import TensorSpec, IgnitionPlan, ignite

plan = IgnitionPlan(tensors=(TensorSpec(rank=1, dims=(64,)),), warmup_passes=1)
region = ignite(session, plan)  # RAII — memory freed on region close

Rust cdylib: rust/crates/public/fercuda-ignition/ → libfercuda_ignition.so

Native build

The supported build path is:

./uvw run fercuda-install

Or, for the OS/runtime layer without default frontend bundles:

./uvw run fercuda-install --no-default-frontends

That installer owns both the native CMake build and the Rust ignition cdylib build, and syncs the runtime-facing shared objects into build/. If cargo is missing, it bootstraps a minimal repo-local Rust toolchain under build/.cargo and build/.rustup.

Optional Inference Runtimes

The installer now has an explicit bundle model:

./uvw run fercuda-install builds native + installs the default hf bundle
./uvw run fercuda-install --no-default-frontends builds only the OS/runtime layer
./uvw run fercuda-install --with gguf adds extra frontends during the main install

List the available bundles:

./uvw run fercuda-install runtimes --list

Preview the default install commands without executing them:

./uvw run fercuda-install --print-only

Build native only:

./uvw run fercuda-install --no-default-frontends

Build native + add optional bundles:

./uvw run fercuda-install --with gguf
./uvw run fercuda-install --with gguf --with dev

Print the exact commands for one or more bundles without rebuilding native:

./uvw run fercuda-install runtimes --print-only hf gguf

Install selected bundles later:

./uvw run fercuda-install runtimes hf
./uvw run fercuda-install runtimes gguf
./uvw run fercuda-install runtimes vllm

Tests

./uvw run fercuda-install --no-default-frontends
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/memory -v
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/inference -v
FERCUDA_NATIVE_BUILD=build LD_PRELOAD=build/libptx_hook.so ./uvw run python -m pytest python/tests/memory/test_subsystem_reboot.py -v
ctest --test-dir build --output-on-failure
cd rust && cargo test -p fercuda-inference                # Rust inference tests

C API

Key functions in include/fercuda/api/c_api.h:

fer_alloc_buffer, fer_free_buffer, fer_upload_bytes, fer_export_buffer_device_ptr
fer_ptxlaunch_submit — kernel dispatch
fer_persistent_dispatcher_boot — boot persistent dispatcher
fer_stream_sync — synchronize session stream
fercuda_ignite — ignition engine entry point

Hook telemetry: include/fercuda/api/intercept_telemetry.h and fercuda_runtime.runtime.telemetry (ctypes into libptx_hook.so).

Execution ABI

C: include/fercuda/api/execution_contract.h — entrypoints (tensor, jit_intent, jit_launch, resident_daemon, framework_allocator), frontends, ABI version.
Python: python/src/fercuda_runtime/execution_contract.py — same names plus rust_ferrite_torch → framework_allocator.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
cmake		cmake
docs		docs
include/fercuda		include/fercuda
native		native
python		python
rust		rust
tests		tests
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
fercuda-deps.toml		fercuda-deps.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock
uvw		uvw

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

feRcuda runtime

Quick start

Layout

Build artifacts

Memory regime

Machine Invariants

Python package

Benchmarks and chatbots

Fused Triton kernels

Rust workspace

Ignition engine

Native build

Optional Inference Runtimes

Tests

C API

Execution ABI

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

feRcuda runtime

Quick start

Layout

Build artifacts

Memory regime

Machine Invariants

Python package

Benchmarks and chatbots

Fused Triton kernels

Rust workspace

Ignition engine

Native build

Optional Inference Runtimes

Tests

C API

Execution ABI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages