Context
Continuum's inference substrate is pluggable via the ModelBackend trait defined in src/workers/continuum-core/src/inference/backends/mod.rs. Today two implementations live side-by-side:
- llama.cpp (
backends/llamacpp.rs) — inference primary, vendored as git submodule at src/workers/vendor/llama.cpp, exposed via the workers/llama Rust crate. Metal / CUDA / Vulkan backends all work through it.
- candle (
candle_adapter.rs) — training primary + limited inference. We carry a fork (joelteply/candle branch fix/metal-memory-and-rope-neox) while the RoPE NEOX fix (PR #3411 — accepted and merged upstream 2026-04-15) and Metal memory fix propagate to a released crates.io version.
The trait is the seam. New backends slot in without the rest of the stack knowing — personas, scheduler, RAG, IPC are all backend-agnostic.
Standing directive (memory/feedback_upstream_fixes_tri_repo.md): build fearlessly in our own fork, cycle down to upstream as fixes merge, keep adapters in parallel for flexibility. "Control our own destiny with open source and pay it forward as we fix" (Joel, 2026-04-15).
This issue is an open invitation to anyone — us eventually, community contributors, anyone who has a strong reason to want a particular substrate available in Continuum — to pick up one of the adapter candidates below.
Candidate adapters
1. MLX adapter (high-value, ambitious)
Apple MLX is Apple's native ML framework for Apple Silicon. Potentially Continuum's best long-term Mac inference path because it bypasses the Metal-via-ggml layer entirely and leverages unified memory natively.
Why it matters:
- Apple Silicon (M-series Mac) is Continuum's primary target (
project_m5_is_primary_audience.md — BMW M4 tier).
- MLX has first-class Metal + unified-memory support, no cross-framework translation loss.
- Apple's team actively maintains it; perf improvements land without our intervention.
What's needed:
- Mature Rust bindings for MLX. Options: use FFI against the C++ API (mlx-c), or wait for / contribute to an ml-explore Rust binding project.
- Implement
ModelBackend on top of MLX for Qwen3.5 family (the model architectures we care about).
- Benchmark vs llama.cpp ggml-metal on Qwen3.5-4B Q4_K_M on M-series. Acceptance: parity or better vs ggml-metal for our concurrent sensory workload.
Difficulty: High. No stable Rust bindings exist today, model-arch support needs implementing.
2. ort-LLM adapter (natural extension, lower effort)
Continuum already uses ONNX Runtime (ort crate) for Silero VAD and Piper TTS via the load-dynamic-ort feature (see workers/continuum-core/Cargo.toml). Extending it to LLM inference is an incremental step.
Why it matters:
- ONNX models are widely distributed; community-converted Qwen/LLaMA ONNX variants exist.
- Single runtime for LLM + STT + TTS simplifies the sensory pipeline.
- ORT has execution providers for CUDA, DirectML (Windows), CoreML (Mac), TensorRT.
What's needed:
- Implement
ModelBackend using ort for LLM models (tokenization + session invocation + KV-cache handling).
- Add ONNX Qwen3.5 variants to
model_registry.json.
- Benchmark vs llama.cpp. Acceptance: within factor of 2 on Apple Silicon; CUDA within 20% of ggml-cuda.
Difficulty: Medium. ort crate is mature, but KV-cache management for autoregressive LLMs via ONNX takes care.
3. vLLM adapter (niche fit)
vLLM is a Python continuous-batching inference server with state-of-the-art throughput on GPU.
Why it matters (conditionally):
- State-of-the-art multi-request throughput on CUDA.
- Easy to deploy on BigMama-class hardware for forge/factory batch eval workloads.
Why it's a niche fit:
What's needed:
- A
VllmRemoteBackend that implements ModelBackend by proxying over a gRPC / HTTP connection to a vLLM server.
- Configuration to point at a vLLM endpoint (our own BigMama or a third-party).
Difficulty: Low-medium. Mostly glue; vLLM's API is stable.
4. Custom in-house adapter (speculative)
If the concurrent-sensory-model workload on M5 has requirements no off-the-shelf backend serves well — e.g., specialized KV-cache layouts for simultaneous avatar-driven personas, LoRA hot-swap semantics specific to our persona genome, domain-specific Metal kernels for DeltaNet — we could write our own.
Why it might matter:
- Full control over the inference hot path for our exact workload shape.
- No dependency on external release cadences.
Why it might not:
- Substantial maintenance surface.
- Only worth it if measured perf gaps on our actual workload justify the investment.
Difficulty: Very high. Scope and surface area would need explicit design RFC first. Not a "start coding" issue — a "decide whether to start designing" issue.
How to pick this up
Comment below with which adapter you want to tackle + rough timeline. We'll coordinate so contributors don't step on each other. The ModelBackend trait is the contract; anything that satisfies it is accepted into the candidate set. Final inclusion decisions gate on:
- Acceptance benchmark — your adapter ≥ parity with llama.cpp on Qwen3.5-4B Q4_K_M for the target hardware, OR clearly better on a specific axis (latency, memory, programmability) with perf tradeoff documented.
- Concurrent sensory envelope — must fit alongside Bevy + Whisper + Piper + LiveKit on M5 without breaking the live-chat workload.
- No feature drop — accelerated only, no CPU fallback (
feedback_support_all_features_no_cheating.md).
- Upstream friendliness — if your adapter surfaces a bug in its underlying library, help land the fix upstream in parallel (same pattern candle PR #3411 followed).
Related
Context
Continuum's inference substrate is pluggable via the
ModelBackendtrait defined insrc/workers/continuum-core/src/inference/backends/mod.rs. Today two implementations live side-by-side:backends/llamacpp.rs) — inference primary, vendored as git submodule atsrc/workers/vendor/llama.cpp, exposed via theworkers/llamaRust crate. Metal / CUDA / Vulkan backends all work through it.candle_adapter.rs) — training primary + limited inference. We carry a fork (joelteply/candlebranchfix/metal-memory-and-rope-neox) while the RoPE NEOX fix (PR #3411 — accepted and merged upstream 2026-04-15) and Metal memory fix propagate to a released crates.io version.The trait is the seam. New backends slot in without the rest of the stack knowing — personas, scheduler, RAG, IPC are all backend-agnostic.
Standing directive (
memory/feedback_upstream_fixes_tri_repo.md): build fearlessly in our own fork, cycle down to upstream as fixes merge, keep adapters in parallel for flexibility. "Control our own destiny with open source and pay it forward as we fix" (Joel, 2026-04-15).This issue is an open invitation to anyone — us eventually, community contributors, anyone who has a strong reason to want a particular substrate available in Continuum — to pick up one of the adapter candidates below.
Candidate adapters
1. MLX adapter (high-value, ambitious)
Apple MLX is Apple's native ML framework for Apple Silicon. Potentially Continuum's best long-term Mac inference path because it bypasses the Metal-via-ggml layer entirely and leverages unified memory natively.
Why it matters:
project_m5_is_primary_audience.md— BMW M4 tier).What's needed:
ModelBackendon top of MLX for Qwen3.5 family (the model architectures we care about).Difficulty: High. No stable Rust bindings exist today, model-arch support needs implementing.
2. ort-LLM adapter (natural extension, lower effort)
Continuum already uses ONNX Runtime (
ortcrate) for Silero VAD and Piper TTS via theload-dynamic-ortfeature (seeworkers/continuum-core/Cargo.toml). Extending it to LLM inference is an incremental step.Why it matters:
What's needed:
ModelBackendusingortfor LLM models (tokenization + session invocation + KV-cache handling).model_registry.json.Difficulty: Medium. ort crate is mature, but KV-cache management for autoregressive LLMs via ONNX takes care.
3. vLLM adapter (niche fit)
vLLM is a Python continuous-batching inference server with state-of-the-art throughput on GPU.
Why it matters (conditionally):
Why it's a niche fit:
workers/continuum-core/src/scheduler.rs) already covers the multi-seq case for our Qwen3.5-4B live-chat workload.What's needed:
VllmRemoteBackendthat implementsModelBackendby proxying over a gRPC / HTTP connection to a vLLM server.Difficulty: Low-medium. Mostly glue; vLLM's API is stable.
4. Custom in-house adapter (speculative)
If the concurrent-sensory-model workload on M5 has requirements no off-the-shelf backend serves well — e.g., specialized KV-cache layouts for simultaneous avatar-driven personas, LoRA hot-swap semantics specific to our persona genome, domain-specific Metal kernels for DeltaNet — we could write our own.
Why it might matter:
Why it might not:
Difficulty: Very high. Scope and surface area would need explicit design RFC first. Not a "start coding" issue — a "decide whether to start designing" issue.
How to pick this up
Comment below with which adapter you want to tackle + rough timeline. We'll coordinate so contributors don't step on each other. The
ModelBackendtrait is the contract; anything that satisfies it is accepted into the candidate set. Final inclusion decisions gate on:feedback_support_all_features_no_cheating.md).Related
src/workers/continuum-core/src/inference/backends/mod.rsbackends/llamacpp.rs,candle_adapter.rsmemory/feedback_upstream_fixes_tri_repo.md