Open invitation: ModelBackend adapters we'd welcome (MLX, vLLM, ort-LLM, custom)

## Context

Continuum's inference substrate is pluggable via the `ModelBackend` trait defined in `src/workers/continuum-core/src/inference/backends/mod.rs`. Today two implementations live side-by-side:

- **llama.cpp** (`backends/llamacpp.rs`) — inference primary, vendored as git submodule at `src/workers/vendor/llama.cpp`, exposed via the `workers/llama` Rust crate. Metal / CUDA / Vulkan backends all work through it.
- **candle** (`candle_adapter.rs`) — training primary + limited inference. We carry a fork (`joelteply/candle` branch `fix/metal-memory-and-rope-neox`) while the RoPE NEOX fix (PR #3411 — accepted and merged upstream 2026-04-15) and Metal memory fix propagate to a released crates.io version.

The trait is the seam. New backends slot in without the rest of the stack knowing — personas, scheduler, RAG, IPC are all backend-agnostic.

**Standing directive** ([`memory/feedback_upstream_fixes_tri_repo.md`](../../memory/feedback_upstream_fixes_tri_repo.md)): build fearlessly in our own fork, cycle down to upstream as fixes merge, keep adapters in parallel for flexibility. "Control our own destiny with open source and pay it forward as we fix" (Joel, 2026-04-15).

This issue is an **open invitation** to anyone — us eventually, community contributors, anyone who has a strong reason to want a particular substrate available in Continuum — to pick up one of the adapter candidates below.

## Candidate adapters

### 1. MLX adapter (high-value, ambitious)

[Apple MLX](https://github.com/ml-explore/mlx) is Apple's native ML framework for Apple Silicon. Potentially Continuum's **best long-term Mac inference path** because it bypasses the Metal-via-ggml layer entirely and leverages unified memory natively.

**Why it matters:**
- Apple Silicon (M-series Mac) is Continuum's primary target ([`project_m5_is_primary_audience.md`](../../memory/project_m5_is_primary_audience.md) — BMW M4 tier).
- MLX has first-class Metal + unified-memory support, no cross-framework translation loss.
- Apple's team actively maintains it; perf improvements land without our intervention.

**What's needed:**
- Mature Rust bindings for MLX. Options: use FFI against the C++ API (mlx-c), or wait for / contribute to an ml-explore Rust binding project.
- Implement `ModelBackend` on top of MLX for Qwen3.5 family (the model architectures we care about).
- Benchmark vs llama.cpp ggml-metal on Qwen3.5-4B Q4_K_M on M-series. Acceptance: parity or better vs ggml-metal for our concurrent sensory workload.

**Difficulty:** High. No stable Rust bindings exist today, model-arch support needs implementing.

---

### 2. ort-LLM adapter (natural extension, lower effort)

Continuum already uses [ONNX Runtime (`ort` crate)](https://github.com/pykeio/ort) for Silero VAD and Piper TTS via the `load-dynamic-ort` feature (see `workers/continuum-core/Cargo.toml`). Extending it to LLM inference is an incremental step.

**Why it matters:**
- ONNX models are widely distributed; community-converted Qwen/LLaMA ONNX variants exist.
- Single runtime for LLM + STT + TTS simplifies the sensory pipeline.
- ORT has execution providers for CUDA, DirectML (Windows), CoreML (Mac), TensorRT.

**What's needed:**
- Implement `ModelBackend` using `ort` for LLM models (tokenization + session invocation + KV-cache handling).
- Add ONNX Qwen3.5 variants to `model_registry.json`.
- Benchmark vs llama.cpp. Acceptance: within factor of 2 on Apple Silicon; CUDA within 20% of ggml-cuda.

**Difficulty:** Medium. ort crate is mature, but KV-cache management for autoregressive LLMs via ONNX takes care.

---

### 3. vLLM adapter (niche fit)

[vLLM](https://github.com/vllm-project/vllm) is a Python continuous-batching inference server with state-of-the-art throughput on GPU.

**Why it matters (conditionally):**
- State-of-the-art multi-request throughput on CUDA.
- Easy to deploy on BigMama-class hardware for forge/factory batch eval workloads.

**Why it's a niche fit:**
- vLLM is Python — would need a gRPC / HTTP bridge, not a direct Rust-trait implementation.
- Continuum's own continuous-batching scheduler (PR #891, `workers/continuum-core/src/scheduler.rs`) already covers the multi-seq case for our Qwen3.5-4B live-chat workload.
- Could be valuable specifically for the **factory / forge evaluation** path (running eval suites against forged models), not the live sensory path.

**What's needed:**
- A `VllmRemoteBackend` that implements `ModelBackend` by proxying over a gRPC / HTTP connection to a vLLM server.
- Configuration to point at a vLLM endpoint (our own BigMama or a third-party).

**Difficulty:** Low-medium. Mostly glue; vLLM's API is stable.

---

### 4. Custom in-house adapter (speculative)

If the concurrent-sensory-model workload on M5 has requirements no off-the-shelf backend serves well — e.g., specialized KV-cache layouts for simultaneous avatar-driven personas, LoRA hot-swap semantics specific to our persona genome, domain-specific Metal kernels for DeltaNet — we could write our own.

**Why it might matter:**
- Full control over the inference hot path for our exact workload shape.
- No dependency on external release cadences.

**Why it might not:**
- Substantial maintenance surface.
- Only worth it if measured perf gaps on our actual workload justify the investment.

**Difficulty:** Very high. Scope and surface area would need explicit design RFC first. Not a "start coding" issue — a "decide whether to start designing" issue.

---

## How to pick this up

Comment below with which adapter you want to tackle + rough timeline. We'll coordinate so contributors don't step on each other. The `ModelBackend` trait is the contract; anything that satisfies it is accepted into the candidate set. Final inclusion decisions gate on:

1. **Acceptance benchmark** — your adapter ≥ parity with llama.cpp on Qwen3.5-4B Q4_K_M for the target hardware, OR clearly better on a specific axis (latency, memory, programmability) with perf tradeoff documented.
2. **Concurrent sensory envelope** — must fit alongside Bevy + Whisper + Piper + LiveKit on M5 without breaking the live-chat workload.
3. **No feature drop** — accelerated only, no CPU fallback ([`feedback_support_all_features_no_cheating.md`](../../memory/feedback_support_all_features_no_cheating.md)).
4. **Upstream friendliness** — if your adapter surfaces a bug in its underlying library, help land the fix upstream in parallel (same pattern candle PR #3411 followed).

## Related

- Trait definition: `src/workers/continuum-core/src/inference/backends/mod.rs`
- Existing implementations: `backends/llamacpp.rs`, `candle_adapter.rs`
- Standing directive: [`memory/feedback_upstream_fixes_tri_repo.md`](../../memory/feedback_upstream_fixes_tri_repo.md)
- Target workload: #895 (live multi-persona concurrency benchmark)
- Forge interop: #894 (vision-Qwen3.5 per device tier)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open invitation: ModelBackend adapters we'd welcome (MLX, vLLM, ort-LLM, custom) #896

Context

Candidate adapters

1. MLX adapter (high-value, ambitious)

2. ort-LLM adapter (natural extension, lower effort)

3. vLLM adapter (niche fit)

4. Custom in-house adapter (speculative)

How to pick this up

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Open invitation: ModelBackend adapters we'd welcome (MLX, vLLM, ort-LLM, custom) #896

Description

Context

Candidate adapters

1. MLX adapter (high-value, ambitious)

2. ort-LLM adapter (natural extension, lower effort)

3. vLLM adapter (niche fit)

4. Custom in-house adapter (speculative)

How to pick this up

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions