Skip to content

Open invitation: ModelBackend adapters we'd welcome (MLX, vLLM, ort-LLM, custom) #896

@joelteply

Description

@joelteply

Context

Continuum's inference substrate is pluggable via the ModelBackend trait defined in src/workers/continuum-core/src/inference/backends/mod.rs. Today two implementations live side-by-side:

  • llama.cpp (backends/llamacpp.rs) — inference primary, vendored as git submodule at src/workers/vendor/llama.cpp, exposed via the workers/llama Rust crate. Metal / CUDA / Vulkan backends all work through it.
  • candle (candle_adapter.rs) — training primary + limited inference. We carry a fork (joelteply/candle branch fix/metal-memory-and-rope-neox) while the RoPE NEOX fix (PR #3411 — accepted and merged upstream 2026-04-15) and Metal memory fix propagate to a released crates.io version.

The trait is the seam. New backends slot in without the rest of the stack knowing — personas, scheduler, RAG, IPC are all backend-agnostic.

Standing directive (memory/feedback_upstream_fixes_tri_repo.md): build fearlessly in our own fork, cycle down to upstream as fixes merge, keep adapters in parallel for flexibility. "Control our own destiny with open source and pay it forward as we fix" (Joel, 2026-04-15).

This issue is an open invitation to anyone — us eventually, community contributors, anyone who has a strong reason to want a particular substrate available in Continuum — to pick up one of the adapter candidates below.

Candidate adapters

1. MLX adapter (high-value, ambitious)

Apple MLX is Apple's native ML framework for Apple Silicon. Potentially Continuum's best long-term Mac inference path because it bypasses the Metal-via-ggml layer entirely and leverages unified memory natively.

Why it matters:

  • Apple Silicon (M-series Mac) is Continuum's primary target (project_m5_is_primary_audience.md — BMW M4 tier).
  • MLX has first-class Metal + unified-memory support, no cross-framework translation loss.
  • Apple's team actively maintains it; perf improvements land without our intervention.

What's needed:

  • Mature Rust bindings for MLX. Options: use FFI against the C++ API (mlx-c), or wait for / contribute to an ml-explore Rust binding project.
  • Implement ModelBackend on top of MLX for Qwen3.5 family (the model architectures we care about).
  • Benchmark vs llama.cpp ggml-metal on Qwen3.5-4B Q4_K_M on M-series. Acceptance: parity or better vs ggml-metal for our concurrent sensory workload.

Difficulty: High. No stable Rust bindings exist today, model-arch support needs implementing.


2. ort-LLM adapter (natural extension, lower effort)

Continuum already uses ONNX Runtime (ort crate) for Silero VAD and Piper TTS via the load-dynamic-ort feature (see workers/continuum-core/Cargo.toml). Extending it to LLM inference is an incremental step.

Why it matters:

  • ONNX models are widely distributed; community-converted Qwen/LLaMA ONNX variants exist.
  • Single runtime for LLM + STT + TTS simplifies the sensory pipeline.
  • ORT has execution providers for CUDA, DirectML (Windows), CoreML (Mac), TensorRT.

What's needed:

  • Implement ModelBackend using ort for LLM models (tokenization + session invocation + KV-cache handling).
  • Add ONNX Qwen3.5 variants to model_registry.json.
  • Benchmark vs llama.cpp. Acceptance: within factor of 2 on Apple Silicon; CUDA within 20% of ggml-cuda.

Difficulty: Medium. ort crate is mature, but KV-cache management for autoregressive LLMs via ONNX takes care.


3. vLLM adapter (niche fit)

vLLM is a Python continuous-batching inference server with state-of-the-art throughput on GPU.

Why it matters (conditionally):

  • State-of-the-art multi-request throughput on CUDA.
  • Easy to deploy on BigMama-class hardware for forge/factory batch eval workloads.

Why it's a niche fit:

What's needed:

  • A VllmRemoteBackend that implements ModelBackend by proxying over a gRPC / HTTP connection to a vLLM server.
  • Configuration to point at a vLLM endpoint (our own BigMama or a third-party).

Difficulty: Low-medium. Mostly glue; vLLM's API is stable.


4. Custom in-house adapter (speculative)

If the concurrent-sensory-model workload on M5 has requirements no off-the-shelf backend serves well — e.g., specialized KV-cache layouts for simultaneous avatar-driven personas, LoRA hot-swap semantics specific to our persona genome, domain-specific Metal kernels for DeltaNet — we could write our own.

Why it might matter:

  • Full control over the inference hot path for our exact workload shape.
  • No dependency on external release cadences.

Why it might not:

  • Substantial maintenance surface.
  • Only worth it if measured perf gaps on our actual workload justify the investment.

Difficulty: Very high. Scope and surface area would need explicit design RFC first. Not a "start coding" issue — a "decide whether to start designing" issue.


How to pick this up

Comment below with which adapter you want to tackle + rough timeline. We'll coordinate so contributors don't step on each other. The ModelBackend trait is the contract; anything that satisfies it is accepted into the candidate set. Final inclusion decisions gate on:

  1. Acceptance benchmark — your adapter ≥ parity with llama.cpp on Qwen3.5-4B Q4_K_M for the target hardware, OR clearly better on a specific axis (latency, memory, programmability) with perf tradeoff documented.
  2. Concurrent sensory envelope — must fit alongside Bevy + Whisper + Piper + LiveKit on M5 without breaking the live-chat workload.
  3. No feature drop — accelerated only, no CPU fallback (feedback_support_all_features_no_cheating.md).
  4. Upstream friendliness — if your adapter surfaces a bug in its underlying library, help land the fix upstream in parallel (same pattern candle PR #3411 followed).

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions