Why
Per docs/planning/ALPHA-GAP-ANALYSIS.md → "The Inference Design Goal — Multi-Persona Live Chat at Low Latency" and memory/project_m5_is_primary_audience.md: Qwen3.5-4B Q4_K_M was forged specifically as a concurrent low-latency sensory model on Apple Silicon. The post-launch track is native multimodality — vision-enabled Qwen3.5 variants sized per device tier.
Stopgap until this lands (good enough for go-live): text-only Qwen3.5 + sensory bridges (VisionDescriptionService → image-to-text, Whisper → audio-to-text, Piper/Orpheus → text-to-audio). All already in the codebase. PR891 restores parity by un-cheating the SKIP_STT/SKIP_TTS hatches.
Launch can go without native vision-Qwen3.5. But it's quickly needed post-launch.
What
Forge native-vision Qwen3.5 variants using the factory + sentinel-ai pipeline (which was built for exactly this — see PR #891 parent narrative). One variant per device tier:
| Tier |
Hardware |
Target size |
Memory budget incl. sensory stack |
| BMW M4 (primary) |
MacBook M3-M5 Pro/Max (16-48GB unified) |
4B-vision Q4_K_M |
≤4GB for model, ≤8GB for Bevy + Whisper + Piper + LiveKit + UI, ≥20GB slack |
| BMW 2 Series (aspirational) |
MacBook Air M1/M2 (8GB unified) |
Smaller vision-Qwen3.5 — forge exploration (2B? 3B?) |
Must fit 8GB alongside sensory stack; may require more aggressive quant (Q3_K / IQ-series) |
| Corvette / Mustang |
RTX 3090+ desktop (24GB+ VRAM) |
Larger vision-Qwen3.5 (7B, 14B, 27B+) |
CUDA scales up as VRAM allows |
Prior art already in the repo
Constraints
- Never degrade: Apple Silicon path must never drop to CPU. Forged variants must work through ggml-metal (native Dev) or ggml-vulkan (container Carl).
- Concurrent sensory envelope: each forged variant must fit alongside Bevy render + Whisper STT + Piper/Orpheus TTS + LiveKit WebRTC simultaneously on the target hardware. Not just "fits in RAM" — fits alongside everything else running.
- Publish via HF distribution (
continuum-ai/qwen3.5-<tier>-vision-forged-GGUF) with continuum-ai tags, reproducibility via forge alloys.
Dependencies
Acceptance
Related memory
memory/project_m5_is_primary_audience.md — audience definition
memory/project_qwen35_forge_targets.md — forge target rules
memory/project_device_target_ladder.md — hardware tiers
memory/feedback_inference_runtime_split.md — llama.cpp for inference, Candle for training
Why
Per
docs/planning/ALPHA-GAP-ANALYSIS.md→ "The Inference Design Goal — Multi-Persona Live Chat at Low Latency" andmemory/project_m5_is_primary_audience.md: Qwen3.5-4B Q4_K_M was forged specifically as a concurrent low-latency sensory model on Apple Silicon. The post-launch track is native multimodality — vision-enabled Qwen3.5 variants sized per device tier.Stopgap until this lands (good enough for go-live): text-only Qwen3.5 + sensory bridges (
VisionDescriptionService→ image-to-text, Whisper → audio-to-text, Piper/Orpheus → text-to-audio). All already in the codebase. PR891 restores parity by un-cheating the SKIP_STT/SKIP_TTS hatches.Launch can go without native vision-Qwen3.5. But it's quickly needed post-launch.
What
Forge native-vision Qwen3.5 variants using the factory + sentinel-ai pipeline (which was built for exactly this — see PR #891 parent narrative). One variant per device tier:
Prior art already in the repo
workers/vendor/llama.cppvendored — Vulkan + CUDA + Metal backends wired viaworkers/llamacrate (PR feat: first-chat experience — accelerated local personas across Mac Metal + Linux/Win CUDA, atomic image ship #891)n_seq_maxparallel sequences per model instance (PR feat: first-chat experience — accelerated local personas across Mac Metal + Linux/Win CUDA, atomic image ship #891)memory/project_device_target_ladder.mdConstraints
continuum-ai/qwen3.5-<tier>-vision-forged-GGUF) with continuum-ai tags, reproducibility via forge alloys.Dependencies
Acceptance
continuum-ai/*-vision-forged-GGUFsrc/workers/continuum-core/src/inference/model_registry.jsonwith device-tier hintsdevice_target_ladderdetectionRelated memory
memory/project_m5_is_primary_audience.md— audience definitionmemory/project_qwen35_forge_targets.md— forge target rulesmemory/project_device_target_ladder.md— hardware tiersmemory/feedback_inference_runtime_split.md— llama.cpp for inference, Candle for training