Skip to content

Vision-enabled Qwen3.5 forge plan — variants per device tier (M5 / MBA / 3090+) #894

@joelteply

Description

@joelteply

Why

Per docs/planning/ALPHA-GAP-ANALYSIS.md → "The Inference Design Goal — Multi-Persona Live Chat at Low Latency" and memory/project_m5_is_primary_audience.md: Qwen3.5-4B Q4_K_M was forged specifically as a concurrent low-latency sensory model on Apple Silicon. The post-launch track is native multimodality — vision-enabled Qwen3.5 variants sized per device tier.

Stopgap until this lands (good enough for go-live): text-only Qwen3.5 + sensory bridges (VisionDescriptionService → image-to-text, Whisper → audio-to-text, Piper/Orpheus → text-to-audio). All already in the codebase. PR891 restores parity by un-cheating the SKIP_STT/SKIP_TTS hatches.

Launch can go without native vision-Qwen3.5. But it's quickly needed post-launch.

What

Forge native-vision Qwen3.5 variants using the factory + sentinel-ai pipeline (which was built for exactly this — see PR #891 parent narrative). One variant per device tier:

Tier Hardware Target size Memory budget incl. sensory stack
BMW M4 (primary) MacBook M3-M5 Pro/Max (16-48GB unified) 4B-vision Q4_K_M ≤4GB for model, ≤8GB for Bevy + Whisper + Piper + LiveKit + UI, ≥20GB slack
BMW 2 Series (aspirational) MacBook Air M1/M2 (8GB unified) Smaller vision-Qwen3.5 — forge exploration (2B? 3B?) Must fit 8GB alongside sensory stack; may require more aggressive quant (Q3_K / IQ-series)
Corvette / Mustang RTX 3090+ desktop (24GB+ VRAM) Larger vision-Qwen3.5 (7B, 14B, 27B+) CUDA scales up as VRAM allows

Prior art already in the repo

Constraints

  • Never degrade: Apple Silicon path must never drop to CPU. Forged variants must work through ggml-metal (native Dev) or ggml-vulkan (container Carl).
  • Concurrent sensory envelope: each forged variant must fit alongside Bevy render + Whisper STT + Piper/Orpheus TTS + LiveKit WebRTC simultaneously on the target hardware. Not just "fits in RAM" — fits alongside everything else running.
  • Publish via HF distribution (continuum-ai/qwen3.5-<tier>-vision-forged-GGUF) with continuum-ai tags, reproducibility via forge alloys.

Dependencies

Acceptance

  • M5 Mac: vision-Qwen3.5 variant runs alongside 3-5 concurrent personas at conversation pace, sensory stack intact
  • MacBook Air: smaller vision-Qwen3.5 variant fits 8GB headroom without breaking the concurrent-sensory invariant
  • RTX 3090+: larger vision-Qwen3.5 variant scales up cleanly, CUDA backend
  • All three published to HF under continuum-ai/*-vision-forged-GGUF
  • Each variant registered in src/workers/continuum-core/src/inference/model_registry.json with device-tier hints
  • Persona runtime auto-selects the right variant based on device_target_ladder detection

Related memory

  • memory/project_m5_is_primary_audience.md — audience definition
  • memory/project_qwen35_forge_targets.md — forge target rules
  • memory/project_device_target_ladder.md — hardware tiers
  • memory/feedback_inference_runtime_split.md — llama.cpp for inference, Candle for training

Metadata

Metadata

Assignees

No one assigned

    Labels

    forgeModel forging and training

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions