Vision-enabled Qwen3.5 forge plan — variants per device tier (M5 / MBA / 3090+)

## Why

Per `docs/planning/ALPHA-GAP-ANALYSIS.md` → "The Inference Design Goal — Multi-Persona Live Chat at Low Latency" and `memory/project_m5_is_primary_audience.md`: Qwen3.5-4B Q4_K_M was forged specifically as a concurrent low-latency sensory model on Apple Silicon. The **post-launch** track is native multimodality — vision-enabled Qwen3.5 variants sized per device tier.

**Stopgap until this lands (good enough for go-live):** text-only Qwen3.5 + sensory bridges (`VisionDescriptionService` → image-to-text, Whisper → audio-to-text, Piper/Orpheus → text-to-audio). All already in the codebase. PR891 restores parity by un-cheating the SKIP_STT/SKIP_TTS hatches.

**Launch can go without native vision-Qwen3.5. But it's quickly needed post-launch.**

## What

Forge native-vision Qwen3.5 variants using the factory + sentinel-ai pipeline (which was built for exactly this — see PR #891 parent narrative). One variant per device tier:

| Tier | Hardware | Target size | Memory budget incl. sensory stack |
|---|---|---|---|
| BMW M4 (primary) | MacBook M3-M5 Pro/Max (16-48GB unified) | 4B-vision Q4_K_M | ≤4GB for model, ≤8GB for Bevy + Whisper + Piper + LiveKit + UI, ≥20GB slack |
| BMW 2 Series (aspirational) | MacBook Air M1/M2 (8GB unified) | Smaller vision-Qwen3.5 — forge exploration (2B? 3B?) | Must fit 8GB alongside sensory stack; may require more aggressive quant (Q3_K / IQ-series) |
| Corvette / Mustang | RTX 3090+ desktop (24GB+ VRAM) | Larger vision-Qwen3.5 (7B, 14B, 27B+) | CUDA scales up as VRAM allows |

## Prior art already in the repo

- `workers/vendor/llama.cpp` vendored — Vulkan + CUDA + Metal backends wired via `workers/llama` crate (PR #891)
- Continuous-batching scheduler supports `n_seq_max` parallel sequences per model instance (PR #891)
- Factory daemon + sentinel-ai pipeline forges models end-to-end (#655 master lifecycle issue)
- Vision encoder architecture proposals: #649 (LLaVA-style projection), #579 (vision model forging)
- Device target ladder already defined in `memory/project_device_target_ladder.md`

## Constraints

- **Never degrade:** Apple Silicon path must never drop to CPU. Forged variants must work through ggml-metal (native Dev) or ggml-vulkan (container Carl).
- **Concurrent sensory envelope:** each forged variant must fit alongside Bevy render + Whisper STT + Piper/Orpheus TTS + LiveKit WebRTC simultaneously on the target hardware. Not just "fits in RAM" — fits alongside everything else running.
- **Publish via HF distribution** (`continuum-ai/qwen3.5-<tier>-vision-forged-GGUF`) with continuum-ai tags, reproducibility via forge alloys.

## Dependencies

- Depends on: #649 (vision encoder bolt-on), #579 (vision model forging), #655 (end-to-end lifecycle)
- Blocked by: PR #891 (needs the Vulkan Mac Carl path + continuous-batching scheduler to land first)

## Acceptance

- [ ] M5 Mac: vision-Qwen3.5 variant runs alongside 3-5 concurrent personas at conversation pace, sensory stack intact
- [ ] MacBook Air: smaller vision-Qwen3.5 variant fits 8GB headroom without breaking the concurrent-sensory invariant
- [ ] RTX 3090+: larger vision-Qwen3.5 variant scales up cleanly, CUDA backend
- [ ] All three published to HF under `continuum-ai/*-vision-forged-GGUF`
- [ ] Each variant registered in `src/workers/continuum-core/src/inference/model_registry.json` with device-tier hints
- [ ] Persona runtime auto-selects the right variant based on `device_target_ladder` detection

## Related memory

- `memory/project_m5_is_primary_audience.md` — audience definition
- `memory/project_qwen35_forge_targets.md` — forge target rules
- `memory/project_device_target_ladder.md` — hardware tiers
- `memory/feedback_inference_runtime_split.md` — llama.cpp for inference, Candle for training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision-enabled Qwen3.5 forge plan — variants per device tier (M5 / MBA / 3090+) #894

Why

What

Prior art already in the repo

Constraints

Dependencies

Acceptance

Related memory

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tier	Hardware	Target size	Memory budget incl. sensory stack
BMW M4 (primary)	MacBook M3-M5 Pro/Max (16-48GB unified)	4B-vision Q4_K_M	≤4GB for model, ≤8GB for Bevy + Whisper + Piper + LiveKit + UI, ≥20GB slack
BMW 2 Series (aspirational)	MacBook Air M1/M2 (8GB unified)	Smaller vision-Qwen3.5 — forge exploration (2B? 3B?)	Must fit 8GB alongside sensory stack; may require more aggressive quant (Q3_K / IQ-series)
Corvette / Mustang	RTX 3090+ desktop (24GB+ VRAM)	Larger vision-Qwen3.5 (7B, 14B, 27B+)	CUDA scales up as VRAM allows

Vision-enabled Qwen3.5 forge plan — variants per device tier (M5 / MBA / 3090+) #894

Description

Why

What

Prior art already in the repo

Constraints

Dependencies

Acceptance

Related memory

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions