Integrate Qwen3.5-0.8B as local VisionDescriptionService — real-time scene understanding at 1GB

## Discovery

Qwen3.5 0.8B does real-time video captioning on Mac: <1s per frame, ~1GB model, streaming descriptions as video plays. Understands scenes, not just detects objects.

Source: https://x.com/HuggingModels — running on Mac Studio M2 Ultra via MLX.

## Impact on Continuum

This replaces our current VisionDescriptionService pipeline (YOLO + cloud vision) with a FULLY LOCAL solution:

- **1GB model** — fits on MacBook Air alongside everything else
- **<1s per frame** — real-time, not batch
- **Scene understanding** — "a hand is positioned in the foreground, palm facing the viewer" not just "hand detected"
- **Streaming** — describes as video plays, not after

## Use Cases

1. **Live call vision**: Persona watches WebRTC video feed, describes to text-only personas in real-time
2. **Screenshot verification** (#453): Describe rendered page in <1s for coding agent QA
3. **Game testing**: Watch game play, describe frame-by-frame for automated evaluation
4. **Avatar QA**: Detect vertex corruption automatically ("geometry appears shredded" vs "clean anime face")
5. **UI testing**: "The button is in the top-right corner, the text says Submit" — automated visual assertions

## Integration

Replace VisionDescriptionService's cloud path with local Qwen3.5-0.8B:
- Load via Candle (GGUF) or MLX (on Mac)
- Content-addressed cache still applies (don't re-describe identical frames)
- Falls back to cloud vision if local model unavailable
- 0.8B is small enough to be ALWAYS loaded — no paging needed

## The Bigger Picture

With this model:
- Every persona can SEE at zero cost
- Visual QA becomes instant and free
- The sensory pipeline is 100% local
- MacBook Air becomes a fully-sighted system

This is what "local AI is getting unreasonably capable" means for us.

## Related
- #409 (sensory verification — this solves it)
- #453 (visual verification — this enables it)
- VisionDescriptionService (existing, to be upgraded)
- #433 (model paging — 0.8B is small enough to always be resident)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Qwen3.5-0.8B as local VisionDescriptionService — real-time scene understanding at 1GB #480

Discovery

Impact on Continuum

Use Cases

Integration

The Bigger Picture

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Integrate Qwen3.5-0.8B as local VisionDescriptionService — real-time scene understanding at 1GB #480

Description

Discovery

Impact on Continuum

Use Cases

Integration

The Bigger Picture

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions