Discovery
Qwen3.5 0.8B does real-time video captioning on Mac: <1s per frame, ~1GB model, streaming descriptions as video plays. Understands scenes, not just detects objects.
Source: https://x.com/HuggingModels — running on Mac Studio M2 Ultra via MLX.
Impact on Continuum
This replaces our current VisionDescriptionService pipeline (YOLO + cloud vision) with a FULLY LOCAL solution:
1GB model — fits on MacBook Air alongside everything else
<1s per frame — real-time, not batch
Scene understanding — "a hand is positioned in the foreground, palm facing the viewer" not just "hand detected"
Streaming — describes as video plays, not after
Use Cases
Live call vision : Persona watches WebRTC video feed, describes to text-only personas in real-time
Screenshot verification (Coding agent visual + runtime verification: screenshot + console errors + simulator/emulator testing #453 ): Describe rendered page in <1s for coding agent QA
Game testing : Watch game play, describe frame-by-frame for automated evaluation
Avatar QA : Detect vertex corruption automatically ("geometry appears shredded" vs "clean anime face")
UI testing : "The button is in the top-right corner, the text says Submit" — automated visual assertions
Integration
Replace VisionDescriptionService's cloud path with local Qwen3.5-0.8B:
Load via Candle (GGUF) or MLX (on Mac)
Content-addressed cache still applies (don't re-describe identical frames)
Falls back to cloud vision if local model unavailable
0.8B is small enough to be ALWAYS loaded — no paging needed
The Bigger Picture
With this model:
Every persona can SEE at zero cost
Visual QA becomes instant and free
The sensory pipeline is 100% local
MacBook Air becomes a fully-sighted system
This is what "local AI is getting unreasonably capable" means for us.
Related
Discovery
Qwen3.5 0.8B does real-time video captioning on Mac: <1s per frame, ~1GB model, streaming descriptions as video plays. Understands scenes, not just detects objects.
Source: https://x.com/HuggingModels — running on Mac Studio M2 Ultra via MLX.
Impact on Continuum
This replaces our current VisionDescriptionService pipeline (YOLO + cloud vision) with a FULLY LOCAL solution:
Use Cases
Integration
Replace VisionDescriptionService's cloud path with local Qwen3.5-0.8B:
The Bigger Picture
With this model:
This is what "local AI is getting unreasonably capable" means for us.
Related