Summary
Personas need to see, hear, and speak at human-conversational speed. This is NOT about adding modalities (see #649, #650) — it's about making them FAST enough for real-time interaction.
Latency Budgets
| Sense |
Target |
Current |
Gap |
| Vision (scene understanding) |
<100ms |
Bridge via VisionDescriptionService (slow) |
Need native or distilled model |
| Audio input (hearing) |
<50ms |
STT bridge (200-500ms) |
Need streaming encoder |
| Audio output (speech) |
<150ms first-byte |
TTS bridge (500ms+) |
Need streaming vocoder |
| Touch/interaction events |
<16ms |
Already fast (DOM events) |
OK |
Architecture Requirements
- ALL processing off main thread (AudioWorklet, Web Workers, Rust workers)
- Streaming — don't wait for complete input. Process chunks as they arrive
- Transferable buffers — zero-copy between threads
- Adaptive quality — degrade gracefully (lower resolution, skip frames) rather than block
- Local inference only — can't hit an API for real-time sensory processing
Vision Pipeline (target: <100ms)
- Frame capture (requestAnimationFrame or Intersection Observer) → <1ms
- Resize/crop to model input size → <5ms (Web Worker)
- Run distilled vision model (Qwen3.5-0.8B or MobileCLIP) → <80ms (Rust worker)
- Inject description into persona context → <5ms
Audio Pipeline (target: <50ms input, <150ms output)
Input:
- AudioWorklet captures PCM chunks → 0ms (runs on audio thread)
- Transfer to Rust worker via SharedArrayBuffer → <1ms
- Streaming Whisper encoder → <40ms per chunk
- Text injected to persona → <5ms
Output:
- LLM generates speech tokens → streaming
- Vocoder decodes to PCM → <100ms first chunk
- AudioWorklet plays back → <1ms queue
Key Principle
This is a hard real-time system. The render loop is sacred. Miss a frame budget and the experience breaks. This is where Rust workers earn their keep — JS cannot meet these latency targets.
Related
Summary
Personas need to see, hear, and speak at human-conversational speed. This is NOT about adding modalities (see #649, #650) — it's about making them FAST enough for real-time interaction.
Latency Budgets
Architecture Requirements
Vision Pipeline (target: <100ms)
Audio Pipeline (target: <50ms input, <150ms output)
Input:
Output:
Key Principle
This is a hard real-time system. The render loop is sacred. Miss a frame budget and the experience breaks. This is where Rust workers earn their keep — JS cannot meet these latency targets.
Related