Low-latency sensory pipeline — sub-100ms vision + real-time audio for personas

## Summary

Personas need to see, hear, and speak at human-conversational speed. This is NOT about adding modalities (see #649, #650) — it's about making them FAST enough for real-time interaction.

## Latency Budgets

| Sense | Target | Current | Gap |
|-------|--------|---------|-----|
| Vision (scene understanding) | <100ms | Bridge via VisionDescriptionService (slow) | Need native or distilled model |
| Audio input (hearing) | <50ms | STT bridge (200-500ms) | Need streaming encoder |
| Audio output (speech) | <150ms first-byte | TTS bridge (500ms+) | Need streaming vocoder |
| Touch/interaction events | <16ms | Already fast (DOM events) | OK |

## Architecture Requirements

- **ALL processing off main thread** (AudioWorklet, Web Workers, Rust workers)
- **Streaming** — don't wait for complete input. Process chunks as they arrive
- **Transferable buffers** — zero-copy between threads
- **Adaptive quality** — degrade gracefully (lower resolution, skip frames) rather than block
- **Local inference only** — can't hit an API for real-time sensory processing

## Vision Pipeline (target: <100ms)

1. Frame capture (requestAnimationFrame or Intersection Observer) → <1ms
2. Resize/crop to model input size → <5ms (Web Worker)  
3. Run distilled vision model (Qwen3.5-0.8B or MobileCLIP) → <80ms (Rust worker)
4. Inject description into persona context → <5ms

## Audio Pipeline (target: <50ms input, <150ms output)

**Input:**
1. AudioWorklet captures PCM chunks → 0ms (runs on audio thread)
2. Transfer to Rust worker via SharedArrayBuffer → <1ms
3. Streaming Whisper encoder → <40ms per chunk
4. Text injected to persona → <5ms

**Output:**
1. LLM generates speech tokens → streaming
2. Vocoder decodes to PCM → <100ms first chunk
3. AudioWorklet plays back → <1ms queue

## Key Principle

This is a hard real-time system. The render loop is sacred. Miss a frame budget and the experience breaks. This is where Rust workers earn their keep — JS cannot meet these latency targets.

## Related

- #582 (Native multimodal pipeline — the capability)
- #649 (Vision encoder recipe — the training)
- #650 (Audio encoder recipe — the training)
- #480 (Qwen3.5-0.8B vision)
- This issue is about INFERENCE SPEED, not training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-latency sensory pipeline — sub-100ms vision + real-time audio for personas #652

Summary

Latency Budgets

Architecture Requirements

Vision Pipeline (target: <100ms)

Audio Pipeline (target: <50ms input, <150ms output)

Key Principle

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sense	Target	Current	Gap
Vision (scene understanding)	<100ms	Bridge via VisionDescriptionService (slow)	Need native or distilled model
Audio input (hearing)	<50ms	STT bridge (200-500ms)	Need streaming encoder
Audio output (speech)	<150ms first-byte	TTS bridge (500ms+)	Need streaming vocoder
Touch/interaction events	<16ms	Already fast (DOM events)	OK

Low-latency sensory pipeline — sub-100ms vision + real-time audio for personas #652

Description

Summary

Latency Budgets

Architecture Requirements

Vision Pipeline (target: <100ms)

Audio Pipeline (target: <50ms input, <150ms output)

Key Principle

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions