Desktop: cut floating-bar voice-ask latency#6727
Conversation
warmupCoreAudio() queries default-input-device metadata on a background queue so coreaudiod wakes up before the first real startCapture(). Does not open the input stream, so no menu-bar mic indicator flash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On cold coreaudiod, startCapture() blocks ~1 s while enumerating input devices — longer than a typical short PTT press. Users were holding PTT, speaking, releasing, and the mic hadn't started capturing yet, so short queries transcribed as empty strings. Waking the daemon once at PTT setup eliminates the first-press latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sendAudio() previously dropped audio when !isConnected, so the first ~500 ms of every PTT session (the websocket handshake) was lost. Short utterances that landed entirely in that window transcribed as empty. Now we always buffer and flush the buffer as soon as the handshake completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- streamingFlushInterval 0.1s → 0.033s: UI text and downstream TTS chunks refresh ~3x faster per token batch. - Floating-session model fallback switches from Sonnet to Haiku 4.5 when the user hasn't picked one — first-token latency on simple voice queries drops from ~2-6 s to sub-1 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Haiku 4.5 has ~3-5x lower first-token latency than Sonnet for typical short voice queries while matching Sonnet's quality on those inputs. Put it at the top of the picker and make it the default for new sessions; existing users keep whatever they already selected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the FloatingControlBarState picker so the two settings surfaces don't disagree on which model is default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches ChatProvider.startACPBridge() — the two call sites that build the floatingModel need to agree, otherwise the warmed ACP session and the user-query session use different models and no prompt cache is shared. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uery Two fixes in updateStreamingResponseIfEnabled / interruptCurrentResponse: 1. Backend-id swap no longer restarts playback. ChatProvider replaces the AI message's local UUID with the server id when the response is persisted — the voice service saw this as a new response, reset its pipeline, and re-enqueued the full answer as a "first chunk", so the whole response played a second time. We now detect when the incoming text is a continuation of what we've already streamed and keep the playback state instead of resetting. 2. interruptCurrentResponse() no longer latches shouldInterruptNextResponse when there's nothing to interrupt. Previously, calling it with no active response set the flag true, which then marked the next real response's id as "interrupted" — every delta was swallowed and no audio ever played. Hit on the first voice query after every launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
preWarmSession() already creates local ACP sessions and ships the
system prompt to the bridge, but that doesn't touch Anthropic's
server-side prompt cache. The user's first real query still pays the
full cacheWrite cost (~44k tokens for the floating system prompt),
which shows up as several extra seconds on first-token latency.
After the local warmup completes, fire a throwaway session/prompt with
"ready" against a scratch session keyed "{sessionKey}-cache-warm".
Anthropic caches by system-prompt-content hash, so the real floating
session reads the same cache on the user's first query. Throwaway
session keeps the reusable one's conversation history clean.
Runs fire-and-forget after preWarmPromise resolves so handleQuery
never blocks on it. Cost: one extra Haiku API call per app launch.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Closing per request — was intended for local testing only |
|
Hey @kodjima33 👋 Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request. After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:
Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out. Thank you for being part of the Omi community! 💜 |
Greptile SummaryThis PR improves floating-bar voice query latency and correctness through several independent changes: defaulting to Haiku 4.5, pre-warming Anthropic's server-side prompt cache at launch, faster streaming flush (33 ms), CoreAudio HAL pre-warm on PTT setup, pre-connection audio buffering in The overall approach is well-reasoned and each fix addresses a documented root cause. Most changes are low-risk; the playback logic rework ( Confidence Score: 4/5Safe to merge; all remaining findings are style/cleanup and do not affect correctness of the primary flows. The core logic fixes (interruptCurrentResponse, isBackendIdSwap, audio buffering) are well-reasoned and trace correctly through the code. Two P2 findings — dead code (shouldInterruptNextResponse) and orphaned warmup sessions with MCP subprocesses — are worth addressing but don't block merge. The third P2 (unbounded pre-connection audio buffer) is benign in practice. desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift (dead shouldInterruptNextResponse field), desktop/acp-bridge/src/index.ts (orphaned warmup sessions) Important Files Changed
Sequence DiagramsequenceDiagram
participant App as App Launch
participant PTT as PushToTalkManager
participant ACS as AudioCaptureService
participant TS as TranscriptionService
participant FVP as FloatingBarVoicePlaybackService
participant CP as ChatProvider
participant ACP as acp-bridge
App->>PTT: setup()
PTT->>ACS: warmupCoreAudio() [async, primes coreaudiod]
App->>ACP: warmup message (floating session)
ACP->>ACP: preWarmSession() → session/new + set_model
ACP->>ACP: warmAnthropicCache() [fire-and-forget, session/prompt "ready"]
Note over PTT,TS: User presses PTT key
PTT->>FVP: interruptCurrentResponse()
Note over FVP: Only sets interruptedResponseID if currentResponseID != nil
PTT->>ACS: startCapture()
ACS-->>TS: onAudioChunk callbacks
PTT->>TS: TranscriptionService.start()
TS->>TS: connect() → WebSocket handshake
Note over TS: sendAudio() buffers audio until isConnected=true
TS->>TS: flushAudioBuffer() after 0.5s
TS-->>PTT: onSegments callback
Note over PTT,CP: User releases PTT
PTT->>CP: sendMessage(query, model=Haiku/selected)
CP->>ACP: session/prompt (floating session reused)
ACP-->>CP: text_delta stream @ 33ms flush
CP-->>FVP: updateStreamingResponseIfEnabled(isFinal=false)
Note over FVP: Checks isBackendIdSwap to avoid replay on ID swap
CP-->>FVP: updateStreamingResponseIfEnabled(isFinal=true)
|
| async function warmAnthropicCache( | ||
| warmCwd: string, | ||
| configs: WarmupSessionConfig[] | ||
| ): Promise<void> { | ||
| for (const cfg of configs) { | ||
| if (!cfg.systemPrompt) continue; | ||
| if (!CACHE_WARM_KEYS.has(cfg.key)) continue; | ||
| try { | ||
| const warmupSessionKey = `${cfg.key}-cache-warm`; | ||
| const sessionParams: Record<string, unknown> = { | ||
| cwd: warmCwd, | ||
| mcpServers: buildMcpServers("act", warmCwd, warmupSessionKey), | ||
| _meta: { systemPrompt: cfg.systemPrompt }, | ||
| }; | ||
| const created = (await acpRequest("session/new", sessionParams)) as { sessionId: string }; | ||
| await acpRequest("session/set_model", { sessionId: created.sessionId, modelId: cfg.model }); | ||
| await acpRequest("session/prompt", { | ||
| sessionId: created.sessionId, | ||
| prompt: [{ type: "text", text: "ready" }], | ||
| }); | ||
| logErr(`Anthropic cache primed: key=${cfg.key} model=${cfg.model}`); | ||
| } catch (err) { | ||
| logErr(`Anthropic cache warmup failed for ${cfg.key}: ${err}`); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Warmup session and its MCP subprocesses are never cleaned up
warmAnthropicCache creates an ACP session (via session/new) with full MCP server configuration — including a spawned omi-tools stdio process and a Playwright server — but never calls any cleanup RPC (session/delete or equivalent). The session is also not added to the sessions Map, so it can't be invalidated later. These orphaned sessions and their child processes persist in the ACP subprocess until the entire bridge process exits.
For one launch this is harmless, but if preWarmSession is called multiple times in the same process lifetime (e.g., on cwd change), multiple abandoned sessions accumulate. Consider closing the warmup session after session/prompt resolves, or at least not passing mcpServers to it so no child processes are spawned:
const sessionParams: Record<string, unknown> = {
cwd: warmCwd,
mcpServers: [], // no child procs needed for a pure cache-warm call
_meta: { systemPrompt: cfg.systemPrompt },
};
Summary
Several independent changes that compound to make the floating-bar voice flow noticeably faster and stop the "first-query dropped audio / response replays at the end" bugs.
Latency wins
selectedModel.isEmptynow picks Haiku.acp-bridge). The existingpreWarmSessiononly creates local ACP state — it doesn't touch Anthropic's server-side cache. Now we fire a throwawaysession/promptagainst a scratch session keyedfloating-cache-warmwith the same system prompt, which causes Anthropic to write the44 k-token cache. The user's first real query then pays$0.001).cacheReadinstead ofcacheWrite, saving a couple more seconds. Fire-and-forget, sohandleQuerynever blocks on it. Cost: one extra Haiku call per app launch (0.1s → 0.033sinChatProvider. UI text and downstream TTS chunk checks fire ~3x faster per token batch.First-query correctness fixes
PushToTalkManager.setup(). On cold coreaudiod,startCapture()blocks ~1 s while enumerating input devices — longer than most short PTT presses. Users were holding, speaking, releasing, and the mic hadn't started capturing yet, so short queries transcribed as empty strings. NewAudioCaptureService.warmupCoreAudio()queries device metadata only (no stream open, no mic indicator) once at setup.TranscriptionService). PreviouslysendAudio()dropped data when!isConnected, losing the first ~500 ms of every PTT session (the websocket handshake window). We now always buffer and flush as soon as the handshake completes.Voice playback fixes
ChatProviderreplaces the AI message's local UUID with the server id when the response is persisted. The voice service saw this as a new response, reset its pipeline, and re-enqueued the full answer as a "first chunk" — so the whole response played a second time from the top. We now detect when the incoming text is a prefix-continuation of what we've already streamed and keep playback state instead of resetting.interruptCurrentResponse()was latchingshouldInterruptNextResponse=trueeven when nothing was playing. The flag then poisoned the next real response's id (marked as interrupted → every delta swallowed). Now the flag only gets set when there's actually a response to interrupt.Test plan
hi/yo/ other short utterances transcribe correctly on first press (was empty-string before).Anthropic cache primed: key=floatingappears in/tmp/omi-dev.logshortly after app launch.tts_chunk_enqueued first=trueon the final chunk).🤖 Generated with Claude Code