Desktop: cut floating-bar voice-ask latency by kodjima33 · Pull Request #6727 · BasedHardware/omi

kodjima33 · 2026-04-16T22:59:25Z

Summary

Several independent changes that compound to make the floating-bar voice flow noticeably faster and stop the "first-query dropped audio / response replays at the end" bugs.

Latency wins

Default floating-bar model → Haiku 4.5. First-token on typical short voice queries drops from ~2-6 s to sub-1 s. Users who already picked Sonnet / Opus keep their choice; only selectedModel.isEmpty now picks Haiku.
Anthropic prompt cache pre-warmed at app launch (acp-bridge). The existing preWarmSession only creates local ACP state — it doesn't touch Anthropic's server-side cache. Now we fire a throwaway session/prompt against a scratch session keyed floating-cache-warm with the same system prompt, which causes Anthropic to write the 44 k-token cache. The user's first real query then pays cacheRead instead of cacheWrite, saving a couple more seconds. Fire-and-forget, so handleQuery never blocks on it. Cost: one extra Haiku call per app launch ($0.001).
Streaming flush interval 0.1s → 0.033s in ChatProvider. UI text and downstream TTS chunk checks fire ~3x faster per token batch.

First-query correctness fixes

Coreaudiod pre-warm at PushToTalkManager.setup(). On cold coreaudiod, startCapture() blocks ~1 s while enumerating input devices — longer than most short PTT presses. Users were holding, speaking, releasing, and the mic hadn't started capturing yet, so short queries transcribed as empty strings. New AudioCaptureService.warmupCoreAudio() queries device metadata only (no stream open, no mic indicator) once at setup.
Buffer mic audio until the transcription socket connects (TranscriptionService). Previously sendAudio() dropped data when !isConnected, losing the first ~500 ms of every PTT session (the websocket handshake window). We now always buffer and flush as soon as the handshake completes.

Voice playback fixes

No more audio restart at end of stream. ChatProvider replaces the AI message's local UUID with the server id when the response is persisted. The voice service saw this as a new response, reset its pipeline, and re-enqueued the full answer as a "first chunk" — so the whole response played a second time from the top. We now detect when the incoming text is a prefix-continuation of what we've already streamed and keep playback state instead of resetting.
First-query-after-launch no longer silent. interruptCurrentResponse() was latching shouldInterruptNextResponse=true even when nothing was playing. The flag then poisoned the next real response's id (marked as interrupted → every delta swallowed). Now the flag only gets set when there's actually a response to interrupt.

Test plan

Mac mini: first voice query after launch produces audio (was silent before).
Mac mini: hi / yo / other short utterances transcribe correctly on first press (was empty-string before).
Mac mini: "how is it going" returns a spoken response in ≤ 4 s on warm runs.
Mac mini: confirm Anthropic cache primed: key=floating appears in /tmp/omi-dev.log shortly after app launch.
Mac mini: confirm audio does not play the final response twice when the backend-id swap fires (look for no duplicated tts_chunk_enqueued first=true on the final chunk).
Mac mini: confirm Sonnet / Opus still work if selected in the picker.

🤖 Generated with Claude Code

warmupCoreAudio() queries default-input-device metadata on a background queue so coreaudiod wakes up before the first real startCapture(). Does not open the input stream, so no menu-bar mic indicator flash. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

On cold coreaudiod, startCapture() blocks ~1 s while enumerating input devices — longer than a typical short PTT press. Users were holding PTT, speaking, releasing, and the mic hadn't started capturing yet, so short queries transcribed as empty strings. Waking the daemon once at PTT setup eliminates the first-press latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sendAudio() previously dropped audio when !isConnected, so the first ~500 ms of every PTT session (the websocket handshake) was lost. Short utterances that landed entirely in that window transcribed as empty. Now we always buffer and flush the buffer as soon as the handshake completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- streamingFlushInterval 0.1s → 0.033s: UI text and downstream TTS chunks refresh ~3x faster per token batch. - Floating-session model fallback switches from Sonnet to Haiku 4.5 when the user hasn't picked one — first-token latency on simple voice queries drops from ~2-6 s to sub-1 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Haiku 4.5 has ~3-5x lower first-token latency than Sonnet for typical short voice queries while matching Sonnet's quality on those inputs. Put it at the top of the picker and make it the default for new sessions; existing users keep whatever they already selected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the FloatingControlBarState picker so the two settings surfaces don't disagree on which model is default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matches ChatProvider.startACPBridge() — the two call sites that build the floatingModel need to agree, otherwise the warmed ACP session and the user-query session use different models and no prompt cache is shared. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…uery Two fixes in updateStreamingResponseIfEnabled / interruptCurrentResponse: 1. Backend-id swap no longer restarts playback. ChatProvider replaces the AI message's local UUID with the server id when the response is persisted — the voice service saw this as a new response, reset its pipeline, and re-enqueued the full answer as a "first chunk", so the whole response played a second time. We now detect when the incoming text is a continuation of what we've already streamed and keep the playback state instead of resetting. 2. interruptCurrentResponse() no longer latches shouldInterruptNextResponse when there's nothing to interrupt. Previously, calling it with no active response set the flag true, which then marked the next real response's id as "interrupted" — every delta was swallowed and no audio ever played. Hit on the first voice query after every launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

preWarmSession() already creates local ACP sessions and ships the system prompt to the bridge, but that doesn't touch Anthropic's server-side prompt cache. The user's first real query still pays the full cacheWrite cost (~44k tokens for the floating system prompt), which shows up as several extra seconds on first-token latency. After the local warmup completes, fire a throwaway session/prompt with "ready" against a scratch session keyed "{sessionKey}-cache-warm". Anthropic caches by system-prompt-content hash, so the real floating session reads the same cache on the user's first query. Throwaway session keeps the reusable one's conversation history clean. Runs fire-and-forget after preWarmPromise resolves so handleQuery never blocks on it. Cost: one extra Haiku API call per app launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kodjima33 · 2026-04-16T23:00:05Z

Closing per request — was intended for local testing only

github-actions · 2026-04-16T23:00:15Z

Hey @kodjima33 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

Project standards — Ensuring consistency across the codebase
User needs — Making sure changes align with what our users need
Code best practices — Maintaining code quality and maintainability
Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

greptile-apps · 2026-04-16T23:07:27Z

Greptile Summary

This PR improves floating-bar voice query latency and correctness through several independent changes: defaulting to Haiku 4.5, pre-warming Anthropic's server-side prompt cache at launch, faster streaming flush (33 ms), CoreAudio HAL pre-warm on PTT setup, pre-connection audio buffering in TranscriptionService, and two voice playback bug fixes — stopping the response-replay-on-ID-swap and the first-query-after-launch silence.

The overall approach is well-reasoned and each fix addresses a documented root cause. Most changes are low-risk; the playback logic rework (isBackendIdSwap, interruptCurrentResponse) deserves careful review.

Confidence Score: 4/5

Safe to merge; all remaining findings are style/cleanup and do not affect correctness of the primary flows.

The core logic fixes (interruptCurrentResponse, isBackendIdSwap, audio buffering) are well-reasoned and trace correctly through the code. Two P2 findings — dead code (shouldInterruptNextResponse) and orphaned warmup sessions with MCP subprocesses — are worth addressing but don't block merge. The third P2 (unbounded pre-connection audio buffer) is benign in practice.

desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift (dead shouldInterruptNextResponse field), desktop/acp-bridge/src/index.ts (orphaned warmup sessions)

Important Files Changed

Filename	Overview
desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift	Two important fixes: `interruptCurrentResponse` no longer sets `shouldInterruptNextResponse=true` unconditionally (fixes first-query silence); `isBackendIdSwap` detection prevents the response-replay bug. `shouldInterruptNextResponse` is now dead code — always false.
desktop/Desktop/Sources/TranscriptionService.swift	Pre-connection audio buffering added to `sendAudio`; buffered audio is flushed in the existing 0.5s post-handshake callback. Buffer has no explicit maximum size, though it's bounded in practice by PTT session length.
desktop/acp-bridge/src/index.ts	Adds `warmAnthropicCache` that fires a throwaway `session/prompt` to prime Anthropic's server-side cache. The warmup session is never added to `sessions` and is not explicitly closed; it accumulates in ACP until process exit.
desktop/Desktop/Sources/AudioCaptureService.swift	Adds `warmupCoreAudio()` static method that queries default-input-device and stream-format on a background queue to prime the coreaudiod mach connection. Safe — no stream opened, no mic indicator.
desktop/Desktop/Sources/Providers/ChatProvider.swift	`streamingFlushInterval` reduced from 0.1 s to 0.033 s for faster UI updates; also applies the Haiku fallback in `sendAIQuery`.

Sequence Diagram

sequenceDiagram
    participant App as App Launch
    participant PTT as PushToTalkManager
    participant ACS as AudioCaptureService
    participant TS as TranscriptionService
    participant FVP as FloatingBarVoicePlaybackService
    participant CP as ChatProvider
    participant ACP as acp-bridge

    App->>PTT: setup()
    PTT->>ACS: warmupCoreAudio() [async, primes coreaudiod]
    App->>ACP: warmup message (floating session)
    ACP->>ACP: preWarmSession() → session/new + set_model
    ACP->>ACP: warmAnthropicCache() [fire-and-forget, session/prompt "ready"]

    Note over PTT,TS: User presses PTT key
    PTT->>FVP: interruptCurrentResponse()
    Note over FVP: Only sets interruptedResponseID if currentResponseID != nil
    PTT->>ACS: startCapture()
    ACS-->>TS: onAudioChunk callbacks
    PTT->>TS: TranscriptionService.start()
    TS->>TS: connect() → WebSocket handshake
    Note over TS: sendAudio() buffers audio until isConnected=true
    TS->>TS: flushAudioBuffer() after 0.5s
    TS-->>PTT: onSegments callback

    Note over PTT,CP: User releases PTT
    PTT->>CP: sendMessage(query, model=Haiku/selected)
    CP->>ACP: session/prompt (floating session reused)
    ACP-->>CP: text_delta stream @ 33ms flush
    CP-->>FVP: updateStreamingResponseIfEnabled(isFinal=false)
    Note over FVP: Checks isBackendIdSwap to avoid replay on ID swap
    CP-->>FVP: updateStreamingResponseIfEnabled(isFinal=true)

Comments Outside Diff (2)

desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift, line 43 (link)

shouldInterruptNextResponse is now dead code

After this PR, shouldInterruptNextResponse is never set to true anywhere — interruptCurrentResponse() now always assigns false, stop() also assigns false, and playVoiceSample() assigns false. The field will always be false, so the conditional on line 110 (shouldInterruptNextResponse ? message.id : nil) always evaluates to nil. The property and the conditional can be removed to avoid confusing future readers.
desktop/Desktop/Sources/TranscriptionService.swift, line 226-244 (link)

Pre-connection audio buffer has no maximum size cap

sendAudio now accumulates data unconditionally when !isConnected. With maxReconnectAttempts = 10 and exponential backoff up to 32 s, the connection attempt can stay in progress for over a minute. At 16 kHz 16-bit mono, that's ~32 KB/s — potentially several megabytes of buffered audio. In practice PTT sessions are short, but a brief comment (or a defensive cap, e.g., 5 s worth ≈ 160 KB) would make the intent explicit and prevent unbounded growth in edge cases (network totally down during a long locked-listening session).

_{Reviews (1): Last reviewed commit: "Add changelog entries for floating-bar v..." | Re-trigger Greptile}

greptile-apps · 2026-04-16T23:07:34Z

+async function warmAnthropicCache(
+  warmCwd: string,
+  configs: WarmupSessionConfig[]
+): Promise<void> {
+  for (const cfg of configs) {
+    if (!cfg.systemPrompt) continue;
+    if (!CACHE_WARM_KEYS.has(cfg.key)) continue;
+    try {
+      const warmupSessionKey = `${cfg.key}-cache-warm`;
+      const sessionParams: Record<string, unknown> = {
+        cwd: warmCwd,
+        mcpServers: buildMcpServers("act", warmCwd, warmupSessionKey),
+        _meta: { systemPrompt: cfg.systemPrompt },
+      };
+      const created = (await acpRequest("session/new", sessionParams)) as { sessionId: string };
+      await acpRequest("session/set_model", { sessionId: created.sessionId, modelId: cfg.model });
+      await acpRequest("session/prompt", {
+        sessionId: created.sessionId,
+        prompt: [{ type: "text", text: "ready" }],
+      });
+      logErr(`Anthropic cache primed: key=${cfg.key} model=${cfg.model}`);
+    } catch (err) {
+      logErr(`Anthropic cache warmup failed for ${cfg.key}: ${err}`);
+    }
+  }
 }


Warmup session and its MCP subprocesses are never cleaned up

warmAnthropicCache creates an ACP session (via session/new) with full MCP server configuration — including a spawned omi-tools stdio process and a Playwright server — but never calls any cleanup RPC (session/delete or equivalent). The session is also not added to the sessions Map, so it can't be invalidated later. These orphaned sessions and their child processes persist in the ACP subprocess until the entire bridge process exits.

For one launch this is harmless, but if preWarmSession is called multiple times in the same process lifetime (e.g., on cwd change), multiple abandoned sessions accumulate. Consider closing the warmup session after session/prompt resolves, or at least not passing mcpServers to it so no child processes are spawned:

const sessionParams: Record<string, unknown> = { cwd: warmCwd, mcpServers: [], // no child procs needed for a pure cache-warm call _meta: { systemPrompt: cfg.systemPrompt }, };

kodjima33 and others added 10 commits April 16, 2026 18:58

Add Haiku 4.5 to Ask Omi model picker as the default

005f5b4

Mirrors the FloatingControlBarState picker so the two settings surfaces don't disagree on which model is default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add changelog entries for floating-bar voice-ask perf

b0c7386

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kodjima33 closed this Apr 16, 2026

greptile-apps Bot reviewed Apr 16, 2026

View reviewed changes

aryaminus mentioned this pull request May 29, 2026

fix(desktop): slim floating bar prompt context #7544

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Desktop: cut floating-bar voice-ask latency#6727

Desktop: cut floating-bar voice-ask latency#6727
kodjima33 wants to merge 10 commits into
mainfrom
desktop-voice-perf

kodjima33 commented Apr 16, 2026

Uh oh!

kodjima33 commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

greptile-apps Bot commented Apr 16, 2026 •

edited

Loading

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kodjima33 commented Apr 16, 2026

Summary

Latency wins

First-query correctness fixes

Voice playback fixes

Test plan

Uh oh!

kodjima33 commented Apr 16, 2026

Uh oh!

github-actions Bot commented Apr 16, 2026

Uh oh!

greptile-apps Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps Bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Apr 16, 2026 •

edited

Loading