Skip to content

Desktop: cut floating-bar voice-ask latency#6727

Closed
kodjima33 wants to merge 10 commits into
mainfrom
desktop-voice-perf
Closed

Desktop: cut floating-bar voice-ask latency#6727
kodjima33 wants to merge 10 commits into
mainfrom
desktop-voice-perf

Conversation

@kodjima33
Copy link
Copy Markdown
Collaborator

Summary

Several independent changes that compound to make the floating-bar voice flow noticeably faster and stop the "first-query dropped audio / response replays at the end" bugs.

Latency wins

  • Default floating-bar model → Haiku 4.5. First-token on typical short voice queries drops from ~2-6 s to sub-1 s. Users who already picked Sonnet / Opus keep their choice; only selectedModel.isEmpty now picks Haiku.
  • Anthropic prompt cache pre-warmed at app launch (acp-bridge). The existing preWarmSession only creates local ACP state — it doesn't touch Anthropic's server-side cache. Now we fire a throwaway session/prompt against a scratch session keyed floating-cache-warm with the same system prompt, which causes Anthropic to write the 44 k-token cache. The user's first real query then pays cacheRead instead of cacheWrite, saving a couple more seconds. Fire-and-forget, so handleQuery never blocks on it. Cost: one extra Haiku call per app launch ($0.001).
  • Streaming flush interval 0.1s → 0.033s in ChatProvider. UI text and downstream TTS chunk checks fire ~3x faster per token batch.

First-query correctness fixes

  • Coreaudiod pre-warm at PushToTalkManager.setup(). On cold coreaudiod, startCapture() blocks ~1 s while enumerating input devices — longer than most short PTT presses. Users were holding, speaking, releasing, and the mic hadn't started capturing yet, so short queries transcribed as empty strings. New AudioCaptureService.warmupCoreAudio() queries device metadata only (no stream open, no mic indicator) once at setup.
  • Buffer mic audio until the transcription socket connects (TranscriptionService). Previously sendAudio() dropped data when !isConnected, losing the first ~500 ms of every PTT session (the websocket handshake window). We now always buffer and flush as soon as the handshake completes.

Voice playback fixes

  • No more audio restart at end of stream. ChatProvider replaces the AI message's local UUID with the server id when the response is persisted. The voice service saw this as a new response, reset its pipeline, and re-enqueued the full answer as a "first chunk" — so the whole response played a second time from the top. We now detect when the incoming text is a prefix-continuation of what we've already streamed and keep playback state instead of resetting.
  • First-query-after-launch no longer silent. interruptCurrentResponse() was latching shouldInterruptNextResponse=true even when nothing was playing. The flag then poisoned the next real response's id (marked as interrupted → every delta swallowed). Now the flag only gets set when there's actually a response to interrupt.

Test plan

  • Mac mini: first voice query after launch produces audio (was silent before).
  • Mac mini: hi / yo / other short utterances transcribe correctly on first press (was empty-string before).
  • Mac mini: "how is it going" returns a spoken response in ≤ 4 s on warm runs.
  • Mac mini: confirm Anthropic cache primed: key=floating appears in /tmp/omi-dev.log shortly after app launch.
  • Mac mini: confirm audio does not play the final response twice when the backend-id swap fires (look for no duplicated tts_chunk_enqueued first=true on the final chunk).
  • Mac mini: confirm Sonnet / Opus still work if selected in the picker.

🤖 Generated with Claude Code

kodjima33 and others added 10 commits April 16, 2026 18:58
warmupCoreAudio() queries default-input-device metadata on a background
queue so coreaudiod wakes up before the first real startCapture(). Does
not open the input stream, so no menu-bar mic indicator flash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On cold coreaudiod, startCapture() blocks ~1 s while enumerating input
devices — longer than a typical short PTT press. Users were holding
PTT, speaking, releasing, and the mic hadn't started capturing yet, so
short queries transcribed as empty strings. Waking the daemon once at
PTT setup eliminates the first-press latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sendAudio() previously dropped audio when !isConnected, so the first
~500 ms of every PTT session (the websocket handshake) was lost. Short
utterances that landed entirely in that window transcribed as empty.
Now we always buffer and flush the buffer as soon as the handshake
completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- streamingFlushInterval 0.1s → 0.033s: UI text and downstream TTS
  chunks refresh ~3x faster per token batch.
- Floating-session model fallback switches from Sonnet to Haiku 4.5
  when the user hasn't picked one — first-token latency on simple
  voice queries drops from ~2-6 s to sub-1 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Haiku 4.5 has ~3-5x lower first-token latency than Sonnet for typical
short voice queries while matching Sonnet's quality on those inputs.
Put it at the top of the picker and make it the default for new
sessions; existing users keep whatever they already selected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the FloatingControlBarState picker so the two settings surfaces
don't disagree on which model is default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches ChatProvider.startACPBridge() — the two call sites that build
the floatingModel need to agree, otherwise the warmed ACP session and
the user-query session use different models and no prompt cache is
shared.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…uery

Two fixes in updateStreamingResponseIfEnabled / interruptCurrentResponse:

1. Backend-id swap no longer restarts playback. ChatProvider replaces
   the AI message's local UUID with the server id when the response is
   persisted — the voice service saw this as a new response, reset its
   pipeline, and re-enqueued the full answer as a "first chunk", so
   the whole response played a second time. We now detect when the
   incoming text is a continuation of what we've already streamed and
   keep the playback state instead of resetting.

2. interruptCurrentResponse() no longer latches shouldInterruptNextResponse
   when there's nothing to interrupt. Previously, calling it with no
   active response set the flag true, which then marked the next real
   response's id as "interrupted" — every delta was swallowed and no
   audio ever played. Hit on the first voice query after every launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
preWarmSession() already creates local ACP sessions and ships the
system prompt to the bridge, but that doesn't touch Anthropic's
server-side prompt cache. The user's first real query still pays the
full cacheWrite cost (~44k tokens for the floating system prompt),
which shows up as several extra seconds on first-token latency.

After the local warmup completes, fire a throwaway session/prompt with
"ready" against a scratch session keyed "{sessionKey}-cache-warm".
Anthropic caches by system-prompt-content hash, so the real floating
session reads the same cache on the user's first query. Throwaway
session keeps the reusable one's conversation history clean.

Runs fire-and-forget after preWarmPromise resolves so handleQuery
never blocks on it. Cost: one extra Haiku API call per app launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kodjima33
Copy link
Copy Markdown
Collaborator Author

Closing per request — was intended for local testing only

@kodjima33 kodjima33 closed this Apr 16, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hey @kodjima33 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

  • Project standards — Ensuring consistency across the codebase
  • User needs — Making sure changes align with what our users need
  • Code best practices — Maintaining code quality and maintainability
  • Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 16, 2026

Greptile Summary

This PR improves floating-bar voice query latency and correctness through several independent changes: defaulting to Haiku 4.5, pre-warming Anthropic's server-side prompt cache at launch, faster streaming flush (33 ms), CoreAudio HAL pre-warm on PTT setup, pre-connection audio buffering in TranscriptionService, and two voice playback bug fixes — stopping the response-replay-on-ID-swap and the first-query-after-launch silence.

The overall approach is well-reasoned and each fix addresses a documented root cause. Most changes are low-risk; the playback logic rework (isBackendIdSwap, interruptCurrentResponse) deserves careful review.

Confidence Score: 4/5

Safe to merge; all remaining findings are style/cleanup and do not affect correctness of the primary flows.

The core logic fixes (interruptCurrentResponse, isBackendIdSwap, audio buffering) are well-reasoned and trace correctly through the code. Two P2 findings — dead code (shouldInterruptNextResponse) and orphaned warmup sessions with MCP subprocesses — are worth addressing but don't block merge. The third P2 (unbounded pre-connection audio buffer) is benign in practice.

desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift (dead shouldInterruptNextResponse field), desktop/acp-bridge/src/index.ts (orphaned warmup sessions)

Important Files Changed

Filename Overview
desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift Two important fixes: interruptCurrentResponse no longer sets shouldInterruptNextResponse=true unconditionally (fixes first-query silence); isBackendIdSwap detection prevents the response-replay bug. shouldInterruptNextResponse is now dead code — always false.
desktop/Desktop/Sources/TranscriptionService.swift Pre-connection audio buffering added to sendAudio; buffered audio is flushed in the existing 0.5s post-handshake callback. Buffer has no explicit maximum size, though it's bounded in practice by PTT session length.
desktop/acp-bridge/src/index.ts Adds warmAnthropicCache that fires a throwaway session/prompt to prime Anthropic's server-side cache. The warmup session is never added to sessions and is not explicitly closed; it accumulates in ACP until process exit.
desktop/Desktop/Sources/AudioCaptureService.swift Adds warmupCoreAudio() static method that queries default-input-device and stream-format on a background queue to prime the coreaudiod mach connection. Safe — no stream opened, no mic indicator.
desktop/Desktop/Sources/Providers/ChatProvider.swift streamingFlushInterval reduced from 0.1 s to 0.033 s for faster UI updates; also applies the Haiku fallback in sendAIQuery.

Sequence Diagram

sequenceDiagram
    participant App as App Launch
    participant PTT as PushToTalkManager
    participant ACS as AudioCaptureService
    participant TS as TranscriptionService
    participant FVP as FloatingBarVoicePlaybackService
    participant CP as ChatProvider
    participant ACP as acp-bridge

    App->>PTT: setup()
    PTT->>ACS: warmupCoreAudio() [async, primes coreaudiod]
    App->>ACP: warmup message (floating session)
    ACP->>ACP: preWarmSession() → session/new + set_model
    ACP->>ACP: warmAnthropicCache() [fire-and-forget, session/prompt "ready"]

    Note over PTT,TS: User presses PTT key
    PTT->>FVP: interruptCurrentResponse()
    Note over FVP: Only sets interruptedResponseID if currentResponseID != nil
    PTT->>ACS: startCapture()
    ACS-->>TS: onAudioChunk callbacks
    PTT->>TS: TranscriptionService.start()
    TS->>TS: connect() → WebSocket handshake
    Note over TS: sendAudio() buffers audio until isConnected=true
    TS->>TS: flushAudioBuffer() after 0.5s
    TS-->>PTT: onSegments callback

    Note over PTT,CP: User releases PTT
    PTT->>CP: sendMessage(query, model=Haiku/selected)
    CP->>ACP: session/prompt (floating session reused)
    ACP-->>CP: text_delta stream @ 33ms flush
    CP-->>FVP: updateStreamingResponseIfEnabled(isFinal=false)
    Note over FVP: Checks isBackendIdSwap to avoid replay on ID swap
    CP-->>FVP: updateStreamingResponseIfEnabled(isFinal=true)
Loading

Comments Outside Diff (2)

  1. desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift, line 43 (link)

    P2 shouldInterruptNextResponse is now dead code

    After this PR, shouldInterruptNextResponse is never set to true anywhere — interruptCurrentResponse() now always assigns false, stop() also assigns false, and playVoiceSample() assigns false. The field will always be false, so the conditional on line 110 (shouldInterruptNextResponse ? message.id : nil) always evaluates to nil. The property and the conditional can be removed to avoid confusing future readers.

  2. desktop/Desktop/Sources/TranscriptionService.swift, line 226-244 (link)

    P2 Pre-connection audio buffer has no maximum size cap

    sendAudio now accumulates data unconditionally when !isConnected. With maxReconnectAttempts = 10 and exponential backoff up to 32 s, the connection attempt can stay in progress for over a minute. At 16 kHz 16-bit mono, that's ~32 KB/s — potentially several megabytes of buffered audio. In practice PTT sessions are short, but a brief comment (or a defensive cap, e.g., 5 s worth ≈ 160 KB) would make the intent explicit and prevent unbounded growth in edge cases (network totally down during a long locked-listening session).

Reviews (1): Last reviewed commit: "Add changelog entries for floating-bar v..." | Re-trigger Greptile

Comment on lines +662 to 687
async function warmAnthropicCache(
warmCwd: string,
configs: WarmupSessionConfig[]
): Promise<void> {
for (const cfg of configs) {
if (!cfg.systemPrompt) continue;
if (!CACHE_WARM_KEYS.has(cfg.key)) continue;
try {
const warmupSessionKey = `${cfg.key}-cache-warm`;
const sessionParams: Record<string, unknown> = {
cwd: warmCwd,
mcpServers: buildMcpServers("act", warmCwd, warmupSessionKey),
_meta: { systemPrompt: cfg.systemPrompt },
};
const created = (await acpRequest("session/new", sessionParams)) as { sessionId: string };
await acpRequest("session/set_model", { sessionId: created.sessionId, modelId: cfg.model });
await acpRequest("session/prompt", {
sessionId: created.sessionId,
prompt: [{ type: "text", text: "ready" }],
});
logErr(`Anthropic cache primed: key=${cfg.key} model=${cfg.model}`);
} catch (err) {
logErr(`Anthropic cache warmup failed for ${cfg.key}: ${err}`);
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Warmup session and its MCP subprocesses are never cleaned up

warmAnthropicCache creates an ACP session (via session/new) with full MCP server configuration — including a spawned omi-tools stdio process and a Playwright server — but never calls any cleanup RPC (session/delete or equivalent). The session is also not added to the sessions Map, so it can't be invalidated later. These orphaned sessions and their child processes persist in the ACP subprocess until the entire bridge process exits.

For one launch this is harmless, but if preWarmSession is called multiple times in the same process lifetime (e.g., on cwd change), multiple abandoned sessions accumulate. Consider closing the warmup session after session/prompt resolves, or at least not passing mcpServers to it so no child processes are spawned:

const sessionParams: Record<string, unknown> = {
  cwd: warmCwd,
  mcpServers: [],  // no child procs needed for a pure cache-warm call
  _meta: { systemPrompt: cfg.systemPrompt },
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant