Skip to content

Desktop: remove client-side API keys, route STT + Gemini through backend #5393

@beastoin

Description

@beastoin

Problem

The desktop app (macOS) bundles vendor API keys (DEEPGRAM_API_KEY, GEMINI_API_KEY) in the app bundle's .env file and calls external APIs directly from the client:

  • Deepgram STT: TranscriptionService.swift connects directly to wss://api.deepgram.com/v1/listen with the API key in the WebSocket auth header
  • Gemini: GeminiClient.swift and EmbeddingService.swift call Google APIs with the key in URL query parameters (?key=<KEY>)

Security risks:

  • Keys are extractable from the app bundle (Contents/Resources/.env — plain text)
  • Keys are visible in network traffic (auth headers, URL params)
  • No per-user attribution, rate limiting, or revocation granularity
  • Blast radius = full vendor account billing

Architectural inconsistency:

  • Mobile app routes ALL audio through the Python backend's /v4/listen WebSocket — API keys stay server-side
  • Desktop app bypasses the backend entirely for STT — keys ship in the client
  • Desktop misses backend features: VAD gate (~75% Deepgram cost savings), speech profiles, speaker identification, unified billing/monitoring

Proposed Solution

Phase 1: Route desktop STT through /v4/listen

The Python backend already has a fully-featured /v4/listen WebSocket endpoint with Firebase auth, used by all mobile clients. Desktop should use it too.

Swift changes:

  • Replace direct Deepgram WebSocket connection in TranscriptionService.swift with a WebSocket connection to the backend's /v4/listen (or /v4/web/listen which supports first-message token auth)
  • Remove DEEPGRAM_API_KEY from client-side .env
  • Desktop gets VAD gate, speech profiles, speaker ID for free

Backend changes:

  • May need minor adjustments to handle desktop audio format (16kHz stereo PCM vs mobile's opus/pcm8)
  • Add source=desktop parameter for monitoring/billing segmentation

Phase 2: Route Gemini through backend endpoints

  • Add backend API endpoints for the proactive assistant operations currently calling Gemini directly (embeddings, generation)
  • Remove GEMINI_API_KEY from client-side .env
  • Enables server-side rate limiting, cost tracking, prompt governance

Phase 3: Decommission direct API paths

  • Remove direct Deepgram/Gemini code paths from desktop app
  • Remove .env bundling of vendor keys from build pipeline
  • Add CI check to block shipping vendor API keys in app bundles

Benefits

Current (direct) Proposed (backend proxy)
API key exposure Client-side, extractable Server-side only
Cost visibility Invisible to backend Unified monitoring
VAD gate savings Not available ~75% Deepgram cost reduction
Speech profiles Not available Speaker identification
Rate limiting None Per-user/device/session
Key rotation Requires app update Server-side, instant
Provider flexibility Hardcoded Deepgram Backend can switch STT providers

Latency Consideration

Adding a backend hop adds some latency. In practice, with persistent WebSocket connections and region colocation, the increase is modest relative to STT model inference + endpointing delays. Mitigated with dedicated streaming workers and autoscaling (same infra mobile already uses).

References

  • desktop/Desktop/Sources/TranscriptionService.swift — direct Deepgram connection
  • desktop/Desktop/Sources/ProactiveAssistants/Core/GeminiClient.swift — direct Gemini calls
  • backend/routers/transcribe.py — existing /v4/listen endpoint
  • backend/utils/stt/streaming.py — server-side STT providers
  • backend/utils/stt/vad_gate.py — VAD gate (active on mobile)

by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    desktopp1Priority: Critical (score 22-29)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions