Skip to content

Desktop: Batch transcription 413 on long speech chunks (3.2MB, 50s+) — 5.5K events #6195

@beastoin

Description

@beastoin

Problem

The desktop batch transcription fails with HTTP 413 ("Failed to buffer the request body: length limit exceeded") when the VAD gate accumulates long speech chunks:

  • OMI-DESKTOP-10: 413 Payload Too Large
  • 5,574 events across 70 users
  • Sentry breadcrumb: VADGate batch speech chunk complete 3,208,532 bytes (50.1s) -> TranscriptionService batch transcribing -> 413 rejected

The VAD gate collects ~3.2MB stereo PCM audio with no size limit, but the backend (or its reverse proxy) rejects payloads above a threshold.

Root Cause Analysis

Traced through VADGateService.swift, TranscriptionService.swift, and AppState.swift:

1. VAD gate has no maximum chunk size

VADGateService.swift: Speech audio is accumulated in batchAudioBuffer (line 262) until 2+ seconds of silence (hangover timeout, line 207: batchHangoverMs = 2000). There is no maximum duration or size limit. A user speaking continuously for 50+ seconds produces a single 3.2MB chunk.

2. Stereo format doubles payload size

Audio format: stereo Int16 PCM at 16kHz = 64 KB/s. For 50.1 seconds: 50.1 x 16000 x 4 bytes = 3,206,400 bytes (~3.2 MB).

3. Single HTTP POST with no chunking

TranscriptionService.batchTranscribeFull() (line 639-737) sends the entire buffer as a single POST request with Content-Type: application/octet-stream to /v1/proxy/deepgram/v1/listen. There is no logic to split large audio into smaller chunks before upload.

4. Backend body size limit

The 413 indicates a body size limit at the backend or its reverse proxy (nginx, GCP Cloud Load Balancer, or Cloud Run). The exact limit is between 1-5MB. Cloud Run default is 32MB but nginx proxy_pass or middleware may impose tighter limits.

5. No retry with smaller chunks

When a 413 is received, the client logs the error and throws TranscriptionError.invalidResponse. No fallback to split and retry. The audio is lost.

Proposed Fix

Client-side (recommended primary fix)

  1. Add max chunk duration in VAD gate — cap at 30 seconds (1.92MB stereo). When buffer exceeds this, emit the chunk as complete and start a new accumulation
  2. Add chunk splitting in TranscriptionService — if audio exceeds a size threshold (e.g., 2MB), split into overlapping segments (with 1-2s overlap for context) and transcribe separately, then merge results
  3. Handle 413 gracefully — on 413 response, split the payload in half and retry each half

Backend-side (defense in depth)

  1. Increase body size limit — if the proxy or middleware has a limit below 5MB, increase it to at least 10MB
  2. Add streaming upload support — accept chunked transfer encoding for large audio payloads

Key Files

  • desktop/Desktop/Sources/VADGateService.swift — lines 554-684 (batch audio accumulation, no size limit)
  • desktop/Desktop/Sources/TranscriptionService.swift — lines 639-737 (batchTranscribeFull, single POST)
  • desktop/Desktop/Sources/AppState.swift — lines 1434-1438 (batchTranscribeChunk caller)

by AI for @beastoin

Metadata

Metadata

Assignees

Labels

captureLayer: Audio recording, device pairing, BLEp1Priority: Critical (score 22-29)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions