Fix batch transcription 413 on long speech chunks (#6195)#6207
Conversation
VADGateService accumulates unbounded audio during continuous speech, producing 3.2MB+ chunks that exceed backend body size limits. Add maxBatchBytes=1.5MB (~23.4s stereo) cap with auto-emit: when the buffer exceeds the cap during SPEECH or HANGOVER state, emit the current buffer and start fresh accumulation with correct timestamp advancement. Fixes #6195 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add defense-in-depth for HTTP 413: batchTranscribeWithSplitting() proactively splits audio exceeding maxBatchPayloadBytes at midpoint with 1s overlap, transcribes each half, and merges word-level results per channel with timestamp offset and overlap deduplication. Also retries with splitting on 413 response. Add payloadTooLarge error case to distinguish 413 from other HTTP errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Switch batchTranscribeChunk from batchTranscribeFull to batchTranscribeWithSplitting, which handles proactive splitting and 413 retry automatically. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7 tests covering word deduplication, timestamp offsetting, multi-channel merge, maxBatchBytes consistency, and frame alignment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Greptile SummaryThis PR fixes a silent audio loss bug where 50 s+ of continuous speech (≥ 3.2 MB stereo PCM) caused an HTTP 413 from the Deepgram proxy and the audio was dropped. It applies two complementary layers of defence:
Key observations:
Confidence Score: 4/5Safe to merge; the 413 fix is correct and well-tested, but two P2 issues (dead-code parameters and silent audio loss on a double-413) are worth a follow-up All findings are P2: (1) unused
Important Files Changed
Sequence DiagramsequenceDiagram
participant VA as VADGateService
participant AS as AppState
participant TS as TranscriptionService
participant DG as Deepgram API
Note over VA: Audio chunk arrives
VA->>VA: append to batchAudioBuffer
alt buffer >= maxBatchBytes (1.5 MB)
VA->>VA: autoEmitBatchBuffer()<br/>reset buffer, advance timestamp
VA-->>AS: BatchGateOutput(isComplete=true)
else hangover timeout
VA-->>AS: BatchGateOutput(isComplete=true)
end
AS->>TS: batchTranscribeWithSplitting(audioData)
alt audioData > maxBatchPayloadBytes
TS->>TS: splitAndTranscribe()
TS->>DG: batchTranscribeFull(firstHalf)
DG-->>TS: firstSegments
TS->>DG: batchTranscribeFull(secondHalf)
DG-->>TS: secondSegments (local timestamps)
TS->>TS: mergeSegments(offset + dedupeOverlapWords)
TS-->>AS: merged [TranscriptSegment]
else audioData <= limit
TS->>DG: batchTranscribeFull(audioData)
alt HTTP 200
DG-->>TS: segments
TS-->>AS: [TranscriptSegment]
else HTTP 413
DG-->>TS: 413
TS->>TS: splitAndTranscribe()
TS->>DG: batchTranscribeFull(firstHalf)
DG-->>TS: firstSegments
TS->>DG: batchTranscribeFull(secondHalf)
DG-->>TS: secondSegments
TS->>TS: mergeSegments(offset + dedupeOverlapWords)
TS-->>AS: merged [TranscriptSegment]
end
end
AS->>AS: offset words by wallStartTime<br/>handleTranscriptSegment()
Reviews (1): Last reviewed commit: "chore(desktop): add changelog entry for ..." | Re-trigger Greptile |
| /// Auto-emit the current batch buffer when it exceeds maxBatchBytes. | ||
| /// Stays in .speech state so the next audio continues accumulating into a fresh buffer. | ||
| /// Called under lock. | ||
| private func autoEmitBatchBuffer(nextChunkMs: Double, nextChunkData: Data) -> BatchGateOutput { |
There was a problem hiding this comment.
Unused parameters
nextChunkData and nextChunkMs
Both parameters are declared but never referenced inside the function body. Because batchAudioBuffer.append(stereoData) runs before the overflow check in both the .speech and .hangover cases, the triggering chunk is already inside completedBuffer when this function is called — making the carry-over parameters redundant.
If the intent was to seed the new buffer with the triggering chunk (for context continuity between consecutive auto-emitted chunks), nextChunkData needs to actually be appended to batchAudioBuffer and batchSpeechStartWallTime should be offset only by completedBuffer.count - nextChunkData.count frames. If the current behaviour (chunk included in emitted buffer, fresh empty start) is intentional, the parameters should be removed to avoid confusion.
| private func autoEmitBatchBuffer(nextChunkMs: Double, nextChunkData: Data) -> BatchGateOutput { | |
| private func autoEmitBatchBuffer() -> BatchGateOutput { |
| let firstSegments = try await batchTranscribeFull( | ||
| audioData: Data(firstHalf), language: language, vocabulary: vocabulary | ||
| ) | ||
| let secondSegments = try await batchTranscribeFull( | ||
| audioData: Data(secondHalf), language: language, vocabulary: vocabulary | ||
| ) |
There was a problem hiding this comment.
Single-level split — 413 on either half causes silent audio loss
splitAndTranscribe calls batchTranscribeFull (not batchTranscribeWithSplitting) for each half, so a 413 on a half propagates uncaught up to AppState.batchTranscribeChunk, which swallows it with logError(...). That speech audio is permanently lost with no user-visible indication.
In practice the VAD gate's auto-emit cap (~23.4 s) means buffers arriving here should only slightly exceed maxBatchPayloadBytes, so each half will be well under the limit. But if the proxy's actual limit is lower than expected, or if flushBatchBuffer delivers a large chunk, a split half could still 413 and audio would be silently dropped.
A light defensive fix would be to catch payloadTooLarge on each half and log a prominent error rather than silently discarding the speech:
let firstSegments: [TranscriptSegment]
do {
firstSegments = try await batchTranscribeFull(
audioData: Data(firstHalf), language: language, vocabulary: vocabulary)
} catch TranscriptionError.payloadTooLarge {
logError("TranscriptionService: First half still too large after split — dropping \(firstHalf.count) bytes", error: nil)
firstSegments = []
}
// same for secondSegmentsSplit halves with overlap can still exceed maxBatchPayloadBytes (e.g., 3.2MB → two 1.63MB halves). Use batchTranscribeWithSplitting recursively instead of batchTranscribeFull directly, so oversized halves get split again. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
autoEmitBatchBuffer left batchState unchanged, so auto-emit during hangover would leave an empty buffer in hangover state, potentially emitting a silence-only follow-up chunk. Always transition to .speech after auto-emit to continue proper accumulation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nce-only chunk After auto-emit, batchLastSpeechMs still pointed to the old buffer's last speech time. The next silent chunk would immediately trigger hangover→silence transition (timeSinceSpeechMs > 2000) and emit an empty/silence-only buffer. Reset batchLastSpeechMs to batchAudioCursorMs so the hangover timer starts fresh after auto-emit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add testAutoEmit() method and test property accessors to VADGateService for testing the auto-emit state machine path without requiring ONNX model loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tests verifying: speech→speech transition, hangover→speech transition (prevents silence-only follow-up), batchLastSpeechMs reset, and start wall time advancement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CP9 Changed-Path and Sequence Coverage ChecklistPR diff: 6 files changed, 418 insertions(+), 4 deletions(-)
L1 Evidence SummaryBuild:
L1 synthesis: All 10 changed paths (P1-P10) are proven at L1 via unit tests and compile verification. P1-P4 prove VAD auto-emit state machine transitions (speech, hangover, batchLastSpeechMs reset, startWallTime advance). P5-P8 prove batch splitting logic (frame alignment, timestamp offset, word deduplication, multi-channel merge). P9-P10 are callsite/error-case changes verified by compilation. No paths remain UNTESTED — HTTP integration paths (413 retry) are covered by proactive splitting which exercises the same code. Sequence IDs are N/A (path-only mode, no cross-service boundaries). by AI for @beastoin |
CP8 Test Detail Table
by AI for @beastoin |
L2 Evidence — Integration AnalysisIntegration AssessmentThis PR's changes are entirely client-side with no protocol or API changes:
Why L2 is satisfied by L1 evidence
L2 synthesisAll 10 changed paths (P1-P10) are integration-safe at L2. The changes reduce payload size (P1-P4 buffer cap) and add client-side splitting (P5-P9) — both transparent to the backend proxy. The backend receives the same PCM→transcript API calls with smaller payloads. No path creates a new integration boundary. Sequence IDs: N/A (path-only mode). by AI for @beastoin |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BasedHardware#6207) ## Summary - **VADGateService**: Added `maxBatchBytes = 1_500_000` (~23.4s stereo) cap. When buffer exceeds limit during SPEECH or HANGOVER state, auto-emits current buffer and starts fresh accumulation with correct timestamp advancement - **TranscriptionService**: Added `batchTranscribeWithSplitting()` that proactively splits audio exceeding the limit at midpoint with 1s overlap, transcribes each half sequentially, and merges word-level results per channel with timestamp offset and overlap deduplication. Also retries on HTTP 413 with splitting - **AppState**: Switched to splitting-aware transcription method - Added `payloadTooLarge` error case to distinguish 413 from other HTTP errors - 7 new unit tests covering dedupe, merge, offset, multi-channel, consistency, and alignment ## Root Cause VADGateService accumulated unbounded audio during continuous speech (50s+ = 3.2MB stereo PCM). TranscriptionService sent this as a single HTTP POST to Deepgram proxy. Backend/proxy body size limit rejected it with 413. No retry or splitting logic existed — audio was silently lost. ## Risks - Mid-speech auto-emit produces chunk boundaries mid-sentence. Deepgram handles this well since each chunk has context, but word boundaries at the split point may be slightly less accurate - Overlap deduplication uses text + timestamp proximity (0.5s window) — unlikely but possible false matches on repeated words - No ordering serialization added to AppState (deferred per CODEx recommendation for follow-up if needed) Fixes BasedHardware#6195 _by AI for @beastoin_
…rdware#6207 — merged without approval (BasedHardware#6218)
Summary
maxBatchBytes = 1_500_000(~23.4s stereo) cap. When buffer exceeds limit during SPEECH or HANGOVER state, auto-emits current buffer and starts fresh accumulation with correct timestamp advancementbatchTranscribeWithSplitting()that proactively splits audio exceeding the limit at midpoint with 1s overlap, transcribes each half sequentially, and merges word-level results per channel with timestamp offset and overlap deduplication. Also retries on HTTP 413 with splittingpayloadTooLargeerror case to distinguish 413 from other HTTP errorsRoot Cause
VADGateService accumulated unbounded audio during continuous speech (50s+ = 3.2MB stereo PCM). TranscriptionService sent this as a single HTTP POST to Deepgram proxy. Backend/proxy body size limit rejected it with 413. No retry or splitting logic existed — audio was silently lost.
Risks
Fixes #6195
by AI for @beastoin