fix: add PCM to WAV encoding and TTS MIME support#123
Conversation
Add support for Gemini TTS returning raw PCM by encoding PCM16LE into a WAV container and propagating MIME types through the stack. Introduces a new encodePcm16AsWav utility and SynthesizedAudio type, updates GeminiTTSExecutor to normalize model responses and return {audioBase64, mimeType}, and exports the WAV encoder. Adjusts frontend to use the provided audio MIME type (removing mp3 defaults), updates API spec and conversation service to handle the new payload shape, and updates tests (including a new unit test for WAV encoding) to expect audio/wav.
|
Deploying mentora-app with
|
| Latest commit: |
0f8283a
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://0326bdc3.mentora-app.pages.dev |
| Branch Preview URL: | https://fix-audio-play.mentora-app.pages.dev |
There was a problem hiding this comment.
Pull request overview
This PR updates the Mentora TTS pipeline to support Gemini returning raw PCM by wrapping it in a WAV container and propagating an explicit audio MIME type through backend → SDK → frontend playback.
Changes:
- Add
encodePcm16AsWavutility and aSynthesizedAudio { audioBase64, mimeType }contract for TTS executors. - Update
GeminiTTSExecutorto normalize model audio responses (PCM → WAV) and return MIME type alongside base64 audio. - Adjust API/service/frontend/tests to expect and use
audio/wavrather than defaulting toaudio/mp3.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/mentora-api/tests/sdk-full-scenario.integration.test.ts | Updates integration expectations for returned audioMimeType to WAV. |
| packages/mentora-api/tests/conversation-service-asr.unit.test.ts | Updates TTS mock + assertions to the new { audioBase64, mimeType } payload and WAV MIME. |
| packages/mentora-api/src/lib/server/application/conversation-service.ts | Plumbs the new TTS payload shape through addTurn and returns audioMimeType from executor output. |
| packages/mentora-api/src/lib/explorer/api-spec.ts | Updates API explorer example to show audio/wav. |
| packages/mentora-ai/tests/tts-audio.test.ts | Adds unit test verifying WAV header/data layout for PCM→WAV encoding. |
| packages/mentora-ai/src/types.ts | Introduces SynthesizedAudio and updates TTSExecutor.synthesize return type accordingly. |
| packages/mentora-ai/src/executor/wav.ts | Adds WAV container encoding helper for raw PCM. |
| packages/mentora-ai/src/executor/tts.ts | Normalizes Gemini TTS audio responses (PCM → WAV) and enforces supported MIME handling. |
| packages/mentora-ai/src/executor/index.ts | Exports the new WAV encoder from the executor barrel. |
| apps/mentora/src/routes/conversations/[id]/+page.svelte | Removes MP3 defaulting and plays audio using the MIME type returned from the API. |
| const normalizedMimeType = audioPart.mimeType | ||
| ?.toLowerCase() | ||
| .split(";")[0] | ||
| ?.trim(); | ||
|
|
||
| if ( | ||
| normalizedMimeType === "audio/wav" || | ||
| normalizedMimeType === "audio/wave" | ||
| ) { | ||
| return { | ||
| audioBase64: data, | ||
| mimeType: "audio/wav", | ||
| }; | ||
| } | ||
|
|
||
| // Gemini preview TTS returns raw 24 kHz 16-bit mono PCM by default. | ||
| if (!normalizedMimeType || PCM_MIME_TYPES.has(normalizedMimeType)) { | ||
| return encodePcm16AsWav(data); | ||
| } | ||
|
|
||
| throw new Error( | ||
| `Unsupported TTS audio MIME type: ${audioPart.mimeType ?? "<missing>"}`, | ||
| ); |
There was a problem hiding this comment.
normalizeGeminiAudioResponse currently only accepts "audio/wav"/"audio/wave" and raw PCM (missing/"audio/l16"/"audio/pcm"). If Gemini returns other common but browser-playable audio types (e.g. audio/mpeg) or WAV aliases (e.g. audio/x-wav, audio/vnd.wave), this will throw and break TTS despite the PR goal of propagating MIME types. Consider normalizing additional WAV aliases and passing through supported codecs (or explicitly documenting/handling them) instead of throwing.
| bytesPerSample?: number; | ||
| } = {}, | ||
| ): SynthesizedAudio { |
There was a problem hiding this comment.
encodePcm16AsWav exposes bytesPerSample/channelCount/sampleRate options, so it can generate WAV headers for non-16-bit PCM as well. The current name implies it only supports PCM16; consider either asserting bytesPerSample===2 (and maybe channelCount/sampleRate expectations) or renaming the helper to reflect the broader capability to avoid misuse.
| bytesPerSample?: number; | |
| } = {}, | |
| ): SynthesizedAudio { | |
| bytesPerSample?: 2; | |
| } = {}, | |
| ): SynthesizedAudio { | |
| if (bytesPerSample !== 2) { | |
| throw new Error( | |
| `encodePcm16AsWav only supports 16-bit PCM input (bytesPerSample must be 2, received ${bytesPerSample}).`, | |
| ); | |
| } |
Add support for Gemini TTS returning raw PCM by encoding PCM16LE into a WAV container and propagating MIME types through the stack. Introduces a new encodePcm16AsWav utility and SynthesizedAudio type, updates GeminiTTSExecutor to normalize model responses and return {audioBase64, mimeType}, and exports the WAV encoder. Adjusts frontend to use the provided audio MIME type (removing mp3 defaults), updates API spec and conversation service to handle the new payload shape, and updates tests (including a new unit test for WAV encoding) to expect audio/wav.