feat: add audio input support and voice recognition features by zhangmo8 · Pull Request #1623 · ThinkInAIXYZ/deepchat

zhangmo8 · 2026-05-14T09:47:28Z

Implemented audio input handling in NewThreadPage.vue, allowing users to attach audio files and transcribe them.
Enhanced ChatInputBox and ChatInputToolbar components to support voice input functionality.
Added speech recognition capabilities using the useSpeechRecognition composable.
Updated model capabilities to include support for audio input and speech recognition.
Introduced new routes for audio transcription and updated related tests to ensure functionality.
Added tests for audio input handling, speech recognition, and integration with the AI SDK.

20260514_173950.mp4

20260514_174252.mp4

Summary by CodeRabbit

New Features
- Local voice recording with transcription and inserting recognized text into the composer.
- Model-aware audio attachment handling and per-model speech-recognition toggle.
- Microphone keyboard shortcut (Ctrl/Meta + Shift + M) and animated waveform recording UI.
- Model capability indicator showing audio-input support.
Documentation
- Added implementation plan, spec, and task checklist for voice-input transcription.
Localization
- Added UI strings for voice input states, errors, and settings across locales.

- Implemented audio input handling in NewThreadPage.vue, allowing users to attach audio files and transcribe them. - Enhanced ChatInputBox and ChatInputToolbar components to support voice input functionality. - Added speech recognition capabilities using the useSpeechRecognition composable. - Updated model capabilities to include support for audio input and speech recognition. - Introduced new routes for audio transcription and updated related tests to ensure functionality. - Added tests for audio input handling, speech recognition, and integration with the AI SDK.

coderabbitai · 2026-05-14T09:47:43Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 82278ec9-eaac-4033-a5b4-0a592b6b1013

📥 Commits

Reviewing files that changed from the base of the PR and between d33da1c and 9537cfc.

📒 Files selected for processing (19)

docs/features/voice-input-transcription/plan.md
docs/features/voice-input-transcription/spec.md
docs/features/voice-input-transcription/tasks.md
src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts
src/main/presenter/llmProviderPresenter/index.ts
src/renderer/api/ModelClient.ts
src/renderer/src/components/chat/composables/useAudioRecorder.ts
src/renderer/src/components/chat/composables/useSpeechRecognition.ts
src/renderer/src/i18n/fa-IR/settings.json
src/renderer/src/i18n/ru-RU/chat.json
src/renderer/src/pages/NewThreadPage.vue
src/shared/contracts/routes/models.routes.ts
src/shared/modelConfigDefaults.ts
test/main/presenter/llmProviderPresenter.test.ts
test/main/presenter/llmProviderPresenter/aiSdkMessageMapper.test.ts
test/main/routes/contracts.test.ts
test/renderer/components/NewThreadPage.test.ts
test/renderer/composables/useSpeechRecognition.test.ts
test/renderer/lib/audioInputSupport.test.ts

✅ Files skipped from review due to trivial changes (5)

docs/features/voice-input-transcription/plan.md
src/renderer/src/i18n/fa-IR/settings.json
src/renderer/src/i18n/ru-RU/chat.json
docs/features/voice-input-transcription/tasks.md
docs/features/voice-input-transcription/spec.md

🚧 Files skipped from review as they are similar to previous changes (11)

src/shared/contracts/routes/models.routes.ts
src/renderer/api/ModelClient.ts
test/renderer/lib/audioInputSupport.test.ts
test/main/presenter/llmProviderPresenter.test.ts
src/shared/modelConfigDefaults.ts
src/renderer/src/components/chat/composables/useAudioRecorder.ts
test/main/presenter/llmProviderPresenter/aiSdkMessageMapper.test.ts
src/renderer/src/pages/NewThreadPage.vue
src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts
src/main/presenter/llmProviderPresenter/index.ts
src/renderer/src/components/chat/composables/useSpeechRecognition.ts

📝 Walkthrough

Walkthrough

Adds local microphone recording, WAV normalization and upload, a typed transcription route + renderer client, provider-native OpenAI-style transcription with a completion fallback, model capability gating for audio, UI recording controls and animations, audio-attachment filtering, i18n strings, and tests.

Changes

Voice Input Transcription

Layer / File(s)	Summary
Docs & planning `docs/features/voice-input-transcription/*`	Feature plan, spec, and task checklist describing architecture, UI, routing, fallback, and tests.
Types & schemas `src/shared/types/`, `src/shared/contracts/`, `src/shared/modelConfigDefaults.ts`	Adds `input_audio` message variant, `speechRecognition` model config default, `supportsAudioInput` capability schema, and route contract for transcription.
Provider transcription surface & AiSdk provider `src/main/presenter/llmProviderPresenter/*`, `src/main/presenter/llmProviderPresenter/providers/aiSdkProvider.ts`	Adds base provider transcribe API, LLMPresenter.transcribeAudioStandalone with fallback, and AiSdkProvider OpenAI-style `/audio/transcriptions` handling with abort/timeout/error mapping.
Message mapping & runtime `src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts`, `src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts`	Maps `input_audio` parts to provider file parts, optionally injects OpenAI-compatible data URLs for compatibility.
Context & compaction plumbing `src/main/presenter/agentRuntimePresenter/*`	Thread `supportsAudioInput` through context building, createUserChatMessage, compaction, resume/recovery flows so `input_audio` parts are included when supported.
Routes & renderer client `src/main/routes/models/*`, `src/main/routes/models/modelRouteHandler.ts`, `src/renderer/api/ModelClient.ts`	Adds typed `models.transcribeAudio` route, handler wiring, and `ModelClient.transcribeAudio` bridge method.
Renderer audio capture & encoding `src/renderer/src/components/chat/composables/useAudioRecorder.ts`, `src/renderer/src/components/chat/composables/useSpeechRecognition.ts`	MediaRecorder-based recorder, preferred MIME selection, WAV (16-bit PCM) encoding, base64 conversion, abort/timeout racing, and transcribe invocation contract.
Voice input abstraction `src/renderer/src/components/chat/composables/useVoiceInput.ts`	Provider-agnostic voice input controller wrapping recorder and exposing start/stop/toggle/cleanup and reactive state.
UI: input box, toolbar, status bar, pages `src/renderer/src/components/chat/ChatInputBox.vue`, `ChatInputToolbar.vue`, `ChatStatusBar.vue`, `src/renderer/src/pages/*`	Keyboard shortcut, exposed insertRecognizedText, voice button with waveform animation and accessibility states, model-capability mic indicator, chat/new-thread pages integrate voice flows and audio-attachment filtering.
Attachment filtering & lib `src/renderer/src/lib/audioInputSupport.ts`	Detects audio attachments (MIME/extension) and filters or rejects them when model does not support audio input.
Model config UI & i18n `src/renderer/src/components/settings/ModelConfigDialog.vue`, `src/renderer/src/i18n/*`	Adds `speechRecognition` switch in model config and localized strings for voice input states, errors, and model audio capability across languages.
Tests `test/main/`, `test/renderer/`	Adds unit and integration tests for context building, message mapping, provider transcription behavior (success/fallback/error), recorder/composable flows, UI events, attachment filtering, and route validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

ThinkInAIXYZ/deepchat#1620: Overlaps ACP gating/submit flow changes that also affect NewThread/submit logic.
ThinkInAIXYZ/deepchat#1558: Related context-builder/compaction signature changes and flags threading.

Suggested reviewers

zerob13
deepinfect

🐰 "I tapped the mic and heard a squeak,
WAV waves glow as bytes take a peek,
From whisper to text in a twinkling beat,
Insert the words — oh what a treat!"

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch asr

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/shared/modelConfigDefaults.ts (1)

8-18: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire speechRecognition into the fallback defaults.

DEFAULT_MODEL_SPEECH_RECOGNITION is defined but not included in DEFAULT_MODEL_CAPABILITY_FALLBACKS, so fallback-derived configs can still expose speechRecognition as undefined instead of false.

🔧 Proposed fix

 export const DEFAULT_MODEL_CAPABILITY_FALLBACKS = Object.freeze({
   contextLength: DEFAULT_MODEL_CONTEXT_LENGTH,
   maxTokens: DEFAULT_MODEL_MAX_TOKENS,
   vision: DEFAULT_MODEL_VISION,
+  speechRecognition: DEFAULT_MODEL_SPEECH_RECOGNITION,
   functionCall: DEFAULT_MODEL_FUNCTION_CALL,
   reasoning: DEFAULT_MODEL_REASONING
 })

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/shared/modelConfigDefaults.ts` around lines 8 - 18,
DEFAULT_MODEL_SPEECH_RECOGNITION is defined but not included in
DEFAULT_MODEL_CAPABILITY_FALLBACKS, leaving fallback-derived configs with
speechRecognition undefined; update the DEFAULT_MODEL_CAPABILITY_FALLBACKS
Object.freeze block to include a speechRecognition property set to
DEFAULT_MODEL_SPEECH_RECOGNITION so fallback resolution returns false by default
(modify the DEFAULT_MODEL_CAPABILITY_FALLBACKS object where contextLength,
maxTokens, vision, functionCall, and reasoning are defined).

🧹 Nitpick comments (1)

src/renderer/api/ModelClient.ts (1)

222-237: ⚡ Quick win

Consider simplifying the optional filename spread.

The conditional spread for the optional filename parameter works correctly but could be more concise using short-circuit evaluation.

♻️ Simpler optional parameter pattern

   const result = await bridge.invoke(modelsTranscribeAudioRoute.name, {
     providerId,
     modelId,
     audioBase64,
     mimeType,
-    ...(filename ? { filename } : {})
+    ...(filename && { filename })
   })

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/renderer/api/ModelClient.ts` around lines 222 - 237, The conditional
spread for the optional filename in transcribeAudio is verbose; replace the
ternary spread (...(filename ? { filename } : {})) with a concise short-circuit
spread like ...(filename && { filename }) when building the payload for
modelsTranscribeAudioRoute.name so the filename is included only when defined.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts`:
- Around line 151-157: The code computes a fallback mediaType but still passes
the original actualMediaType into buildAudioProviderOptions, causing the
generated data URL to use an unsupported MIME type; update the call to
buildAudioProviderOptions to pass the computed mediaType (not actualMediaType)
and adjust buildAudioProviderOptions' signature/usage to accept and use this
mediaType so the data:<mime>;base64,... string and provider options consistently
use OPENAI_COMPATIBLE_AUDIO_FALLBACK_MEDIA_TYPE when applicable.

In `@src/main/presenter/llmProviderPresenter/index.ts`:
- Around line 401-403: The check uses normalizedMimeType but may be comparing
with mixed-case input (e.g., "Audio/WAV"); ensure normalizedMimeType is created
by lowercasing the incoming mimeType (e.g., normalizedMimeType =
mimeType.toLowerCase()) before performing the startsWith('audio/') validation so
the condition accepts valid audio types regardless of case; update the logic
around the normalizedMimeType variable where the MIME is validated in the LLM
provider presenter (the block that throws Error(`Invalid audio MIME type for
transcription: ${mimeType}`)) to use the lowercased value.

In `@src/renderer/src/components/chat/composables/useAudioRecorder.ts`:
- Around line 88-95: mediaRecorder.onstop currently calls options.onRecorded
even after cleanupRecorder()/cleanup() has run, which can emit stale callbacks;
fix by adding a disposal guard: introduce and set a local boolean flag (e.g.,
isDisposed or isActiveRecording) that cleanupRecorder()/cleanup() flips to true,
and in mediaRecorder.onstop check the flag before invoking options.onRecorded
(or alternatively null out options.onRecorded in cleanup and guard for its
existence in onstop); reference mediaRecorder.onstop, cleanupRecorder(),
cleanup(), and options.onRecorded when implementing the guard.

In `@src/renderer/src/components/chat/composables/useSpeechRecognition.ts`:
- Around line 100-106: The switch in useSpeechRecognition.ts that inspects
error.message currently groups 'transcription-timeout' with decode failures;
update the switch in the function handling speech errors so that
'transcription-timeout' is not returned as 'decode-failed'—remove it from the
decode-failed case and return a distinct, appropriate error key (e.g.,
'transcription-timeout' or 'timeout') in the switch's default/own case so
callers of the composable (useSpeechRecognition) can distinguish timeout vs
decode failures.

In `@src/renderer/src/i18n/fa-IR/settings.json`:
- Line 427: In the fa-IR translation entry (the JSON description string
currently reading "مشخص می‌کند آیا این مدل ورود صوتی با تبدیل محلی گفتار به متن
را مجاز می‌کند یا نه."), replace the phrase "ورود صوتی" with "ورودی صوتی" so the
description reads with the correct wording for "voice input" and matches other
feature labels; update the value for the same "description" string in
settings.json accordingly.

In `@src/renderer/src/i18n/ru-RU/chat.json`:
- Line 71: The message for the JSON key "audioInputUnsupportedDescription"
contains the pseudo-plural "аудиовложение(й)"; replace that with a neutral,
natural Russian phrase that works for any count (for example, use "аудиофайлы"
or "аудиозаписи") so the string reads smoothly: "Модель {model} не поддерживает
аудиоввод. {count} аудиофайлы были пропущены." Update the value for
audioInputUnsupportedDescription accordingly.

In `@src/renderer/src/pages/NewThreadPage.vue`:
- Around line 899-918: prepareFilesForCurrentModel currently calls
resolveModel(), which can return a different model than the active ACP draft
target and thus mis-filter attachments; replace the resolveModel() call with the
actual submission target used for the ACP draft (either by reading the active
ACP draft target from state/context or by adding a parameter like
submissionTarget to prepareFilesForCurrentModel) and use that selection when
calling modelClient.getCapabilities(selection.providerId, selection.modelId);
keep the filtering logic with filterUnsupportedAudioAttachments and
notifyUnsupportedAudioAttachments the same and preserve the early-return when no
selection or files are empty.

In `@src/shared/contracts/routes/models.routes.ts`:
- Around line 192-198: The input schema's audioBase64 and mimeType are unbounded
causing potential oversized IPC payloads; update the zod schema in
models.routes.ts (the input: z.object({...}) block) to add max limits: add
.max(15_000_000) to audioBase64 to cap base64 audio around ~10MB binary and add
.max(255) to mimeType (and consider .max(255) on filename.optional() if
desired). Keep the field names providerId, modelId, audioBase64, mimeType, and
filename unchanged when applying these .max(...) constraints.

In `@test/renderer/lib/audioInputSupport.test.ts`:
- Around line 12-22: Add a negative test that asserts isAudioAttachment returns
false for non-audio MIME types by creating a file via createFile with a
non-audio mimeType (e.g., application/pdf, text/plain, or image/png) and
expecting the result toBe(false); place the new test alongside the existing
'detects audio attachments from mime type' test and name it something like
'returns false for non-audio attachments' to clearly cover the negative case.

---

Outside diff comments:
In `@src/shared/modelConfigDefaults.ts`:
- Around line 8-18: DEFAULT_MODEL_SPEECH_RECOGNITION is defined but not included
in DEFAULT_MODEL_CAPABILITY_FALLBACKS, leaving fallback-derived configs with
speechRecognition undefined; update the DEFAULT_MODEL_CAPABILITY_FALLBACKS
Object.freeze block to include a speechRecognition property set to
DEFAULT_MODEL_SPEECH_RECOGNITION so fallback resolution returns false by default
(modify the DEFAULT_MODEL_CAPABILITY_FALLBACKS object where contextLength,
maxTokens, vision, functionCall, and reasoning are defined).

---

Nitpick comments:
In `@src/renderer/api/ModelClient.ts`:
- Around line 222-237: The conditional spread for the optional filename in
transcribeAudio is verbose; replace the ternary spread (...(filename ? {
filename } : {})) with a concise short-circuit spread like ...(filename && {
filename }) when building the payload for modelsTranscribeAudioRoute.name so the
filename is included only when defined.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f530e40c-8f2f-4439-8f75-4caa292bfd90

📥 Commits

Reviewing files that changed from the base of the PR and between ad62bab and d33da1c.

📒 Files selected for processing (68)

docs/features/voice-input-transcription/plan.md
docs/features/voice-input-transcription/spec.md
docs/features/voice-input-transcription/tasks.md
src/main/presenter/agentRuntimePresenter/compactionService.ts
src/main/presenter/agentRuntimePresenter/contextBuilder.ts
src/main/presenter/agentRuntimePresenter/index.ts
src/main/presenter/configPresenter/index.ts
src/main/presenter/configPresenter/modelCapabilities.ts
src/main/presenter/configPresenter/modelConfig.ts
src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts
src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts
src/main/presenter/llmProviderPresenter/baseProvider.ts
src/main/presenter/llmProviderPresenter/index.ts
src/main/presenter/llmProviderPresenter/providers/aiSdkProvider.ts
src/main/routes/config/configRouteSupport.ts
src/main/routes/models/modelRouteHandler.ts
src/renderer/api/ModelClient.ts
src/renderer/src/components/chat/ChatInputBox.vue
src/renderer/src/components/chat/ChatInputToolbar.vue
src/renderer/src/components/chat/ChatStatusBar.vue
src/renderer/src/components/chat/composables/useAudioRecorder.ts
src/renderer/src/components/chat/composables/useSpeechRecognition.ts
src/renderer/src/components/chat/composables/useVoiceInput.ts
src/renderer/src/components/settings/ModelConfigDialog.vue
src/renderer/src/i18n/da-DK/chat.json
src/renderer/src/i18n/da-DK/settings.json
src/renderer/src/i18n/en-US/chat.json
src/renderer/src/i18n/en-US/settings.json
src/renderer/src/i18n/fa-IR/chat.json
src/renderer/src/i18n/fa-IR/settings.json
src/renderer/src/i18n/fr-FR/chat.json
src/renderer/src/i18n/fr-FR/settings.json
src/renderer/src/i18n/he-IL/chat.json
src/renderer/src/i18n/he-IL/settings.json
src/renderer/src/i18n/ja-JP/chat.json
src/renderer/src/i18n/ja-JP/settings.json
src/renderer/src/i18n/ko-KR/chat.json
src/renderer/src/i18n/ko-KR/settings.json
src/renderer/src/i18n/pt-BR/chat.json
src/renderer/src/i18n/pt-BR/settings.json
src/renderer/src/i18n/ru-RU/chat.json
src/renderer/src/i18n/ru-RU/settings.json
src/renderer/src/i18n/zh-CN/chat.json
src/renderer/src/i18n/zh-CN/settings.json
src/renderer/src/i18n/zh-HK/chat.json
src/renderer/src/i18n/zh-HK/settings.json
src/renderer/src/i18n/zh-TW/chat.json
src/renderer/src/i18n/zh-TW/settings.json
src/renderer/src/lib/audioInputSupport.ts
src/renderer/src/pages/ChatPage.vue
src/renderer/src/pages/NewThreadPage.vue
src/shared/contracts/domainSchemas.ts
src/shared/contracts/routes.ts
src/shared/contracts/routes/models.routes.ts
src/shared/modelConfigDefaults.ts
src/shared/types/core/chat-message.ts
src/shared/types/presenters/legacy.presenters.d.ts
src/shared/types/presenters/llmprovider.presenter.d.ts
test/main/presenter/agentRuntimePresenter/contextBuilder.test.ts
test/main/presenter/llmProviderPresenter.test.ts
test/main/presenter/llmProviderPresenter/aiSdkMessageMapper.test.ts
test/main/presenter/llmProviderPresenter/openAICompatibleProvider.test.ts
test/main/presenter/llmProviderPresenter/openAIResponsesProvider.test.ts
test/renderer/components/ChatInputBox.test.ts
test/renderer/components/ChatInputToolbar.test.ts
test/renderer/components/ModelConfigDialog.test.ts
test/renderer/composables/useSpeechRecognition.test.ts
test/renderer/lib/audioInputSupport.test.ts

👮 Files not reviewed due to content moderation or server errors (4)

src/main/presenter/agentRuntimePresenter/contextBuilder.ts
test/main/presenter/agentRuntimePresenter/contextBuilder.test.ts
src/main/presenter/agentRuntimePresenter/index.ts
src/main/presenter/agentRuntimePresenter/compactionService.ts

zhangmo8 mentioned this pull request May 14, 2026

[Feature] Voice Input Button #949

Closed

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

fix(asr): address PR review feedback

9537cfc

zerob13 merged commit 9c7060f into dev May 14, 2026
3 checks passed

zhangmo8 deleted the asr branch May 14, 2026 11:59

coderabbitai Bot mentioned this pull request May 18, 2026

fix: update TTS references and localization across components #1633

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add audio input support and voice recognition features#1623

feat: add audio input support and voice recognition features#1623
zerob13 merged 2 commits into
devfrom
asr

zhangmo8 commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhangmo8 commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhangmo8 commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading