Skip to content

feat: add audio input support and voice recognition features#1623

Merged
zerob13 merged 2 commits into
devfrom
asr
May 14, 2026
Merged

feat: add audio input support and voice recognition features#1623
zerob13 merged 2 commits into
devfrom
asr

Conversation

@zhangmo8
Copy link
Copy Markdown
Collaborator

@zhangmo8 zhangmo8 commented May 14, 2026

  • Implemented audio input handling in NewThreadPage.vue, allowing users to attach audio files and transcribe them.
  • Enhanced ChatInputBox and ChatInputToolbar components to support voice input functionality.
  • Added speech recognition capabilities using the useSpeechRecognition composable.
  • Updated model capabilities to include support for audio input and speech recognition.
  • Introduced new routes for audio transcription and updated related tests to ensure functionality.
  • Added tests for audio input handling, speech recognition, and integration with the AI SDK.
20260514_173950.mp4
20260514_174252.mp4

Summary by CodeRabbit

  • New Features

    • Local voice recording with transcription and inserting recognized text into the composer.
    • Model-aware audio attachment handling and per-model speech-recognition toggle.
    • Microphone keyboard shortcut (Ctrl/Meta + Shift + M) and animated waveform recording UI.
    • Model capability indicator showing audio-input support.
  • Documentation

    • Added implementation plan, spec, and task checklist for voice-input transcription.
  • Localization

    • Added UI strings for voice input states, errors, and settings across locales.

Review Change Stack

- Implemented audio input handling in NewThreadPage.vue, allowing users to attach audio files and transcribe them.
- Enhanced ChatInputBox and ChatInputToolbar components to support voice input functionality.
- Added speech recognition capabilities using the useSpeechRecognition composable.
- Updated model capabilities to include support for audio input and speech recognition.
- Introduced new routes for audio transcription and updated related tests to ensure functionality.
- Added tests for audio input handling, speech recognition, and integration with the AI SDK.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 82278ec9-eaac-4033-a5b4-0a592b6b1013

📥 Commits

Reviewing files that changed from the base of the PR and between d33da1c and 9537cfc.

📒 Files selected for processing (19)
  • docs/features/voice-input-transcription/plan.md
  • docs/features/voice-input-transcription/spec.md
  • docs/features/voice-input-transcription/tasks.md
  • src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts
  • src/main/presenter/llmProviderPresenter/index.ts
  • src/renderer/api/ModelClient.ts
  • src/renderer/src/components/chat/composables/useAudioRecorder.ts
  • src/renderer/src/components/chat/composables/useSpeechRecognition.ts
  • src/renderer/src/i18n/fa-IR/settings.json
  • src/renderer/src/i18n/ru-RU/chat.json
  • src/renderer/src/pages/NewThreadPage.vue
  • src/shared/contracts/routes/models.routes.ts
  • src/shared/modelConfigDefaults.ts
  • test/main/presenter/llmProviderPresenter.test.ts
  • test/main/presenter/llmProviderPresenter/aiSdkMessageMapper.test.ts
  • test/main/routes/contracts.test.ts
  • test/renderer/components/NewThreadPage.test.ts
  • test/renderer/composables/useSpeechRecognition.test.ts
  • test/renderer/lib/audioInputSupport.test.ts
✅ Files skipped from review due to trivial changes (5)
  • docs/features/voice-input-transcription/plan.md
  • src/renderer/src/i18n/fa-IR/settings.json
  • src/renderer/src/i18n/ru-RU/chat.json
  • docs/features/voice-input-transcription/tasks.md
  • docs/features/voice-input-transcription/spec.md
🚧 Files skipped from review as they are similar to previous changes (11)
  • src/shared/contracts/routes/models.routes.ts
  • src/renderer/api/ModelClient.ts
  • test/renderer/lib/audioInputSupport.test.ts
  • test/main/presenter/llmProviderPresenter.test.ts
  • src/shared/modelConfigDefaults.ts
  • src/renderer/src/components/chat/composables/useAudioRecorder.ts
  • test/main/presenter/llmProviderPresenter/aiSdkMessageMapper.test.ts
  • src/renderer/src/pages/NewThreadPage.vue
  • src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts
  • src/main/presenter/llmProviderPresenter/index.ts
  • src/renderer/src/components/chat/composables/useSpeechRecognition.ts

📝 Walkthrough

Walkthrough

Adds local microphone recording, WAV normalization and upload, a typed transcription route + renderer client, provider-native OpenAI-style transcription with a completion fallback, model capability gating for audio, UI recording controls and animations, audio-attachment filtering, i18n strings, and tests.

Changes

Voice Input Transcription

Layer / File(s) Summary
Docs & planning
docs/features/voice-input-transcription/*
Feature plan, spec, and task checklist describing architecture, UI, routing, fallback, and tests.
Types & schemas
src/shared/types/*, src/shared/contracts/*, src/shared/modelConfigDefaults.ts
Adds input_audio message variant, speechRecognition model config default, supportsAudioInput capability schema, and route contract for transcription.
Provider transcription surface & AiSdk provider
src/main/presenter/llmProviderPresenter/*, src/main/presenter/llmProviderPresenter/providers/aiSdkProvider.ts
Adds base provider transcribe API, LLMPresenter.transcribeAudioStandalone with fallback, and AiSdkProvider OpenAI-style /audio/transcriptions handling with abort/timeout/error mapping.
Message mapping & runtime
src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts, src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts
Maps input_audio parts to provider file parts, optionally injects OpenAI-compatible data URLs for compatibility.
Context & compaction plumbing
src/main/presenter/agentRuntimePresenter/*
Thread supportsAudioInput through context building, createUserChatMessage, compaction, resume/recovery flows so input_audio parts are included when supported.
Routes & renderer client
src/main/routes/models/*, src/main/routes/models/modelRouteHandler.ts, src/renderer/api/ModelClient.ts
Adds typed models.transcribeAudio route, handler wiring, and ModelClient.transcribeAudio bridge method.
Renderer audio capture & encoding
src/renderer/src/components/chat/composables/useAudioRecorder.ts, src/renderer/src/components/chat/composables/useSpeechRecognition.ts
MediaRecorder-based recorder, preferred MIME selection, WAV (16-bit PCM) encoding, base64 conversion, abort/timeout racing, and transcribe invocation contract.
Voice input abstraction
src/renderer/src/components/chat/composables/useVoiceInput.ts
Provider-agnostic voice input controller wrapping recorder and exposing start/stop/toggle/cleanup and reactive state.
UI: input box, toolbar, status bar, pages
src/renderer/src/components/chat/ChatInputBox.vue, ChatInputToolbar.vue, ChatStatusBar.vue, src/renderer/src/pages/*
Keyboard shortcut, exposed insertRecognizedText, voice button with waveform animation and accessibility states, model-capability mic indicator, chat/new-thread pages integrate voice flows and audio-attachment filtering.
Attachment filtering & lib
src/renderer/src/lib/audioInputSupport.ts
Detects audio attachments (MIME/extension) and filters or rejects them when model does not support audio input.
Model config UI & i18n
src/renderer/src/components/settings/ModelConfigDialog.vue, src/renderer/src/i18n/*
Adds speechRecognition switch in model config and localized strings for voice input states, errors, and model audio capability across languages.
Tests
test/main/*, test/renderer/*
Adds unit and integration tests for context building, message mapping, provider transcription behavior (success/fallback/error), recorder/composable flows, UI events, attachment filtering, and route validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • zerob13
  • deepinfect

🐰 "I tapped the mic and heard a squeak,
WAV waves glow as bytes take a peek,
From whisper to text in a twinkling beat,
Insert the words — oh what a treat!"

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch asr

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/shared/modelConfigDefaults.ts (1)

8-18: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire speechRecognition into the fallback defaults.

DEFAULT_MODEL_SPEECH_RECOGNITION is defined but not included in DEFAULT_MODEL_CAPABILITY_FALLBACKS, so fallback-derived configs can still expose speechRecognition as undefined instead of false.

🔧 Proposed fix
 export const DEFAULT_MODEL_CAPABILITY_FALLBACKS = Object.freeze({
   contextLength: DEFAULT_MODEL_CONTEXT_LENGTH,
   maxTokens: DEFAULT_MODEL_MAX_TOKENS,
   vision: DEFAULT_MODEL_VISION,
+  speechRecognition: DEFAULT_MODEL_SPEECH_RECOGNITION,
   functionCall: DEFAULT_MODEL_FUNCTION_CALL,
   reasoning: DEFAULT_MODEL_REASONING
 })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/shared/modelConfigDefaults.ts` around lines 8 - 18,
DEFAULT_MODEL_SPEECH_RECOGNITION is defined but not included in
DEFAULT_MODEL_CAPABILITY_FALLBACKS, leaving fallback-derived configs with
speechRecognition undefined; update the DEFAULT_MODEL_CAPABILITY_FALLBACKS
Object.freeze block to include a speechRecognition property set to
DEFAULT_MODEL_SPEECH_RECOGNITION so fallback resolution returns false by default
(modify the DEFAULT_MODEL_CAPABILITY_FALLBACKS object where contextLength,
maxTokens, vision, functionCall, and reasoning are defined).
🧹 Nitpick comments (1)
src/renderer/api/ModelClient.ts (1)

222-237: ⚡ Quick win

Consider simplifying the optional filename spread.

The conditional spread for the optional filename parameter works correctly but could be more concise using short-circuit evaluation.

♻️ Simpler optional parameter pattern
   const result = await bridge.invoke(modelsTranscribeAudioRoute.name, {
     providerId,
     modelId,
     audioBase64,
     mimeType,
-    ...(filename ? { filename } : {})
+    ...(filename && { filename })
   })
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/renderer/api/ModelClient.ts` around lines 222 - 237, The conditional
spread for the optional filename in transcribeAudio is verbose; replace the
ternary spread (...(filename ? { filename } : {})) with a concise short-circuit
spread like ...(filename && { filename }) when building the payload for
modelsTranscribeAudioRoute.name so the filename is included only when defined.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts`:
- Around line 151-157: The code computes a fallback mediaType but still passes
the original actualMediaType into buildAudioProviderOptions, causing the
generated data URL to use an unsupported MIME type; update the call to
buildAudioProviderOptions to pass the computed mediaType (not actualMediaType)
and adjust buildAudioProviderOptions' signature/usage to accept and use this
mediaType so the data:<mime>;base64,... string and provider options consistently
use OPENAI_COMPATIBLE_AUDIO_FALLBACK_MEDIA_TYPE when applicable.

In `@src/main/presenter/llmProviderPresenter/index.ts`:
- Around line 401-403: The check uses normalizedMimeType but may be comparing
with mixed-case input (e.g., "Audio/WAV"); ensure normalizedMimeType is created
by lowercasing the incoming mimeType (e.g., normalizedMimeType =
mimeType.toLowerCase()) before performing the startsWith('audio/') validation so
the condition accepts valid audio types regardless of case; update the logic
around the normalizedMimeType variable where the MIME is validated in the LLM
provider presenter (the block that throws Error(`Invalid audio MIME type for
transcription: ${mimeType}`)) to use the lowercased value.

In `@src/renderer/src/components/chat/composables/useAudioRecorder.ts`:
- Around line 88-95: mediaRecorder.onstop currently calls options.onRecorded
even after cleanupRecorder()/cleanup() has run, which can emit stale callbacks;
fix by adding a disposal guard: introduce and set a local boolean flag (e.g.,
isDisposed or isActiveRecording) that cleanupRecorder()/cleanup() flips to true,
and in mediaRecorder.onstop check the flag before invoking options.onRecorded
(or alternatively null out options.onRecorded in cleanup and guard for its
existence in onstop); reference mediaRecorder.onstop, cleanupRecorder(),
cleanup(), and options.onRecorded when implementing the guard.

In `@src/renderer/src/components/chat/composables/useSpeechRecognition.ts`:
- Around line 100-106: The switch in useSpeechRecognition.ts that inspects
error.message currently groups 'transcription-timeout' with decode failures;
update the switch in the function handling speech errors so that
'transcription-timeout' is not returned as 'decode-failed'—remove it from the
decode-failed case and return a distinct, appropriate error key (e.g.,
'transcription-timeout' or 'timeout') in the switch's default/own case so
callers of the composable (useSpeechRecognition) can distinguish timeout vs
decode failures.

In `@src/renderer/src/i18n/fa-IR/settings.json`:
- Line 427: In the fa-IR translation entry (the JSON description string
currently reading "مشخص می‌کند آیا این مدل ورود صوتی با تبدیل محلی گفتار به متن
را مجاز می‌کند یا نه."), replace the phrase "ورود صوتی" with "ورودی صوتی" so the
description reads with the correct wording for "voice input" and matches other
feature labels; update the value for the same "description" string in
settings.json accordingly.

In `@src/renderer/src/i18n/ru-RU/chat.json`:
- Line 71: The message for the JSON key "audioInputUnsupportedDescription"
contains the pseudo-plural "аудиовложение(й)"; replace that with a neutral,
natural Russian phrase that works for any count (for example, use "аудиофайлы"
or "аудиозаписи") so the string reads smoothly: "Модель {model} не поддерживает
аудиоввод. {count} аудиофайлы были пропущены." Update the value for
audioInputUnsupportedDescription accordingly.

In `@src/renderer/src/pages/NewThreadPage.vue`:
- Around line 899-918: prepareFilesForCurrentModel currently calls
resolveModel(), which can return a different model than the active ACP draft
target and thus mis-filter attachments; replace the resolveModel() call with the
actual submission target used for the ACP draft (either by reading the active
ACP draft target from state/context or by adding a parameter like
submissionTarget to prepareFilesForCurrentModel) and use that selection when
calling modelClient.getCapabilities(selection.providerId, selection.modelId);
keep the filtering logic with filterUnsupportedAudioAttachments and
notifyUnsupportedAudioAttachments the same and preserve the early-return when no
selection or files are empty.

In `@src/shared/contracts/routes/models.routes.ts`:
- Around line 192-198: The input schema's audioBase64 and mimeType are unbounded
causing potential oversized IPC payloads; update the zod schema in
models.routes.ts (the input: z.object({...}) block) to add max limits: add
.max(15_000_000) to audioBase64 to cap base64 audio around ~10MB binary and add
.max(255) to mimeType (and consider .max(255) on filename.optional() if
desired). Keep the field names providerId, modelId, audioBase64, mimeType, and
filename unchanged when applying these .max(...) constraints.

In `@test/renderer/lib/audioInputSupport.test.ts`:
- Around line 12-22: Add a negative test that asserts isAudioAttachment returns
false for non-audio MIME types by creating a file via createFile with a
non-audio mimeType (e.g., application/pdf, text/plain, or image/png) and
expecting the result toBe(false); place the new test alongside the existing
'detects audio attachments from mime type' test and name it something like
'returns false for non-audio attachments' to clearly cover the negative case.

---

Outside diff comments:
In `@src/shared/modelConfigDefaults.ts`:
- Around line 8-18: DEFAULT_MODEL_SPEECH_RECOGNITION is defined but not included
in DEFAULT_MODEL_CAPABILITY_FALLBACKS, leaving fallback-derived configs with
speechRecognition undefined; update the DEFAULT_MODEL_CAPABILITY_FALLBACKS
Object.freeze block to include a speechRecognition property set to
DEFAULT_MODEL_SPEECH_RECOGNITION so fallback resolution returns false by default
(modify the DEFAULT_MODEL_CAPABILITY_FALLBACKS object where contextLength,
maxTokens, vision, functionCall, and reasoning are defined).

---

Nitpick comments:
In `@src/renderer/api/ModelClient.ts`:
- Around line 222-237: The conditional spread for the optional filename in
transcribeAudio is verbose; replace the ternary spread (...(filename ? {
filename } : {})) with a concise short-circuit spread like ...(filename && {
filename }) when building the payload for modelsTranscribeAudioRoute.name so the
filename is included only when defined.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f530e40c-8f2f-4439-8f75-4caa292bfd90

📥 Commits

Reviewing files that changed from the base of the PR and between ad62bab and d33da1c.

📒 Files selected for processing (68)
  • docs/features/voice-input-transcription/plan.md
  • docs/features/voice-input-transcription/spec.md
  • docs/features/voice-input-transcription/tasks.md
  • src/main/presenter/agentRuntimePresenter/compactionService.ts
  • src/main/presenter/agentRuntimePresenter/contextBuilder.ts
  • src/main/presenter/agentRuntimePresenter/index.ts
  • src/main/presenter/configPresenter/index.ts
  • src/main/presenter/configPresenter/modelCapabilities.ts
  • src/main/presenter/configPresenter/modelConfig.ts
  • src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts
  • src/main/presenter/llmProviderPresenter/aiSdk/runtime.ts
  • src/main/presenter/llmProviderPresenter/baseProvider.ts
  • src/main/presenter/llmProviderPresenter/index.ts
  • src/main/presenter/llmProviderPresenter/providers/aiSdkProvider.ts
  • src/main/routes/config/configRouteSupport.ts
  • src/main/routes/models/modelRouteHandler.ts
  • src/renderer/api/ModelClient.ts
  • src/renderer/src/components/chat/ChatInputBox.vue
  • src/renderer/src/components/chat/ChatInputToolbar.vue
  • src/renderer/src/components/chat/ChatStatusBar.vue
  • src/renderer/src/components/chat/composables/useAudioRecorder.ts
  • src/renderer/src/components/chat/composables/useSpeechRecognition.ts
  • src/renderer/src/components/chat/composables/useVoiceInput.ts
  • src/renderer/src/components/settings/ModelConfigDialog.vue
  • src/renderer/src/i18n/da-DK/chat.json
  • src/renderer/src/i18n/da-DK/settings.json
  • src/renderer/src/i18n/en-US/chat.json
  • src/renderer/src/i18n/en-US/settings.json
  • src/renderer/src/i18n/fa-IR/chat.json
  • src/renderer/src/i18n/fa-IR/settings.json
  • src/renderer/src/i18n/fr-FR/chat.json
  • src/renderer/src/i18n/fr-FR/settings.json
  • src/renderer/src/i18n/he-IL/chat.json
  • src/renderer/src/i18n/he-IL/settings.json
  • src/renderer/src/i18n/ja-JP/chat.json
  • src/renderer/src/i18n/ja-JP/settings.json
  • src/renderer/src/i18n/ko-KR/chat.json
  • src/renderer/src/i18n/ko-KR/settings.json
  • src/renderer/src/i18n/pt-BR/chat.json
  • src/renderer/src/i18n/pt-BR/settings.json
  • src/renderer/src/i18n/ru-RU/chat.json
  • src/renderer/src/i18n/ru-RU/settings.json
  • src/renderer/src/i18n/zh-CN/chat.json
  • src/renderer/src/i18n/zh-CN/settings.json
  • src/renderer/src/i18n/zh-HK/chat.json
  • src/renderer/src/i18n/zh-HK/settings.json
  • src/renderer/src/i18n/zh-TW/chat.json
  • src/renderer/src/i18n/zh-TW/settings.json
  • src/renderer/src/lib/audioInputSupport.ts
  • src/renderer/src/pages/ChatPage.vue
  • src/renderer/src/pages/NewThreadPage.vue
  • src/shared/contracts/domainSchemas.ts
  • src/shared/contracts/routes.ts
  • src/shared/contracts/routes/models.routes.ts
  • src/shared/modelConfigDefaults.ts
  • src/shared/types/core/chat-message.ts
  • src/shared/types/presenters/legacy.presenters.d.ts
  • src/shared/types/presenters/llmprovider.presenter.d.ts
  • test/main/presenter/agentRuntimePresenter/contextBuilder.test.ts
  • test/main/presenter/llmProviderPresenter.test.ts
  • test/main/presenter/llmProviderPresenter/aiSdkMessageMapper.test.ts
  • test/main/presenter/llmProviderPresenter/openAICompatibleProvider.test.ts
  • test/main/presenter/llmProviderPresenter/openAIResponsesProvider.test.ts
  • test/renderer/components/ChatInputBox.test.ts
  • test/renderer/components/ChatInputToolbar.test.ts
  • test/renderer/components/ModelConfigDialog.test.ts
  • test/renderer/composables/useSpeechRecognition.test.ts
  • test/renderer/lib/audioInputSupport.test.ts
👮 Files not reviewed due to content moderation or server errors (4)
  • src/main/presenter/agentRuntimePresenter/contextBuilder.ts
  • test/main/presenter/agentRuntimePresenter/contextBuilder.test.ts
  • src/main/presenter/agentRuntimePresenter/index.ts
  • src/main/presenter/agentRuntimePresenter/compactionService.ts

Comment thread src/main/presenter/llmProviderPresenter/aiSdk/messageMapper.ts Outdated
Comment thread src/main/presenter/llmProviderPresenter/index.ts
Comment thread src/renderer/src/components/chat/composables/useAudioRecorder.ts Outdated
Comment thread src/renderer/src/i18n/fa-IR/settings.json Outdated
Comment thread src/renderer/src/i18n/ru-RU/chat.json Outdated
Comment thread src/renderer/src/pages/NewThreadPage.vue
Comment thread src/shared/contracts/routes/models.routes.ts
Comment thread test/renderer/lib/audioInputSupport.test.ts
@zerob13 zerob13 merged commit 9c7060f into dev May 14, 2026
3 checks passed
@zhangmo8 zhangmo8 deleted the asr branch May 14, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants