Add "Qwen3 TTS (CrispASR)" engine + rename Chatterbox TTS#11097
Merged
Conversation
Makes it explicit that Chatterbox runs through the CrispASR runtime, in line with the upcoming "Qwen3 TTS (CrispASR)" engine that uses the same binary. Cosmetic only - the engine class, settings, voices, and model files are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawns the existing crispasr binary in --server mode with one of: --backend qwen3-tts-1.7b-voicedesign (default; needs instructions) --backend qwen3-tts-1.7b-customvoice (voice cloning, takes a WAV + ref-text) and POSTs to its OpenAI-compatible /v1/audio/speech endpoint. Why both engines: qwen3-tts.cpp's 0.6B model never emits EOS for short prompts (frame-by-frame logit dump shows the EOS slot is deeply negative and trending more negative), so a "Hello world" runs to the max_audio_tokens cap. CrispASR's qwen3-tts-1.7b-voicedesign backend cleanly emits EOS on the same prompt (1 s of audio in 1.4 s wall time, RTF=1.5 on M4) - useful as a working baseline while the qwen3-tts.cpp EOS work continues upstream. Engine layout (separate from qwen3-tts.cpp's folder per user direction): TextToSpeech/Qwen3TtsCrispAsr/models/ talker GGUF + 12Hz codec TextToSpeech/Qwen3TtsCrispAsr/voices/ reference WAVs (+ .txt sidecar) The CrispASR-style 12Hz codec (~986 MB) is a different file from qwen3-tts.cpp's tokenizer; if not staged locally, the engine passes --auto-download to crispasr so it fetches into ~/.cache/crispasr/ on first run (first-run timeout bumped to 30 min for that case). The talker GGUF must be placed manually until the auto-downloader lands - the cstr-uploaded 1.7B VoiceDesign GGUF that qwen3-tts.cpp already downloads is bit-for-bit compatible and can be symlinked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new engine needs the same free-text instruction box the qwen3-tts.cpp engine uses for VoiceDesign — without it there's no UI to set the voice description and the engine silently falls back to "a calm female voice" on every request. - IsVoiceDesignModel helper on Qwen3TtsCrispAsr matches the one on Qwen3TtsCpp. - RefreshInstructionVisibility shows the instruction text box for either Qwen3 engine when VoiceDesign is the selected model. - UpdateVoiceLock disables the voice combo for both engines on VoiceDesign — voice cloning has no effect there. - GetInstructionForEngine reads Qwen3TtsCppInstruction for both engines, so the same description is shared when A/B testing. - Save path persists Qwen3TtsCrispAsrModel + the shared instruction. - Engine-change path restores SelectedModel from Qwen3TtsCrispAsrModel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The engine-combo dot defaulted to None for Qwen3TtsCrispAsr because GetTtsEngineDotStatus had no case for it. Mirror the Chatterbox approach: add GetEngineUpdateStatus on the engine class (reads the CrispASR runtime's sidecar — that's what the engine sits on top of) and route the new case through StatusDots.From in TextToSpeechWindow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues from manual testing: 1. CustomVoice fails up-front with "talker GGUF not found" — the 1.7B CustomVoice talker isn't bit-for-bit compatible with the VoiceDesign one, so symlinking from qwen3-tts.cpp doesn't help and no SE-side downloader exists yet. Hand the talker off to crispasr's own --auto-download when missing locally (same approach we already use for the 12Hz codec). Pass `-m auto` so crispasr resolves the right model from the backend name. First-run timeout already covers the longer download window. 2. CustomVoice has no baked default voice — it's pure voice cloning and rejects requests without a `voice` field. Drop the "Default" entry from the voice combo when CustomVoice is selected, and add a clear up-front error in Speak if the user picks an empty voice anyway (mentions 24 kHz mono WAV + .txt sidecar requirements). 3. The voice list now depends on the selected model, so trigger a RefreshVoices when the user toggles between VoiceDesign and CustomVoice. Also persist the new model choice immediately so GetVoices sees it on the next call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new "Qwen3 TTS (CrispASR)" engine had ImportVoice implemented but VoiceSettingsViewModel.IsImportVoiceVisible didn't include it, so the button stayed hidden — typed Import Voice as a no-op for users. Also seed our voices folder from the existing qwen3-tts.cpp install (TextToSpeech/Qwen3TtsCpp/voices/) on first GetVoices when our folder is empty. The reference WAVs and .txt sidecars are bit-for-bit usable by both engines, so users who already downloaded the qwen3-tts.cpp voice pack get them for free here. Users who never installed qwen3-tts.cpp start empty and can use Import Voice. A dedicated CrispASR-side voice-pack downloader (fetching the support-files voices.zip into this engine's folder) is still a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probe `crispasr --version` once per binary mtime and cache the parsed
semver. Surfaces the installed runtime version next to the existing
backend / status labels in two places:
- Speech-to-text engine settings: BackendLabel now reads e.g.
"macOS universal, v0.6.9" instead of just "macOS universal".
- Chatterbox TTS settings: EngineLabel reads e.g.
"CrispASR v0.6.9 (Chatterbox-capable)" instead of without the version.
Both call into a shared CrispAsrVersion helper (new file under
SpeechToText/Engines). The probe accepts both the new structured
--version output (v0.6.9+, "version : 0.6.9") and the legacy single-line
banner ("crispasr 0.6.7 (git ..., Release) [backends: ...]").
The new Qwen3 TTS (CrispASR) engine doesn't have a settings dialog yet,
so it doesn't pick up the version display in this commit - that's a
follow-up once that dialog lands.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the Chatterbox TTS settings dialog: shows CrispASR runtime status + version, the install state of the VoiceDesign / CustomVoice talker GGUFs and the 12 Hz codec, count of imported voices, and the install folder. Status dots match the rest of the SE TTS UI - green when present locally, grey "Auto-download on first use" otherwise (missing local files aren't fatal because the engine passes --auto-download to crispasr in that case). Buttons: - Re-download CrispASR (delegates to EnsureCrispAsrForChatterbox, same flow Chatterbox already uses) - Open models folder - Open voices folder - Close Also sets IsEngineSettingsVisible = true when this engine is selected so the gear icon appears in the main TTS UI, and registers the new ViewModel in DI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new Qwen3 TTS (CrispASR) engine that routes synthesis through the existing CrispASR runtime (VoiceDesign + CustomVoice), and updates UI/settings plumbing to support engine selection, engine settings, and CrispASR version display. It also renames the existing Chatterbox engine to make the CrispASR dependency explicit.
Changes:
- Introduce
Qwen3TtsCrispAsrengine with local CrispASR server lifecycle + OpenAI-compatible/v1/audio/speechcalls. - Add a dedicated settings window/viewmodel for Qwen3 (CrispASR), plus a new persisted model key setting.
- Show CrispASR runtime version in engine settings UIs and rename Chatterbox engine display name.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ui/Logic/Config/SeVideoTextToSpeech.cs | Adds persisted setting for Qwen3 (CrispASR) model selection + default. |
| src/ui/Features/Video/TextToSpeech/VoiceSettings/VoiceSettingsViewModel.cs | Enables voice import UI for the new engine. |
| src/ui/Features/Video/TextToSpeech/TextToSpeechWindow.cs | Adds status-dot support for Qwen3 (CrispASR) engine. |
| src/ui/Features/Video/TextToSpeech/TextToSpeechViewModel.cs | Registers the new engine, persists its model selection, and wires settings dialog + model-dependent voice list behavior. |
| src/ui/Features/Video/TextToSpeech/Qwen3TtsCrispAsrSettings/Qwen3TtsCrispAsrSettingsWindow.cs | New settings dialog UI for Qwen3 (CrispASR). |
| src/ui/Features/Video/TextToSpeech/Qwen3TtsCrispAsrSettings/Qwen3TtsCrispAsrSettingsViewModel.cs | New settings VM: CrispASR runtime/model/codec/voices status + actions. |
| src/ui/Features/Video/TextToSpeech/Engines/Qwen3TtsCrispAsr.cs | New engine implementation (server start/stop, voice handling, import). |
| src/ui/Features/Video/TextToSpeech/Engines/ChatterboxTtsCpp.cs | Renames engine display name to “Chatterbox TTS (CrispASR)”. |
| src/ui/Features/Video/TextToSpeech/ChatterboxTtsSettings/ChatterboxTtsSettingsViewModel.cs | Adds CrispASR version suffix to improve runtime visibility. |
| src/ui/Features/Video/SpeechToText/EngineSettings/SpeechToTextEngineSettingsViewModel.cs | Appends CrispASR version to backend labels for CrispASR engines. |
| src/ui/Features/Video/SpeechToText/Engines/CrispAsrVersion.cs | New cached helper to probe crispasr --version and parse version output. |
| src/ui/DependencyInjectionExtensions.cs | Registers the new Qwen3 (CrispASR) settings ViewModel for DI. |
| change-log.txt | Documents the new engine and Chatterbox rename. |
Comments suppressed due to low confidence (1)
src/ui/Features/Video/TextToSpeech/TextToSpeechViewModel.cs:179
Qwen3TtsCrispAsris added to the available engines, butIsEngineInstalled(...)in this view model does not have a branch to guide the user through installing CrispASR (unlikeChatterboxTtsCpp). If CrispASR is missing, the flow will just return false with no prompt, making the new engine effectively unusable on first use. Add an install/update prompt (likely reusing the CrispASR download flow) whenSelectedEngine is Qwen3TtsCrispAsr.
new MistralSpeech(ttsDownloadService),
new Murf(ttsDownloadService),
new GoogleSpeech(ttsDownloadService),
new Qwen3TtsCpp(),
new Qwen3TtsCrispAsr(),
new KokoroTtsCpp(),
new ChatterboxTtsCpp(),
new OmniVoiceTtsCpp(),
];
| public class ChatterboxTtsCpp : ITtsEngine | ||
| { | ||
| public string Name => "Chatterbox TTS"; | ||
| public string Name => "Chatterbox TTS (CrispASR)"; |
| if (SelectedEngine is Qwen3TtsCrispAsr engine) | ||
| { | ||
| Se.Settings.Video.TextToSpeech.Qwen3TtsCrispAsrModel = value ?? Qwen3TtsCrispAsr.DefaultModelKey; | ||
| _ = RefreshVoices(engine); |
Comment on lines
+148
to
+153
| // CrispASR is shared with Speech-to-text; piggy-back on the same redownload flow | ||
| // Chatterbox uses, then refresh status here. The crispasr binary itself does not | ||
| // care which TTS engine triggered the download. | ||
| await TtsVoiceInstaller.EnsureCrispAsrForChatterbox(Window, _windowService, forceRedownload: true); | ||
| Refresh(); | ||
| } |
Comment on lines
+23
to
+27
| UiUtil.InitializeWindow(this, GetType().Name); | ||
| Title = "Qwen3 TTS (CrispASR) settings"; | ||
| SizeToContent = SizeToContent.WidthAndHeight; | ||
| CanResize = false; | ||
| MinWidth = 580; |
Comment on lines
+188
to
+189
| var openModelsFolder = UiUtil.MakeButton("Open models folder", vm.OpenModelsFolderCommand).WithIconLeft(IconNames.FolderOpen); | ||
| var openVoicesFolder = UiUtil.MakeButton("Open voices folder", vm.OpenVoicesFolderCommand).WithIconLeft(IconNames.FolderOpen); |
Comment on lines
+657
to
+688
| public bool ImportVoice(string fileName) | ||
| { | ||
| if (string.IsNullOrEmpty(fileName) || !File.Exists(fileName)) | ||
| { | ||
| return false; | ||
| } | ||
|
|
||
| var voicesFolder = GetSetVoicesFolder(); | ||
| var baseName = Path.GetFileNameWithoutExtension(fileName); | ||
| var destinationFileName = GetUniqueDestinationFileName(voicesFolder, baseName); | ||
|
|
||
| // CrispASR's qwen3-tts CustomVoice backend expects a 24 kHz mono reference WAV. | ||
| // Always resample on import via ffmpeg so the saved file is in the right shape | ||
| // regardless of what the user picked. | ||
| try | ||
| { | ||
| var process = FfmpegGenerator.ConvertToMono24kHzWav(fileName, destinationFileName); | ||
| if (!process.Start()) | ||
| { | ||
| return false; | ||
| } | ||
|
|
||
| process.WaitForExit(); | ||
| } | ||
| catch (Exception ex) | ||
| { | ||
| Se.LogError(ex, "Qwen3 TTS (CrispASR) voice import failed (ffmpeg conversion)."); | ||
| return false; | ||
| } | ||
|
|
||
| return File.Exists(destinationFileName); | ||
| } |
Comment on lines
+92
to
+109
| // Talker GGUFs live in the engine's own models folder OR are auto-downloaded by | ||
| // crispasr into ~/.cache/crispasr. We can only verify the engine-folder copy here; | ||
| // a missing local talker isn't fatal (the engine falls back to --auto-download) so | ||
| // surface it as a neutral "Auto-download on first use" rather than an error. | ||
| var voiceDesignPath = Qwen3TtsCrispAsr.GetTalkerPath(Qwen3TtsCrispAsr.ModelKeyVoiceDesign); | ||
| ApplyModelStatus(File.Exists(voiceDesignPath), | ||
| label => VoiceDesignTalkerLabel = label, | ||
| brush => VoiceDesignTalkerBrush = brush); | ||
|
|
||
| var customVoicePath = Qwen3TtsCrispAsr.GetTalkerPath(Qwen3TtsCrispAsr.ModelKeyCustomVoice); | ||
| ApplyModelStatus(File.Exists(customVoicePath), | ||
| label => CustomVoiceTalkerLabel = label, | ||
| brush => CustomVoiceTalkerBrush = brush); | ||
|
|
||
| var codecPath = Qwen3TtsCrispAsr.GetCodecPath(); | ||
| ApplyModelStatus(File.Exists(codecPath), | ||
| label => CodecLabel = label, | ||
| brush => CodecBrush = brush); |
Comment on lines
+91
to
+105
| // For CrispASR engines, append the installed runtime version (probed once via | ||
| // `crispasr --version`) so users can tell at a glance which build they have. | ||
| // Other engines stay as-is. | ||
| private static string AppendVersion(ISpeechToTextEngine engine, string backendLabel) | ||
| { | ||
| if (engine is not ICrispAsrEngine crispAsr) | ||
| { | ||
| return backendLabel; | ||
| } | ||
|
|
||
| var version = CrispAsrVersion.TryGet(crispAsr.GetExecutable()); | ||
| return string.IsNullOrEmpty(version) | ||
| ? backendLabel | ||
| : $"{backendLabel}, v{version}"; | ||
| } |
Six fixes plus a partial localization pass: 1. Migrate the persisted engine name: settings serialize the engine by Name, so the rename "Chatterbox TTS" -> "Chatterbox TTS (CrispASR)" would silently drop users' saved selection. LoadSettings rewrites the old name on load. 2. RefreshVoices on Qwen3TtsCrispAsr model change was fire-and-forget; any exception would surface later as an unobserved task exception. Wrap in Task.Run + try/catch and route the actual refresh through the UI thread dispatcher. 3. EnsureCrispAsrForChatterbox was hardcoded with "Chatterbox TTS" in every prompt — confusing when invoked from the new Qwen3 (CrispASR) settings dialog. Refactor to a shared EnsureCrispAsrAsync helper that takes an engine display name + optional capability check, with thin Chatterbox / Qwen3 wrappers. Qwen3 dialog now uses the new EnsureCrispAsrForQwen3 entry point. 6. ImportVoice on Qwen3TtsCrispAsr now also copies an adjacent .txt sidecar (transcription of the reference WAV) — CustomVoice cloning uses it for best quality. 7. Settings dialog model status: "Auto-download on first use" is misleading when CrispASR itself isn't installed (nothing to download into). Show "CrispASR required" + grey for the talker / codec rows in that case. 8. `crispasr --version` probe blocked the UI thread (up to 5 s timeout) when the engine-settings dialog first opened. Move the probe into a Task.Run and patch the version into BackendLabel / EngineLabel via the dispatcher when it returns. Applied to both SpeechToTextEngineSettings and the two CrispASR TTS dialogs (Chatterbox, Qwen3). Partial fix for #5: localize "Open containing folder" via existing Se.Language.General string. Title + remaining engine-specific status strings stay hardcoded English to match the existing pattern in the sibling Chatterbox / Qwen3TtsCpp settings dialogs; full localization across language resource files is a separate follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a second route to Qwen3 TTS, this time via the existing CrispASR runtime instead of
qwen3-tts.cpp. Useful as an A/B comparison engine while the qwen3-tts.cpp EOS issue is being investigated upstream.Also renames the existing Chatterbox TTS engine to Chatterbox TTS (CrispASR) for consistency — both run through the same
crispasrbinary now.Why
The qwen3-tts.cpp 0.6B model never emits EOS for short prompts (per the frame-by-frame logit dump from our earlier investigation — EOS slot is deeply negative and trends more negative each frame). "Hello world" runs to
max_audio_tokens=4096and yields ~5 min of audio in ~10 min wall time.CrispASR's
qwen3-tts-1.7b-voicedesignbackend, on the same M4 hardware with the same cstr-uploaded 1.7B VoiceDesign GGUF, cleanly emits EOS:That's a working baseline while the qwen3-tts.cpp EOS bug remains open.
How
New
Qwen3TtsCrispAsrengine spawnscrispasr --server --backend qwen3-tts-1.7b-voicedesign|customvoice -m <talker> [--codec-model <codec> | --auto-download] --voice-dir <voices>and POSTs to/v1/audio/speech(OpenAI-compatible).VoiceDesign: requires
instructions(defaults to "a calm female voice" if the user hasn't set one). ReusesQwen3TtsCppInstructionso the same description applies to both Qwen3 engines.CustomVoice: takes a reference WAV from the voices folder; an adjacent
.txtsidecar (matching filename) provides the reference transcription that CrispASR's qwen3-tts-customvoice expects.Voice import resamples to 24 kHz mono via ffmpeg (same convention as Chatterbox).
Engine folder layout (kept separate per user direction):
Known limitations / follow-ups
crispasr --auto-downloadinto~/.cache/crispasr/on first run — first-run timeout bumped to 30 min for that case. A proper SE-side download service is a follow-up.Test plan
dotnet build src/ui/UI.csprojclean (0 warnings, 0 errors)/v1/audio/speechwithinstructions: "a calm female voice"— HTTP 200, valid 24 kHz mono WAV, 1 s audio in 1.4 s wall time.🤖 Generated with Claude Code