Add "Qwen3 TTS (CrispASR)" engine + rename Chatterbox TTS by niksedk · Pull Request #11097 · SubtitleEdit/subtitleedit

niksedk · 2026-05-21T20:21:48Z

Summary

Adds a second route to Qwen3 TTS, this time via the existing CrispASR runtime instead of qwen3-tts.cpp. Useful as an A/B comparison engine while the qwen3-tts.cpp EOS issue is being investigated upstream.

Also renames the existing Chatterbox TTS engine to Chatterbox TTS (CrispASR) for consistency — both run through the same crispasr binary now.

Why

The qwen3-tts.cpp 0.6B model never emits EOS for short prompts (per the frame-by-frame logit dump from our earlier investigation — EOS slot is deeply negative and trends more negative each frame). "Hello world" runs to max_audio_tokens=4096 and yields ~5 min of audio in ~10 min wall time.

CrispASR's qwen3-tts-1.7b-voicedesign backend, on the same M4 hardware with the same cstr-uploaded 1.7B VoiceDesign GGUF, cleanly emits EOS:

qwen3_tts: produced 12 frames × 16 codebooks = 192 codes
crispasr-server: synthesized 1.0s audio in 1.44s (RTF=1.50) voice='<startup>' speed=1.00
HTTP 200 body=46124 time=1.444497s

That's a working baseline while the qwen3-tts.cpp EOS bug remains open.

How

New Qwen3TtsCrispAsr engine spawns crispasr --server --backend qwen3-tts-1.7b-voicedesign|customvoice -m <talker> [--codec-model <codec> | --auto-download] --voice-dir <voices> and POSTs to /v1/audio/speech (OpenAI-compatible).
VoiceDesign: requires instructions (defaults to "a calm female voice" if the user hasn't set one). Reuses Qwen3TtsCppInstruction so the same description applies to both Qwen3 engines.
CustomVoice: takes a reference WAV from the voices folder; an adjacent .txt sidecar (matching filename) provides the reference transcription that CrispASR's qwen3-tts-customvoice expects.
Voice import resamples to 24 kHz mono via ffmpeg (same convention as Chatterbox).

Engine folder layout (kept separate per user direction):

TextToSpeech/Qwen3TtsCrispAsr/models/  talker GGUF + 12Hz codec
TextToSpeech/Qwen3TtsCrispAsr/voices/  reference WAVs (+ .txt sidecars)

Known limitations / follow-ups

No download UX yet. The talker GGUF must be manually placed (or symlinked from the existing qwen3-tts.cpp install — the 1.7B VoiceDesign file is bit-for-bit compatible). The CrispASR-style 12Hz codec (~986 MB, different file from qwen3-tts.cpp's tokenizer) gets fetched by crispasr --auto-download into ~/.cache/crispasr/ on first run — first-run timeout bumped to 30 min for that case. A proper SE-side download service is a follow-up.
No engine-settings dialog yet. Model selection (VoiceDesign vs CustomVoice) currently only via direct setting edit / dropdown in the main TTS UI. The Qwen3 instruction string is shared with the existing qwen3-tts.cpp engine.
0.6B not supported. The koboldcpp 0.6B GGUF uses the old tensor names and won't load in CrispASR (mirror of the qwen3-tts.cpp PR dvb subtitles to sub/idx #14 issue, opposite direction). Only the cstr 1.7B variants work.

Test plan

dotnet build src/ui/UI.csproj clean (0 warnings, 0 errors)
End-to-end smoke: symlink existing 1.7B VoiceDesign GGUF, let crispasr auto-download the 12Hz codec, POST /v1/audio/speech with instructions: "a calm female voice" — HTTP 200, valid 24 kHz mono WAV, 1 s audio in 1.4 s wall time.
Manual UI test on Mac/Windows: select "Qwen3 TTS (CrispASR)" in the engine combo, run the voice test
Manual UI test: verify "Chatterbox TTS (CrispASR)" rename appears in the engine combo and engine settings

🤖 Generated with Claude Code

Makes it explicit that Chatterbox runs through the CrispASR runtime, in line with the upcoming "Qwen3 TTS (CrispASR)" engine that uses the same binary. Cosmetic only - the engine class, settings, voices, and model files are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Spawns the existing crispasr binary in --server mode with one of: --backend qwen3-tts-1.7b-voicedesign (default; needs instructions) --backend qwen3-tts-1.7b-customvoice (voice cloning, takes a WAV + ref-text) and POSTs to its OpenAI-compatible /v1/audio/speech endpoint. Why both engines: qwen3-tts.cpp's 0.6B model never emits EOS for short prompts (frame-by-frame logit dump shows the EOS slot is deeply negative and trending more negative), so a "Hello world" runs to the max_audio_tokens cap. CrispASR's qwen3-tts-1.7b-voicedesign backend cleanly emits EOS on the same prompt (1 s of audio in 1.4 s wall time, RTF=1.5 on M4) - useful as a working baseline while the qwen3-tts.cpp EOS work continues upstream. Engine layout (separate from qwen3-tts.cpp's folder per user direction): TextToSpeech/Qwen3TtsCrispAsr/models/ talker GGUF + 12Hz codec TextToSpeech/Qwen3TtsCrispAsr/voices/ reference WAVs (+ .txt sidecar) The CrispASR-style 12Hz codec (~986 MB) is a different file from qwen3-tts.cpp's tokenizer; if not staged locally, the engine passes --auto-download to crispasr so it fetches into ~/.cache/crispasr/ on first run (first-run timeout bumped to 30 min for that case). The talker GGUF must be placed manually until the auto-downloader lands - the cstr-uploaded 1.7B VoiceDesign GGUF that qwen3-tts.cpp already downloads is bit-for-bit compatible and can be symlinked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The new engine needs the same free-text instruction box the qwen3-tts.cpp engine uses for VoiceDesign — without it there's no UI to set the voice description and the engine silently falls back to "a calm female voice" on every request. - IsVoiceDesignModel helper on Qwen3TtsCrispAsr matches the one on Qwen3TtsCpp. - RefreshInstructionVisibility shows the instruction text box for either Qwen3 engine when VoiceDesign is the selected model. - UpdateVoiceLock disables the voice combo for both engines on VoiceDesign — voice cloning has no effect there. - GetInstructionForEngine reads Qwen3TtsCppInstruction for both engines, so the same description is shared when A/B testing. - Save path persists Qwen3TtsCrispAsrModel + the shared instruction. - Engine-change path restores SelectedModel from Qwen3TtsCrispAsrModel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The engine-combo dot defaulted to None for Qwen3TtsCrispAsr because GetTtsEngineDotStatus had no case for it. Mirror the Chatterbox approach: add GetEngineUpdateStatus on the engine class (reads the CrispASR runtime's sidecar — that's what the engine sits on top of) and route the new case through StatusDots.From in TextToSpeechWindow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two issues from manual testing: 1. CustomVoice fails up-front with "talker GGUF not found" — the 1.7B CustomVoice talker isn't bit-for-bit compatible with the VoiceDesign one, so symlinking from qwen3-tts.cpp doesn't help and no SE-side downloader exists yet. Hand the talker off to crispasr's own --auto-download when missing locally (same approach we already use for the 12Hz codec). Pass `-m auto` so crispasr resolves the right model from the backend name. First-run timeout already covers the longer download window. 2. CustomVoice has no baked default voice — it's pure voice cloning and rejects requests without a `voice` field. Drop the "Default" entry from the voice combo when CustomVoice is selected, and add a clear up-front error in Speak if the user picks an empty voice anyway (mentions 24 kHz mono WAV + .txt sidecar requirements). 3. The voice list now depends on the selected model, so trigger a RefreshVoices when the user toggles between VoiceDesign and CustomVoice. Also persist the new model choice immediately so GetVoices sees it on the next call. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The new "Qwen3 TTS (CrispASR)" engine had ImportVoice implemented but VoiceSettingsViewModel.IsImportVoiceVisible didn't include it, so the button stayed hidden — typed Import Voice as a no-op for users. Also seed our voices folder from the existing qwen3-tts.cpp install (TextToSpeech/Qwen3TtsCpp/voices/) on first GetVoices when our folder is empty. The reference WAVs and .txt sidecars are bit-for-bit usable by both engines, so users who already downloaded the qwen3-tts.cpp voice pack get them for free here. Users who never installed qwen3-tts.cpp start empty and can use Import Voice. A dedicated CrispASR-side voice-pack downloader (fetching the support-files voices.zip into this engine's folder) is still a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Probe `crispasr --version` once per binary mtime and cache the parsed semver. Surfaces the installed runtime version next to the existing backend / status labels in two places: - Speech-to-text engine settings: BackendLabel now reads e.g. "macOS universal, v0.6.9" instead of just "macOS universal". - Chatterbox TTS settings: EngineLabel reads e.g. "CrispASR v0.6.9 (Chatterbox-capable)" instead of without the version. Both call into a shared CrispAsrVersion helper (new file under SpeechToText/Engines). The probe accepts both the new structured --version output (v0.6.9+, "version : 0.6.9") and the legacy single-line banner ("crispasr 0.6.7 (git ..., Release) [backends: ...]"). The new Qwen3 TTS (CrispASR) engine doesn't have a settings dialog yet, so it doesn't pick up the version display in this commit - that's a follow-up once that dialog lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors the Chatterbox TTS settings dialog: shows CrispASR runtime status + version, the install state of the VoiceDesign / CustomVoice talker GGUFs and the 12 Hz codec, count of imported voices, and the install folder. Status dots match the rest of the SE TTS UI - green when present locally, grey "Auto-download on first use" otherwise (missing local files aren't fatal because the engine passes --auto-download to crispasr in that case). Buttons: - Re-download CrispASR (delegates to EnsureCrispAsrForChatterbox, same flow Chatterbox already uses) - Open models folder - Open voices folder - Close Also sets IsEngineSettingsVisible = true when this engine is selected so the gear icon appears in the main TTS UI, and registers the new ViewModel in DI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new Qwen3 TTS (CrispASR) engine that routes synthesis through the existing CrispASR runtime (VoiceDesign + CustomVoice), and updates UI/settings plumbing to support engine selection, engine settings, and CrispASR version display. It also renames the existing Chatterbox engine to make the CrispASR dependency explicit.

Changes:

Introduce Qwen3TtsCrispAsr engine with local CrispASR server lifecycle + OpenAI-compatible /v1/audio/speech calls.
Add a dedicated settings window/viewmodel for Qwen3 (CrispASR), plus a new persisted model key setting.
Show CrispASR runtime version in engine settings UIs and rename Chatterbox engine display name.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
src/ui/Logic/Config/SeVideoTextToSpeech.cs	Adds persisted setting for Qwen3 (CrispASR) model selection + default.
src/ui/Features/Video/TextToSpeech/VoiceSettings/VoiceSettingsViewModel.cs	Enables voice import UI for the new engine.
src/ui/Features/Video/TextToSpeech/TextToSpeechWindow.cs	Adds status-dot support for Qwen3 (CrispASR) engine.
src/ui/Features/Video/TextToSpeech/TextToSpeechViewModel.cs	Registers the new engine, persists its model selection, and wires settings dialog + model-dependent voice list behavior.
src/ui/Features/Video/TextToSpeech/Qwen3TtsCrispAsrSettings/Qwen3TtsCrispAsrSettingsWindow.cs	New settings dialog UI for Qwen3 (CrispASR).
src/ui/Features/Video/TextToSpeech/Qwen3TtsCrispAsrSettings/Qwen3TtsCrispAsrSettingsViewModel.cs	New settings VM: CrispASR runtime/model/codec/voices status + actions.
src/ui/Features/Video/TextToSpeech/Engines/Qwen3TtsCrispAsr.cs	New engine implementation (server start/stop, voice handling, import).
src/ui/Features/Video/TextToSpeech/Engines/ChatterboxTtsCpp.cs	Renames engine display name to “Chatterbox TTS (CrispASR)”.
src/ui/Features/Video/TextToSpeech/ChatterboxTtsSettings/ChatterboxTtsSettingsViewModel.cs	Adds CrispASR version suffix to improve runtime visibility.
src/ui/Features/Video/SpeechToText/EngineSettings/SpeechToTextEngineSettingsViewModel.cs	Appends CrispASR version to backend labels for CrispASR engines.
src/ui/Features/Video/SpeechToText/Engines/CrispAsrVersion.cs	New cached helper to probe `crispasr --version` and parse version output.
src/ui/DependencyInjectionExtensions.cs	Registers the new Qwen3 (CrispASR) settings ViewModel for DI.
change-log.txt	Documents the new engine and Chatterbox rename.

Comments suppressed due to low confidence (1)

src/ui/Features/Video/TextToSpeech/TextToSpeechViewModel.cs:179

Qwen3TtsCrispAsr is added to the available engines, but IsEngineInstalled(...) in this view model does not have a branch to guide the user through installing CrispASR (unlike ChatterboxTtsCpp). If CrispASR is missing, the flow will just return false with no prompt, making the new engine effectively unusable on first use. Add an install/update prompt (likely reusing the CrispASR download flow) when SelectedEngine is Qwen3TtsCrispAsr.

            new MistralSpeech(ttsDownloadService),
            new Murf(ttsDownloadService),
            new GoogleSpeech(ttsDownloadService),
            new Qwen3TtsCpp(),
            new Qwen3TtsCrispAsr(),
            new KokoroTtsCpp(),
            new ChatterboxTtsCpp(),
            new OmniVoiceTtsCpp(),
        ];

 public class ChatterboxTtsCpp : ITtsEngine
 {
-    public string Name => "Chatterbox TTS";
+    public string Name => "Chatterbox TTS (CrispASR)";


+        if (SelectedEngine is Qwen3TtsCrispAsr engine)
+        {
+            Se.Settings.Video.TextToSpeech.Qwen3TtsCrispAsrModel = value ?? Qwen3TtsCrispAsr.DefaultModelKey;
+            _ = RefreshVoices(engine);


+        // CrispASR is shared with Speech-to-text; piggy-back on the same redownload flow
+        // Chatterbox uses, then refresh status here. The crispasr binary itself does not
+        // care which TTS engine triggered the download.
+        await TtsVoiceInstaller.EnsureCrispAsrForChatterbox(Window, _windowService, forceRedownload: true);
+        Refresh();
+    }


+        UiUtil.InitializeWindow(this, GetType().Name);
+        Title = "Qwen3 TTS (CrispASR) settings";
+        SizeToContent = SizeToContent.WidthAndHeight;
+        CanResize = false;
+        MinWidth = 580;


+        var openModelsFolder = UiUtil.MakeButton("Open models folder", vm.OpenModelsFolderCommand).WithIconLeft(IconNames.FolderOpen);
+        var openVoicesFolder = UiUtil.MakeButton("Open voices folder", vm.OpenVoicesFolderCommand).WithIconLeft(IconNames.FolderOpen);


+    public bool ImportVoice(string fileName)
+    {
+        if (string.IsNullOrEmpty(fileName) || !File.Exists(fileName))
+        {
+            return false;
+        }
+
+        var voicesFolder = GetSetVoicesFolder();
+        var baseName = Path.GetFileNameWithoutExtension(fileName);
+        var destinationFileName = GetUniqueDestinationFileName(voicesFolder, baseName);
+
+        // CrispASR's qwen3-tts CustomVoice backend expects a 24 kHz mono reference WAV.
+        // Always resample on import via ffmpeg so the saved file is in the right shape
+        // regardless of what the user picked.
+        try
+        {
+            var process = FfmpegGenerator.ConvertToMono24kHzWav(fileName, destinationFileName);
+            if (!process.Start())
+            {
+                return false;
+            }
+
+            process.WaitForExit();
+        }
+        catch (Exception ex)
+        {
+            Se.LogError(ex, "Qwen3 TTS (CrispASR) voice import failed (ffmpeg conversion).");
+            return false;
+        }
+
+        return File.Exists(destinationFileName);
+    }


+        // Talker GGUFs live in the engine's own models folder OR are auto-downloaded by
+        // crispasr into ~/.cache/crispasr. We can only verify the engine-folder copy here;
+        // a missing local talker isn't fatal (the engine falls back to --auto-download) so
+        // surface it as a neutral "Auto-download on first use" rather than an error.
+        var voiceDesignPath = Qwen3TtsCrispAsr.GetTalkerPath(Qwen3TtsCrispAsr.ModelKeyVoiceDesign);
+        ApplyModelStatus(File.Exists(voiceDesignPath),
+            label => VoiceDesignTalkerLabel = label,
+            brush => VoiceDesignTalkerBrush = brush);
+
+        var customVoicePath = Qwen3TtsCrispAsr.GetTalkerPath(Qwen3TtsCrispAsr.ModelKeyCustomVoice);
+        ApplyModelStatus(File.Exists(customVoicePath),
+            label => CustomVoiceTalkerLabel = label,
+            brush => CustomVoiceTalkerBrush = brush);
+
+        var codecPath = Qwen3TtsCrispAsr.GetCodecPath();
+        ApplyModelStatus(File.Exists(codecPath),
+            label => CodecLabel = label,
+            brush => CodecBrush = brush);


+    // For CrispASR engines, append the installed runtime version (probed once via
+    // `crispasr --version`) so users can tell at a glance which build they have.
+    // Other engines stay as-is.
+    private static string AppendVersion(ISpeechToTextEngine engine, string backendLabel)
+    {
+        if (engine is not ICrispAsrEngine crispAsr)
+        {
+            return backendLabel;
+        }
+
+        var version = CrispAsrVersion.TryGet(crispAsr.GetExecutable());
+        return string.IsNullOrEmpty(version)
+            ? backendLabel
+            : $"{backendLabel}, v{version}";
+    }


Six fixes plus a partial localization pass: 1. Migrate the persisted engine name: settings serialize the engine by Name, so the rename "Chatterbox TTS" -> "Chatterbox TTS (CrispASR)" would silently drop users' saved selection. LoadSettings rewrites the old name on load. 2. RefreshVoices on Qwen3TtsCrispAsr model change was fire-and-forget; any exception would surface later as an unobserved task exception. Wrap in Task.Run + try/catch and route the actual refresh through the UI thread dispatcher. 3. EnsureCrispAsrForChatterbox was hardcoded with "Chatterbox TTS" in every prompt — confusing when invoked from the new Qwen3 (CrispASR) settings dialog. Refactor to a shared EnsureCrispAsrAsync helper that takes an engine display name + optional capability check, with thin Chatterbox / Qwen3 wrappers. Qwen3 dialog now uses the new EnsureCrispAsrForQwen3 entry point. 6. ImportVoice on Qwen3TtsCrispAsr now also copies an adjacent .txt sidecar (transcription of the reference WAV) — CustomVoice cloning uses it for best quality. 7. Settings dialog model status: "Auto-download on first use" is misleading when CrispASR itself isn't installed (nothing to download into). Show "CrispASR required" + grey for the talker / codec rows in that case. 8. `crispasr --version` probe blocked the UI thread (up to 5 s timeout) when the engine-settings dialog first opened. Move the probe into a Task.Run and patch the version into BackendLabel / EngineLabel via the dispatcher when it returns. Applied to both SpeechToTextEngineSettings and the two CrispASR TTS dialogs (Chatterbox, Qwen3). Partial fix for #5: localize "Open containing folder" via existing Se.Language.General string. Title + remaining engine-specific status strings stay hardcoded English to match the existing pattern in the sibling Chatterbox / Qwen3TtsCpp settings dialogs; full localization across language resource files is a separate follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

niksedk and others added 8 commits May 21, 2026 22:11

niksedk requested a review from Copilot May 22, 2026 06:16

Copilot started reviewing on behalf of niksedk May 22, 2026 06:16 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

niksedk merged commit 70cfef2 into main May 22, 2026
1 of 3 checks passed

niksedk deleted the qwen3-tts-via-crispasr branch May 22, 2026 07:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "Qwen3 TTS (CrispASR)" engine + rename Chatterbox TTS#11097

Add "Qwen3 TTS (CrispASR)" engine + rename Chatterbox TTS#11097
niksedk merged 9 commits into
mainfrom
qwen3-tts-via-crispasr

niksedk commented May 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		var openModelsFolder = UiUtil.MakeButton("Open models folder", vm.OpenModelsFolderCommand).WithIconLeft(IconNames.FolderOpen);
		var openVoicesFolder = UiUtil.MakeButton("Open voices folder", vm.OpenVoicesFolderCommand).WithIconLeft(IconNames.FolderOpen);

Conversation

niksedk commented May 21, 2026

Summary

Why

How

Known limitations / follow-ups

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants