Skip to content

Add "Qwen3 TTS (CrispASR)" engine + rename Chatterbox TTS#11097

Merged
niksedk merged 9 commits into
mainfrom
qwen3-tts-via-crispasr
May 22, 2026
Merged

Add "Qwen3 TTS (CrispASR)" engine + rename Chatterbox TTS#11097
niksedk merged 9 commits into
mainfrom
qwen3-tts-via-crispasr

Conversation

@niksedk
Copy link
Copy Markdown
Member

@niksedk niksedk commented May 21, 2026

Summary

Adds a second route to Qwen3 TTS, this time via the existing CrispASR runtime instead of qwen3-tts.cpp. Useful as an A/B comparison engine while the qwen3-tts.cpp EOS issue is being investigated upstream.

Also renames the existing Chatterbox TTS engine to Chatterbox TTS (CrispASR) for consistency — both run through the same crispasr binary now.

Why

The qwen3-tts.cpp 0.6B model never emits EOS for short prompts (per the frame-by-frame logit dump from our earlier investigation — EOS slot is deeply negative and trends more negative each frame). "Hello world" runs to max_audio_tokens=4096 and yields ~5 min of audio in ~10 min wall time.

CrispASR's qwen3-tts-1.7b-voicedesign backend, on the same M4 hardware with the same cstr-uploaded 1.7B VoiceDesign GGUF, cleanly emits EOS:

qwen3_tts: produced 12 frames × 16 codebooks = 192 codes
crispasr-server: synthesized 1.0s audio in 1.44s (RTF=1.50) voice='<startup>' speed=1.00
HTTP 200 body=46124 time=1.444497s

That's a working baseline while the qwen3-tts.cpp EOS bug remains open.

How

  • New Qwen3TtsCrispAsr engine spawns crispasr --server --backend qwen3-tts-1.7b-voicedesign|customvoice -m <talker> [--codec-model <codec> | --auto-download] --voice-dir <voices> and POSTs to /v1/audio/speech (OpenAI-compatible).

  • VoiceDesign: requires instructions (defaults to "a calm female voice" if the user hasn't set one). Reuses Qwen3TtsCppInstruction so the same description applies to both Qwen3 engines.

  • CustomVoice: takes a reference WAV from the voices folder; an adjacent .txt sidecar (matching filename) provides the reference transcription that CrispASR's qwen3-tts-customvoice expects.

  • Voice import resamples to 24 kHz mono via ffmpeg (same convention as Chatterbox).

  • Engine folder layout (kept separate per user direction):

    TextToSpeech/Qwen3TtsCrispAsr/models/  talker GGUF + 12Hz codec
    TextToSpeech/Qwen3TtsCrispAsr/voices/  reference WAVs (+ .txt sidecars)
    

Known limitations / follow-ups

  • No download UX yet. The talker GGUF must be manually placed (or symlinked from the existing qwen3-tts.cpp install — the 1.7B VoiceDesign file is bit-for-bit compatible). The CrispASR-style 12Hz codec (~986 MB, different file from qwen3-tts.cpp's tokenizer) gets fetched by crispasr --auto-download into ~/.cache/crispasr/ on first run — first-run timeout bumped to 30 min for that case. A proper SE-side download service is a follow-up.
  • No engine-settings dialog yet. Model selection (VoiceDesign vs CustomVoice) currently only via direct setting edit / dropdown in the main TTS UI. The Qwen3 instruction string is shared with the existing qwen3-tts.cpp engine.
  • 0.6B not supported. The koboldcpp 0.6B GGUF uses the old tensor names and won't load in CrispASR (mirror of the qwen3-tts.cpp PR dvb subtitles to sub/idx #14 issue, opposite direction). Only the cstr 1.7B variants work.

Test plan

  • dotnet build src/ui/UI.csproj clean (0 warnings, 0 errors)
  • End-to-end smoke: symlink existing 1.7B VoiceDesign GGUF, let crispasr auto-download the 12Hz codec, POST /v1/audio/speech with instructions: "a calm female voice" — HTTP 200, valid 24 kHz mono WAV, 1 s audio in 1.4 s wall time.
  • Manual UI test on Mac/Windows: select "Qwen3 TTS (CrispASR)" in the engine combo, run the voice test
  • Manual UI test: verify "Chatterbox TTS (CrispASR)" rename appears in the engine combo and engine settings

🤖 Generated with Claude Code

niksedk and others added 8 commits May 21, 2026 22:11
Makes it explicit that Chatterbox runs through the CrispASR runtime, in
line with the upcoming "Qwen3 TTS (CrispASR)" engine that uses the same
binary. Cosmetic only - the engine class, settings, voices, and model
files are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawns the existing crispasr binary in --server mode with one of:
  --backend qwen3-tts-1.7b-voicedesign   (default; needs instructions)
  --backend qwen3-tts-1.7b-customvoice   (voice cloning, takes a WAV + ref-text)
and POSTs to its OpenAI-compatible /v1/audio/speech endpoint.

Why both engines: qwen3-tts.cpp's 0.6B model never emits EOS for short
prompts (frame-by-frame logit dump shows the EOS slot is deeply negative
and trending more negative), so a "Hello world" runs to the
max_audio_tokens cap. CrispASR's qwen3-tts-1.7b-voicedesign backend
cleanly emits EOS on the same prompt (1 s of audio in 1.4 s wall time,
RTF=1.5 on M4) - useful as a working baseline while the qwen3-tts.cpp
EOS work continues upstream.

Engine layout (separate from qwen3-tts.cpp's folder per user direction):
  TextToSpeech/Qwen3TtsCrispAsr/models/  talker GGUF + 12Hz codec
  TextToSpeech/Qwen3TtsCrispAsr/voices/  reference WAVs (+ .txt sidecar)

The CrispASR-style 12Hz codec (~986 MB) is a different file from
qwen3-tts.cpp's tokenizer; if not staged locally, the engine passes
--auto-download to crispasr so it fetches into ~/.cache/crispasr/ on
first run (first-run timeout bumped to 30 min for that case).

The talker GGUF must be placed manually until the auto-downloader lands
- the cstr-uploaded 1.7B VoiceDesign GGUF that qwen3-tts.cpp already
downloads is bit-for-bit compatible and can be symlinked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new engine needs the same free-text instruction box the qwen3-tts.cpp
engine uses for VoiceDesign — without it there's no UI to set the voice
description and the engine silently falls back to "a calm female voice"
on every request.

- IsVoiceDesignModel helper on Qwen3TtsCrispAsr matches the one on
  Qwen3TtsCpp.
- RefreshInstructionVisibility shows the instruction text box for either
  Qwen3 engine when VoiceDesign is the selected model.
- UpdateVoiceLock disables the voice combo for both engines on
  VoiceDesign — voice cloning has no effect there.
- GetInstructionForEngine reads Qwen3TtsCppInstruction for both engines,
  so the same description is shared when A/B testing.
- Save path persists Qwen3TtsCrispAsrModel + the shared instruction.
- Engine-change path restores SelectedModel from Qwen3TtsCrispAsrModel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The engine-combo dot defaulted to None for Qwen3TtsCrispAsr because
GetTtsEngineDotStatus had no case for it. Mirror the Chatterbox approach:
add GetEngineUpdateStatus on the engine class (reads the CrispASR
runtime's sidecar — that's what the engine sits on top of) and route the
new case through StatusDots.From in TextToSpeechWindow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two issues from manual testing:

1. CustomVoice fails up-front with "talker GGUF not found" — the
   1.7B CustomVoice talker isn't bit-for-bit compatible with the
   VoiceDesign one, so symlinking from qwen3-tts.cpp doesn't help and
   no SE-side downloader exists yet. Hand the talker off to crispasr's
   own --auto-download when missing locally (same approach we already
   use for the 12Hz codec). Pass `-m auto` so crispasr resolves the
   right model from the backend name. First-run timeout already covers
   the longer download window.

2. CustomVoice has no baked default voice — it's pure voice cloning
   and rejects requests without a `voice` field. Drop the "Default"
   entry from the voice combo when CustomVoice is selected, and add a
   clear up-front error in Speak if the user picks an empty voice
   anyway (mentions 24 kHz mono WAV + .txt sidecar requirements).

3. The voice list now depends on the selected model, so trigger a
   RefreshVoices when the user toggles between VoiceDesign and
   CustomVoice. Also persist the new model choice immediately so
   GetVoices sees it on the next call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new "Qwen3 TTS (CrispASR)" engine had ImportVoice implemented but
VoiceSettingsViewModel.IsImportVoiceVisible didn't include it, so the
button stayed hidden — typed Import Voice as a no-op for users.

Also seed our voices folder from the existing qwen3-tts.cpp install
(TextToSpeech/Qwen3TtsCpp/voices/) on first GetVoices when our folder
is empty. The reference WAVs and .txt sidecars are bit-for-bit usable
by both engines, so users who already downloaded the qwen3-tts.cpp voice
pack get them for free here.

Users who never installed qwen3-tts.cpp start empty and can use Import
Voice. A dedicated CrispASR-side voice-pack downloader (fetching the
support-files voices.zip into this engine's folder) is still a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probe `crispasr --version` once per binary mtime and cache the parsed
semver. Surfaces the installed runtime version next to the existing
backend / status labels in two places:

- Speech-to-text engine settings: BackendLabel now reads e.g.
  "macOS universal, v0.6.9" instead of just "macOS universal".
- Chatterbox TTS settings: EngineLabel reads e.g.
  "CrispASR v0.6.9 (Chatterbox-capable)" instead of without the version.

Both call into a shared CrispAsrVersion helper (new file under
SpeechToText/Engines). The probe accepts both the new structured
--version output (v0.6.9+, "version : 0.6.9") and the legacy single-line
banner ("crispasr 0.6.7 (git ..., Release) [backends: ...]").

The new Qwen3 TTS (CrispASR) engine doesn't have a settings dialog yet,
so it doesn't pick up the version display in this commit - that's a
follow-up once that dialog lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the Chatterbox TTS settings dialog: shows CrispASR runtime
status + version, the install state of the VoiceDesign / CustomVoice
talker GGUFs and the 12 Hz codec, count of imported voices, and the
install folder. Status dots match the rest of the SE TTS UI - green
when present locally, grey "Auto-download on first use" otherwise
(missing local files aren't fatal because the engine passes
--auto-download to crispasr in that case).

Buttons:
- Re-download CrispASR (delegates to EnsureCrispAsrForChatterbox,
  same flow Chatterbox already uses)
- Open models folder
- Open voices folder
- Close

Also sets IsEngineSettingsVisible = true when this engine is selected
so the gear icon appears in the main TTS UI, and registers the new
ViewModel in DI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Qwen3 TTS (CrispASR) engine that routes synthesis through the existing CrispASR runtime (VoiceDesign + CustomVoice), and updates UI/settings plumbing to support engine selection, engine settings, and CrispASR version display. It also renames the existing Chatterbox engine to make the CrispASR dependency explicit.

Changes:

  • Introduce Qwen3TtsCrispAsr engine with local CrispASR server lifecycle + OpenAI-compatible /v1/audio/speech calls.
  • Add a dedicated settings window/viewmodel for Qwen3 (CrispASR), plus a new persisted model key setting.
  • Show CrispASR runtime version in engine settings UIs and rename Chatterbox engine display name.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/ui/Logic/Config/SeVideoTextToSpeech.cs Adds persisted setting for Qwen3 (CrispASR) model selection + default.
src/ui/Features/Video/TextToSpeech/VoiceSettings/VoiceSettingsViewModel.cs Enables voice import UI for the new engine.
src/ui/Features/Video/TextToSpeech/TextToSpeechWindow.cs Adds status-dot support for Qwen3 (CrispASR) engine.
src/ui/Features/Video/TextToSpeech/TextToSpeechViewModel.cs Registers the new engine, persists its model selection, and wires settings dialog + model-dependent voice list behavior.
src/ui/Features/Video/TextToSpeech/Qwen3TtsCrispAsrSettings/Qwen3TtsCrispAsrSettingsWindow.cs New settings dialog UI for Qwen3 (CrispASR).
src/ui/Features/Video/TextToSpeech/Qwen3TtsCrispAsrSettings/Qwen3TtsCrispAsrSettingsViewModel.cs New settings VM: CrispASR runtime/model/codec/voices status + actions.
src/ui/Features/Video/TextToSpeech/Engines/Qwen3TtsCrispAsr.cs New engine implementation (server start/stop, voice handling, import).
src/ui/Features/Video/TextToSpeech/Engines/ChatterboxTtsCpp.cs Renames engine display name to “Chatterbox TTS (CrispASR)”.
src/ui/Features/Video/TextToSpeech/ChatterboxTtsSettings/ChatterboxTtsSettingsViewModel.cs Adds CrispASR version suffix to improve runtime visibility.
src/ui/Features/Video/SpeechToText/EngineSettings/SpeechToTextEngineSettingsViewModel.cs Appends CrispASR version to backend labels for CrispASR engines.
src/ui/Features/Video/SpeechToText/Engines/CrispAsrVersion.cs New cached helper to probe crispasr --version and parse version output.
src/ui/DependencyInjectionExtensions.cs Registers the new Qwen3 (CrispASR) settings ViewModel for DI.
change-log.txt Documents the new engine and Chatterbox rename.
Comments suppressed due to low confidence (1)

src/ui/Features/Video/TextToSpeech/TextToSpeechViewModel.cs:179

  • Qwen3TtsCrispAsr is added to the available engines, but IsEngineInstalled(...) in this view model does not have a branch to guide the user through installing CrispASR (unlike ChatterboxTtsCpp). If CrispASR is missing, the flow will just return false with no prompt, making the new engine effectively unusable on first use. Add an install/update prompt (likely reusing the CrispASR download flow) when SelectedEngine is Qwen3TtsCrispAsr.
            new MistralSpeech(ttsDownloadService),
            new Murf(ttsDownloadService),
            new GoogleSpeech(ttsDownloadService),
            new Qwen3TtsCpp(),
            new Qwen3TtsCrispAsr(),
            new KokoroTtsCpp(),
            new ChatterboxTtsCpp(),
            new OmniVoiceTtsCpp(),
        ];

public class ChatterboxTtsCpp : ITtsEngine
{
public string Name => "Chatterbox TTS";
public string Name => "Chatterbox TTS (CrispASR)";
if (SelectedEngine is Qwen3TtsCrispAsr engine)
{
Se.Settings.Video.TextToSpeech.Qwen3TtsCrispAsrModel = value ?? Qwen3TtsCrispAsr.DefaultModelKey;
_ = RefreshVoices(engine);
Comment on lines +148 to +153
// CrispASR is shared with Speech-to-text; piggy-back on the same redownload flow
// Chatterbox uses, then refresh status here. The crispasr binary itself does not
// care which TTS engine triggered the download.
await TtsVoiceInstaller.EnsureCrispAsrForChatterbox(Window, _windowService, forceRedownload: true);
Refresh();
}
Comment on lines +23 to +27
UiUtil.InitializeWindow(this, GetType().Name);
Title = "Qwen3 TTS (CrispASR) settings";
SizeToContent = SizeToContent.WidthAndHeight;
CanResize = false;
MinWidth = 580;
Comment on lines +188 to +189
var openModelsFolder = UiUtil.MakeButton("Open models folder", vm.OpenModelsFolderCommand).WithIconLeft(IconNames.FolderOpen);
var openVoicesFolder = UiUtil.MakeButton("Open voices folder", vm.OpenVoicesFolderCommand).WithIconLeft(IconNames.FolderOpen);
Comment on lines +657 to +688
public bool ImportVoice(string fileName)
{
if (string.IsNullOrEmpty(fileName) || !File.Exists(fileName))
{
return false;
}

var voicesFolder = GetSetVoicesFolder();
var baseName = Path.GetFileNameWithoutExtension(fileName);
var destinationFileName = GetUniqueDestinationFileName(voicesFolder, baseName);

// CrispASR's qwen3-tts CustomVoice backend expects a 24 kHz mono reference WAV.
// Always resample on import via ffmpeg so the saved file is in the right shape
// regardless of what the user picked.
try
{
var process = FfmpegGenerator.ConvertToMono24kHzWav(fileName, destinationFileName);
if (!process.Start())
{
return false;
}

process.WaitForExit();
}
catch (Exception ex)
{
Se.LogError(ex, "Qwen3 TTS (CrispASR) voice import failed (ffmpeg conversion).");
return false;
}

return File.Exists(destinationFileName);
}
Comment on lines +92 to +109
// Talker GGUFs live in the engine's own models folder OR are auto-downloaded by
// crispasr into ~/.cache/crispasr. We can only verify the engine-folder copy here;
// a missing local talker isn't fatal (the engine falls back to --auto-download) so
// surface it as a neutral "Auto-download on first use" rather than an error.
var voiceDesignPath = Qwen3TtsCrispAsr.GetTalkerPath(Qwen3TtsCrispAsr.ModelKeyVoiceDesign);
ApplyModelStatus(File.Exists(voiceDesignPath),
label => VoiceDesignTalkerLabel = label,
brush => VoiceDesignTalkerBrush = brush);

var customVoicePath = Qwen3TtsCrispAsr.GetTalkerPath(Qwen3TtsCrispAsr.ModelKeyCustomVoice);
ApplyModelStatus(File.Exists(customVoicePath),
label => CustomVoiceTalkerLabel = label,
brush => CustomVoiceTalkerBrush = brush);

var codecPath = Qwen3TtsCrispAsr.GetCodecPath();
ApplyModelStatus(File.Exists(codecPath),
label => CodecLabel = label,
brush => CodecBrush = brush);
Comment on lines +91 to +105
// For CrispASR engines, append the installed runtime version (probed once via
// `crispasr --version`) so users can tell at a glance which build they have.
// Other engines stay as-is.
private static string AppendVersion(ISpeechToTextEngine engine, string backendLabel)
{
if (engine is not ICrispAsrEngine crispAsr)
{
return backendLabel;
}

var version = CrispAsrVersion.TryGet(crispAsr.GetExecutable());
return string.IsNullOrEmpty(version)
? backendLabel
: $"{backendLabel}, v{version}";
}
Six fixes plus a partial localization pass:

1. Migrate the persisted engine name: settings serialize the engine by
   Name, so the rename "Chatterbox TTS" -> "Chatterbox TTS (CrispASR)"
   would silently drop users' saved selection. LoadSettings rewrites the
   old name on load.

2. RefreshVoices on Qwen3TtsCrispAsr model change was fire-and-forget;
   any exception would surface later as an unobserved task exception.
   Wrap in Task.Run + try/catch and route the actual refresh through
   the UI thread dispatcher.

3. EnsureCrispAsrForChatterbox was hardcoded with "Chatterbox TTS" in
   every prompt — confusing when invoked from the new Qwen3 (CrispASR)
   settings dialog. Refactor to a shared EnsureCrispAsrAsync helper
   that takes an engine display name + optional capability check, with
   thin Chatterbox / Qwen3 wrappers. Qwen3 dialog now uses the new
   EnsureCrispAsrForQwen3 entry point.

6. ImportVoice on Qwen3TtsCrispAsr now also copies an adjacent .txt
   sidecar (transcription of the reference WAV) — CustomVoice cloning
   uses it for best quality.

7. Settings dialog model status: "Auto-download on first use" is
   misleading when CrispASR itself isn't installed (nothing to download
   into). Show "CrispASR required" + grey for the talker / codec rows
   in that case.

8. `crispasr --version` probe blocked the UI thread (up to 5 s timeout)
   when the engine-settings dialog first opened. Move the probe into a
   Task.Run and patch the version into BackendLabel / EngineLabel via
   the dispatcher when it returns. Applied to both
   SpeechToTextEngineSettings and the two CrispASR TTS dialogs
   (Chatterbox, Qwen3).

Partial fix for #5: localize "Open containing folder" via existing
Se.Language.General string. Title + remaining engine-specific status
strings stay hardcoded English to match the existing pattern in the
sibling Chatterbox / Qwen3TtsCpp settings dialogs; full localization
across language resource files is a separate follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@niksedk niksedk merged commit 70cfef2 into main May 22, 2026
1 of 3 checks passed
@niksedk niksedk deleted the qwen3-tts-via-crispasr branch May 22, 2026 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants