Skip to content

v0.7.2

Latest

Choose a tag to compare

@CrispStrobe CrispStrobe released this 14 Jun 18:15
· 51 commits to main since this release

New Backends

ASR:

  • LFM2-Audio 1.5B (LiquidAI LFM2.5-Audio) — Depthformer encoder with KV cache, conv state caching, gallocr graph allocation. ASR + TTS + speech-to-speech. GPU backend support.
  • Mini-Omni2 — Qwen2-0.5B LLM + Whisper encoder + adapter. ASR/TTS/S2S with BPE tokenizer. Q4_K auto-download.
  • Nemotron 3.5 ASR Streaming 0.6B — 39-language streaming ASR scaffold (NVIDIA Parakeet architecture + prompt kernel). Chunked encoder with graph reuse and causal DW conv padding.

Major Features

  • Speech-to-speech (--s2s) — new CLI flag for end-to-end speech-to-speech with LFM2-Audio and Mini-Omni2; POST /v1/audio/speech-to-speech server endpoint; session-level S2S C API + Dart FFI bindings
  • WebSocket streaming ASR — real-time ASR via --ws-port with proper WebSocket handshake
  • M4A/AAC/Opus/WebM input — ffmpeg-based container support across CLI, WASM demo, and HF Space
  • WASM ASR session surface — backend-agnostic asrOpen/asrTranscribe/asrSet* for JavaScript/WASM consumers
  • Node.js addontranscribeSession() via crispasr_session C ABI with Jest test coverage
  • Full C-ABI session-setter parity — all 7 language bindings (C, Go, Python, Ruby, Rust, JS/WASM, Dart) at full parity; crispasr_session_set_punc_model + hotwords setters
  • Server parity — truecase, per-request diarize/LID knobs, punc-model (PCS + CTC auto-enable), POST /v1/translate, all remaining transcription params exposed per-request
  • --dry-run-resolve — now honors sub-variant model keys
  • FunASR -l language flag — language routing wired into prompt template
  • Regression manifest — 65 total backends tracked (17 new); pinned SHA revisions for all model repos

Performance

  • LFM2-Audio — KV cache (3x decode speedup), conv state caching (10x), gallocr graph allocation (prefill buffer 2 GB → 256 MB), depthformer buffer reuse, streaming TTS API
  • FireRed — batch beam decoder (4.5x faster beam search), OpenMP + vectorizable dot (~4x faster AED decoder)
  • Server — VAD slicing now matches CLI for unbounded backends (#165); LID model kept resident across requests; Silero/FireRed/ECAPA LID contexts cached across calls
  • Mini-Omni2 — batch embed for TTS/S2S

Notable Bug Fixes

  • #167 — SIGILL on AMD Ryzen 5700X (and all non-AVX512 CPUs): release binaries were compiled with -march=native on AVX-512 CI runners. All Linux x86_64 builds now use -DGGML_NATIVE=OFF -DGGML_AVX2=ON
  • #164 — VoxCPM2 TTS: 12+ fixes — VAE Vulkan work-group overflow (CPU fallback for long decodes), RALM NaN on Vulkan, CUDA SIGABRT in attention permute, FSQ NaN on CUDA, stop predictor crash, null-guard graph tensors
  • #165 — Silero-LID crash on GPU builds (CPU threads set on GPU backend)
  • #89 — Parakeet-JA: auto-enable VAD instead of shorter chunks for Japanese models
  • #81 — Nemotron: tensor name loading (conv.bn.*, prompt_kernel.linear*.*), exact-size graphs (no zero-padding), language mapping, causal DW conv padding
  • #52 — Qwen3-TTS O15 CUDA crash: dedicated scheduler for cached T=1 graph
  • #125 — Gemma4-E2B prefers_vad + MIMO-ASR sweep fixes
  • MOSS-Audio GPT-2 byte detokenization (remove Ġ artifacts)
  • Orpheus token_embd kept at F16 during quantization (SNAC codec is quant-sensitive)
  • Core attention: skip RoPE when rope_theta <= 0 (fixes RALM NaN)
  • Core attention: ggml_cont after permute before set_rows (fixes CUDA SIGABRT)
  • Server: --no-warmup opt-out, guarded warmup, surfaced 500s, robust scratch dir
  • CLI: --dry-run-resolve honors sub-variant model keys

Build / CI

  • Emscripten: fix libwhisper.worker.js copy failure with SINGLE_FILE=1 (modern Emscripten inlines pthread worker)
  • Windows MSYS2: fix kokoro phonemize_builtin_* linker errors (phonemizer extracted to shared OBJECT library)
  • Regression manifest: all model revisions pinned to SHAs; voice_preset added to 12 TTS entries
  • Pre-commit hook auto-syncs Go CGO LDFLAGS on CMake changes
  • Docker Smoke workflow fixed (correct build context)
  • Go CGO LDFLAGS regenerated for new backends (snac, mini-omni2, lfm2_audio)
  • clang-format v18 pass on 25 files with format drift
  • HF Space: pre-built binary workflow, model pre-download, Ubuntu 24.04 fixes
  • WASM build workflow added (all backends)
  • CI: windows-blas pkgconfiglite fix for windows-2025 runner

SNAC Refactor

  • SNAC decoder extracted to core/snac (shared by Orpheus + Mini-Omni2)
  • Unit + live tests for core/snac.h