Local AI text-to-speech with voice cloning and voice design, powered by GGML. C++17 port of OmniVoice (k2-fsa/OmniVoice). 646 languages, 24 kHz mono output, runs on CPU, CUDA, ROCm, Metal, Vulkan.
- Voice cloning from a reference WAV plus its transcript
- Voice design via attribute keywords (gender, age, pitch, style, volume, emotion)
- Auto voice with consistent speaker identity across long inputs
- Long-form synthesis with punctuation-aware text chunking, voice prompt promotion, cross-fade and pydub-strict silence removal
- Bit deterministic generation in greedy mode, seedable Philox PRNG for stochastic sampling
- Q8_0 quantisation of the 612 M parameter Qwen3 backbone
- Two CLI tools :
omnivoice-tts(text -> WAV) andomnivoice-codec(WAV <-> RVQ codes)
git clone --recurse-submodules https://github.com/ServeurpersoCom/omnivoice.cpp.git
cd omnivoice.cpp
./buildcuda.sh # NVIDIA GPU
./buildvulkan.sh # AMD/Intel GPU (Vulkan)
./buildcpu.sh # CPU only
./buildall.sh # all backends, runtime DL loading
Pre-converted GGUFs are available on Hugging Face :
https://huggingface.co/Serveurperso/OmniVoice-GGUF
Drop them in models/ and skip to the quick start. To convert from
the original checkpoint :
./checkpoints.sh # hf download k2-fsa/OmniVoice -> checkpoints/
./convert.py # 2 GGUFs in BF16 -> models/
./quantize.sh # base LM Q8_0 (tokenizer stays at native dtype)
echo "Hello world." | ./build/omnivoice-tts \
--model models/omnivoice-base-Q8_0.gguf \
--codec models/omnivoice-tokenizer-F32.gguf \
--lang English -o hello.wav
Voice cloning :
./build/omnivoice-tts \
--model models/omnivoice-base-Q8_0.gguf \
--codec models/omnivoice-tokenizer-F32.gguf \
--ref-wav ref.wav --ref-text ref.txt \
--lang French -o out.wav < prompt.txt
See docs/ARCHITECTURE.md for the model, the GGUF layout, the inference pipeline, every CLI flag, and the validation results.
MIT. See LICENSE.
Upstream model : OmniVoice by Xiaomi / k2-fsa, Apache 2.0.
Audio codec : Higgs Audio v2 (bosonai/higgs-audio-v2-tokenizer),
Apache 2.0.