A comprehensive WebUI for fine-tuning and interacting with VoxCPM models. This application leverages Efficient LoRA (Low-Rank Adaptation) capabilities to enable high-quality voice cloning.
- Data Preparation: Upload/record audio and generate transcriptions using
Faster-Whisper. - Fine-Tuning: Configure LoRA or Full Fine-Tuning parameters and train your voice adapter.
- Inference: Generate speech using the base model combined with your trained weights.
- Persistent Torch & Triton cache: Integration of
triton-windowsand a custom kernel caching system inmodels/.cache, enabling the full power oftorch.compilefor inference speed-up. - Building the persistent cache for the first time, might take a up to 5 minutes. This is a one time process. Once cached, Subsequent generations will be significantly faster.
- Python: 3.10 – 3.11 (Recommended for stability during training).
- PyTorch: 2.5.0+
- CUDA: 12.0+
- Format Support:
.wavis recommended.
| Model | LoRA Training |
|---|---|
| VoxCPM 1.5 (750M) | ~12 GB VRAM |
| VoxCPM 2.0 (2B) | ~20 GB VRAM |
- Format:
.wavis highly recommended. Other formats supported bytorchaudioalso work. - Duration: 3–30 seconds per clip is the "sweet spot."
- Warning: Clips < 1s produce unstable results.
- Warning: Very long clips increase VRAM usage and may be filtered by
max_batch_tokens.
- Sample Rate: The dataloader resamples automatically. Your config
sample_ratemust match the AudioVAE encoder input:- VoxCPM 1.0: 16kHz
- VoxCPM 1.5: 44.1kHz
- VoxCPM 2.0: 16kHz (The encoder operates at 16kHz; the decoder outputs 48kHz).
- Trim Trailing Silence: Keep silence to < 0.5 seconds. Excessive trailing silence is the leading cause of "infinite generation" issues after fine-tuning.
- Normalize Volume: Ensure consistent levels across all training samples.
- Clean Transcripts: Text must match audio exactly. Inaccurate transcripts degrade both cloning quality and text adherence.
- Remove Noise: The model is highly sensitive to background noise. Use clean, isolated voice recordings.
- Inference (Running the model):
- Minimum: 8 GB VRAM
- Recommended: 12 GB VRAM
- Training (LoRA):
- Minimum: +12 GB VRAM (VoxCPM 1.5)
- Recommended: +20 GB VRAM (VoxCPM 2.0)
git clone https://github.com/OpenBMB/VoxCPM.gitThis project utilizes uv for lightning-fast dependency management.
- Run Installer: Double-click
install.bat.- This installs
uvvia Winget (if not present). - Synchronizes the environment and installs all required libraries automatically.
- This installs
- Launch App: Double-click
start.bat. - Access: Navigate to
http://127.0.0.1:7860in your web browser.
Inspired by FranckyB Voice Clone Studio



