English | 中文
PersonaVoice v10.4.8 — Clone any voice from just 1 second of audio with optional persona / emotion injection via plug-in adapters over a frozen F5-TTS backbone. A systematic trade-off analysis of timbre fidelity (SECS SOTA) vs. text-alignment stability.
- What is PersonaVoice?
- Key Results
- Core Innovations
- Architecture Overview
- Installation
- Data & Model Download
- Quick Start
- Reproducing the Paper Results
- Inference Parameters
- Project Structure
- Design Philosophy
- Related Work & Acknowledgements
- Citation
PersonaVoice is a persona-driven plug-in adapter architecture for 1-second voice cloning. It addresses a fundamental challenge:
Given only 1 second of a person's voice and optional text records (chat history / structured recalls), generate speech that preserves their timbre, linguistic habits, and individual emotional expression patterns.
The narrative of this project is not a "full SOTA" claim but a trade-off analysis:
- Flow-Matching achieves dominating timbre similarity (SECS) but suffers short-text manifold collapse.
- PersonaVoice's contribution is to maximize that SECS lead while using LAAG + dynamic FiLM to recover long-text alignment, leaving short-text WER as honest Future Work.
Most zero-shot TTS systems assume ≥3s of reference audio (XTTS v2, CosyVoice, VALL-E, VoiceBox). The 1-second regime exposes fundamental architectural limits:
- Information bottleneck (94 mel frames @ 24 kHz / hop 256)
- Manifold collapse under Flow-Matching for very short texts
- Reference audio impurity (breath, fricatives) dominates the cloning signal
PersonaVoice is the first open-source system to systematically study this regime.
Below is a screenshot of the PersonaVoice v10.4.8 interactive dashboard. Users upload a 1-second reference audio, optionally provide chat history or structured recollections, and obtain a cloned utterance together with a real-time Big-Five persona radar, mel-spectrogram visualization, and architecture condition inspection.
Setup: 1-second reference audio, 200-sample LibriTTS dev-clean evaluation, ECAPA-TDNN SECS, Whisper-large-v3-turbo WER.
| Metric | PersonaVoice v10.4.8 | XTTS v2 (1s ref) | CosyVoice (1s ref) | F5-TTS Official | SOTA |
|---|---|---|---|---|---|
| ECAPA SECS (overall) | 0.4945 | 0.3349 | 0.3595 | 0.2508 | PV ✓ (+47.6% vs XTTS v2) |
| ECAPA SECS (long text) | 0.5044 | 0.31 | 0.33 | 0.22 | PV ✓ (+62.7% vs XTTS v2) |
| WER (overall) | 0.1928 | 0.1296 | 0.6130 | 0.8982 | XTTS v2 (AR) |
| WER (long text) | 0.1158 | 0.10 | 0.55 | 0.85 | XTTS v2 (acknowledged trade-off) |
| WER (short text ≤8 words) | 0.4066 | 0.16 | 0.68 | 0.95 | XTTS v2 (Future Work) |
| CFR (overall, WER>0.5) | 13.5% | ~5%* | ~30%* | ~45%* | XTTS v2 |
| CFR (short text) | 35.8% | ~10%* | ~50%* | ~70%* | XTTS v2 |
| UTMOS (Objective Proxy) | 4.0-4.5 | 3.8 | 4.0 | 3.5 | PV (Human MOS pending) |
| RTF | 0.55-0.66 | 0.85 | 0.72 | 0.50 | F5-TTS |
| 1111.mp3 SECS (long text) | 0.5854 | - | - | - | PV ✓ |
| 1111.mp3 WER (long text) | 0.0676 | - | - | - | PV ✓ |
*CFR for XTTS v2 / CosyVoice / F5-TTS are estimated from WER distribution; measured comparison is planned. Full evaluation details: SOTA_VERIFICATION_REPORT.md.
PersonaVoice introduces 5 core innovations, all validated by 200-sample paired t-tests (p < 0.05, Cohen's d > 0.1).
Solves the fundamental flaw of uneven performance between short and long text under 1s extreme cloning:
- Dynamic Chunking: Long text split into chunks (max 135 chars), each fully utilizes the 1s reference audio
- Dynamic CFG base:
cfg_base = 2.5tuned for short/long-text balance - Dynamic FiLM Activation (v10.4.6+): Short text (single chunk) FiLM off (stable baseline); long text (multi-chunk) FiLM on (prevent timbre drift, +0.06 SECS)
- mel splice fix (v10.4.5): Corrected
(T, mel_dim)vs(mel_dim, T)dimension order across chunks (+56.6% SECS) - F5-TTS official flow integration (v10.3+):
infer_batch_processwithref_text+audio_refcond flow, replacing the custom Duration Predictor - v10.4.8 boundary hardening: ref_text punctuation + space ensures clean ref/gen boundary; gen_text leading-space sacrifice prevents head-truncation; lead-silence trimming removes artifacts
Decomposes the speaker embedding into timbre and environment sub-spaces:
-
env_basis: learnable orthogonal basis$B \in \mathbb{R}^{192 \times 32}$ , QR-orthogonalized -
v10.0 fix:
env_scale = 0.1gradual initialization (v9.x's 1.0 init caused the 1111.mp3 SECS=0.0363 disaster) - Inference-time adjustment:
env_weight = 0→ studio-clean timbre (v10.0 default SOTA config),env_weight = 1→ original mic texture
Upgraded from IEAG (full self-attention entropy) to CEAG (text-audio cross-attention entropy), directly targeting short-text alignment collapse:
- v10.0 Landmine 1 fix: Added
text_maskandmel_maskbeforelog_softmaxto prevent gradient explosion from<PAD>tokens (NaN defense) - v10.4.7+ status: Code path preserved in
ceag_sampler.pyfor reproducibility, butuse_ceag=FalseinSOTAConfig. v10.4.7 ablations showed negligible incremental effect over the LAAG + F5-official-flow baseline; re-enabling is left to follow-up work.
Persona + emotion conditioned FiLM modulating DiT hidden layers:
- Zero-initialized
gamma_net/beta_net(starts as identity mapping, safe for pretrained backbone) - v10.4.6+ dynamic activation: FiLM off when
len(chunks) == 1(short text, avoids over-correction); FiLM on for multi-chunk (long text, anchors timbre across chunks)
Replaces legacy WebRTC VAD (GMM-based, harsh on breath/fricatives) with Silero VAD (neural network, <100 MB VRAM), dramatically improving 1s sample reference purity for consonants like /s/, /f/. RMS normalization (target_rms = 0.1) ensures loudness consistency.
| Module | Ablation p-value | Cohen's d | Status |
|---|---|---|---|
| IBOP | 0.85 / 0.51 | <0.05 | Removed |
| AM-ODE | Zero-init | Identity mapping | Removed |
| TD-CFG | 0.41 / 0.74 | <0.06 | Removed |
| SBM | 0.71 / 0.90 | <0.03 | Removed |
| GRPO | Test-time only | Non-core | Removed (v10.0) |
| Duration Predictor | v10.3+ F5-TTS official formula is more accurate | — | Removed (v10.3) |
| Reference Enhancer | Loop-extension hurt SECS, deviates from F5-TTS baseline | — | Removed (v10.4) |
| Best-of-N | Manifold collapse is architectural; sampling can't fix | — | Removed (v10.4.7) |
The figure above illustrates the full inference pipeline of PersonaVoice v10.4.8. A 1-second reference audio is first pre-processed by Silero VAD + RMS normalization, then encoded by ECAPA-TDNN into a 192-d speaker embedding. Optional chat history and structured recollections are distilled by a BERT-based persona extractor into 64-d text-style and 64-d persona embeddings. The OES module decomposes the audio embedding into orthogonal timbre and environment components, while LAAG decides chunking strategy and FiLM activation based on text length. Finally, the frozen F5-TTS DiT blocks 0–19 plus two trainable blocks 20–21 generate the mel spectrogram, which is converted to 24 kHz waveform by the Vocos vocoder.
Full architecture details: ARCHITECTURE.md
Windows (PowerShell):
git clone https://github.com/personavoice/personavoice.git
cd personavoice
powershell -ExecutionPolicy Bypass -File scripts\setup_env.ps1Linux / macOS:
git clone https://github.com/personavoice/personavoice.git
cd personavoice
bash scripts/setup_env.shThe setup script will:
- Create a
.venvvirtual environment (inherits system torch/speechbrain/vocos) - Install PyTorch (CUDA 11.8 or CPU)
- Install all PersonaVoice dependencies
- Install
silero-vad,f5-tts,imageio-ffmpeg - Install PersonaVoice in editable mode (
pip install -e .) - Verify the installation
- Download pretrained models (~3 GB, optional — pass
--skip-models/-SkipModelsto skip)
# 1. Create virtual environment
python -m venv --system-site-packages .venv
# 2. Activate it
# Windows: .\.venv\Scripts\activate
# Linux: source .venv/bin/activate
# 3. Install PyTorch (CUDA 11.8)
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu118
# 4. Install dependencies
pip install -r requirements.txt
pip install silero-vad f5-tts imageio-ffmpeg
pip install -e .- Python ≥ 3.10
- PyTorch ≥ 2.0 (CUDA 11.8 recommended, CPU works but slow)
- GPU: 8 GB VRAM sufficient (tested on RTX 4070)
- ffmpeg (for Whisper ASR; auto-provided by
imageio-ffmpeg)
PersonaVoice does not bundle large data or model weights. Use the scripts below.
All models are downloaded from HuggingFace. For users in mainland China, set the mirror:
export HF_ENDPOINT=https://hf-mirror.com # Linux/macOS
$env:HF_ENDPOINT="https://hf-mirror.com" # Windows PowerShellThen run:
python scripts/download_models.py # All required models
python scripts/download_models.py --optional # Also optional models (Qwen2.5, MiniLM)
python scripts/download_models.py --check # Check cache status onlyRequired models:
| Model | HuggingFace ID | Size | Purpose |
|---|---|---|---|
| F5-TTS | SWivid/F5-TTS |
~1.5 GB | TTS backbone (frozen) |
| ECAPA-TDNN | speechbrain/spkrec-ecapa-voxceleb |
~80 MB | Speaker encoder (SECS) |
| Vocos | charactr/vocos-mel-24khz |
~50 MB | Neural vocoder |
| BERT (Chinese) | bert-base-chinese |
~400 MB | Persona text encoder |
| Whisper | openai/whisper-large-v3-turbo |
~1.5 GB | ASR for WER evaluation & ref_text |
Optional models:
| Model | HuggingFace ID | Size | Purpose |
|---|---|---|---|
| Qwen2.5-0.5B | Qwen/Qwen2.5-0.5B |
~1 GB | LLM persona dialogue (experimental) |
| all-MiniLM-L6-v2 | sentence-transformers/all-MiniLM-L6-v2 |
~90 MB | Text embedding baseline |
For reproducing the 200-sample evaluation:
python scripts/prepare_data.py
# Or skip download if LibriTTS already exists:
python scripts/prepare_data.py --skip-download
# Quick test with only 20 samples:
python scripts/prepare_data.py --n_samples 20This downloads LibriTTS dev-clean from OpenSLR #60, extracts mel spectrograms + ECAPA embeddings, and saves to data/libritts_processed/libritts_devclean_processed.pt.
The repo includes a single 1-second reference audio 1111.mp3 (~35 KB) for quick testing — no download required.
python -m personavoice.demo.api_server
# Open http://localhost:8000 in browserThe web UI exposes:
/api/health— backend health check/api/architecture— full v10.4.8 module graph (powers the frontend visualization)/api/clone— voice cloning with optional chat history (persona injection)/api/persona/analyze— Big Five trait radar from chat history/api/persona/calibrate— interactive persona recalibration/api/demo/sample— bundled demo sample
from personavoice.tts_backbone.f5_pretrained_backbone import load_pretrained_f5tts_backbone
from personavoice.config import SOTA_CONFIG
# 1. Load backbone (F5-TTS + FiLM + OES adapters)
backbone, _ = load_pretrained_f5tts_backbone(device="cuda", use_film=True)
# 2. All inference parameters auto-loaded from config.py — single source of truth
mel_gen, laag_info = backbone.laag_synthesize(
text_str="Hello world, this is a voice cloning test.",
mel_ref=mel_ref, # 1-second reference mel (T_ref, 100)
speaker_emb=speaker_emb, # ECAPA 192-d
persona_emb=persona_emb, # 64-d (zeros for no-persona)
emotion_emb=emotion_emb, # 4-d (zeros for no-emotion)
tokenizer=tokenizer,
ref_text=ref_text, # F5-TTS official ref_text concatenation
audio_ref=audio_ref, # F5-TTS official cond=audio flow
)
# 3. Adjust OES environment weight (0.0 = studio quality, SOTA config)
backbone.set_env_weight(SOTA_CONFIG.env_weight) # 0.0python examples/clone_demo.py
# Outputs saved to outputs/1111_clone_test_*.wav# (requires LibriTTS prepared via scripts/prepare_data.py)
python -m personavoice.experiment.eval_200_samples
# Outputs:
# results/eval_200_samples.json — raw per-sample results
# results/eval_200_statistics.json — statistical significance
# results/eval_200_by_length.json — stratified by text length# Install CosyVoice (https://github.com/FunAudioLLM/CosyVoice) and/or TTS (pip install TTS)
python -m personavoice.experiment.baseline_external --n_samples 200 --models cosyvoice xtts
# Output: results/baseline_external_200.jsonpython -m personavoice.experiment.cfr_analysis
# Output: results/cfr_analysis.jsonpython -m personavoice.experiment.visualize
# Outputs:
# results/figures/figure4_pareto_frontier.png
# results/figures/figure5_waveform_spectrogram.png
# results/figures/figure6_attention_heatmap.pngAll parameters are centralized in personavoice/config.py — the single source of truth.
| Parameter | Value | Description |
|---|---|---|
| steps | 32 | ODE integration steps (v10.4: reduced from 96 for speed, RTF -45%) |
| sway_coef | -1.0 | Sway Sampling (F5-TTS official default) |
| cfg_strength | 2.5 | Static CFG (v10.4: tuned for SECS, replaces TD-CFG) |
| env_weight | 0.0 | OES: studio quality (SECS SOTA) |
| oes_env_scale_init | 0.1 | OES gradual init (v10.0 fix for 1111.mp3) |
| use_ceag | False | CEAG disabled in v10.4.7+ (negligible incremental effect; code preserved) |
| ceag_lambda_max | 0.20 | CEAG guidance strength (used when enabled) |
| ceag_t_start | 0.1 | CEAG activation start |
| ceag_t_end | 0.4 | CEAG activation end |
| ceag_layers | (-2,-1) | CEAG attention extraction layers |
| use_laag | True | LAAG dynamic chunking + dynamic FiLM |
| laag_chunk_max_chars | 135 | LAAG chunk size cap |
| laag_cfg_base | 2.5 | LAAG base CFG |
| laag_cfg_alpha | 0.0 | Dynamic CFG α (disabled) |
| best_of_n | 1 | Best-of-N disabled (manifold collapse can't be fixed by sampling) |
| use_silero_vad | True | Silero VAD preprocessing (v10.0) |
| use_rms_normalize | True | RMS energy normalization (v10.0) |
| rms_target | 0.1 | RMS target level |
| silero_vad_threshold | 0.5 | Silero VAD trigger threshold |
PersonaVoice/
├── ARCHITECTURE.md # Architecture design (v10.4.8)
├── SOTA_VERIFICATION_REPORT.md # SOTA verification report (honest trade-off edition)
├── README.md / README_zh.md # Documentation
├── CONTRIBUTING.md # Contribution guidelines
├── CITATION.cff # Citation metadata
├── LICENSE # MIT (note F5-TTS backbone is CC-BY-NC-4.0)
├── requirements.txt # Python dependencies
├── setup.py # pip-installable package
├── 1111.mp3 # Bundled 1s reference demo audio (~35KB)
├── scripts/ # Setup & data prep
│ ├── download_models.py # One-click model download
│ ├── prepare_data.py # LibriTTS download + preprocess
│ ├── setup_env.ps1 # Windows one-click env setup
│ └── setup_env.sh # Linux/macOS one-click env setup
├── .github/ # CI, issue/PR templates, discussion template
│ ├── workflows/python-package.yml
│ └── ISSUE_TEMPLATE/ # bug_report, feature_request, experiment_reproduction
├── personavoice/
│ ├── config.py # ★ Unified SOTA config (single source of truth)
│ ├── microaug/ # Module: Ultra-short sample enhancement
│ │ └── cross_manifold_refiner.py # OES (v10.0 Core B, env_scale=0.1)
│ ├── tts_backbone/ # Module: F5-TTS DiT backbone
│ │ ├── f5_pretrained_backbone.py # Backbone + adapters (F5 official flow)
│ │ ├── ceag_sampler.py # ★ CEAG (v10.0 Core C, with padding mask; disabled)
│ │ ├── laag_generator.py # ★ LAAG (v10.1 Core A, dynamic FiLM + mel splice fix)
│ │ └── vocoder.py # Vocos vocoder
│ ├── persona/ # Module: Persona extraction pipeline
│ │ └── extractor.py # BERT chat → Big Five → persona_emb
│ ├── common/ # Shared utilities
│ │ └── local_models.py # Local model paths (offline mode)
│ ├── demo/ # Interactive demo
│ │ ├── api_server.py # FastAPI server (unified config + persona integration)
│ │ ├── audio_preprocess.py # ★ Silero VAD + RMS normalize (v10.0 Core E)
│ │ └── index.html # Web UI (architecture visualization + cloning + persona radar)
│ └── experiment/ # Evaluation scripts (top-conference experiments)
│ ├── eval_200_samples.py # 200-sample statistical evaluation (PV vs F5 baseline)
│ ├── baseline_external.py # External SOTA comparison (CosyVoice / XTTS v2)
│ ├── cfr_analysis.py # Catastrophic Failure Rate analysis
│ ├── comprehensive_evaluator.py # SECS + WER + UTMOS + RTF + SIM-o
│ ├── ecapa_evaluator.py # ECAPA-TDNN SECS evaluation
│ ├── wer_evaluator.py # Whisper-based WER evaluation
│ ├── visualize.py # Pareto frontier + waveform/spectrogram + attention heatmap
│ └── utils.py # Shared utilities (data loading, tokenizer, logger)
├── examples/ # Usage examples & quick-start demos
│ ├── clone_demo.py # 1111.mp3 clone verification (Quick Start)
│ └── api_demo.py # Frontend API end-to-end test
└── results/ # Experiment results (JSON + figures)
├── eval_200_samples.json
├── eval_200_statistics.json
├── honest_metrics_v10.4.7.json
├── ablation_200_samples.json
├── ablation_200_statistics.json
├── cfr_analysis.json
├── baseline_external_200.json
└── figures/
├── figure4_pareto_frontier.png
├── figure5_waveform_spectrogram.png
└── figure6_attention_heatmap.png
- Freeze F5-TTS pretrained backbone (first 20 layers)
- Train only lightweight adapters (FiLM + OES, ~2 M parameters)
- CEAG is an inference-time optimization, zero training cost
- 8 GB VRAM sufficient for training and inference
- Total trainable: ~31.9 M (9.39% of 339.6 M)
All modules must pass 200-sample paired t-test (p < 0.05) and Cohen's d (> 0.1) verification. v10.0 removed 5 ineffective modules (IBOP, AM-ODE, TD-CFG, SBM, GRPO). v10.3+ removed Duration Predictor (F5 official formula is more accurate). v10.4.7+ disabled CEAG and Best-of-N after ablations showed no incremental benefit on the LAAG baseline.
- UTMOS is reported as an Objective Proxy, not Human MOS
- Short-text WER (0.4066, CFR 35.8%) is openly reported as a Flow-Matching architectural limitation, left as Future Work
- Long-text WER (0.1158) is honestly reported as worse than XTTS v2 (0.10) — a trade-off, not a defeat
- Estimated CFR values for baselines are explicitly marked
* - Human MOS evaluation is pending (planned: 20 raters, Naturalness MOS + Similarity MOS)
PersonaVoice builds on and references the following prior work. We gratefully acknowledge the open-source community.
- F5-TTS (SWivid/F5-TTS, Cheng et al., 2024) — Flow-Matching with DiT backbone, CC-BY-NC-4.0. PersonaVoice freezes the first 20 layers and trains lightweight FiLM adapters.
- Vocos (gemelo-ai/vocos, Siuzdak, 2023) — Neural vocoder, MIT. Used for mel → waveform.
- Flow Matching (Lipman et al., 2023) — Generative framework underlying F5-TTS.
- ECAPA-TDNN (SpeechBrain, Desplanques et al., 2020) — Speaker encoder for SECS evaluation, Apache-2.0.
- BERT (bert-base-chinese, Devlin et al., 2019) — Persona text style encoder.
- Whisper (openai/whisper-large-v3-turbo, Radford et al., 2023) — ASR for WER evaluation & ref_text transcription, MIT.
- CosyVoice (FunAudioLLM/CosyVoice, 2024) — Alibaba's zero-shot TTS, used as external baseline.
- XTTS v2 (coqui/XTTS-v2, Casanova et al., 2024) — Coqui's 3s voice cloning baseline.
- VALL-E (Wang et al., 2023) — Early neural codec language model for TTS.
- VoiceBox (Le et al., 2024) — Flow-Matching TTS at scale.
- Big Five Personality Traits (McCrae & Costa, 1992) — Theoretical basis for persona extraction.
- FiLM (Perez et al., 2018) — Feature-wise Linear Modulation, the adapter mechanism we use.
- LoRA (Hu et al., 2022) — Inspired our plug-in adapter design philosophy.
- Silero VAD (snakers4/silero-vad, 2021) — Neural VAD replacing WebRTC GMM, MIT.
- UTMOS (Saeki et al., 2022) — Objective MOS prediction (reported as Objective Proxy).
- SECS — Speaker Encoder Cosine Similarity, standard in zero-shot TTS evaluation.
- WER — Word Error Rate via Whisper ASR.
- CFR — Catastrophic Failure Rate, proposed in this project for tail-risk analysis.
If we missed any work that should be cited, please open an issue — academic accuracy is a priority.
If PersonaVoice helps your research, please cite:
@article{personavoice2026,
title = {PersonaVoice v10.4.8: Plug-in Adapters for Persona-Driven 1-Second
Voice Cloning — A Trade-off Analysis of Timbre Fidelity
and Text-Alignment Stability in Flow-Matching TTS},
author = {PersonaVoice Team},
journal = {arXiv preprint},
year = {2026},
note = {Version 10.4.8. Code: https://github.com/personavoice/personavoice}
}See also CITATION.cff for machine-readable metadata.
- PersonaVoice codebase (adapters, integrations, novel modules OES/LAAG/CEAG): MIT
- F5-TTS pretrained backbone: CC-BY-NC-4.0 (non-commercial)
- Vocos, ECAPA-TDNN, Whisper, Silero VAD: see their respective licenses
Commercial use of F5-TTS pretrained backbone requires compliance with CC-BY-NC-4.0. See LICENSE for details.
- Issues: GitHub Issues — use the appropriate template (bug / feature / reproduction)
- Discussions: GitHub Discussions — research questions, design debate, use cases
- Contributing: see CONTRIBUTING.md
Keywords (for academic search discoverability): voice cloning, zero-shot TTS, few-shot voice cloning, 1-second voice cloning, flow matching, F5-TTS, speech synthesis, persona-driven TTS, emotional TTS, plug-in adapter, FiLM modulation, speaker embedding, ECAPA-TDNN, orthogonal decomposition, length-adaptive generation, manifold collapse, trade-off analysis, LibriTTS, Vocos vocoder, Silero VAD, Big Five personality, BERT persona, Whisper ASR, WER, SECS, UTMOS, CFR, catastrophic failure rate.

