Skip to content

InsiderX-Pro/SE-Bridge-TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bridging the Stability-Expressivity Gap

Synthetic Data Scaling and Preference Alignment
for Low-Resource Spoken Language Models

SE-Bridge-TTS | ICML 2026

Project page arXiv paper Hugging Face weights GitHub repository

SE-Bridge-TTS is a low-resource Thai and Lao speech synthesis project. It studies a practical failure mode in spoken language models: synthetic data improves pronunciation stability, but too much flat synthetic speech erodes prosody and speaker expressivity. The release provides public Thai and Lao CosyVoice2-compatible checkpoints on Hugging Face.

What This Work Does

Capability What it enables
Thai DGSA Recovers expressive speech at high synthetic-data ratios while preserving intelligibility and speaker identity.
Lao TDSC Builds a pure-synthetic self-improvement loop for Lao TTS and zero-shot voice cloning without authentic target-language recordings.
Open checkpoints Provides public Thai and Lao CosyVoice2-compatible weights, audio demos, a project page, and Hugging Face inference guidance.

Methods

DGSA: Disentanglement-Guided Self-Alignment

Disentanglement-Guided Self-Alignment method overview

DGSA uses the prosody-timbre separation in flow-matching SLMs to build preference pairs that reward stable, expressive speech without collapsing speaker identity.

TDSC: Temperature-Driven Self-Critique

Temperature-Driven Self-Critique method overview

TDSC samples candidates across conservative-to-expressive temperatures, filters them with automatic quality checks, and iteratively improves low-resource synthesis when real target-language speech is unavailable.

Main Results

Paper accuracy is reported as Accuracy = 100 - WER, so higher is better. Higher SIM, NMOS, and SMOS are also better.

Paper setting SE-Bridge-TTS Strong comparison Takeaway
Thai standard TTS 61.1% accuracy (38.9 WER), 4.51 NMOS Azure: 63.5% accuracy (36.5 WER), 4.01 NMOS; ElevenLabs-v3: 59.4% accuracy, 4.21 NMOS Comparable intelligibility with stronger naturalness.
Lao standard TTS 70.2% accuracy (29.8 WER), 4.53 NMOS Gemini Flash: 65.8% accuracy, 4.12 NMOS; MMS-TTS: 55.2% accuracy, 3.52 NMOS Best accuracy and naturalness among tested systems.
Thai zero-shot cloning 61.1% accuracy (38.9 WER), 0.84 SIM, 4.51 SMOS ElevenLabs-v3: 57.7% accuracy (42.3 WER), 0.78 SIM, 4.23 SMOS Better intelligibility and speaker preservation.
Lao zero-shot cloning 70.2% accuracy (29.8 WER), 0.81 SIM, 4.32 SMOS Other tested systems: not supported Public Lao voice cloning capability.

Selected demos are available on the project page, including Thai standard TTS, Lao standard TTS, Thai cloning, and Lao cloning.

Open FLEURS Evaluation

We also publish a reproducible FLEURS benchmark comparing the released SE-Bridge-TTS checkpoints with three recent open multilingual TTS systems: Higgs Audio v3, OmniVoice, and X-Voice Stage1. The evaluation uses 255 paired Lao/Thai target sentences and reference prompts from Lao, Thai, Chinese, and English. It tests whether a model can synthesize Lao/Thai speech while preserving the prompt speaker.

Although SE-Bridge-TTS was developed before these newer open systems, it still reaches the best overall calibrated accuracy in the Chinese/English prompt -> Lao/Thai target setting, while remaining competitive on speaker similarity.

Accuracy keeps the CER signal but makes it easier to read:

calibrated CER = max(0, generated CER - ground-truth CER)
Accuracy = 1 - calibrated CER

Ground-truth CER is the ASR error measured on original FLEURS target audio, so the metric discounts recognizer baseline errors and focuses on synthesis degradation. Higher is better for Accuracy and speaker similarity.

Best results are bold. Second-best results are underlined.

Model Supported samples Accuracy Speaker similarity
Higgs Audio v3 1020/1020 78.2% 0.520
OmniVoice 1020/1020 75.9% 0.645
SE-Bridge-TTS 1020/1020 83.4% 0.593
X-Voice Stage1 510/1020 53.7% 0.361

Each detail cell below is Cal. CER↓ / SIM↑.

Target Prompt Higgs Audio v3 OmniVoice SE-Bridge-TTS X-Voice Stage1
Lao Lao 0.2330 / 0.699 0.3912 / 0.771 0.2170 / 0.694 - / -
Lao English 0.4491 / 0.492 0.4532 / 0.537 0.3408 / 0.459 - / -
Lao Chinese 0.3828 / 0.651 0.4306 / 0.711 0.2603 / 0.726 - / -
Thai Thai 0.0095 / 0.761 0.0210 / 0.794 0.0264 / 0.763 0.1879 / 0.774
Thai English 0.0310 / 0.263 0.0307 / 0.586 0.0268 / 0.452 0.8227 / -0.019
Thai Chinese 0.0089 / 0.674 0.0497 / 0.745 0.0356 / 0.736 0.1035 / 0.741

This public run covers same-language Lao/Thai prompts plus Chinese and English prompts to Lao/Thai targets. Unsupported Lao directions for X-Voice Stage1 are counted as coverage failures and excluded from quality averages.

The full protocol, machine-readable results, and table renderer are in evaluation/fleurs-lo-th-255pair.

Use the Weights

The release checkpoints are hosted at:

https://huggingface.co/isabeth/SE-Bridge-TTS

For inference:

  1. Open the Hugging Face model card above.
  2. Download thai_tts.pt or lao_tts.pt from the model repository.
  3. Follow the CosyVoice2 loading example in the model card.
File Language Recommended use
thai_tts.pt Thai CosyVoice2 cross-lingual inference by default; optional zero-shot use with caution
lao_tts.pt Lao CosyVoice2 cross-lingual inference

This GitHub repository is intentionally lightweight: it hosts the project page, audio demos, paper links, and release pointers; the runnable checkpoint package lives on Hugging Face.

Links

Resource Link
Project page and audio browser https://insiderx-pro.github.io/SE-Bridge-TTS/
Paper https://arxiv.org/abs/2605.27383
Weights and inference notes https://huggingface.co/isabeth/SE-Bridge-TTS
FLEURS evaluation protocol and results evaluation/fleurs-lo-th-255pair/
Demo metadata assets/data/demo-data.json

Citation

@inproceedings{geng2026bridging,
  title = {Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models},
  author = {Geng, Yizhong and Li, Yanliang and Yang, Jinghan and Jiang, Tianhan and An, Boxun and Li, Ya and Shen, Xiaoyu},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year = {2026}
}

Acknowledgements

SE-Bridge-TTS thanks the following open-source projects:

  • FunAudioLLM/CosyVoice, which provides the CosyVoice2 toolkit and inference stack used by the released checkpoints.
  • facebook/mms-tts, a multilingual open-source TTS baseline included in the project demos.
  • Higgs Audio v3, OmniVoice, and X-Voice, recent open multilingual speech generation systems referenced in the public FLEURS evaluation.

We thank the maintainers and contributors of these projects for making multilingual and low-resource speech synthesis research easier to reproduce, compare, and extend.

Releases

No releases published

Packages

 
 
 

Contributors