Bridging the Stability-Expressivity Gap

Synthetic Data Scaling and Preference Alignment
for Low-Resource Spoken Language Models

SE-Bridge-TTS | ICML 2026

SE-Bridge-TTS is a low-resource Thai and Lao speech synthesis project. It studies a practical failure mode in spoken language models: synthetic data improves pronunciation stability, but too much flat synthetic speech erodes prosody and speaker expressivity. The release provides public Thai and Lao CosyVoice2-compatible checkpoints on Hugging Face.

What This Work Does

Capability	What it enables
Thai DGSA	Recovers expressive speech at high synthetic-data ratios while preserving intelligibility and speaker identity.
Lao TDSC	Builds a pure-synthetic self-improvement loop for Lao TTS and zero-shot voice cloning without authentic target-language recordings.
Open checkpoints	Provides public Thai and Lao CosyVoice2-compatible weights, audio demos, a project page, and Hugging Face inference guidance.

Methods

DGSA: Disentanglement-Guided Self-Alignment

DGSA uses the prosody-timbre separation in flow-matching SLMs to build preference pairs that reward stable, expressive speech without collapsing speaker identity.

TDSC: Temperature-Driven Self-Critique

TDSC samples candidates across conservative-to-expressive temperatures, filters them with automatic quality checks, and iteratively improves low-resource synthesis when real target-language speech is unavailable.

Main Results

Paper accuracy is reported as Accuracy = 100 - WER, so higher is better. Higher SIM, NMOS, and SMOS are also better.

Paper setting	SE-Bridge-TTS	Strong comparison	Takeaway
Thai standard TTS	61.1% accuracy (38.9 WER), 4.51 NMOS	Azure: 63.5% accuracy (36.5 WER), 4.01 NMOS; ElevenLabs-v3: 59.4% accuracy, 4.21 NMOS	Comparable intelligibility with stronger naturalness.
Lao standard TTS	70.2% accuracy (29.8 WER), 4.53 NMOS	Gemini Flash: 65.8% accuracy, 4.12 NMOS; MMS-TTS: 55.2% accuracy, 3.52 NMOS	Best accuracy and naturalness among tested systems.
Thai zero-shot cloning	61.1% accuracy (38.9 WER), 0.84 SIM, 4.51 SMOS	ElevenLabs-v3: 57.7% accuracy (42.3 WER), 0.78 SIM, 4.23 SMOS	Better intelligibility and speaker preservation.
Lao zero-shot cloning	70.2% accuracy (29.8 WER), 0.81 SIM, 4.32 SMOS	Other tested systems: not supported	Public Lao voice cloning capability.

Selected demos are available on the project page, including Thai standard TTS, Lao standard TTS, Thai cloning, and Lao cloning.

Open FLEURS Evaluation

We also publish a reproducible FLEURS benchmark comparing the released SE-Bridge-TTS checkpoints with three recent open multilingual TTS systems: Higgs Audio v3, OmniVoice, and X-Voice Stage1. The evaluation uses 255 paired Lao/Thai target sentences and reference prompts from Lao, Thai, Chinese, and English. It tests whether a model can synthesize Lao/Thai speech while preserving the prompt speaker.

Although SE-Bridge-TTS was developed before these newer open systems, it still reaches the best overall calibrated accuracy in the Chinese/English prompt -> Lao/Thai target setting, while remaining competitive on speaker similarity.

Accuracy keeps the CER signal but makes it easier to read:

calibrated CER = max(0, generated CER - ground-truth CER)
Accuracy = 1 - calibrated CER

Ground-truth CER is the ASR error measured on original FLEURS target audio, so the metric discounts recognizer baseline errors and focuses on synthesis degradation. Higher is better for Accuracy and speaker similarity.

Best results are bold. Second-best results are underlined.

Model	Supported samples	Accuracy	Speaker similarity
Higgs Audio v3	1020/1020	78.2%	0.520
OmniVoice	1020/1020	75.9%	0.645
SE-Bridge-TTS	1020/1020	83.4%	0.593
X-Voice Stage1	510/1020	53.7%	0.361

Each detail cell below is Cal. CER↓ / SIM↑.

Target	Prompt	Higgs Audio v3	OmniVoice	SE-Bridge-TTS	X-Voice Stage1
Lao	Lao	0.2330 / 0.699	0.3912 / 0.771	0.2170 / 0.694	- / -
Lao	English	0.4491 / 0.492	0.4532 / 0.537	0.3408 / 0.459	- / -
Lao	Chinese	0.3828 / 0.651	0.4306 / 0.711	0.2603 / 0.726	- / -
Thai	Thai	0.0095 / 0.761	0.0210 / 0.794	0.0264 / 0.763	0.1879 / 0.774
Thai	English	0.0310 / 0.263	0.0307 / 0.586	0.0268 / 0.452	0.8227 / -0.019
Thai	Chinese	0.0089 / 0.674	0.0497 / 0.745	0.0356 / 0.736	0.1035 / 0.741

This public run covers same-language Lao/Thai prompts plus Chinese and English prompts to Lao/Thai targets. Unsupported Lao directions for X-Voice Stage1 are counted as coverage failures and excluded from quality averages.

The full protocol, machine-readable results, and table renderer are in evaluation/fleurs-lo-th-255pair.

Use the Weights

The release checkpoints are hosted at:

https://huggingface.co/isabeth/SE-Bridge-TTS

For inference:

Open the Hugging Face model card above.
Download thai_tts.pt or lao_tts.pt from the model repository.
Follow the CosyVoice2 loading example in the model card.

File	Language	Recommended use
`thai_tts.pt`	Thai	CosyVoice2 cross-lingual inference by default; optional zero-shot use with caution
`lao_tts.pt`	Lao	CosyVoice2 cross-lingual inference

This GitHub repository is intentionally lightweight: it hosts the project page, audio demos, paper links, and release pointers; the runnable checkpoint package lives on Hugging Face.

Links

Resource	Link
Project page and audio browser	https://insiderx-pro.github.io/SE-Bridge-TTS/
Paper	https://arxiv.org/abs/2605.27383
Weights and inference notes	https://huggingface.co/isabeth/SE-Bridge-TTS
FLEURS evaluation protocol and results	`evaluation/fleurs-lo-th-255pair/`
Demo metadata	`assets/data/demo-data.json`

Citation

@inproceedings{geng2026bridging,
  title = {Bridging the Stability-Expressivity Gap: Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models},
  author = {Geng, Yizhong and Li, Yanliang and Yang, Jinghan and Jiang, Tianhan and An, Boxun and Li, Ya and Shen, Xiaoyu},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year = {2026}
}

Acknowledgements

SE-Bridge-TTS thanks the following open-source projects:

FunAudioLLM/CosyVoice, which provides the CosyVoice2 toolkit and inference stack used by the released checkpoints.
facebook/mms-tts, a multilingual open-source TTS baseline included in the project demos.
Higgs Audio v3, OmniVoice, and X-Voice, recent open multilingual speech generation systems referenced in the public FLEURS evaluation.

We thank the maintainers and contributors of these projects for making multilingual and low-resource speech synthesis research easier to reproduce, compare, and extend.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets		assets
evaluation/fleurs-lo-th-255pair		evaluation/fleurs-lo-th-255pair
paper		paper
tests		tests
.gitignore		.gitignore
.nojekyll		.nojekyll
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bridging the Stability-Expressivity Gap

Synthetic Data Scaling and Preference Alignment
for Low-Resource Spoken Language Models

What This Work Does

Methods

DGSA: Disentanglement-Guided Self-Alignment

TDSC: Temperature-Driven Self-Critique

Main Results

Open FLEURS Evaluation

Use the Weights

Links

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bridging the Stability-Expressivity Gap

Synthetic Data Scaling and Preference Alignment for Low-Resource Spoken Language Models

What This Work Does

Methods

DGSA: Disentanglement-Guided Self-Alignment

TDSC: Temperature-Driven Self-Critique

Main Results

Open FLEURS Evaluation

Use the Weights

Links

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Synthetic Data Scaling and Preference Alignment
for Low-Resource Spoken Language Models

Packages