Stability Comparison: As noise scale increases from 0% to 100%, StableToken maintains highly stable token sequences (bottom), while baseline tokenizer (middle) shows significant instability and jitter.
| Date | News |
|---|---|
| 2026-02-28 | π Initial release of StableToken on GitHub and HuggingFace! |
| 2026-01-26 | π Our paper has been accepted to ICLR 2026! |
Existing semantic speech tokenizers suffer from critical instability in noisy environments, causing downstream SpeechLLMs to generate inconsistent or erroneous outputs when processing real-world audio.
StableToken solves this through two key innovations:
- π³οΈ Voting-LFQ: A novel multi-voter quantization mechanism that achieves robust consensus under noise
- π Noise-Aware Consensus Training: A multi-branch training paradigm that enhances representational stability by achieving a global consensus between noisy and clean branches
This results in:
- β 2.5Γ more stable than existing tokenizers (UED: 10.17% vs 26.17%)
- β High-quality speech reconstruction from discrete tokens
- β Seamless integration with downstream LLMs
- β Superior downstream performance for SpeechLLMs on noisy audio
git clone --recursive https://github.com/Tencent/StableToken.git
cd StableToken && pip install -r requirements.txtDetailed Installation Guide Using Conda (Click to expand)
-
Clone the repository with submodules:
git clone --recursive https://github.com/Tencent/StableToken.git cd StableTokenIf you have already cloned without
--recursive:git submodule init git submodule update
-
Create a conda environment:
conda create -n stabletoken python=3.10.13 -y conda activate stabletoken
-
Install dependencies:
conda install -c conda-forge libsndfile -y pip install -r requirements.txt
Using huggingface-cli:
huggingface-cli download tencent/StableToken --local-dir checkpoints/StableTokenOr using Python:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="tencent/StableToken", local_dir="checkpoints/StableToken")python example_usage.py \
--device auto \
--model_path checkpoints/StableToken \
--audio_path /path/to/audio.wavCommand Line Arguments:
| Argument | Type | Default | Description |
|---|---|---|---|
--device |
str |
auto |
Device for inference (auto, cpu, cuda, cuda:0, etc.) |
--model_path |
str |
Required | Path to the model directory |
--audio_path |
list[str] |
Required | Path(s) to input audio file(s) |
Example Command:
python example_usage.py \
--device cuda \
--model_path checkpoints/StableToken \
--audio_path sample_en.wav sample_zh.wavExample Output:
================================== Arguments ===================================
Using device: cuda
Model path: checkpoints/StableToken
Audio path: ['assets/sample_en.wav', 'assets/sample_zh.wav']
--------------------------------------------------------------------------------
================================= Tokenization =================================
[1/2] `sample_en.wav` Generated 443 tokens:
[2963, 3232, 3236, 3556, 3301, ...]
--------------------------------------------------------------------------------
[2/2] `sample_zh.wav` Generated 381 tokens:
[3283, 7271, 7239, 5214, 5183, ...]
--------------------------------------------------------------------------------
================================ Reconstruction ================================
[1/2] Reconstructed audio saved to: `reconstruction/sample_en.wav`
[2/2] Reconstructed audio saved to: `reconstruction/sample_zh.wav`
--------------------------------------------------------------------------------
For a complete runnable example, please refer to example_usage.py. Below is a simplified example of using the core components:
import os
from transformers import WhisperFeatureExtractor
from src.model.modeling_whisper import WhisperLFQEncoder
from src.utils.flow_inference import AudioDecoder
from src.utils.utils import extract_speech_token, speech_token_to_wav
# 1. Load Models
model_dir = "path/to/model"
tokenizer = WhisperLFQEncoder.from_pretrained(os.path.join(model_dir, "tokenizer")).eval().cuda()
feature_extractor = WhisperFeatureExtractor.from_pretrained(os.path.join(model_dir, "tokenizer"))
decoder = AudioDecoder(
config_path=os.path.join(model_dir, "decoder", "config.yaml"),
flow_ckpt_path=os.path.join(model_dir, "decoder", "flow.pt"),
hift_ckpt_path=os.path.join(model_dir, "decoder", "hift.pt"),
device="cuda"
)
# 2. Tokenize
tokens = extract_speech_token(tokenizer, feature_extractor, ["audio.wav"], device="cuda")[0]
# 3. Reconstruct
tts_speech, sampling_rate = speech_token_to_wav(decoder, tokens)We recommend using WAV format. However, StableToken supports all audio formats compatible with torchaudio (including .flac, .mp3, etc.).
StableToken achieves state-of-the-art noise robustness while maintaining high reconstruction quality.
| Model | Frame Rate | Codebook Size | Noise Robustness (UED%, β) |
|---|---|---|---|
| GLM-4-Voice-Tokenizer | 12.5Hz | 16,384 | 31.10 |
| S3 Tokenizer | 25Hz | 4,096 | 26.17 |
| CosyVoice2 | 25Hz | 6,561 | 38.66 |
| StableToken | 25Hz | 8,192 | 10.17 π |
Note
UED (Unit Edit Distance) measures the edit distance between token sequences from clean and noisy audio. Lower UED indicates better noise robustness. StableToken achieves 60% UED reduction over the best existing supervised semantic tokenizer.
Before running the evaluation, you need to prepare a parquet file containing paired clean and noisy audio data. You can use the audiomentations library to add noise to clean audio samples.
Tip
Data Format: The current code in ued.py expects the parquet file to contain specific columns: 'audio_en_clean', 'audio_en_noise', 'audio_zh_clean', and 'audio_zh_noise'. You can easily modify the column names in the script to match your custom dataset structure.
python ued.py \
--model_path checkpoints/StableToken \
--parquet_files /path/to/data.parquet \
--output_file ./UED_results/ued_results.jsonMeasurements of Word Error Rate (WER, β) and Mean Opinion Score (MOS, β) on LibriSpeech (LS) and SEED benchmarks.
| Model | Frame Rate |
BPS | WER (β) LS-clean |
WER (β) LS-other |
WER (β) SEED-en |
WER (β) SEED-zh |
MOS (β) LS-clean |
MOS (β) LS-other |
MOS (β) SEED-en |
MOS (β) SEED-zh |
|---|---|---|---|---|---|---|---|---|---|---|
| GLM-4-Voice-Tokenizer | 12.5Hz | 175 | 4.04 | 9.33 | 3.54 | 3.23 | 4.07 | 3.99 | 4.16 | 4.10 |
| S3 Tokenizer | 25Hz | 300 | 5.78 | 13.38 | 5.91 | 4.26 | 3.40 | 3.31 | 3.40 | 3.31 |
| CosyVoice2 | 25Hz | 325 | 4.25 | 9.68 | 4.34 | 2.75 | 3.36 | 3.25 | 3.31 | 3.58 |
| StableToken | 25Hz | 325 | 3.84 | 7.99 | 3.44 | 2.62 | 4.09 | 3.83 | 4.01 | 4.18 |
| Model | Frame Rate | Codebook Size | BPS | Download |
|---|---|---|---|---|
| StableToken | 25Hz | 8,192 | 325 |
We thank the authors of GLM-4-Voice for their open-source code.
If you find StableToken useful for your research, please cite:
@article{song2025stabletoken,
title={StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs},
author={Song, Yuhan and Zhang, Linhao and Wu, Chuhan and Liu, Aiwei and Jia, Wei and Wang, Houfeng and Zhou, Xiao},
journal={arXiv preprint arXiv:2509.22220},
year={2025}
}This project is licensed under the License Term of StableToken.

