Building a real-time speech-text foundation model capable of understanding and generating speech has garnered significant attention. Notable examples of such models include ChatGPT-4o and Moshi. However, challenges in training these models continue to persist in the research community. We introduce RSTnet, a new open-source platform designed for developing real-time speech-text foundation models. RSTnet offers a comprehensive framework for data processing, pre-training, and fine-tuning, aimed at helping researchers build their real-time speech-text models efficiently. It builds upon previous works, such as the real-time spoken dialogue model (Moshi) and the universal audio generation model (UniAudio). RSTnet consists of following key components: (1) Data preparation; (2) streaming audio codec models; (3) speech-text foundation models; (4) Benchmark and Evaluation.
- 2025.3.4. We release the second version of RSTnet, which supports to pre-training speech-text foundation model. Please refer to MLLM_v2 to find the details.
- 2024.10.7. We release the first version of RSTnet.
The project is still ongoing. If you have interest about RSTnet, welcome to make contribution. You can consider:
- [1] Propose issue or PR to solve the bugs
- [2] Propose more idea about Data collection, streaming codec, and speech-text foundation model
- [3] Join us as an author of this project. (Contact me by dcyang@se.cuhk.edu.hk)
conda create -n RSTnet python=3.12
conda activate RSTnet
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install tqdm
pip install librosa==0.9.1
pip install matplotlib
pip install omegaconf
pip install einops
pip install vector_quantize_pytorch
pip install tensorboard
pip install deepspeed
pip install peft
You can find our technical report from https://github.com/yangdongchao/RSTnet/blob/main/RSTnet.pdf
More details will be updated soon. You can refer to DataPipeline part.
We plan to support more SOTA streaming audio codec. Now, we have reproduced the MimiCodec.
We have released (1) the fine-tuning code for moshi (2) Pre-training speech-text foundation model recipes, which includes the following advantages:
(1) Supports any LLM backbone, includes LLAMA, Gemma, Mistral, Phi, StableLM, Qwen
(2) Supports Lora fine-tuning LLM backbone to save the GPU resources
(3) Supports fully training for MLLM if you have ennough GPUs.
The implements of streaming audio codec and speech-text language models are based on previous codebase: https://github.com/kyutai-labs/moshi https://github.com/yangdongchao/UniAudio
@techreport{RSTnet,
title={RSTnet: Real-time Speech-Text Foundation Model Toolkit},
author={RSTnet team},
journal={Technical report},
year={2024}
}
@techreport{kyutai2024moshi,
author = {Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and Am\'elie Royer and
Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
title = {Moshi: a speech-text foundation model for real-time dialogue},
institution = {Kyutai},
year={2024},
month={September},
url={http://kyutai.org/Moshi.pdf},
}
```bibtex
@article{yang2023uniaudio,
title={UniAudio: An Audio Foundation Model Toward Universal Audio Generation},
author={Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Helen Meng},
journal={arXiv preprint arXiv:2310.00704},
year={2023}
}