RSTnet: Real-time Speech-Text Foundation Model Toolkit (continuing update)

Building a real-time speech-text foundation model capable of understanding and generating speech has garnered significant attention. Notable examples of such models include ChatGPT-4o and Moshi. However, challenges in training these models continue to persist in the research community. We introduce RSTnet, a new open-source platform designed for developing real-time speech-text foundation models. RSTnet offers a comprehensive framework for data processing, pre-training, and fine-tuning, aimed at helping researchers build their real-time speech-text models efficiently. It builds upon previous works, such as the real-time spoken dialogue model (Moshi) and the universal audio generation model (UniAudio). RSTnet consists of following key components: (1) Data preparation; (2) streaming audio codec models; (3) speech-text foundation models; (4) Benchmark and Evaluation.

News

2025.3.4. We release the second version of RSTnet, which supports to pre-training speech-text foundation model. Please refer to MLLM_v2 to find the details.
2024.10.7. We release the first version of RSTnet.

Make Contributions

The project is still ongoing. If you have interest about RSTnet, welcome to make contribution. You can consider:

[1] Propose issue or PR to solve the bugs
[2] Propose more idea about Data collection, streaming codec, and speech-text foundation model
[3] Join us as an author of this project. (Contact me by dcyang@se.cuhk.edu.hk)

Install

conda create -n RSTnet python=3.12
conda activate RSTnet
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install tqdm
pip install librosa==0.9.1
pip install matplotlib
pip install omegaconf 
pip install einops
pip install vector_quantize_pytorch
pip install tensorboard
pip install deepspeed
pip install peft

Technical report

You can find our technical report from https://github.com/yangdongchao/RSTnet/blob/main/RSTnet.pdf

DataPipeline

More details will be updated soon. You can refer to DataPipeline part.

AudioCodec

We plan to support more SOTA streaming audio codec. Now, we have reproduced the MimiCodec.

Multi-modal LLM

We have released (1) the fine-tuning code for moshi (2) Pre-training speech-text foundation model recipes, which includes the following advantages:

(1) Supports any LLM backbone, includes LLAMA, Gemma, Mistral, Phi, StableLM, Qwen

(2) Supports Lora fine-tuning LLM backbone to save the GPU resources

(3) Supports fully training for MLLM if you have ennough GPUs.

Reference

The implements of streaming audio codec and speech-text language models are based on previous codebase: https://github.com/kyutai-labs/moshi https://github.com/yangdongchao/UniAudio

Citations

@techreport{RSTnet,
  title={RSTnet: Real-time Speech-Text Foundation Model Toolkit},
  author={RSTnet team},
  journal={Technical report},
  year={2024}
}

@techreport{kyutai2024moshi,
    author = {Alexandre D\'efossez and Laurent Mazar\'e and Manu Orsini and Am\'elie Royer and
			  Patrick P\'erez and Herv\'e J\'egou and Edouard Grave and Neil Zeghidour},
    title = {Moshi: a speech-text foundation model for real-time dialogue},
    institution = {Kyutai},
    year={2024},
    month={September},
    url={http://kyutai.org/Moshi.pdf},
}

```bibtex
@article{yang2023uniaudio,
  title={UniAudio: An Audio Foundation Model Toward Universal Audio Generation},
  author={Dongchao Yang, Jinchuan Tian, Xu Tan, Rongjie Huang, Songxiang Liu, Xuankai Chang, Jiatong Shi, Sheng Zhao, Jiang Bian, Xixin Wu, Zhou Zhao, Helen Meng},
  journal={arXiv preprint arXiv:2310.00704},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
AudioCodec		AudioCodec
DataPipeline		DataPipeline
Evaluation		Evaluation
MLLM		MLLM
MLLM_v2		MLLM_v2
demos		demos
.gitignore		.gitignore
RSTnet.pdf		RSTnet.pdf
RSTnet.png		RSTnet.png
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RSTnet: Real-time Speech-Text Foundation Model Toolkit (continuing update)

News

Make Contributions

Install

Technical report

DataPipeline

AudioCodec

Multi-modal LLM

Reference

Citations

About

Releases

Packages

Contributors 4

Languages

yangdongchao/RSTnet

Folders and files

Latest commit

History

Repository files navigation

RSTnet: Real-time Speech-Text Foundation Model Toolkit (continuing update)

News

Make Contributions

Install

Technical report

DataPipeline

AudioCodec

Multi-modal LLM

Reference

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages