OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu*, Zhennan Lin*, Zhixian Zhao*, Guojian Li*, Wenjie Tian*, Peikun Chen, Yangze Li, Pengcheng Guo, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie†

Huggingface Test Page
📑 Paper (v2.0) | 📑 Demo | 💬 WeChat (微信)

OSUM is pronounced as ‘awesome’ (/ˈɔː.səm/).

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

Architecture

The overview of the architecture and tasks of OSUM.

News and Updates

2025.2.16 🎉 We updated the technical report OSUM technical report v2.0 and released the checkpoint, and the online test page on hugging face.

In technical report v2.0, the OSUM model has gone through more training steps and the training data volume has increased to 50.5K hours (as compared to 44.1K hours in v1.0)

3000 hours of speech gender classification (SGC) data, which includes 1500 hours of existing data augmented with noise, and another 1500 hours of new data.
Speaker age prediction (SAP) data expansion: The original 3400 hours of age prediction data were augmented with noise, doubling the volume to 6800 hours.

2025.1.22 🔥 We released the OSUM technical report v1.0.

Evaluation

Comparison of Qwen2-Audio and our OSUM model. In most tasks, OSUM achieves a better performance than Qwen2-Audio despite using significantly fewer computational resources and training data.

Evaluation results of ASR tasks on public and internal test sets. The bold font represents the best result among the same test set. All internal results are inferred by ourselves.

Evaluation results of multi-tasking on public and internal test sets. The best results for each test set are highlighted in bold font. Results shown in blue font, as well as those on internal test sets, are inferred using the original released model by ourselves.

Requirements

pip install requirements.txt

How to use the OSUM framework for inference and training? Please refer to here

License Agreement

We use the Apache 2.0 license. Researchers and developers are free to use the codes and model weights of our OSUM, even for commercial use. Check the license at LICENSE.txt for more details.

Citation

@article{geng2025osum,
  title={{OSUM}: {Advancing} Open Speech Understanding Models with Limited Resources in Academia},
  author={Geng, Xuelong and Wei, Kun and Shao, Qijie and Liu, Shuiyun and Lin, Zhennan and Zhao, Zhixian and Li, Guojian and Tian, Wenjie and Chen, Peikun and Li, Yangze and others},
  journal={arXiv preprint arXiv:2501.13306},
  year={2025}
}

Contact Us

If you are interested in leaving a message to our research team, feel free to email xlgeng@mail.nwpu.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 189 Commits
examples/osum		examples/osum
images		images
tools		tools
wenet		wenet
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.txt		LICENSE.txt
README.md		README.md
README_CN.md		README_CN.md
README_JP.md		README_JP.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

OSUM is pronounced as ‘awesome’ (/ˈɔː.səm/).

Architecture

News and Updates

2025.2.16 🎉 We updated the technical report OSUM technical report v2.0 and released the checkpoint, and the online test page on hugging face.

2025.1.22 🔥 We released the OSUM technical report v1.0.

Evaluation

Requirements

License Agreement

Citation

Contact Us

About

Releases

Packages

Contributors 3

Languages

License

ASLP-lab/OSUM

Folders and files

Latest commit

History

Repository files navigation

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

OSUM is pronounced as ‘awesome’ (/ˈɔː.səm/).

Architecture

News and Updates

2025.2.16 🎉 We updated the technical report OSUM technical report v2.0 and released the checkpoint, and the online test page on hugging face.

2025.1.22 🔥 We released the OSUM technical report v1.0.

Evaluation

Requirements

License Agreement

Citation

Contact Us

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages