Zihan Liu*
·
Zhikang Niu*
·
Qiuyang Xiao
·
Zhisheng Zheng
·
Ruoqi Yuan
·
Yuhang Zang†
Yuhang Cao
·
Xiaoyi Dong
·
Jianze Liang
·
Xie Chen
·
Leilei Sun
·
Dahua Lin
·
Jiaqi Wang†
* Equal Contribution. †Corresponding authors.
-
🚀 [10/28/2025] We have released the STAR-Bench 🏠repository and 🌐homepage.
-
🚀 [10/28/2025] STAR-Bench v1.0 is now available on 🤗HuggingFace!
Compared with v0.5 (introduced in our arXiv paper), v1.0 features revised and refined Questions & Answers for improved clarity and quality for spatial tasks. 📌 Please cite this version (v1.0) when reporting results going forward. The leaderboard will be updated soon.
We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce a STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perceptionsetting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories.
Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps to humans and a capability hierarchy. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.Benchmark examples are illustrated below. You can also visit the 🌐homepage for a more intuitive overview.
Evaluation results of various models on STAR-Bench v0.5 are shown below. The leaderboard for v1.0 will be released soon.
Error distribution across temporal and spatial Tasks:- 🔥A clear capability hierarchy between the two groups. Closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning.
- 🔥 Enhancing dense audio captioning. Open-source models struggle to produce dense, fine-grained captions, which limits their perceptual sensitivity and ability to extract embedded knowledge. Bridging this gap is a crucial first step.
- 🔥 Improving multi-audio reasoning. Open-source models lag significantly in comparing, integrating, and grounding information across multiple audio clips.
- 🔥 Moving beyond channel-averaged audio preprocessing. The common practice of averaging multi-channel audio into a mono signal is a major bottleneck for spatial reasoning. Developing architectures that natively process multi-channel cues is essential for unlocking genuine spatial awareness.
For the holistic spatio-temporal reasoning task, the curation process comprises four key stages, including human annotation and final selection based on human performance, as illustrated below.
The ALMEval_code/ is partially adapted from VLMEvalKit and Kimi-Audio-Evalkit.
It provides a unified evaluation pipeline for multimodal large models on STAR-Bench.
Step 1: Prepare Environment
git clone https://github.com/InternLM/StarBench.git
cd StarBench
conda activate starbench python==3.10.0
pip install -r requirements.txt
cd ALMEval_codeStep 2: Get STAR-Bench v1.0 Dataset
Download STAR-Bench v1.0 dataset from 🤗HuggingFace
huggingface-cli download --repo-type dataset --resume-download <repo_name> --local-dir your_local_data_dir Step 3: Set Up Your Model for Evaluation
Currently supported models include: Qwen2.5-Omni, Qwen2-Audio-Instruct, DeSTA2.5-Audio, Phi4-MM, Kimi-Audio, MiDashengLM, Step-Audio-2-mini, Gemma-3n-E4B-it, Ming-Lite-Omni-1.5,Xiaomi-MiMo-Audio,MiniCPM-O-v2.6,Audio Flamingo 3, Gemini and GPT-4o Audio.
To integrate a new model, create a new file yourmodel.py under the models/ directory and implement the function generate_inner().
✅ Example: generate_inner()
def generate_inner(self, msg):
"""
Args:
msg: dict, input format as below
"""
msg = {
"meta": {
"id": ...,
"task": ...,
"category": ...,
"sub-category": ...,
"options": ...,
"answer": ...,
"answer_letter": ...,
"rotate_id": ...,
"seed": ...
},
"prompts": [
{"type": "text", "value": "xxxx"},
{"type": "audio", "value": "audio1.wav"},
{"type": "text", "value": "xxxx"},
{"type": "audio", "value": "audio2.wav"},
...
]
}
# Return the model's textual response
return "your model output here"
Step 4: Configure Model Settings
Modify the configuration file: /models/model.yaml.
For existing models, you may need to update parameters such as model_path to match your local model weight path.
To add a new model variant, follow these steps:
- Create a new top-level key for your alias (e.g., 'my_model_variant:').
- Set 'base_model' to the
NAMEattribute of the corresponding Python class. - Add any necessary arguments for the class's
__init__method underinit_args.
Example:
qwen25-omni:
base_model: qwen25-omni
init_args:
model_path: your_model_weight_path_here
Step 5: Run Evaluation
Run the following command:
python ./run.py \
--model qwen25-omni \
--data starbench_default \
--dataset_root your_local_data_dir \
--work-dir ./eval_results
Evaluation results will be automatically saved to the ./eval_results directory.
You can also evaluate specific subtasks or their combinations by modifying the --data argument.
The full list of available task names can be found in
ALMEval_code/datasets/__init__.py.
Example: Evaluate only the temporal reasoning and spatial reasoning tasks:
python ./run.py \
--model qwen25-omni \
--data tr sr \
--dataset_root your_local_data_dir \
--work-dir ./eval_results
TBD
Usage and License Notices: The data and code are intended and licensed for research use only.
We sincerely thank 2077AI for providing the platform that supported our data annotation, verification, and review processes.






