Skip to content

InternLM/StarBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo
STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

Zihan Liu* · Zhikang Niu* · Qiuyang Xiao · Zhisheng Zheng · Ruoqi Yuan · Yuhang Zang
Yuhang Cao · Xiaoyi Dong · Jianze Liang · Xie Chen · Leilei Sun · Dahua Lin · Jiaqi Wang

* Equal Contribution. Corresponding authors.

📖arXiv | 🌐Homepage | 🤗Dataset

📢 News

  • 🚀 [10/28/2025] We have released the STAR-Bench 🏠repository and 🌐homepage.

  • 🚀 [10/28/2025] STAR-Bench v1.0 is now available on 🤗HuggingFace!

    Compared with v0.5 (introduced in our arXiv paper), v1.0 features revised and refined Questions & Answers for improved clarity and quality for spatial tasks. 📌 Please cite this version (v1.0) when reporting results going forward. The leaderboard will be updated soon.

🌈Overview

We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce a STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perceptionsetting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories.

teaser

Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps to humans and a capability hierarchy. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

Benchmark examples are illustrated below. You can also visit the 🌐homepage for a more intuitive overview.

STAR-Bench Examples

📊Results and Analysis

Evaluation results of various models on STAR-Bench v0.5 are shown below. The leaderboard for v1.0 will be released soon.

Results

Error distribution across temporal and spatial Tasks:

Results

💡 Key Insights

  • 🔥A clear capability hierarchy between the two groups. Closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning.
  • 🔥 Enhancing dense audio captioning. Open-source models struggle to produce dense, fine-grained captions, which limits their perceptual sensitivity and ability to extract embedded knowledge. Bridging this gap is a crucial first step.
  • 🔥 Improving multi-audio reasoning. Open-source models lag significantly in comparing, integrating, and grounding information across multiple audio clips.
  • 🔥 Moving beyond channel-averaged audio preprocessing. The common practice of averaging multi-channel audio into a mono signal is a major bottleneck for spatial reasoning. Developing architectures that natively process multi-channel cues is essential for unlocking genuine spatial awareness.

⚙️Data Curation

All audio for the foundational perception task is synthesized using precise parameterization or the Pyroomacoustics physics-based simulator, providing complete control over acoustic parameters. Domain experts rigorously validate the task difficulty levels, which are then calibrated through human testing.
For the holistic spatio-temporal reasoning task, the curation process comprises four key stages, including human annotation and final selection based on human performance, as illustrated below.

pipeline

🛠️ Test Your Model!

The ALMEval_code/ is partially adapted from VLMEvalKit and Kimi-Audio-Evalkit.
It provides a unified evaluation pipeline for multimodal large models on STAR-Bench.

Step 1: Prepare Environment

git clone https://github.com/InternLM/StarBench.git
cd StarBench
conda activate starbench python==3.10.0
pip install -r requirements.txt
cd ALMEval_code

Step 2: Get STAR-Bench v1.0 Dataset

Download STAR-Bench v1.0 dataset from 🤗HuggingFace

huggingface-cli download --repo-type dataset --resume-download <repo_name> --local-dir your_local_data_dir 

Step 3: Set Up Your Model for Evaluation

Currently supported models include: Qwen2.5-Omni, Qwen2-Audio-Instruct, DeSTA2.5-Audio, Phi4-MM, Kimi-Audio, MiDashengLM, Step-Audio-2-mini, Gemma-3n-E4B-it, Ming-Lite-Omni-1.5,Xiaomi-MiMo-Audio,MiniCPM-O-v2.6,Audio Flamingo 3, Gemini and GPT-4o Audio.

To integrate a new model, create a new file yourmodel.py under the models/ directory and implement the function generate_inner().

✅ Example: generate_inner()

def generate_inner(self, msg):
    """
    Args:
        msg: dict, input format as below
    """
    msg = {
        "meta": {
            "id": ...,
            "task": ...,
            "category": ...,
            "sub-category": ...,
            "options": ...,
            "answer": ...,
            "answer_letter": ...,
            "rotate_id": ...,
            "seed": ...
        },
        "prompts": [
            {"type": "text", "value": "xxxx"},
            {"type": "audio", "value": "audio1.wav"},
            {"type": "text", "value": "xxxx"},
            {"type": "audio", "value": "audio2.wav"},
            ...
        ]
    }
    # Return the model's textual response
    return "your model output here"

Step 4: Configure Model Settings

Modify the configuration file: /models/model.yaml.

For existing models, you may need to update parameters such as model_path to match your local model weight path.

To add a new model variant, follow these steps:

  1. Create a new top-level key for your alias (e.g., 'my_model_variant:').
  2. Set 'base_model' to the NAME attribute of the corresponding Python class.
  3. Add any necessary arguments for the class's __init__ method under init_args.

Example:

qwen25-omni:
  base_model: qwen25-omni
  init_args:
    model_path: your_model_weight_path_here

Step 5: Run Evaluation

Run the following command:

python ./run.py \
  --model qwen25-omni \
  --data starbench_default \
  --dataset_root your_local_data_dir  \
  --work-dir ./eval_results

Evaluation results will be automatically saved to the ./eval_results directory.

You can also evaluate specific subtasks or their combinations by modifying the --data argument. The full list of available task names can be found in ALMEval_code/datasets/__init__.py.

Example: Evaluate only the temporal reasoning and spatial reasoning tasks:

python ./run.py \
  --model qwen25-omni \
  --data tr sr \
  --dataset_root your_local_data_dir  \
  --work-dir ./eval_results

✒️Citation

TBD

📄 License

Code License Data License Usage and License Notices: The data and code are intended and licensed for research use only.

Acknowledgement

We sincerely thank 2077AI for providing the platform that supported our data annotation, verification, and review processes.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages