MSU-Bench is a comprehensive benchmark for evaluating multi-speaker conversational understanding with a speaker-centric design. MSU-Bench compiles speaker-related questions in conversational understanding and organizes them into four hierarchical levels of increasing difficulty, comprising a total of 25 tasks with over 1,200 open-ended QAs.
This hierarchical framework covers four progressive tiers: single-speaker static attribute understanding, single-speaker dynamic attribute understanding, multi-speaker background understanding, and multi-speaker interaction understanding. This structure ensures all tasks are grounded in speaker-centric contexts, from basic perception to complex reasoning across multiple speakers.
This document records the key events and milestones of this repository.
| Date | Event |
|---|---|
| 2025-08-10 | 📂 Repository created. |
| 2025-08-10 | ✨ Added initial project structure and demo page. |
| 2025-08-10 | 📝 Updated README with project description and usage instructions. |
| 2025-08-xx | 🚀 |
MSU-Bench progresses from speaker-level perception to complex multi-party interaction reasoning. The progression follows a natural cognitive hierarchy: Tier 1 establishes foundational recognition capabilities for static speaker attributes, Tier 2 extends to temporal dynamics analysis within individual speakers, Tier 3 advances to contextual inference and background understanding across multiple speakers, and Tier 4 culminates in comprehensive multi-speaker interaction un derstanding.
The QA data in MSU-Bench undergoes automatic labeling and QA generation, followed by human screening after automated quality assessment. Each audio segment used for QA construction is clipped to 60–120 seconds and labeled through an open-source pipeline (SDASR/Speaker Attribute). The table below summarizes the Bench dataset statistics.
| Feature | Description |
|---|---|
| Session Duration | 60-120s |
| Number of Trials | 1232 |
| Number of Speakers | 2-4 speakers each session |
| Languages | [CN, EN] |
| Annotations | Speaker diarization, gender, age group, accent, emotions, speech flow, voice quality |
| Formats | Audio (.wav), Metadata (.json), Transcripts (.txt) |
MSU-Bench constructs its dataset by selecting QAs from six open-source datasets, covering both Chinese and English as well as diverse acoustic environments.
The data sources are as follows: Chinese two-speaker telephone conversation data MDT-AA007, Chinese far-field multi-speaker conversation data Alimeeting, Chinese film and television dialogue data CN-Film, English two-speaker telephone conversation data MDT-AD015, English far-field multi-speaker conversation data CHiME6, and English film and television dialogue data EN-Film. The detailed statistics of these data sources are as follows:
| Name | Lang | Domain | Description | Duration | Open-source |
|---|---|---|---|---|---|
| MDT-AD015 | EN | Telephone conversation | Two-speaker dialogues with background noise and significant channel loss | 304.80 min | ✅ True |
| CHiME-6 | EN | Far-field multi-speaker | Multi-speaker family dinner scenario with background event noise and spatial reverberation | 433.21 min | ✅ True |
| R3VQA-Audio | EN | Complex film scenes | Multi-speaker film/TV drama scenes with background event noise and background music | 2470.66 min | ✅ True |
| MDT-AA007 | CN | Near-field multi-speaker | Multi-speaker podcast conversations with noise, reverberation, and background sound | 378×6 min | ✅ True |
| AliMeeting | CN | Far-field multi-speaker | Multi-speaker far-field meeting data with noise and reverberation | 252 min | ✅ True |
| CN-Film | CN | Complex film scenes | Multi-speaker film/TV drama scenes with background event noise and background music | 1135 min | ✅ True |
To construct the four-tier benchmark with diverse speaker-centric audio-text tasks, we build a rigorous QA generation pipeline(will fully open-source) that automatically produces high-quality question–answer pairs from multi-speaker dialogues spanning various real-world scenarios and acoustic conditions. For each core ability, we design dedicated prompts to guide template construction and question formulation, ensuring that the resulting QA samples are tightly aligned with task-specific objectives.

Fig.5 LALM Evaluation Results on MSU-Bench
MSU-Bench presents comprehensive evaluation results of both state-of-the-art open-source and commercial models on our benchmark (see Fig 4). For each benchmark tier and capability, we report the aver-age results of all tasks related to that capability, with the task-capability mappings detailed in Fig 2. Moreover, the comparison of different systems across all 25 tasks is illustrated in Fig 5, providing a highly intuitive performance comparison across models.

Fig.6 Task Performance on MSU-Bench
MSU-Bench/
│
├── audio/ # Raw audio files
├── transcripts/ # Text transcripts
├── metadata/ # Speaker attributes, emotion labels, etc.
└── examples/ # Sample dialogues
Audio Annotation Components:
- Speaker Identity & Attributes — gender, age group, etc.
- Paralinguistic Cues — pitch, volume variations.
- Conversational Events — interruptions, overlaps.
- Background Context — relationship between speakers, scene type.
1. Clone the Repository
git clone https://github.com/ASLP-lab/MSU-Bench.git
cd MSU-Bench2. Install Dependencies
pip install -r requirements.txt3. Access the Dataset
You can download the dataset here.
Load Metadata in Python
import json
with open("metadata/sample.json", "r", encoding="utf-8") as f:
data = json.load(f)
print(data["speakers"])
print(data["dialogue"])Simple Visualization of Speaker Turns
import matplotlib.pyplot as plt
# Example: plot speaker turns timeline
speakers = ["S1", "S2", "S3"]
turns = [(0, 3, "S1"), (3, 6, "S2"), (6, 9, "S1")] # (start, end, speaker)
for start, end, spk in turns:
plt.plot([start, end], [speakers.index(spk)]*2, linewidth=6, label=spk)
plt.yticks(range(len(speakers)), speakers)
plt.xlabel("Time (s)")
plt.title("Speaker Turns")
plt.show()If you use MSU-Bench in your research, please cite:
@dataset{msu-bench2025,
title={MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios},
author={Shuai Wang, Zhaokai Sun, Zhennan Lin, Chenyou Wang, Zhou Pan, Lei Xie},
year={2025},
url={https://github.com/ASLP-lab/MSU-Bench}
}© 2025 MSU-Bench Team. Released under the MIT License.



