MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios

📄 Abstract

🔗 Demo Page

MSU-Bench is a comprehensive benchmark for evaluating multi-speaker conversational understanding with a speaker-centric design. MSU-Bench compiles speaker-related questions in conversational understanding and organizes them into four hierarchical levels of increasing difficulty, comprising a total of 25 tasks with over 1,200 open-ended QAs.

This hierarchical framework covers four progressive tiers: single-speaker static attribute understanding, single-speaker dynamic attribute understanding, multi-speaker background understanding, and multi-speaker interaction understanding. This structure ensures all tasks are grounded in speaker-centric contexts, from basic perception to complex reasoning across multiple speakers.

Fig.1 MSU-Bench Example

🗓️ Event List

This document records the key events and milestones of this repository.

Date	Event
2025-08-10	📂 Repository created.
2025-08-10	✨ Added initial project structure and demo page.
2025-08-10	📝 Updated README with project description and usage instructions.
2025-08-xx	🚀 ~~Release v1.0.0~~ Coming Soon

🧩 Benchmark Tasks

MSU-Bench progresses from speaker-level perception to complex multi-party interaction reasoning. The progression follows a natural cognitive hierarchy: Tier 1 establishes foundational recognition capabilities for static speaker attributes, Tier 2 extends to temporal dynamics analysis within individual speakers, Tier 3 advances to contextual inference and background understanding across multiple speakers, and Tier 4 culminates in comprehensive multi-speaker interaction un derstanding.

Fig.2 MSU-Bench Tasks

📊 Data Overview

The QA data in MSU-Bench undergoes automatic labeling and QA generation, followed by human screening after automated quality assessment. Each audio segment used for QA construction is clipped to 60–120 seconds and labeled through an open-source pipeline (SDASR/Speaker Attribute). The table below summarizes the Bench dataset statistics.

Feature	Description
Session Duration	`60-120s`
Number of Trials	`1232`
Number of Speakers	`2-4 speakers each session`
Languages	`[CN, EN]`
Annotations	`Speaker diarization, gender, age group, accent, emotions, speech flow, voice quality`
Formats	Audio (`.wav`), Metadata (`.json`), Transcripts (`.txt`)

Fig.3 MSU-Bench Trials

📊 Data Source Details

MSU-Bench constructs its dataset by selecting QAs from six open-source datasets, covering both Chinese and English as well as diverse acoustic environments.

The data sources are as follows: Chinese two-speaker telephone conversation data MDT-AA007, Chinese far-field multi-speaker conversation data Alimeeting, Chinese film and television dialogue data CN-Film, English two-speaker telephone conversation data MDT-AD015, English far-field multi-speaker conversation data CHiME6, and English film and television dialogue data EN-Film. The detailed statistics of these data sources are as follows:

Name	Lang	Domain	Description	Duration	Open-source
MDT-AD015	EN	Telephone conversation	Two-speaker dialogues with background noise and significant channel loss	304.80 min	✅ True
CHiME-6	EN	Far-field multi-speaker	Multi-speaker family dinner scenario with background event noise and spatial reverberation	433.21 min	✅ True
R3VQA-Audio	EN	Complex film scenes	Multi-speaker film/TV drama scenes with background event noise and background music	2470.66 min	✅ True
MDT-AA007	CN	Near-field multi-speaker	Multi-speaker podcast conversations with noise, reverberation, and background sound	378×6 min	✅ True
AliMeeting	CN	Far-field multi-speaker	Multi-speaker far-field meeting data with noise and reverberation	252 min	✅ True
CN-Film	CN	Complex film scenes	Multi-speaker film/TV drama scenes with background event noise and background music	1135 min	✅ True

🤖 QA Pipeline

Fig.4 MSU-Bench Pipeline

To construct the four-tier benchmark with diverse speaker-centric audio-text tasks, we build a rigorous QA generation pipeline(will fully open-source) that automatically produces high-quality question–answer pairs from multi-speaker dialogues spanning various real-world scenarios and acoustic conditions. For each core ability, we design dedicated prompts to guide template construction and question formulation, ensuring that the resulting QA samples are tightly aligned with task-specific objectives.

📈 Evaluation Performance

Fig.5 LALM Evaluation Results on MSU-Bench

MSU-Bench presents comprehensive evaluation results of both state-of-the-art open-source and commercial models on our benchmark (see Fig 4). For each benchmark tier and capability, we report the aver-age results of all tasks related to that capability, with the task-capability mappings detailed in Fig 2. Moreover, the comparison of different systems across all 25 tasks is illustrated in Fig 5, providing a highly intuitive performance comparison across models.

Fig.6 Task Performance on MSU-Bench

📂 Repository Composition

MSU-Bench/
│
├── audio/                  # Raw audio files
├── transcripts/            # Text transcripts
├── metadata/               # Speaker attributes, emotion labels, etc.
└── examples/               # Sample dialogues

Audio Annotation Components:

Speaker Identity & Attributes — gender, age group, etc.
Paralinguistic Cues — pitch, volume variations.
Conversational Events — interruptions, overlaps.
Background Context — relationship between speakers, scene type.

🚀 Getting Started

1. Clone the Repository

git clone https://github.com/ASLP-lab/MSU-Bench.git
cd MSU-Bench

2. Install Dependencies

pip install -r requirements.txt

3. Access the Dataset
You can download the dataset here.

💡 Example Usage

Load Metadata in Python

import json

with open("metadata/sample.json", "r", encoding="utf-8") as f:
    data = json.load(f)

print(data["speakers"])
print(data["dialogue"])

Simple Visualization of Speaker Turns

import matplotlib.pyplot as plt

# Example: plot speaker turns timeline
speakers = ["S1", "S2", "S3"]
turns = [(0, 3, "S1"), (3, 6, "S2"), (6, 9, "S1")]  # (start, end, speaker)

for start, end, spk in turns:
    plt.plot([start, end], [speakers.index(spk)]*2, linewidth=6, label=spk)

plt.yticks(range(len(speakers)), speakers)
plt.xlabel("Time (s)")
plt.title("Speaker Turns")
plt.show()

📜 Citation

If you use MSU-Bench in your research, please cite:

@dataset{msu-bench2025,
  title={MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios},
  author={Shuai Wang, Zhaokai Sun, Zhennan Lin, Chenyou Wang, Zhou Pan, Lei Xie},
  year={2025},
  url={https://github.com/ASLP-lab/MSU-Bench}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios

📄 Abstract

🔗 Demo Page

🗓️ Event List

🧩 Benchmark Tasks

📊 Data Overview

📊 Data Source Details

🤖 QA Pipeline

📈 Evaluation Performance

📂 Repository Composition

🚀 Getting Started

💡 Example Usage

📜 Citation

🔗 Links

About

Uh oh!

Releases

Packages

License

ASLP-lab/MSU-Bench

Folders and files

Latest commit

History

Repository files navigation

MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios

📄 Abstract

🔗 Demo Page

🗓️ Event List

🧩 Benchmark Tasks

📊 Data Overview

📊 Data Source Details

🤖 QA Pipeline

📈 Evaluation Performance

📂 Repository Composition

🚀 Getting Started

💡 Example Usage

📜 Citation

🔗 Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages