Skip to content

ASLP-lab/MSU-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios

Release Version GitHub Stars License Dataset Size Paper


📄 Abstract

MSU-Bench is a comprehensive benchmark for evaluating multi-speaker conversational understanding with a speaker-centric design. MSU-Bench compiles speaker-related questions in conversational understanding and organizes them into four hierarchical levels of increasing difficulty, comprising a total of 25 tasks with over 1,200 open-ended QAs.

This hierarchical framework covers four progressive tiers: single-speaker static attribute understanding, single-speaker dynamic attribute understanding, multi-speaker background understanding, and multi-speaker interaction understanding. This structure ensures all tasks are grounded in speaker-centric contexts, from basic perception to complex reasoning across multiple speakers.


Fig.1 MSU-Bench Example


🗓️ Event List

This document records the key events and milestones of this repository.

Date Event
2025-08-10 📂 Repository created.
2025-08-10 ✨ Added initial project structure and demo page.
2025-08-10 📝 Updated README with project description and usage instructions.
2025-08-xx 🚀 Release v1.0.0 Coming Soon

🧩 Benchmark Tasks

MSU-Bench progresses from speaker-level perception to complex multi-party interaction reasoning. The progression follows a natural cognitive hierarchy: Tier 1 establishes foundational recognition capabilities for static speaker attributes, Tier 2 extends to temporal dynamics analysis within individual speakers, Tier 3 advances to contextual inference and background understanding across multiple speakers, and Tier 4 culminates in comprehensive multi-speaker interaction un derstanding.


Fig.2 MSU-Bench Tasks


📊 Data Overview

The QA data in MSU-Bench undergoes automatic labeling and QA generation, followed by human screening after automated quality assessment. Each audio segment used for QA construction is clipped to 60–120 seconds and labeled through an open-source pipeline (SDASR/Speaker Attribute). The table below summarizes the Bench dataset statistics.

Feature Description
Session Duration 60-120s
Number of Trials 1232
Number of Speakers 2-4 speakers each session
Languages [CN, EN]
Annotations Speaker diarization, gender, age group, accent, emotions, speech flow, voice quality
Formats Audio (.wav), Metadata (.json), Transcripts (.txt)


Fig.3 MSU-Bench Trials


📊 Data Source Details

MSU-Bench constructs its dataset by selecting QAs from six open-source datasets, covering both Chinese and English as well as diverse acoustic environments.

The data sources are as follows: Chinese two-speaker telephone conversation data MDT-AA007, Chinese far-field multi-speaker conversation data Alimeeting, Chinese film and television dialogue data CN-Film, English two-speaker telephone conversation data MDT-AD015, English far-field multi-speaker conversation data CHiME6, and English film and television dialogue data EN-Film. The detailed statistics of these data sources are as follows:

Name Lang Domain Description Duration Open-source
MDT-AD015 EN Telephone conversation Two-speaker dialogues with background noise and significant channel loss 304.80 min ✅ True
CHiME-6 EN Far-field multi-speaker Multi-speaker family dinner scenario with background event noise and spatial reverberation 433.21 min ✅ True
R3VQA-Audio EN Complex film scenes Multi-speaker film/TV drama scenes with background event noise and background music 2470.66 min ✅ True
MDT-AA007 CN Near-field multi-speaker Multi-speaker podcast conversations with noise, reverberation, and background sound 378×6 min ✅ True
AliMeeting CN Far-field multi-speaker Multi-speaker far-field meeting data with noise and reverberation 252 min ✅ True
CN-Film CN Complex film scenes Multi-speaker film/TV drama scenes with background event noise and background music 1135 min ✅ True

🤖 QA Pipeline


Fig.4 MSU-Bench Pipeline

To construct the four-tier benchmark with diverse speaker-centric audio-text tasks, we build a rigorous QA generation pipeline(will fully open-source) that automatically produces high-quality question–answer pairs from multi-speaker dialogues spanning various real-world scenarios and acoustic conditions. For each core ability, we design dedicated prompts to guide template construction and question formulation, ensuring that the resulting QA samples are tightly aligned with task-specific objectives.


📈 Evaluation Performance


Fig.5 LALM Evaluation Results on MSU-Bench

MSU-Bench presents comprehensive evaluation results of both state-of-the-art open-source and commercial models on our benchmark (see Fig 4). For each benchmark tier and capability, we report the aver-age results of all tasks related to that capability, with the task-capability mappings detailed in Fig 2. Moreover, the comparison of different systems across all 25 tasks is illustrated in Fig 5, providing a highly intuitive performance comparison across models.


Fig.6 Task Performance on MSU-Bench


📂 Repository Composition

MSU-Bench/
│
├── audio/                  # Raw audio files
├── transcripts/            # Text transcripts
├── metadata/               # Speaker attributes, emotion labels, etc.
└── examples/               # Sample dialogues

Audio Annotation Components:

  1. Speaker Identity & Attributes — gender, age group, etc.
  2. Paralinguistic Cues — pitch, volume variations.
  3. Conversational Events — interruptions, overlaps.
  4. Background Context — relationship between speakers, scene type.

🚀 Getting Started

1. Clone the Repository

git clone https://github.com/ASLP-lab/MSU-Bench.git
cd MSU-Bench

2. Install Dependencies

pip install -r requirements.txt

3. Access the Dataset
You can download the dataset here.


💡 Example Usage

Load Metadata in Python

import json

with open("metadata/sample.json", "r", encoding="utf-8") as f:
    data = json.load(f)

print(data["speakers"])
print(data["dialogue"])

Simple Visualization of Speaker Turns

import matplotlib.pyplot as plt

# Example: plot speaker turns timeline
speakers = ["S1", "S2", "S3"]
turns = [(0, 3, "S1"), (3, 6, "S2"), (6, 9, "S1")]  # (start, end, speaker)

for start, end, spk in turns:
    plt.plot([start, end], [speakers.index(spk)]*2, linewidth=6, label=spk)

plt.yticks(range(len(speakers)), speakers)
plt.xlabel("Time (s)")
plt.title("Speaker Turns")
plt.show()

📜 Citation

If you use MSU-Bench in your research, please cite:

@dataset{msu-bench2025,
  title={MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios},
  author={Shuai Wang, Zhaokai Sun, Zhennan Lin, Chenyou Wang, Zhou Pan, Lei Xie},
  year={2025},
  url={https://github.com/ASLP-lab/MSU-Bench}
}

🔗 Links


© 2025 MSU-Bench Team. Released under the MIT License.

About

Open repository of "MSU-Bench: Towards Understanding the Conversational Multi-Speaker Scenarios"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published