HumDial-EIBench

A human-recorded multi-turn emotional intelligence benchmark for audio language models.

Figure: Three-stage data pipeline and the four evaluation tasks in HumDial-EIBench.

HumDial-EIBench is designed to evaluate whether audio language models (ALMs) truly understand emotion in speech, rather than relying on text transcription shortcuts.

The benchmark is built from authentic human-recorded dialogues from the ICASSP 2026 HumDial Challenge and includes both Chinese and English subsets.

Why HumDial-EIBench

Existing ALM benchmarks often suffer from one or more of these issues:

synthetic (TTS-only) speech instead of authentic human recordings
single-turn settings that miss emotional evolution over context
subjective open-ended scoring that confounds reasoning and generation quality

HumDial-EIBench addresses these gaps by combining:

real human multi-turn audio
objective adversarial MCQ tasks for reasoning-heavy evaluation
a dedicated acoustic-semantic conflict task
separate diagnosis of textual empathy vs acoustic empathy

Benchmark at a Glance

Total samples: 1,077
Languages: Chinese + English
Core goal: diagnose emotional intelligence in ALMs across memory, reasoning, generation, and cross-modal robustness

Task	Type	CN / EN	Turns	Main Metric
Task 1: Emotional Trajectory Detection	MCQ	150 / 150	3-5	Accuracy
Task 2: Implicit Causal Reasoning	MCQ	134 / 149	3-5	Accuracy
Task 3: Empathetic Response Generation	Open Generation	144 / 150	3-5	LLM + Human
Task 4: Acoustic-Semantic Conflict	MCQ	100 / 100	1	Accuracy
Total		528 / 549

Four Tasks

Task 1: Emotional Trajectory Detection

Track emotion changes across dialogue turns (for example, E_t1 -> E_t2 -> E_t3), instead of classifying isolated utterances.

Task 2: Implicit Causal Reasoning

Infer the latent emotional trigger from scattered context clues. The MCQ format helps reduce evaluator subjectivity.

Task 3: Empathetic Response Generation

Evaluate generated responses in three dimensions:

D1: Textual Empathy & Insight (LLM-judge, 1-5)
D2: Vocal Empathy & Congruence (human rating, 1-5)
D3: Audio Quality & Naturalness (human rating, 1-5)

Task 4: Acoustic-Semantic Conflict

Test robustness when text sentiment contradicts vocal affect (for example, sarcasm-like cases), exposing text-dominance bias.

Key Findings

Most ALMs still struggle with multi-turn emotional tracking and implicit causal reasoning.
Strong decoupling exists between textual empathy and acoustic empathy.
All tested models show a notable text-dominance bias under acoustic-semantic conflict.

Repository Structure

.
├── assets/
│   └── humdial-bench.png
├── eval/
│   └── eval_task3.py
└── README.md

Data and Code Access

The dataset will be released soon.

Evaluation Usage

Task 3 (Empathetic Generation) Scoring

eval/eval_task3.py scores model outputs for D1/D2/D3 and writes per-sample + summary results.

Input format (`jsonl`)

{
  "dialogue_id": "sample_001",
  "turns": [
    {
      "input_emotion": "sad",
      "input_text": "I've been feeling really overwhelmed lately...",
      "response_text": "It sounds like you're carrying a lot right now.",
      "response_audio": "outputs/sample_001_turn1.wav"
    }
  ]
}

Run

python eval/eval_task3.py \
  --model Qwen3-Omni-30B-A3B-Instruct \
  --input_file results/task3_outputs.jsonl \
  --output_file results/task3_scores.jsonl

The script automatically identifies the target evaluation turn (second non-neutral turn) and builds context from prior turns.

Environment note: this script requires a GPU runtime and vLLM. Set the local judge checkpoint path in eval/eval_task3.py before running.

Contact

For questions or collaboration, please open an issue in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HumDial-EIBench

Why HumDial-EIBench

Benchmark at a Glance

Four Tasks

Task 1: Emotional Trajectory Detection

Task 2: Implicit Causal Reasoning

Task 3: Empathetic Response Generation

Task 4: Acoustic-Semantic Conflict

Key Findings

Repository Structure

Data and Code Access

Evaluation Usage

Task 3 (Empathetic Generation) Scoring

Input format (`jsonl`)

Run

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
eval		eval
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

HumDial-EIBench

Why HumDial-EIBench

Benchmark at a Glance

Four Tasks

Task 1: Emotional Trajectory Detection

Task 2: Implicit Causal Reasoning

Task 3: Empathetic Response Generation

Task 4: Acoustic-Semantic Conflict

Key Findings

Repository Structure

Data and Code Access

Evaluation Usage

Task 3 (Empathetic Generation) Scoring

Input format (jsonl)

Run

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Input format (`jsonl`)

Packages