AttuneBench evaluates LLM emotional intelligence (EI) through 200 genuine multi-turn human-model conversations with first-person turn-level annotations of emotional state, model behavior, and response preferences provided by the human participants who held the conversations. The benchmark is grounded in the Mayer-Salovey-Caruso Four Branch Model of EI and measures EI through observable model behavior in conversation: emotion inference, response evaluation, preference anticipation, and outcome prediction.
This repository contains the runner, scorer, and supplementary analysis scripts. The benchmark dataset (200-conversation Sample200 plus four smaller subsamples) is bundled in Test Samples/.
.
├── attunebench/ ← runner + scorer package (CLI: `attune-bench`)
│ ├── runner.py ← per-turn EM call orchestration
│ ├── scorer.py ← canonical 3-pillar Composite + per-metric scoring
│ ├── reporter.py ← `attune-bench report` formatting
│ ├── harness/ ← lm-evaluation-harness task adapter
│ └── prompts/ ← system + user prompt templates per call type
├── analyses/ ← supplementary analysis scripts (paper-internal)
├── tests/ ← pytest suite
├── Test Samples/ ← bundled conversation data
│ ├── Sample200/ ← canonical 200-conversation benchmark
│ ├── Subsample100/
│ ├── Subsample50/ ← Verbose Mode coverage in paper runs
│ ├── Subsample25/ ← Omniscient Mode coverage in paper runs
│ └── Subsample20/
├── subset_report.py ← trait-stratified score filtering tool
├── model_stats.py ← per-model summary stats from a scores.json
├── predict_profile.py ← experimental: profile prediction from conversation
├── rejudge.py ← re-run draft-judge on existing run outputs
├── pyproject.toml
├── QUICKSTART.md
├── README.md ← this file
└── SUBSET_REPORT_README.md ← detailed `subset_report.py` reference
The per-conversation scoring outputs and aggregated scores.json files for the paper's reported runs (across all 11 models × 3 modes) live in the dataset deposit under Experimental Runs/, not in this code repository — see the dataset README for that layout.
- Python 3.10+
- An API key for OpenRouter, OpenAI, or Prime Intellect
git clone https://anonymous.4open.science/r/attunebench-release-n-1F9B
cd <repo-dir>
pip install -e .Optional extras:
pip install -e ".[bertscore]" # BERTScore for response similarity
pip install -e ".[harness]" # lm-evaluation-harness integration
pip install -e ".[analyses]" # pandas / matplotlib / seaborn for supplementary analyses
pip install -e ".[dev]" # pytest, ruff for developmentSet your API key as an environment variable:
# OpenRouter (default backend)
export OPENROUTER_API_KEY="your-key-here"
# Or OpenAI
export OPENAI_API_KEY="your-key-here"
# Or Prime Intellect
export PRIME_API_KEY="your-key-here"Prime Intellect also supports ~/.prime/config.json with api_key, team_id, and
inference_url. AttuneBench checks RLVR_PRIME_API_KEY / PRIME_API_KEY first, then
falls back to that config file.
1. Run the benchmark against a model:
# Default mode only (when --modes is omitted)
attune-bench run openrouter anthropic/claude-sonnet-4 <path/to/conversations>
# All 4 modes at once
attune-bench run prime anthropic/claude-sonnet-4 <path/to/conversations> --modes all
# Pick specific modes
attune-bench run openai gpt-4o <path/to/conversations> --modes default verbose
# With draft quality judge (LLM-as-judge + binary alignment)
attune-bench run prime anthropic/claude-sonnet-4 <path/to/conversations> --judge-model anthropic/claude-opus-4.6
# Custom output directory, call timeout, explicit API key
attune-bench run prime anthropic/claude-opus-4.6 <path/to/conversations> \
-o ./my-results/ --call-timeout 300 --api-key sk-...Runs are resumable — if a run is interrupted, re-run the same command with the same --output directory. Completed conversations are automatically skipped. Failed conversations are logged to _skipped_<model>.json with re-run instructions.
2. Score the results against ground-truth annotations:
attune-bench score \
--results ./results/ \
--ground-truth <path/to/conversations> \
--output ./scores.json3. View the report:
attune-bench report --scores ./scores.json4. (Optional) Trait-stratified analysis — filter scores by participant traits or conversation metadata:
python subset_report.py -s ./scores.json -d <path/to/conversations> --diagnosis ADHD --compareSee SUBSET_REPORT_README.md for the full set of filters and the per-tool metric reference.
attune-bench run writes to the --output directory:
<conversationId>_<provider>_<model>_<mode>.json— one per conversation, containing per-turn EM predictions (emotion tags, both binary phrasings, pairwise rankings, response draft, and judge ratings if--judge-modelwas set), conversation-wide predictions (Four-Branch ratings, Q1–Q3, post-PANAS estimate), and aggregated per-conversation scores._skipped_<model>.json— conversations that errored out during the run, with re-run instructions. Re-running the same command resumes from where it left off.
attune-bench score writes a single scores.json:
- Per-conversation scores (one entry per conversation: every metric defined below, plus the source result file path).
- Aggregated per-(model, mode) metrics (mean across conversations).
- Run metadata: model identifier, mode, ground-truth path, number of conversations scored.
subset_report.py does not write to disk by default — it prints a filtered report to stdout.
| Mode | EM context and outputs |
|---|---|
default |
EM observes the HP message, drafts a response, then observes the OM response and produces emotion predictions, binary judgments, and pairwise rankings. |
verbose |
Same as default; the EM additionally produces reasoning traces justifying each prediction. |
omniscient |
Same as default; the EM is additionally provided the participant profile, pre-conversation PANAS, and topic attitude at turn 1. |
verbose_omniscient |
Combines verbose (reasoning traces) and omniscient (background context). |
All modes follow the same per-turn call flow:
- Draft generation: EM sees the HP's message and drafts how it would respond, before observing the OM response.
- Optional draft judging: when
--judge-modelis provided, a separate judge LLM rates the EM's draft on overall quality, emotional appropriateness, helpfulness, and tone match (1–7 scale). - Emotion inference: EM predicts the HP's current emotional state using PANAS terminology.
- Binary judgment prediction: for each applicable binary question, the EM predicts both the observed OM behavior and the HP's preferred behavior. Each question is posed twice — once first-person (
in your message'') and once observer (in the HP's message'') — in separate calls to measure perspective effect. - Pairwise preference prediction: EM ranks the three response variants (original OM, model-generated alternate, human-edited golden response) from the HP's perspective; response labels are anonymized to reduce source bias.
The EM's own prior predictions are not passed back as context — each turn's analysis is independent to prevent anchoring bias.
Cost estimation. With one judge model on default mode, each turn makes 5 LLM calls (EM draft, judge quality + binary alignment, EM emotion + binary judgments, HP-facing binary re-prompt, pairwise selections). One additional post-conversation call covers PANAS and the conversation-wide questions. A 5-turn conversation is therefore 5 × 5 + 1 = 26 calls total. Verbose mode adds reasoning traces but does not change the call count; omniscient adds context to existing calls without changing the count.
Run multiple modes at once:
attune-bench run prime anthropic/claude-sonnet-4 ./data/ --modes default verbose
attune-bench run prime anthropic/claude-sonnet-4 ./data/ --modes all # all 4 modespip install -e ".[harness]"
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-3-70B \
--tasks attunebenchConversation files (located in Test Samples/) follow the annotated format below. Sample200/ contains 200 conversations meeting protocol-completeness and engagement criteria; conversation IDs are non-sequential because they preserve their position in the original collection (full inclusion criteria documented in the dataverse README).
{
"conversationId": "429fdfd2-9b97-4f8e-b61f-d5615305bf3f",
"metadata": {
"model": "gpt-4o-mini",
"category": "Family",
"subtopic": "Children / Family Planning",
"text": "Family: Children / Family Planning"
},
"participant_profile": {
"schema_version": "2.0",
"participant_id": "A00002",
"profile_type": "shortened",
"demographics": {
"country": "United States",
"education": "High school",
"work_edu_background": ["..."],
"gender": "Male",
"age_range": "25-34",
"english_proficiency": "Native",
"diagnoses": ["None of the above"]
},
"computed_scores": {
"who_5": 19,
"asrs_6": 1,
"aq_10": 6,
"promis_anxiety": 5,
"promis_anxiety_t": 48.0,
"promis_depression": 4,
"promis_depression_t": 41.0,
"promis_sleep": 6,
"promis_sleep_t": 41.1,
"tipi_extraversion": 9,
"tipi_agreeableness": 9,
"tipi_conscientiousness": 11,
"tipi_stability": 11,
"tipi_openness": 11
},
"conversation_topic_attitude": "Excited/Optimistic"
},
"prePanas": {
"totalPositiveAffect": 43,
"totalNegativeAffect": 11,
"responses": {
"alert": 5, "proud": 4, "upset": 1, "active": 5,
"afraid": 1, "guilty": 1, "scared": 1, "strong": 3,
"ashamed": 1, "excited": 4, "hostile": 1, "jittery": 1,
"nervous": 2, "inspired": 4, "attentive": 5, "irritable": 1,
"determined": 5, "distressed": 1, "interested": 5, "enthusiastic": 3
}
},
"postPanas": {
"totalPositiveAffect": 49,
"totalNegativeAffect": 10,
"responses": { "...same 20 items, 1-7 each..." : 0 }
},
"conversationWideQuestions": {
"fourBranchScores": {
"managing": 4, "perceiving": 5, "facilitating": 6, "understanding": 5
},
"q1_lookingFor": ["To help me understand or sort out my feelings"],
"q2_emotionClarity": "Implied or indirect",
"q3_modelFit": "Mixed, some good moments, some misses",
"q3_followUp_whatFeltOff": ["..."]
},
"turns": [
{
"turnNumber": 1,
"userMessage": "...",
"llmResponse": "...",
"moodShiftTags": [
{ "emotion": "Proud", "intensity": 4 }
],
"annotations": {
"binaryJudgements": [
{ "questionId": "B1", "observedBehavior": "yes", "preferredBehavior": "yes" }
],
"alternateResponses": {
"llmImproved": "...",
"humanEdited": "..."
},
"pairwiseComparisons": [
{ "questionId": "general", "responseA": "original", "responseB": "alternate", "winner": "A" }
],
"selectedPairwiseQuestions": ["PW7", "PW4", "PW3"]
}
}
]
}Key data format details:
- Metadata: Topic info (
category,subtopic,text) and OM model identifier sit directly onmetadata. - Participant profile:
demographics(country, education, work background, gender, age range, English proficiency, self-reported mental health diagnoses);computed_scores(TIPI Big-Five subscores plus WHO-5, AQ-10, ASRS-6, and PROMIS-4 raw + T-scores for anxiety, depression, and sleep);conversation_topic_attitude(participant's pre-conversation attitude toward the assigned topic). - PANAS: Both aggregate totals AND all 20 individual item scores (1-7 scale) under
responses. The scorer derives aggregates from items for consistency. - Mood shift tags: PANAS-SF 20-item taxonomy only. Empty list
[]for neutral/stable turns. Intensity is 1-7. - Binary judgements:
observedBehavior= did the OM do this?preferredBehavior= would the HP have wanted this? Both areyes/no/na. Supports 36 questions (B1–B20 core + B108–B505 extended). - Pairwise comparisons: 12 per turn (4 questions × 3 pairs).
winner: "A"|"B"is the HP's choice. Each comparison has a unique index key in scored output. - Conversation-wide:
q1_lookingFor(multi-select),q2_emotionClarity,q3_modelFit,q3_followUp_whatFeltOff(multi-select, populated only whenq3_modelFitis negative),fourBranchScores(perceiving / facilitating / understanding / managing on 1–7 scale). - Question text lookup: full text for each binary ID (B1–B505) and pairwise ID (PW1–PW16) is consolidated in
annotation_codebook.jsonat the repo root, alongside option lists for conversation-wide Q1/Q2/Q3, Four-Branch definitions, and PANAS taxonomy. Question text is also defined as Python dicts inattunebench/constants.py(BINARY_QUESTION_TEXT,BINARY_HP_QUESTION_TEXT,PAIRWISE_QUESTION_TEXT); the runtime imports those. - Two phrasings for binary questions: each binary question has an HP-facing phrasing ("your message") used during participant annotation, and a parallel observer phrasing ("the HP's message") used when EMs predict HP labels during evaluation. Both refer to the same construct and differ only in pronoun stance. The codebook lists
text(HP-facing canonical) andobserverText(where different) per question. Pairwise questions use HP-facing phrasing only.
The report displays scores in three groups plus an overall composite:
Turn-by-Turn Scores (averaged across all turns):
| Metric | Description |
|---|---|
Emotion F1 |
F1 score (precision + recall) for predicted PANAS emotion tags vs ground-truth. |
Emotion VA |
Valence-Arousal similarity using Hungarian matching against a precomputed 20×20 VAD similarity matrix. Gives graded credit for near-miss predictions (e.g., "Distressed" vs "Upset" = 0.92 similarity). |
Emotion Intensity MAE |
Mean absolute error on intensity (1-7 scale) for matched emotion tags. |
Binary OM Accuracy |
Whether the EM correctly assesses what the OM did (observedBehavior). Pooled metrics include precision, recall, F1, and MCC. |
Binary HP Accuracy |
Whether the EM correctly predicts the HP's preference (preferredBehavior). |
Binary OM/HP (HP-facing) |
Same binary questions re-asked with HP-facing text. Stored separately as _hp fields. |
Pairwise Accuracy |
How often the EM matches the HP's preferred winner across all 12 comparisons per turn. |
Kendall Tau |
Rank correlation of implied response rankings (original vs alternate vs human) between EM and GT. |
Draft Judge |
LLM-as-judge score comparing EM's drafted response to humanEdited reference (optional, requires --judge-model). |
Draft Binary Alignment |
Binary preference alignment between EM's draft and HP's preferredBehavior (optional, requires --judge-model). |
Conversation-Wide Scores:
| Metric | Description |
|---|---|
PANAS Normalized |
1 - MAE/60 for aggregate PA and NA predictions. Scale is 1-7 per item, 10 items per aggregate (range 10-70). |
PANAS Item Normalized |
1 - item_MAE/6 averaged across all 20 individual PANAS items. |
PANAS Baseline-Adjusted |
Model score relative to a "predict no change from pre-PANAS" naive baseline. 0.0 = no better than naive, 1.0 = perfect, negative = worse than naive. Clamped to [-1, 1]. |
Q1 Goals |
Set overlap on q1_lookingFor (intersection/union of selected options). |
Q2 Clarity |
Exact match on q2_emotionClarity. |
Q3 Fit |
Exact match + ordinal distance on q3_modelFit. |
Q3 Follow-up |
Set overlap with fuzzy matching on q3_followUp_whatFeltOff. |
4-Branch |
Normalized similarity on perceiving/facilitating/understanding/managing scores (1-7 scale; 1 - MAE/max_error). Per-branch breakdown stored in output. |
Composite Scoring (0-100 scale):
The composite blends 0.5 * emotion_f1 + 0.5 * emotion_va_score for the emotion tracking component. Response similarity is excluded from the composite (informational only). Composite weights:
| Component | Weight | Inputs |
|---|---|---|
| Emotion Tracking | 24% | Blended F1 + VA score |
| Evaluation Quality | 49% | Binary OM/HP accuracy + pairwise accuracy |
| Holistic Comprehension | 27% | PANAS + conv-wide questions + four-branch |
The scorer outputs more metrics than the paper analyzes — including exploratory and diagnostic variants of paper-canonical metrics. To compare numbers directly with paper claims, use the canonical-Composite group below.
Paper-canonical Composite-input metrics:
Emotion F1andEmotion VA(blended 50/50 inside the emotion-tracking pillar)Binary OM AccuracyandBinary HP Accuracy(averaged inside the evaluation-quality pillar, withPairwise Accuracy)PANAS Baseline-Adjusted,4-Branch(normalized similarity), and the mean ofQ1/Q2/Q3/Q3 Follow-up(inside the holistic-comprehension pillar)
Reported alongside but not in the Composite:
Kendall τ(correlated with Pairwise Accuracy; reported for interpretability)Draft Judge(judge-model artifact, supplementary)
Diagnostic-only (related but distinct from paper-canonical):
Emotion Hit Rate— set-level recall on mood-shift tags. Different metric fromEmotion F1; same input, different formula. Not used in the Composite or paper analyses.PANAS Normalized— simple1 - MAE/max_erroraccuracy. Different fromPANAS Baseline-Adjusted, which discounts the trivial "predict no change" baseline. The paper uses the baseline-adjusted version.Response Similarity— intentionally excluded from the canonical Composite per scorer design.Emotion Intensity MAE— exploratory; not part of any paper-reported analysis.
The subset_report.py tool surfaces some of these diagnostic metrics directly in its output; see SUBSET_REPORT_README.md for the per-tool metric reference.
pip install -e ".[dev]"
pytest tests/ -vAttuneBench grounds its evaluation in the Mayer-Salovey-Caruso Four Branch Model of emotional intelligence (Mayer, Caruso, & Salovey 2016): perceiving emotion, using emotion to facilitate thought, understanding emotion, and managing emotion. Benchmark components map pragmatically to these four branches:
- Mood-shift tag predictions → perceiving emotion
- Binary judgments (observed and preferred behavior) → facilitating thought
- Post-conversation PANAS estimation → understanding emotion
- Response drafting and pairwise preference prediction → managing emotion
These are pragmatic approximations rather than strict correspondences, but together span major dimensions of emotional intelligence operationalized as observable behavior. See the accompanying paper for full theoretical grounding, methodology details, and result analyses.
These guidelines are not technically enforced but constitute the expected standard for published results. To ensure comparability and diagnostic value:
- No training on benchmark data. Evaluated models must not be fine-tuned on any conversations, annotations, or metadata from this dataset.
- No access to human annotations during evaluation. The EM must not access binary judgments, pairwise preferences, PANAS labels, or other ground-truth annotations.
- Report evaluation mode. Specify Default, Omniscient, or Verbose. Default Mode is the primary benchmark; other modes should not be presented as primary without qualification.
- No test set contamination. All conversations were collected through March 2026. Users are encouraged to verify that EMs were not exposed to similar content during training.
This dataset is intended for research evaluation of LLM emotional intelligence. Conversations were not collected as therapeutic interactions; participants were adults who were explicitly informed the LLM was not a therapy tool, and the topic list deliberately excludes high-risk content categories (e.g., sex, drugs, self-harm). The released data contain no PII; anonymization comprises manual PII review and redaction, randomized participant identifiers, and coarse-grained demographic and diagnostic categories. Researchers requiring formal de-identification under specific regulatory frameworks (e.g., HIPAA Safe Harbor) should apply additional review for those frameworks.
This repository contains both code and data, released under two licenses:
- Code (Python package
attunebench/, analyses, top-level utilities, tests) — MIT License; seeLICENSE. - Data (
Test Samples/conversation files, per-conversation aggregated metric outputs,annotation_codebook.json) — Creative Commons Attribution 4.0 International (CC BY 4.0); seeLICENSE-DATA.
[Citation pending publication]
This release constitutes v1.0. Bug reports and replication questions should be directed to the repository issue tracker (anonymous URL during review; deanonymized post-acceptance).