Skip to content

Thoughtful-Lab/attunebench

Repository files navigation

AttuneBench

AttuneBench evaluates LLM emotional intelligence (EI) through 200 genuine multi-turn human-model conversations with first-person turn-level annotations of emotional state, model behavior, and response preferences provided by the human participants who held the conversations. The benchmark is grounded in the Mayer-Salovey-Caruso Four Branch Model of EI and measures EI through observable model behavior in conversation: emotion inference, response evaluation, preference anticipation, and outcome prediction.

This repository contains the runner, scorer, and supplementary analysis scripts. The benchmark dataset (200-conversation Sample200 plus four smaller subsamples) is bundled in Test Samples/.

Repository layout

.
├── attunebench/              ← runner + scorer package (CLI: `attune-bench`)
│   ├── runner.py             ← per-turn EM call orchestration
│   ├── scorer.py             ← canonical 3-pillar Composite + per-metric scoring
│   ├── reporter.py           ← `attune-bench report` formatting
│   ├── harness/              ← lm-evaluation-harness task adapter
│   └── prompts/              ← system + user prompt templates per call type
├── analyses/                 ← supplementary analysis scripts (paper-internal)
├── tests/                    ← pytest suite
├── Test Samples/             ← bundled conversation data
│   ├── Sample200/            ← canonical 200-conversation benchmark
│   ├── Subsample100/
│   ├── Subsample50/          ← Verbose Mode coverage in paper runs
│   ├── Subsample25/          ← Omniscient Mode coverage in paper runs
│   └── Subsample20/
├── subset_report.py          ← trait-stratified score filtering tool
├── model_stats.py            ← per-model summary stats from a scores.json
├── predict_profile.py        ← experimental: profile prediction from conversation
├── rejudge.py                ← re-run draft-judge on existing run outputs
├── pyproject.toml
├── QUICKSTART.md
├── README.md                 ← this file
└── SUBSET_REPORT_README.md   ← detailed `subset_report.py` reference

The per-conversation scoring outputs and aggregated scores.json files for the paper's reported runs (across all 11 models × 3 modes) live in the dataset deposit under Experimental Runs/, not in this code repository — see the dataset README for that layout.

Setup Guide

Requirements

Installation

git clone https://anonymous.4open.science/r/attunebench-release-n-1F9B
cd <repo-dir>
pip install -e .

Optional extras:

pip install -e ".[bertscore]"   # BERTScore for response similarity
pip install -e ".[harness]"     # lm-evaluation-harness integration
pip install -e ".[analyses]"    # pandas / matplotlib / seaborn for supplementary analyses
pip install -e ".[dev]"         # pytest, ruff for development

Configuration

Set your API key as an environment variable:

# OpenRouter (default backend)
export OPENROUTER_API_KEY="your-key-here"

# Or OpenAI
export OPENAI_API_KEY="your-key-here"

# Or Prime Intellect
export PRIME_API_KEY="your-key-here"

Prime Intellect also supports ~/.prime/config.json with api_key, team_id, and inference_url. AttuneBench checks RLVR_PRIME_API_KEY / PRIME_API_KEY first, then falls back to that config file.

Quick Start

1. Run the benchmark against a model:

# Default mode only (when --modes is omitted)
attune-bench run openrouter anthropic/claude-sonnet-4 <path/to/conversations>

# All 4 modes at once
attune-bench run prime anthropic/claude-sonnet-4 <path/to/conversations> --modes all

# Pick specific modes
attune-bench run openai gpt-4o <path/to/conversations> --modes default verbose

# With draft quality judge (LLM-as-judge + binary alignment)
attune-bench run prime anthropic/claude-sonnet-4 <path/to/conversations> --judge-model anthropic/claude-opus-4.6

# Custom output directory, call timeout, explicit API key
attune-bench run prime anthropic/claude-opus-4.6 <path/to/conversations> \
    -o ./my-results/ --call-timeout 300 --api-key sk-...

Runs are resumable — if a run is interrupted, re-run the same command with the same --output directory. Completed conversations are automatically skipped. Failed conversations are logged to _skipped_<model>.json with re-run instructions.

2. Score the results against ground-truth annotations:

attune-bench score \
    --results ./results/ \
    --ground-truth <path/to/conversations> \
    --output ./scores.json

3. View the report:

attune-bench report --scores ./scores.json

4. (Optional) Trait-stratified analysis — filter scores by participant traits or conversation metadata:

python subset_report.py -s ./scores.json -d <path/to/conversations> --diagnosis ADHD --compare

See SUBSET_REPORT_README.md for the full set of filters and the per-tool metric reference.

Output files

attune-bench run writes to the --output directory:

  • <conversationId>_<provider>_<model>_<mode>.json — one per conversation, containing per-turn EM predictions (emotion tags, both binary phrasings, pairwise rankings, response draft, and judge ratings if --judge-model was set), conversation-wide predictions (Four-Branch ratings, Q1–Q3, post-PANAS estimate), and aggregated per-conversation scores.
  • _skipped_<model>.json — conversations that errored out during the run, with re-run instructions. Re-running the same command resumes from where it left off.

attune-bench score writes a single scores.json:

  • Per-conversation scores (one entry per conversation: every metric defined below, plus the source result file path).
  • Aggregated per-(model, mode) metrics (mean across conversations).
  • Run metadata: model identifier, mode, ground-truth path, number of conversations scored.

subset_report.py does not write to disk by default — it prints a filtered report to stdout.

Evaluation Modes

Mode EM context and outputs
default EM observes the HP message, drafts a response, then observes the OM response and produces emotion predictions, binary judgments, and pairwise rankings.
verbose Same as default; the EM additionally produces reasoning traces justifying each prediction.
omniscient Same as default; the EM is additionally provided the participant profile, pre-conversation PANAS, and topic attitude at turn 1.
verbose_omniscient Combines verbose (reasoning traces) and omniscient (background context).

All modes follow the same per-turn call flow:

  1. Draft generation: EM sees the HP's message and drafts how it would respond, before observing the OM response.
  2. Optional draft judging: when --judge-model is provided, a separate judge LLM rates the EM's draft on overall quality, emotional appropriateness, helpfulness, and tone match (1–7 scale).
  3. Emotion inference: EM predicts the HP's current emotional state using PANAS terminology.
  4. Binary judgment prediction: for each applicable binary question, the EM predicts both the observed OM behavior and the HP's preferred behavior. Each question is posed twice — once first-person (in your message'') and once observer (in the HP's message'') — in separate calls to measure perspective effect.
  5. Pairwise preference prediction: EM ranks the three response variants (original OM, model-generated alternate, human-edited golden response) from the HP's perspective; response labels are anonymized to reduce source bias.

The EM's own prior predictions are not passed back as context — each turn's analysis is independent to prevent anchoring bias.

Cost estimation. With one judge model on default mode, each turn makes 5 LLM calls (EM draft, judge quality + binary alignment, EM emotion + binary judgments, HP-facing binary re-prompt, pairwise selections). One additional post-conversation call covers PANAS and the conversation-wide questions. A 5-turn conversation is therefore 5 × 5 + 1 = 26 calls total. Verbose mode adds reasoning traces but does not change the call count; omniscient adds context to existing calls without changing the count.

Run multiple modes at once:

attune-bench run prime anthropic/claude-sonnet-4 ./data/ --modes default verbose
attune-bench run prime anthropic/claude-sonnet-4 ./data/ --modes all  # all 4 modes

Using with HuggingFace lm-evaluation-harness

pip install -e ".[harness]"
lm_eval --model hf \
    --model_args pretrained=meta-llama/Llama-3-70B \
    --tasks attunebench

Conversation Data Format

Conversation files (located in Test Samples/) follow the annotated format below. Sample200/ contains 200 conversations meeting protocol-completeness and engagement criteria; conversation IDs are non-sequential because they preserve their position in the original collection (full inclusion criteria documented in the dataverse README).

{
  "conversationId": "429fdfd2-9b97-4f8e-b61f-d5615305bf3f",
  "metadata": {
    "model": "gpt-4o-mini",
    "category": "Family",
    "subtopic": "Children / Family Planning",
    "text": "Family: Children / Family Planning"
  },
  "participant_profile": {
    "schema_version": "2.0",
    "participant_id": "A00002",
    "profile_type": "shortened",
    "demographics": {
      "country": "United States",
      "education": "High school",
      "work_edu_background": ["..."],
      "gender": "Male",
      "age_range": "25-34",
      "english_proficiency": "Native",
      "diagnoses": ["None of the above"]
    },
    "computed_scores": {
      "who_5": 19,
      "asrs_6": 1,
      "aq_10": 6,
      "promis_anxiety": 5,
      "promis_anxiety_t": 48.0,
      "promis_depression": 4,
      "promis_depression_t": 41.0,
      "promis_sleep": 6,
      "promis_sleep_t": 41.1,
      "tipi_extraversion": 9,
      "tipi_agreeableness": 9,
      "tipi_conscientiousness": 11,
      "tipi_stability": 11,
      "tipi_openness": 11
    },
    "conversation_topic_attitude": "Excited/Optimistic"
  },
  "prePanas": {
    "totalPositiveAffect": 43,
    "totalNegativeAffect": 11,
    "responses": {
      "alert": 5, "proud": 4, "upset": 1, "active": 5,
      "afraid": 1, "guilty": 1, "scared": 1, "strong": 3,
      "ashamed": 1, "excited": 4, "hostile": 1, "jittery": 1,
      "nervous": 2, "inspired": 4, "attentive": 5, "irritable": 1,
      "determined": 5, "distressed": 1, "interested": 5, "enthusiastic": 3
    }
  },
  "postPanas": {
    "totalPositiveAffect": 49,
    "totalNegativeAffect": 10,
    "responses": { "...same 20 items, 1-7 each..." : 0 }
  },
  "conversationWideQuestions": {
    "fourBranchScores": {
      "managing": 4, "perceiving": 5, "facilitating": 6, "understanding": 5
    },
    "q1_lookingFor": ["To help me understand or sort out my feelings"],
    "q2_emotionClarity": "Implied or indirect",
    "q3_modelFit": "Mixed, some good moments, some misses",
    "q3_followUp_whatFeltOff": ["..."]
  },
  "turns": [
    {
      "turnNumber": 1,
      "userMessage": "...",
      "llmResponse": "...",
      "moodShiftTags": [
        { "emotion": "Proud", "intensity": 4 }
      ],
      "annotations": {
        "binaryJudgements": [
          { "questionId": "B1", "observedBehavior": "yes", "preferredBehavior": "yes" }
        ],
        "alternateResponses": {
          "llmImproved": "...",
          "humanEdited": "..."
        },
        "pairwiseComparisons": [
          { "questionId": "general", "responseA": "original", "responseB": "alternate", "winner": "A" }
        ],
        "selectedPairwiseQuestions": ["PW7", "PW4", "PW3"]
      }
    }
  ]
}

Key data format details:

  • Metadata: Topic info (category, subtopic, text) and OM model identifier sit directly on metadata.
  • Participant profile: demographics (country, education, work background, gender, age range, English proficiency, self-reported mental health diagnoses); computed_scores (TIPI Big-Five subscores plus WHO-5, AQ-10, ASRS-6, and PROMIS-4 raw + T-scores for anxiety, depression, and sleep); conversation_topic_attitude (participant's pre-conversation attitude toward the assigned topic).
  • PANAS: Both aggregate totals AND all 20 individual item scores (1-7 scale) under responses. The scorer derives aggregates from items for consistency.
  • Mood shift tags: PANAS-SF 20-item taxonomy only. Empty list [] for neutral/stable turns. Intensity is 1-7.
  • Binary judgements: observedBehavior = did the OM do this? preferredBehavior = would the HP have wanted this? Both are yes/no/na. Supports 36 questions (B1–B20 core + B108–B505 extended).
  • Pairwise comparisons: 12 per turn (4 questions × 3 pairs). winner: "A"|"B" is the HP's choice. Each comparison has a unique index key in scored output.
  • Conversation-wide: q1_lookingFor (multi-select), q2_emotionClarity, q3_modelFit, q3_followUp_whatFeltOff (multi-select, populated only when q3_modelFit is negative), fourBranchScores (perceiving / facilitating / understanding / managing on 1–7 scale).
  • Question text lookup: full text for each binary ID (B1–B505) and pairwise ID (PW1–PW16) is consolidated in annotation_codebook.json at the repo root, alongside option lists for conversation-wide Q1/Q2/Q3, Four-Branch definitions, and PANAS taxonomy. Question text is also defined as Python dicts in attunebench/constants.py (BINARY_QUESTION_TEXT, BINARY_HP_QUESTION_TEXT, PAIRWISE_QUESTION_TEXT); the runtime imports those.
  • Two phrasings for binary questions: each binary question has an HP-facing phrasing ("your message") used during participant annotation, and a parallel observer phrasing ("the HP's message") used when EMs predict HP labels during evaluation. Both refer to the same construct and differ only in pronoun stance. The codebook lists text (HP-facing canonical) and observerText (where different) per question. Pairwise questions use HP-facing phrasing only.

Scored Metrics

The report displays scores in three groups plus an overall composite:

Turn-by-Turn Scores (averaged across all turns):

Metric Description
Emotion F1 F1 score (precision + recall) for predicted PANAS emotion tags vs ground-truth.
Emotion VA Valence-Arousal similarity using Hungarian matching against a precomputed 20×20 VAD similarity matrix. Gives graded credit for near-miss predictions (e.g., "Distressed" vs "Upset" = 0.92 similarity).
Emotion Intensity MAE Mean absolute error on intensity (1-7 scale) for matched emotion tags.
Binary OM Accuracy Whether the EM correctly assesses what the OM did (observedBehavior). Pooled metrics include precision, recall, F1, and MCC.
Binary HP Accuracy Whether the EM correctly predicts the HP's preference (preferredBehavior).
Binary OM/HP (HP-facing) Same binary questions re-asked with HP-facing text. Stored separately as _hp fields.
Pairwise Accuracy How often the EM matches the HP's preferred winner across all 12 comparisons per turn.
Kendall Tau Rank correlation of implied response rankings (original vs alternate vs human) between EM and GT.
Draft Judge LLM-as-judge score comparing EM's drafted response to humanEdited reference (optional, requires --judge-model).
Draft Binary Alignment Binary preference alignment between EM's draft and HP's preferredBehavior (optional, requires --judge-model).

Conversation-Wide Scores:

Metric Description
PANAS Normalized 1 - MAE/60 for aggregate PA and NA predictions. Scale is 1-7 per item, 10 items per aggregate (range 10-70).
PANAS Item Normalized 1 - item_MAE/6 averaged across all 20 individual PANAS items.
PANAS Baseline-Adjusted Model score relative to a "predict no change from pre-PANAS" naive baseline. 0.0 = no better than naive, 1.0 = perfect, negative = worse than naive. Clamped to [-1, 1].
Q1 Goals Set overlap on q1_lookingFor (intersection/union of selected options).
Q2 Clarity Exact match on q2_emotionClarity.
Q3 Fit Exact match + ordinal distance on q3_modelFit.
Q3 Follow-up Set overlap with fuzzy matching on q3_followUp_whatFeltOff.
4-Branch Normalized similarity on perceiving/facilitating/understanding/managing scores (1-7 scale; 1 - MAE/max_error). Per-branch breakdown stored in output.

Composite Scoring (0-100 scale):

The composite blends 0.5 * emotion_f1 + 0.5 * emotion_va_score for the emotion tracking component. Response similarity is excluded from the composite (informational only). Composite weights:

Component Weight Inputs
Emotion Tracking 24% Blended F1 + VA score
Evaluation Quality 49% Binary OM/HP accuracy + pairwise accuracy
Holistic Comprehension 27% PANAS + conv-wide questions + four-branch

A note on metric naming and what's in the Composite

The scorer outputs more metrics than the paper analyzes — including exploratory and diagnostic variants of paper-canonical metrics. To compare numbers directly with paper claims, use the canonical-Composite group below.

Paper-canonical Composite-input metrics:

  • Emotion F1 and Emotion VA (blended 50/50 inside the emotion-tracking pillar)
  • Binary OM Accuracy and Binary HP Accuracy (averaged inside the evaluation-quality pillar, with Pairwise Accuracy)
  • PANAS Baseline-Adjusted, 4-Branch (normalized similarity), and the mean of Q1/Q2/Q3/Q3 Follow-up (inside the holistic-comprehension pillar)

Reported alongside but not in the Composite:

  • Kendall τ (correlated with Pairwise Accuracy; reported for interpretability)
  • Draft Judge (judge-model artifact, supplementary)

Diagnostic-only (related but distinct from paper-canonical):

  • Emotion Hit Rate — set-level recall on mood-shift tags. Different metric from Emotion F1; same input, different formula. Not used in the Composite or paper analyses.
  • PANAS Normalized — simple 1 - MAE/max_error accuracy. Different from PANAS Baseline-Adjusted, which discounts the trivial "predict no change" baseline. The paper uses the baseline-adjusted version.
  • Response Similarity — intentionally excluded from the canonical Composite per scorer design.
  • Emotion Intensity MAE — exploratory; not part of any paper-reported analysis.

The subset_report.py tool surfaces some of these diagnostic metrics directly in its output; see SUBSET_REPORT_README.md for the per-tool metric reference.

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

Theoretical Framework

AttuneBench grounds its evaluation in the Mayer-Salovey-Caruso Four Branch Model of emotional intelligence (Mayer, Caruso, & Salovey 2016): perceiving emotion, using emotion to facilitate thought, understanding emotion, and managing emotion. Benchmark components map pragmatically to these four branches:

  • Mood-shift tag predictions → perceiving emotion
  • Binary judgments (observed and preferred behavior) → facilitating thought
  • Post-conversation PANAS estimation → understanding emotion
  • Response drafting and pairwise preference prediction → managing emotion

These are pragmatic approximations rather than strict correspondences, but together span major dimensions of emotional intelligence operationalized as observable behavior. See the accompanying paper for full theoretical grounding, methodology details, and result analyses.

Usage Guidelines

These guidelines are not technically enforced but constitute the expected standard for published results. To ensure comparability and diagnostic value:

  • No training on benchmark data. Evaluated models must not be fine-tuned on any conversations, annotations, or metadata from this dataset.
  • No access to human annotations during evaluation. The EM must not access binary judgments, pairwise preferences, PANAS labels, or other ground-truth annotations.
  • Report evaluation mode. Specify Default, Omniscient, or Verbose. Default Mode is the primary benchmark; other modes should not be presented as primary without qualification.
  • No test set contamination. All conversations were collected through March 2026. Users are encouraged to verify that EMs were not exposed to similar content during training.

Limitations

This dataset is intended for research evaluation of LLM emotional intelligence. Conversations were not collected as therapeutic interactions; participants were adults who were explicitly informed the LLM was not a therapy tool, and the topic list deliberately excludes high-risk content categories (e.g., sex, drugs, self-harm). The released data contain no PII; anonymization comprises manual PII review and redaction, randomized participant identifiers, and coarse-grained demographic and diagnostic categories. Researchers requiring formal de-identification under specific regulatory frameworks (e.g., HIPAA Safe Harbor) should apply additional review for those frameworks.

License

This repository contains both code and data, released under two licenses:

  • Code (Python package attunebench/, analyses, top-level utilities, tests) — MIT License; see LICENSE.
  • Data (Test Samples/ conversation files, per-conversation aggregated metric outputs, annotation_codebook.json) — Creative Commons Attribution 4.0 International (CC BY 4.0); see LICENSE-DATA.

Citation

[Citation pending publication]

Maintenance

This release constitutes v1.0. Bug reports and replication questions should be directed to the repository issue tracker (anonymous URL during review; deanonymized post-acceptance).

About

No description, website, or topics provided.

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE-DATA

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages