# Explore Candidate ToM Benchmarks

This notebook explores several Theory of Mind (ToM) benchmarks for evaluating language models' ability to reason about mental states, beliefs, and intentions of agents.

**Benchmarks covered:**
- **ToMi** - Theory of Mind Inventory with first/second-order belief tracking
- **FANToM** - A benchmark for stress-testing machine ToM in conversations
- **SimpleToM** - Allen AI's simplified ToM evaluation dataset

## Setup

Install required dependencies for loading and evaluating datasets.

In [None]:
! pip install evaluate datasets

---

## ToMi (Theory of Mind Inventory)

ToMi is a benchmark from Facebook Research that tests models on false-belief understanding through story-based question answering.

**Key features:**
- **First-order beliefs**: "Where does X think the object is?"
- **Second-order beliefs**: "Where does X think Y thinks the object is?"
- **ToM vs No-ToM**: Questions that require mental state reasoning vs. simple memory retrieval

**Paper:** [ToMi: A Test of Theory of Mind in Language Models](https://arxiv.org/abs/1909.01871)

### Install ToMi and Dependencies

In [None]:
! git clone https://github.com/facebookresearch/ToMi.git tomi
! rm -rf tomi/.git
! pip install tqdm pandas matplotlib seaborn numpy jupyterlab openpyxl transformers datasets evaluate wandb

In [None]:
! unzip tomi/tomi_balanced_story_types.zip -d tomi/

### Extract Question-Answer Pairs

The extractor script processes the raw story traces into structured JSONL files, organized by:
- Order (first-order vs second-order belief questions)
- Story type (0 or 1)
- ToM requirement (whether mental state reasoning is needed)

In [None]:
! python tomi/tomi_pair_extractor.py --data_dir ./tomi/tomi_balanced_story_types --output_dir ./tomi/tomi_pairs --split test

### Format Data for Model Inference

Convert the extracted JSONL files into a prompt format suitable for LLM evaluation.

In [None]:
import json
from pathlib import Path


def format_for_inference(jsonl_path: str, output_path: str = None):
    """Convert ToMi JSONL to inference-ready format.
    
    Args:
        jsonl_path: Path to input JSONL file with story/question pairs
        output_path: Optional output path (defaults to input path with '_prompts' suffix)
    
    Returns:
        List of formatted examples with prompts and metadata
    """
    examples = []
    with open(jsonl_path) as f:
        for line in f:
            ex = json.loads(line)
            
            # Clean up story (remove line numbers)
            story_lines = ex['story'].split('\n')
            clean_story = '\n'.join(
                line.split(' ', 1)[1] if line[0].isdigit() else line 
                for line in story_lines
            )
            
            # Format prompt
            prompt = f"""Story:
{clean_story}

Question: {ex['question']}
Answer:"""
            
            examples.append({
                'prompt': prompt,
                'answer': ex['answer'],
                'question_type': ex['question_type'],
                'story_type': ex['story_type'],
                'requires_tom': ex['requires_tom'],
            })
    
    # Save
    out_path = output_path or jsonl_path.replace('.jsonl', '_prompts.jsonl')
    with open(out_path, 'w') as f:
        for ex in examples:
            f.write(json.dumps(ex) + '\n')
    
    print(f"Saved {len(examples)} prompts to {out_path}")
    return examples


def evaluate_response(response: str, correct_answer: str) -> bool:
    """Check if model response contains the correct answer.
    
    Uses case-insensitive substring matching.
    """
    response_lower = response.lower().strip()
    answer_lower = correct_answer.lower()
    return answer_lower in response_lower

### Generate Prompt Files

Process all ToMi data splits into inference-ready prompts.

In [None]:
# First-order belief questions
format_for_inference('tomi/tomi_pairs/first_order_0_no_tom.jsonl')
format_for_inference('tomi/tomi_pairs/first_order_1_no_tom.jsonl')
format_for_inference('tomi/tomi_pairs/first_order_1_tom.jsonl')

# Second-order belief questions
format_for_inference('tomi/tomi_pairs/second_order_0_tom.jsonl')
format_for_inference('tomi/tomi_pairs/second_order_0_no_tom.jsonl')
format_for_inference('tomi/tomi_pairs/second_order_1_tom.jsonl')
format_for_inference('tomi/tomi_pairs/second_order_1_no_tom.jsonl')

---

## FANToM

FANToM (FAct-tracking in NaTural cOnversations with Models) is a benchmark that tests ToM capabilities through naturalistic multi-party conversations.

**Key features:**
- Realistic conversational scenarios with multiple speakers
- Information asymmetry between conversation participants
- Tests whether models can track who knows what

**Paper:** [FANToM: A Benchmark for Stress-Testing Machine Theory of Mind in Interactions](https://arxiv.org/abs/2310.15421)

### Setup and Run FANToM Evaluation

FANToM requires a separate conda environment due to specific dependency requirements.

In [None]:
! conda env create -f fantom/environment.yml
! conda activate fantom
! python fantom/eval_fantom.py

---

## SimpleToM

SimpleToM from Allen AI provides a clean, focused evaluation of ToM capabilities across three question types.

**Question types:**
- **Mental-state QA**: Questions about characters' beliefs and knowledge
- **Behavior QA**: Predicting actions based on mental states
- **Judgment QA**: Evaluating decisions given information asymmetry

**Paper:** [SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs](https://arxiv.org/abs/2410.13648)

### Load SimpleToM Dataset

In [None]:
from datasets import load_dataset

In [None]:
simple_tom_mental = load_dataset("allenai/SimpleToM", 'mental-state-qa')
simple_tom_behavior = load_dataset("allenai/SimpleToM", 'behavior-qa')
simple_tom_judgment = load_dataset("allenai/SimpleToM", 'judgment-qa')

### Explore Dataset Structure

In [None]:
print("Mental-state QA:", simple_tom_mental)
print("\nBehavior QA:", simple_tom_behavior)
print("\nJudgment QA:", simple_tom_judgment)