# Hangman Bench Experimental Plan

This notebook captures a reproducible blueprint for analysing the Hangman Bench codebase, planning model evaluations, and preparing a research paper that reports the benchmark results.

## Repository at a Glance

The following code cell enumerates the key Python modules that implement the benchmark along with companion analysis utilities bundled with the repository.

In [1]:
from pathlib import Path

src_files = sorted(Path('src/hangman_bench').glob('**/*.py'))
analysis_files = sorted(Path('analysis').glob('*.py'))

print('Source modules:')
for path in src_files:
    print(f"  - {path}")

print('\nAnalysis utilities:')
for path in analysis_files:
    print(f"  - {path}")

Source modules:
  - src/hangman_bench/__init__.py
  - src/hangman_bench/datasets.py
  - src/hangman_bench/hangman.py

Analysis utilities:
  - analysis/bin_difficulty.py
  - analysis/cost_estimation.py
  - analysis/extract_wordlist.py
  - analysis/ingest_simulation.py
  - analysis/measure_difficulty.py
  - analysis/reclassified_from_coverage.py
  - analysis/reclassified_from_freq.py
  - analysis/reclassify_words.py
  - analysis/zen_hangman.py


## Dataset Composition

Hangman Bench ships with a curated, difficulty-annotated English word list. The cell below summarises the distribution of words per difficulty tier and reports descriptive statistics that are relevant when sampling evaluation subsets.

In [2]:
import importlib.util
from collections import Counter
from statistics import mean
from pathlib import Path

spec = importlib.util.spec_from_file_location('hangman_datasets', Path('src/hangman_bench/datasets.py'))
datasets = importlib.util.module_from_spec(spec)
spec.loader.exec_module(datasets)

words = datasets.get_words_by_language(datasets.Language.ENGLISH)
difficulty_counts = Counter(word.difficulty for word in words)
lengths = [len(word.word) for word in words]

print(f"Total words: {len(words)}")
print('Difficulty distribution:')
for difficulty, count in sorted(difficulty_counts.items()):
    pct = (count / len(words)) * 100
    print(f"  - {difficulty:7s}: {count:3d} words ({pct:5.1f}%)")

print('\nWord length statistics:')
print(f"  - Min: {min(lengths)} characters")
print(f"  - Median: {sorted(lengths)[len(lengths)//2]} characters")
print(f"  - Mean: {mean(lengths):.2f} characters")
print(f"  - Max: {max(lengths)} characters")

Total words: 100
Difficulty distribution:
  - easy   :  20 words ( 20.0%)
  - hard   :  20 words ( 20.0%)
  - medium :  20 words ( 20.0%)
  - v_easy :  20 words ( 20.0%)
  - v_hard :  20 words ( 20.0%)

Word length statistics:
  - Min: 3 characters
  - Median: 6 characters
  - Mean: 6.28 characters
  - Max: 10 characters


## Task Configuration Signals

Because the Inspect runtime is an optional dependency, we analyse the task definition statically. The next cell parses `hangman.py` with Python's `ast` module to expose the default parameters that shape each evaluation run.

In [3]:
from pathlib import Path
import ast

source = Path('src/hangman_bench/hangman.py').read_text()
module = ast.parse(source)

defaults = {}
constants = {}

for node in module.body:
    if isinstance(node, ast.Assign):
        for target in node.targets:
            if isinstance(target, ast.Name):
                try:
                    constants[target.id] = ast.literal_eval(node.value)
                except Exception:
                    pass
    if isinstance(node, ast.FunctionDef) and node.name == 'hangman':
        args = node.args
        positional = args.args
        default_values = [None] * (len(positional) - len(args.defaults)) + list(args.defaults)
        for arg, default in zip(positional, default_values):
            if default is None:
                defaults[arg.arg] = None
            else:
                try:
                    defaults[arg.arg] = ast.literal_eval(default)
                except Exception:
                    defaults[arg.arg] = ast.unparse(default)

print('Top-level constants:')
for key in sorted(constants):
    if key.isupper():
        print(f"  - {key} = {constants[key]}")

print('\n`hangman` task parameters:')
for key in defaults:
    print(f"  - {key}: default={defaults[key]}")

Top-level constants:
  - DEFAULT_MAX_GUESSES = 10
  - NUM_ALLOWABLE_EXTRA_MESSAGES = 5

`hangman` task parameters:
  - language: default=DEFAULT_LANGUAGE.value
  - difficulty: default=None
  - max_guesses: default=DEFAULT_MAX_GUESSES
  - shuffle: default=True
  - allow_word_guesses: default=False


## Experiment Matrix

The following experiment plan enumerates the benchmark scenarios we will run. Each row captures the evaluation axis, the hypothesis under test, and any task parameters that need to be toggled.

In [4]:
experiments = [
    {
        'axis': 'Difficulty sensitivity',
        'description': 'Evaluate performance separately on each difficulty tier to quantify robustness across easy and hard words.',
        'task_parameters': 'language="english"; difficulty in {v_easy, easy, medium, hard, v_hard}',
        'primary_metrics': 'Grouped accuracy, stderr',
    },
    {
        'axis': 'Word guessing ability',
        'description': 'Allow full-word submissions to measure whether models leverage early inference versus letter-by-letter play.',
        'task_parameters': 'allow_word_guesses=True',
        'primary_metrics': 'Win rate, number of incorrect guesses',
    },
    {
        'axis': 'Guess budget',
        'description': 'Stress-test models with tighter mistake budgets (max_guesses in {6, 8, 10}) to see how strategic play adapts.',
        'task_parameters': 'max_guesses varied',
        'primary_metrics': 'Win rate, mean remaining guesses',
    },
    {
        'axis': 'Language generalisation',
        'description': 'If additional dictionaries are added, repeat the mixed-difficulty evaluation for each supported language.',
        'task_parameters': 'language varied; shuffle=True',
        'primary_metrics': 'Accuracy per language, stderr',
    },
]

headers = ['Axis', 'Description', 'Task parameters', 'Primary metrics']
col_widths = []
for header, key in zip(headers, ['axis', 'description', 'task_parameters', 'primary_metrics']):
    col_width = len(header)
    for row in experiments:
        col_width = max(col_width, len(row[key]))
    col_widths.append(col_width)

header_row = ' | '.join(header.ljust(width) for header, width in zip(headers, col_widths))
print(header_row)
print('-+-'.join('-' * width for width in col_widths))
for row in experiments:
    print(' | '.join(row[key].ljust(width) for key, width in zip(['axis', 'description', 'task_parameters', 'primary_metrics'], col_widths)))

Axis                    | Description                                                                                                   | Task parameters                                                        | Primary metrics                      
------------------------+---------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------+--------------------------------------
Difficulty sensitivity  | Evaluate performance separately on each difficulty tier to quantify robustness across easy and hard words.    | language="english"; difficulty in {v_easy, easy, medium, hard, v_hard} | Grouped accuracy, stderr             
Word guessing ability   | Allow full-word submissions to measure whether models leverage early inference versus letter-by-letter play.  | allow_word_guesses=True                                                | Win rate, number of incorrect guesses
Gues

## Target Model Coverage

To turn the experiment matrix into a paper-ready benchmark, we curate a cross-provider model list. Costs will be projected using the helpers introduced later in this notebook.

In [5]:
models = [
    {
        'provider': 'OpenAI',
        'model': 'gpt-4o-mini',
        'notes': 'Fast baseline with tool-use support; primary benchmark anchor.',
    },
    {
        'provider': 'OpenAI',
        'model': 'gpt-4o',
        'notes': 'Higher-end capability reference for English Hangman.',
    },
    {
        'provider': 'Anthropic',
        'model': 'claude-3-5-sonnet',
        'notes': 'Latest Claude family model with strong reasoning.',
    },
    {
        'provider': 'Google',
        'model': 'gemini-2.0-flash-thinking',
        'notes': 'Gemini model with tool-use; evaluate instruction following.',
    },
    {
        'provider': 'Cohere',
        'model': 'command-r-plus',
        'notes': 'Tool-enabled Command series model for multilingual robustness.',
    },
]

headers = ['Provider', 'Model', 'Notes']
col_widths = []
for header, key in zip(headers, ['provider', 'model', 'notes']):
    col_width = len(header)
    for entry in models:
        col_width = max(col_width, len(entry[key]))
    col_widths.append(col_width)

print(' | '.join(header.ljust(width) for header, width in zip(headers, col_widths)))
print('-+-'.join('-' * width for width in col_widths))
for entry in models:
    print(' | '.join(entry[key].ljust(width) for key, width in zip(['provider', 'model', 'notes'], col_widths)))

Provider  | Model                     | Notes                                                         
----------+---------------------------+---------------------------------------------------------------
OpenAI    | gpt-4o-mini               | Fast baseline with tool-use support; primary benchmark anchor.
OpenAI    | gpt-4o                    | Higher-end capability reference for English Hangman.          
Anthropic | claude-3-5-sonnet         | Latest Claude family model with strong reasoning.             
Google    | gemini-2.0-flash-thinking | Gemini model with tool-use; evaluate instruction following.   
Cohere    | command-r-plus            | Tool-enabled Command series model for multilingual robustness.


## Cost Projection Workflow

The benchmark uses pilot runs to calibrate token usage and then extrapolates costs for full-scale experiments. The following code demonstrates the `analysis.cost_estimation` helpers with illustrative data.

In [6]:
from datetime import datetime, timezone

from inspect_ai.log import EvalLog

from analysis.cost_estimation import (
    ModelPricing,
    project_costs,
    summarise_eval_logs,
)


def fake_eval_log(model_name, samples, input_tokens, output_tokens):
    return EvalLog.model_validate(
        {
            "status": "success",
            "eval": {
                "created": datetime.now(timezone.utc).isoformat(),
                "task": "hangman_bench/hangman",
                "model": model_name,
                "dataset": {"name": "hangman-sample", "samples": samples},
                "config": {},
            },
            "stats": {
                "model_usage": {
                    model_name: {
                        "input_tokens": input_tokens,
                        "output_tokens": output_tokens,
                        "total_tokens": input_tokens + output_tokens,
                    }
                }
            },
            "results": {
                "total_samples": samples,
                "completed_samples": samples,
            },
        }
    )


pilot_logs = [
    fake_eval_log("openai/gpt-4o-mini", samples=3, input_tokens=2150, output_tokens=540),
    fake_eval_log("anthropic/claude-3-5-sonnet", samples=3, input_tokens=2400, output_tokens=720),
]

usage_summaries = summarise_eval_logs(pilot_logs)

print("Per-model usage captured from Inspect logs:")
for model, summary in usage_summaries.items():
    print(f"  - {model}: {summary.samples} samples, {summary.total_tokens} total tokens")

pricing_table = [
    ModelPricing(provider="OpenAI", model="openai/gpt-4o-mini", input_per_million=0.15, output_per_million=0.60),
    ModelPricing(provider="Anthropic", model="anthropic/claude-3-5-sonnet", input_per_million=3.00, output_per_million=15.00),
]

projections = project_costs(usage_summaries, target_games=100, pricings=pricing_table)

for estimate in projections:
    print(f"{estimate.provider} / {estimate.model} -> ${estimate.total_cost:.2f} ({estimate.currency})")
    print(f"  input tokens: {estimate.input_tokens:,}")
    print(f"  output tokens: {estimate.output_tokens:,}")


Per-model usage captured from Inspect logs:
  - openai/gpt-4o-mini: 3 samples, 2690 total tokens
  - anthropic/claude-3-5-sonnet: 3 samples, 3120 total tokens
OpenAI / openai/gpt-4o-mini -> $0.02 (USD)
  input tokens: 71,667
  output tokens: 18,000
Anthropic / anthropic/claude-3-5-sonnet -> $0.60 (USD)
  input tokens: 80,000
  output tokens: 24,000


## Reproducible Evaluation Loop

The Inspect CLI exposes every ingredient we need for reproducible experiments. Run the following template once per model, updating the provider identifier and evaluation limit as needed:

```bash
uv run inspect eval hangman_bench/hangman         --model <provider/model-id>         --limit 50         --log-dir runs/<provider>-<model>/
```

Each run produces an `.inspect` log inside the log directory along with human-readable transcripts. Load those logs with `inspect_ai.log.read_eval_log` (or capture them directly from `inspect.eval(...)`) and feed the resulting `EvalLog` objects into the cost estimation helpers introduced above.


## Paper Outline

The final section sketches a publication-ready structure. Each bullet links back to deliverables generated earlier in the notebook:

1. **Introduction** – Motivation for evaluating tool-using agents on Hangman; highlight dataset composition statistics.
2. **Related Work** – Survey tool-augmented reasoning benchmarks and prior Hangman solvers.
3. **Benchmark Description** – Describe the Inspect task configuration, solver architecture, and scoring pipeline.
4. **Experimental Setup** – Reference the experiment matrix and model list; detail prompt/call parameters.
5. **Results** – Present grouped accuracy, error bars, and qualitative transcripts; include cost analysis tables.
6. **Ablations** – Discuss effects of difficulty tiers, guess budgets, and optional word guesses.
7. **Discussion & Limitations** – Interpret strategic behaviours, failure modes, and token efficiency.
8. **Cost & Accessibility** – Summarise projected spend per provider and guidelines for budget-conscious replication.
9. **Conclusion** – Reflect on broader implications and future extensions (e.g., multilingual datasets).