# Build Your Own Benchmark

In Notebook 01, we explored the pre-computed results of administering the SIRI-2 to nine language models. This notebook walks through **running your own evaluation** from scratch: making live API calls, scoring the results, and comparing your model to published human benchmarks.

### Why run your own evaluation?

The central finding of our study is that model performance varies dramatically depending on how the model is prompted and configured. A model that scores at the level of a trained counselor under one set of conditions can score at the level of an untrained undergraduate under another. The only way to know how a model performs in *your* context — with your prompts, your clinical domain, your scoring criteria — is to test it yourself.

This notebook makes that process accessible. Every API call is a simple request-response: you send the model a clinical scenario and a helper response, and it sends back a rating. No machine learning expertise is required. If you can interpret a SIRI-2 score, you can interpret the output of this pipeline.

### What you'll need

- **At least one API key** from OpenAI, Anthropic, or Google. The setup cell below has a place to paste your key directly. If you don't have an API key yet, each provider offers a free or low-cost tier — see the README for links.
- **Python dependencies** installed (`pip install -r requirements.txt`).

### What this costs

The tutorial run below uses a single model, one prompt variant, one temperature, and three repetitions — about **144 API calls**. This costs a few cents and completes in under a minute. Section 8 shows how to scale up to the full experiment design (~$100 for all nine models).

---
## 1. Setup

The cell below loads libraries and detects your API key. **You have two options** for providing your key:

1. **Paste it directly** in the cell below (simplest — uncomment one line and paste your key)
2. **Use a `.env` file** if you prefer to keep keys out of the notebook. Note: `.env` files are hidden by default in Jupyter Lab. To see them, go to **View > Show Hidden Files** in the Jupyter Lab menu bar, then copy `.env.example` to `.env` and fill in your key(s).

Either way works. If you paste a key directly, it stays in your local notebook and is never sent anywhere except to the API provider.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import sys
import os
from pathlib import Path

# ── Paths ─────────────────────────────────────────────────────────────
REPO_ROOT = Path('..').resolve()
INSTRUMENT_DIR = REPO_ROOT / 'instrument'

# Tutorial results go in their own directory so they don't mix with
# the published experiment data in experiment-results/
TUTORIAL_OUTPUT = REPO_ROOT / 'tutorial-results'

# ── API keys ──────────────────────────────────────────────────────────
# OPTION 1: Paste your key directly (uncomment ONE line below).
# os.environ['ANTHROPIC_API_KEY'] = 'sk-ant-...'   # Anthropic
# os.environ['OPENAI_API_KEY']    = 'sk-...'        # OpenAI
# os.environ['GOOGLE_API_KEY']    = 'AI...'          # Google

# OPTION 2: Load from .env file (works automatically if .env exists).
from dotenv import load_dotenv
load_dotenv(REPO_ROOT / '.env')

# Make the src/ modules importable from this notebook
sys.path.insert(0, str(REPO_ROOT / 'src'))

# Plot styling
plt.style.use('seaborn-v0_8-whitegrid')
pd.set_option('display.max_columns', 15)
pd.set_option('display.width', 120)

# ── Detect which providers are available ──────────────────────────────
# Each provider needs its own API key. The cheapest/fastest model for
# each provider is listed as the default tutorial model.
AVAILABLE_MODELS = {}
if os.getenv('ANTHROPIC_API_KEY'):
    AVAILABLE_MODELS['anthropic'] = 'claude-3-5-haiku-20241022'
if os.getenv('GOOGLE_API_KEY'):
    AVAILABLE_MODELS['google'] = 'gemini-2.0-flash'
if os.getenv('OPENAI_API_KEY'):
    AVAILABLE_MODELS['openai'] = 'gpt-3.5-turbo-0125'

if not AVAILABLE_MODELS:
    raise RuntimeError(
        'No API keys found.\n\n'
        'Uncomment one of the os.environ lines at the top of this cell\n'
        'and paste your API key, then re-run this cell.'
    )

# Pick the first available model (provider order prioritizes cost)
MODEL = list(AVAILABLE_MODELS.values())[0]

print(f'Available providers: {list(AVAILABLE_MODELS.keys())}')
print(f'Auto-selected model: {MODEL}')
print(f'\nTo use a different model, change MODEL in the configuration cell below.')
print(f'Tutorial output will be saved to: {TUTORIAL_OUTPUT}')

---
## 2. Review the Instrument

Before running the experiment, let's look at what the models will be asked to evaluate. The SIRI-2 instrument is stored as two files:

1. **`siri2_items.json`** — the 24 clinical scenarios, each presenting a client expressing suicidal ideation and two possible helper responses
2. **`siri2_expert_scores.csv`** — the expert panel's mean ratings and standard deviations for all 48 helper responses (24 scenarios x 2 helpers)

The model receives exactly what a human respondent would: a client statement, a single helper response, and instructions to rate it on the -3 to +3 scale. Each helper response is scored independently — the model never sees both options for the same scenario at once, just as a human respondent rates each one separately.

Any instrument that follows this same format will work with the benchmark runner (see Section 9).

In [None]:
# Load the instrument items
with open(INSTRUMENT_DIR / 'siri2_items.json') as f:
    items = json.load(f)

print(f'{len(items)} scenarios\n')

# Show one scenario as an example.
# Each scenario has a client statement and two helper responses.
# In clinical terms: this is a client in crisis, and we're asking
# the model to evaluate how appropriate each possible helper response is.
item = items[0]
print(f'Scenario {item["item_id"]}:')
print(f'  Client: "{item["client"]}"')
print(f'  Helper A: "{item["helper_a"]}"')
print(f'  Helper B: "{item["helper_b"]}"')
print(f'\nRequired JSON fields: item_id, client, helper_a, helper_b')
print(f'The model will rate Helper A and Helper B separately.')

In [None]:
# Load the expert scoring key.
# These are the ground-truth ratings from a panel of seven nationally
# recognized suicidologists. "M" is the panel mean, "SD" is the standard
# deviation. Small SDs mean the experts largely agreed on the rating.
expert = pd.read_csv(INSTRUMENT_DIR / 'siri2_expert_scores.csv')
expert.columns = ['Item', 'expert_mean', 'expert_sd']

print(f'{len(expert)} scored items (24 scenarios x 2 helpers = 48 items)')
print(f'Rating scale: -3 (very inappropriate) to +3 (very appropriate)\n')
print(expert.head(10).to_string(index=False))
print(f'\nNote: Item 14 is included in the data but excluded from scoring.')
print(f'The original expert panel could not reach consensus on it — both')
print(f'helper responses represent defensible crisis intervention strategies.')

---
## 3. Configure the Experiment

The benchmark runner supports four configuration axes:

| Parameter | What it controls | Full experiment | Tutorial run |
|-----------|-----------------|----------------|---------------|
| **Models** | Which LLMs to test | 9 (across 3 providers) | 1 |
| **Prompt variants** | How much clinical context the model receives | 3 | 1 (detailed) |
| **Temperatures** | How much randomness in the model's output | 2 (0.0, 1.0) | 1 (1.0) |
| **Repetitions** | How many times to ask the same question | 10 | 3 |

A few notes on what these mean in practice:

- **Temperature** controls randomness. At temperature 0, the model gives the same answer every time (deterministic). At temperature 1.0, there is stochastic variation — the model may rate the same item differently across repetitions. We use 1.0 here so you can see how consistent the model is. A model that gives wildly different ratings to the same scenario on repeated trials is less trustworthy, even if its average is close to the expert mean.

- **Prompt variant** determines how much clinical context the model receives before rating each item. The "detailed" prompt frames the task as rating an excerpt from a counseling session on the SIRI-2 scale. Notebook 01 (Section 4) showed that this framing substantially improves performance for most models — without it, models often default to generic sentiment rather than clinical judgment.

- **Repetitions** let us measure both the average response and its variability. Three repetitions is enough to get a rough estimate; the full study used ten.

In [None]:
# ── Tutorial configuration ──────────────────────────────────────────
# MODEL was auto-selected above based on your API keys.
# To use a different model, uncomment one of the lines below.
#
# Available models (must match your API key):
#   Anthropic: claude-3-5-haiku-20241022, claude-3-5-sonnet-20241022,
#             claude-sonnet-4-20250514, claude-opus-4-20250514
#   OpenAI:   gpt-3.5-turbo-0125, gpt-4o-2024-11-20
#   Google:   gemini-2.0-flash, gemini-2.5-flash, gemini-2.5-pro

# MODEL = 'claude-3-5-haiku-20241022'

TEMPERATURE = 1.0       # stochastic — lets us measure consistency
PROMPT_VARIANT = 'detailed'  # provides full clinical context
REPEATS = 3             # enough to estimate mean and variability

# Show what the experiment will look like
n_calls = 1 * 1 * 1 * len(items) * 2 * REPEATS
print(f'Model:           {MODEL}')
print(f'Temperature:     {TEMPERATURE}')
print(f'Prompt variant:  {PROMPT_VARIANT}')
print(f'Repetitions:     {REPEATS}')
print(f'Scenarios:       {len(items)} (x2 helpers each = {len(items) * 2} items)')
print(f'\nTotal API calls: {len(items)} scenarios x 2 helpers x {REPEATS} repeats = {n_calls}')
print(f'Estimated time:  ~1 minute')
print(f'Estimated cost:  a few cents')

---
## 4. Run the Experiment

This is the step that actually talks to the model. The `run_experiment()` function sends each helper response to the model as an **independent API call** with no shared conversation context. This means the model's rating of Item 1A cannot influence its rating of Item 1B or any other item — there are no order effects or anchoring between items, just as in a properly administered paper-and-pencil assessment.

Results are written incrementally to a JSONL file (one JSON record per line). **If the run is interrupted** (network issue, kernel restart, etc.), re-running the cell picks up where it left off — completed calls are detected and skipped automatically. You will not be double-charged.

In [None]:
from benchmark_runner import run_experiment, summarize

# This cell makes live API calls. It will take about a minute.
# A progress bar will appear showing each item being scored.
run_experiment(
    prompts_path=INSTRUMENT_DIR / 'siri2_items.json',
    output_dir=TUTORIAL_OUTPUT,
    models=[MODEL],
    temperatures=[TEMPERATURE],
    prompt_variants=[PROMPT_VARIANT],
    repeats=REPEATS,
)

In [None]:
# Let's look at what the model actually returned for one API call.
# Each record captures the model's numeric rating ("score") plus
# metadata about which item, helper, and repetition it was.
with open(TUTORIAL_OUTPUT / 'api_responses_raw.jsonl') as f:
    first = json.loads(f.readline())

print('Fields recorded for each API call:\n')
for key in ['model', 'temperature', 'prompt_variant', 'item_id', 'helper',
            'repeat', 'score', 'reasoning', 'raw_response']:
    val = first.get(key)
    if isinstance(val, str) and len(val) > 80:
        val = val[:80] + '...'
    print(f'  {key}: {val}')

print(f'\nThe "score" field is the model\'s rating on the -3 to +3 scale.')
print(f'The "reasoning" field is null here because the "detailed" prompt')
print(f'only asks for a score. The "detailed_w_reasoning" variant also')
print(f'asks the model to explain its rating in free text.')

---
## 5. Summarize Results

The raw JSONL file contains one record per API call — with 3 repetitions per item, that means 3 separate ratings for each of the 48 helper responses. The `summarize()` function aggregates these into a single row per item with the **mean** rating (average across repetitions), **standard deviation** (how much the ratings varied), and **count** (number of repetitions).

This is analogous to computing a respondent's average score when they take a test multiple times. The mean tells you where the model lands; the SD tells you how stable that landing is.

In [None]:
# Aggregate raw responses: compute mean, std, and count across repetitions
summarize(TUTORIAL_OUTPUT)

# Load the summary table
summary = pd.read_csv(TUTORIAL_OUTPUT / 'model_scores_by_condition.csv')
print(f'\n{len(summary)} rows (one per item x helper)')
print(f'Columns: {summary.columns.tolist()}')
print(f'\nKey columns:')
print(f'  mean  = average score across {REPEATS} repetitions')
print(f'  std   = standard deviation across repetitions (consistency)')
print(f'  count = number of repetitions\n')
summary.head(10)

---
## 6. Score Against Experts

Now we compute the SIRI-2 total score — the same metric used in every published human study of this instrument. For each of the 48 validated items, we take the absolute difference between the model's mean rating and the expert panel's mean rating, then sum across all items.

$$\text{SIRI-2 Total Score} = \sum_{i=1}^{48} |\text{model mean}_i - \text{expert mean}_i|$$

The result is a single number: **lower is better**. A score of 0 would mean perfect agreement with the expert panel on every item. The expected score for an individual expert panelist (rating against the panel mean) is approximately 32.5 — this represents the natural disagreement *within* the expert panel itself. Scores below this threshold indicate performance that exceeds the typical agreement among the experts themselves.

In [None]:
# ── Step 1: Merge model scores with expert ratings ────────────────────
# This joins each item's model mean with the corresponding expert mean,
# so we can compute the distance between them.
merged = summary.merge(expert, on='Item', how='inner')

# ── Step 2: Exclude item 14 (not validated) ───────────────────────────
merged = merged[merged['item_id'] != 14].copy()

# ── Step 3: Compute absolute error per item ───────────────────────────
# For each item: how far was the model's average rating from the expert mean?
# Example: if experts rated an item -2.71 and the model averaged -1.5,
# the absolute error is |(-1.5) - (-2.71)| = 1.21
merged['abs_error'] = (merged['mean'] - merged['expert_mean']).abs()

# ── Step 4: Sum to get the SIRI-2 total score ────────────────────────
siri2_score = merged['abs_error'].sum()
n_items = len(merged)
mae = siri2_score / n_items

print(f'SIRI-2 Total Score: {siri2_score:.1f}')
print(f'  Items scored: {n_items}')
print(f'  Mean absolute error per item: {mae:.2f}')
print(f'  Model: {MODEL}')
print(f'  Condition: {PROMPT_VARIANT}, T={TEMPERATURE}, {REPEATS} repeats')

In [None]:
# ── Compare to published human benchmarks ─────────────────────────────
# The SIRI-2 has been administered to a wide range of clinical and
# non-clinical groups over the past 30 years. These published scores
# let us place the model on the same scale as real clinicians.
from siri2_scoring import HUMAN_BENCHMARKS

EXPERT_BASELINE = 32.5  # expected score for an individual expert panelist

# Key reference points from the SIRI-2 literature
refs = [
    ('Expert panel baseline', EXPERT_BASELINE),
    ("Master's counselors (post-training)", 41.0),
    ('Clinical psych PhD students', 45.4),
    ('K-12 school staff (pre-training)', 52.9),
    ("Master's counselors (pre-training)", 54.7),
    ('Intro psychology students', 70.4),
]

print(f'Your model scored {siri2_score:.1f}. Here\'s how that compares:\n')
print(f'{"Respondent":<42s}  {"SIRI-2 Score":>12s}')
print(f'{"-" * 42}  {"-" * 12}')

for label, score in sorted(refs + [(f'>>> {MODEL}', siri2_score)], key=lambda x: x[1]):
    display = label.replace('>>> ', '')
    if label.startswith('>>>'):
        print(f'  {display:<42s}  {score:>10.1f}  <-- your run')
    else:
        print(f'  {display:<42s}  {score:>10.1f}')

print(f'\nScores below 32.5 mean the model agrees with experts more closely')
print(f'than a typical individual expert agrees with the rest of the panel.')

---
## 7. Visualize Your Results

### Per-item comparison

The chart below shows the expert panel mean (green) alongside your model's mean rating (blue) for each of the 48 validated items. Error bars on the blue bars show the standard deviation across repetitions — wider bars mean the model was less consistent on that item.

**What to look for:**
- **Close alignment:** green and blue bars of similar height mean the model agrees with experts.
- **Sign reversals:** if a green bar points up (appropriate) but the blue bar points down (inappropriate), or vice versa, the model and experts fundamentally disagree about that response. This is the most clinically concerning type of error.
- **Consistent overrating:** if blue bars are systematically taller than green bars on inappropriate items (left side of zero), the model is rating bad responses too favorably — a pattern we found across most models in the full study.

In [None]:
# Sort items in order (1A, 1B, 2A, 2B, ...) for a clean plot
plot_data = merged.sort_values('Item').copy()

fig, ax = plt.subplots(figsize=(14, 5))

x = np.arange(len(plot_data))
width = 0.35

# Green bars = expert panel means (ground truth)
ax.bar(x - width/2, plot_data['expert_mean'], width, color='#2ca02c', alpha=0.7,
       label='Expert panel mean')

# Blue bars = model's mean ratings
ax.bar(x + width/2, plot_data['mean'], width, color='#4a90d9', alpha=0.7,
       label=f'{MODEL}')

# Error bars showing model variability across repetitions
if 'std' in plot_data.columns and plot_data['std'].notna().any():
    ax.errorbar(x + width/2, plot_data['mean'], yerr=plot_data['std'],
                fmt='none', ecolor='#4a90d9', alpha=0.4, capsize=1.5)

ax.axhline(0, color='black', linewidth=0.5)
ax.set_xlabel('Item', fontsize=10)
ax.set_ylabel('Rating (-3 to +3)', fontsize=10)
ax.set_title(f'Expert vs. Model Ratings by Item ({MODEL})', fontsize=12, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(plot_data['Item'], fontsize=6, rotation=90)
ax.legend(fontsize=9)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.tight_layout()
plt.show()

# Flag sign reversals — items where model and expert disagree on the
# basic direction (appropriate vs. inappropriate). These are the most
# clinically significant errors.
reversals = plot_data[
    (plot_data['mean'] * plot_data['expert_mean'] < 0) &
    (plot_data['mean'].abs() > 0.5)
]
if len(reversals) > 0:
    print(f'Sign reversals (model and expert on opposite sides of zero):')
    for _, r in reversals.iterrows():
        direction = 'appropriate' if r['expert_mean'] > 0 else 'inappropriate'
        print(f'  Item {r["Item"]}: expert = {r["expert_mean"]:+.2f} ({direction}), '
              f'model = {r["mean"]:+.2f}')
else:
    print('No sign reversals detected — the model agrees with experts on the')
    print('basic direction (appropriate vs. inappropriate) for all items.')

### Score in context

The chart below places your model's SIRI-2 total score (red star) among published human benchmark groups (gray squares) and the expert panel baseline (green diamond). These human groups span a wide range of clinical training and experience — from untrained introductory psychology students to post-training master's-level counselors.

This is the same comparison that has been used in the SIRI-2 literature for 30 years to contextualize the clinical competence of new respondent groups. The only difference is that the respondent here is a language model.

In [None]:
# Build the comparison dataset: human benchmark groups + your model
entries = [{'label': h['group'], 'score': h['score'], 'type': 'human'}
           for h in HUMAN_BENCHMARKS if 'chat tool' not in h['group']]
entries.append({'label': 'Expert panel baseline', 'score': EXPERT_BASELINE, 'type': 'expert'})
entries.append({'label': f'{MODEL} (your run)', 'score': siri2_score, 'type': 'model'})
entries = sorted(entries, key=lambda x: x['score'])

fig, ax = plt.subplots(figsize=(10, max(6, len(entries) * 0.28)))

# Different markers for each type of entry
colors = {'human': '#5a6872', 'expert': '#2ca02c', 'model': '#d64541'}
markers = {'human': 's', 'expert': 'D', 'model': '*'}
sizes = {'human': 5, 'expert': 7, 'model': 12}

for i, entry in enumerate(entries):
    t = entry['type']
    ax.plot(entry['score'], i, markers[t], color=colors[t],
            markersize=sizes[t], markeredgecolor='white', markeredgewidth=0.5, zorder=3)

# Style the y-axis labels to match the marker colors
ax.set_yticks(range(len(entries)))
ax.set_yticklabels([e['label'] for e in entries], fontsize=8)
for i, entry in enumerate(entries):
    tick = ax.get_yticklabels()[i]
    tick.set_color(colors[entry['type']])
    if entry['type'] == 'model':
        tick.set_fontweight('bold')

# Dashed line at expert baseline for reference
ax.axvline(EXPERT_BASELINE, color='#2ca02c', linewidth=1, linestyle='--', alpha=0.5, zorder=1)

ax.invert_yaxis()
ax.set_xlabel('SIRI-2 Total Score (lower = better)', fontsize=10)
ax.set_title('Your Model vs. Human Benchmarks', fontsize=12, fontweight='bold')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(axis='x', alpha=0.2)
ax.tick_params(axis='y', length=0)

plt.tight_layout()
plt.show()

---
## 8. Scaling Up

The tutorial run above used a single model and condition. To run a more comprehensive evaluation, you can expand any of the configuration parameters. The benchmark runner handles the combinatorics automatically — it crosses every model with every temperature, prompt variant, and repetition count.

The full experiment design from the paper crosses **9 models x 3 prompt variants x 2 temperatures x 10 repetitions**, producing about 27,000 API calls at a total cost of roughly $100. The bulk of that cost comes from the larger models (Claude Opus 4, GPT-4o, Gemini 2.5 Pro). A more targeted evaluation — say, 3 models, 2 prompts, both temperatures, 5 repetitions — would cost under $10.

Because the runner is **resumable**, you can scale up incrementally. Run one model today, add another tomorrow. Already-completed calls are skipped automatically.

In [None]:
# Example: full experiment configuration
# (Don't run this cell unless you intend to — it will make ~27,000 API calls.)
# Requires API keys for all three providers (OpenAI, Anthropic, Google).

FULL_CONFIG = """
run_experiment(
    prompts_path=INSTRUMENT_DIR / 'siri2_items.json',
    output_dir=REPO_ROOT / 'my-full-results',
    models=[
        'gpt-3.5-turbo-0125', 'gpt-4o-2024-11-20',              # OpenAI
        'claude-3-5-haiku-20241022', 'claude-3-5-sonnet-20241022',
        'claude-sonnet-4-20250514', 'claude-opus-4-20250514',     # Anthropic
        'gemini-2.0-flash', 'gemini-2.5-flash', 'gemini-2.5-pro', # Google
    ],
    temperatures=[0.0, 1.0],
    prompt_variants=['minimal', 'detailed', 'detailed_w_reasoning'],
    repeats=10,
)
"""
print(FULL_CONFIG)
print('For deeper analysis on full results, the src/ directory includes:')
print('  analysis.py           — RMSE decomposition, prompt/temperature effect figures')
print('  siri2_scoring.py      — SIRI-2 total scores and human benchmark comparison')
print('  tutorial_figures.py   — publication-quality figures from the paper')
print('  reasoning_analysis.py — embedding-based reasoning consistency (advanced)')

---
## 9. Adapting for Your Own Instrument

The benchmark runner works with **any** clinical instrument that follows the same format. The SIRI-2 is used here because it provides a well-validated scoring key and decades of published human comparison data, but the pipeline itself is instrument-agnostic. If you have a clinical assessment that presents scenarios with ratable responses, you can plug it in.

To substitute your own instrument, you need two files:

### Items file (JSON)

A JSON array where each object has four fields:

```json
[
  {
    "item_id": "01",
    "client": "The client statement or scenario description.",
    "helper_a": "One possible helper/clinician response.",
    "helper_b": "An alternative helper/clinician response."
  },
  ...
]
```

Each helper response is sent to the model independently — the model sees only the client statement and one helper response at a time. This mirrors how the SIRI-2 is administered to human respondents.

### Expert scoring key (CSV)

A CSV with three columns:

```
Item,M,SD
1A,-2.71,0.49
1B,+1.86,0.38
2A,-2.71,0.49
...
```

The `Item` column uses the format `{item_number}{A|B}` to match each helper response. `M` is the expert panel mean and `SD` is the standard deviation. The scoring pipeline uses `M` to compute alignment; `SD` is used for the expert baseline calculation and for understanding items where experts themselves disagreed.

### Prompt text

You will also want to modify the system prompt to match your instrument's rating scale and clinical context. The prompt variants are defined in `src/benchmark_runner.py` in the `PROMPT_VARIANTS` dictionary. The key elements to customize are:

1. The **rating scale** — the SIRI-2 uses -3 to +3; your instrument may use a different range (e.g., 1-5, 0-10)
2. The **clinical framing** — "hypothetical counseling session" may not fit your domain. A diagnostic reasoning instrument would need different framing than a therapeutic alliance measure.
3. The **output format** — JSON with `score` and optionally `reasoning`. The parser handles common variations (markdown code blocks, leading + signs, surrounding prose).

### Assembling an expert panel

The scoring key is the most important component of any benchmark — it defines ground truth. For a new instrument, this requires recruiting qualified experts to independently rate each item and computing the panel means. The SIRI-2 used seven nationally recognized suicidologists; modern psychometric practice generally recommends larger and more diverse panels. The key criterion is that panelists should be people whose judgment you would trust in the clinical domain the instrument covers.

---
## 10. Next Steps

This notebook demonstrated the complete pipeline from API call to scored benchmark result. You sent a clinical instrument to a language model, collected its ratings, scored them against expert consensus, and placed the result in the context of three decades of published human performance data — all in a few minutes and for a few cents.

From here, you can:

- **Scale up** to more models, prompt variants, and temperatures (Section 8). The runner is resumable, so you can add conditions incrementally.
- **Substitute your own instrument** following the format described in Section 9. The scoring code works identically regardless of clinical content.
- **Run deeper analysis** using the modules in `src/` — RMSE decomposition, per-item bias analysis, and reasoning embedding consistency.
- **Compare to the full results** from our nine-model experiment in Notebook 01.

### Get involved

The research team behind this paper is building **MindBench.ai**, a benchmarking aggregation project for mental health AI. We are assembling a growing library of expert-rated clinical scenarios across crisis intervention, therapeutic alliance, diagnostic reasoning, and other domains where clinical expertise is essential for AI evaluation.

There are two ways to contribute:

1. **Rate existing benchmark items** — bring your clinical perspective to instruments like the SIRI-2. Your areas of agreement and disagreement with existing scoring keys are informative, particularly if they reflect clinical perspectives that original expert panels may not have represented.
2. **Write new scenarios** — develop items for domains that existing instruments don't cover. Clinicians who write scenarios for training exercises, supervision, or competency exams are already producing material in this format.

If you are interested in contributing, please contact the corresponding author.