Code and data release for "The Severance Problem: LLMs are Unaware of the Person Beyond the Prompt".
The Severance Schema is a prompt-level structural prior that gives an LLM an explicit inventory of which categories of person-context exist for the user it is serving — even when no personal data is filled in. We show that this single structural change (i) cuts harm and sycophancy on every one of five model families, (ii) recovers most of the safety cost that bullet-style memory introduces (hallucination 3.7–11.7% → 1.7–4.0%), (iii) holds across fill levels from 0% to 100%, and (iv) is the only condition under which a model's clarifying questions translate into a usefulness gain on the next turn.
This release contains everything needed to reproduce the four experiments reported in the paper.
release/
├── core.py # Prompts, schema, profile-routing, scenario logic
├── run_all.py # Single entrypoint: generation, evaluation, tables
├── run_all_batch.py # Optional: batch-mode judge (Anthropic Batch API)
├── data/ # Profiles, scenarios, claims (the benchmark)
├── results/ # Pre-computed model outputs + judge scores (134 MB)
├── assets/ # Bootstrap, table-builder, figure scripts
├── examples/ # Single qualitative transcript (Fig. 4)
├── requirements.txt
└── LICENSE
pip install -r requirements.txt
# Portkey is used for ALL judging and for Claude/GPT subject generation.
# Set up two virtual keys in your Portkey dashboard (one fronting Anthropic,
# one fronting OpenAI) and export them along with your Portkey API key:
export PORTKEY_API_KEY=...
export PORTKEY_VK_ANTHROPIC=... # Portkey virtual key for Anthropic
export PORTKEY_VK_OPENAI=... # Portkey virtual key for OpenAI
# Together is only needed for the open-weight subject models:
export TOGETHER_API_KEY=...Models referenced in the paper:
| Slug | Provider |
|---|---|
claude-sonnet-4-20250514 |
Portkey/Anthropic |
meta-llama/Llama-3.3-70B-Instruct-Turbo |
Together |
deepseek-ai/DeepSeek-V3 |
Together |
google/gemma-4-31B-it |
Together |
Qwen/Qwen3-235B-A22B-Instruct-2507-tput |
Together |
gpt-5.2-2025-12-11 (judge) |
Portkey/OpenAI |
Each experiment is a three-stage pipeline:
- generate subject-model responses
- evaluate them with one or both judges
- tables / figures / bootstrap to produce the paper artifacts
The pre-computed outputs of stages 1–2 are shipped under results/, so a reviewer can skip directly to stage 3 if they trust the API outputs.
Five subject models × four conditions (No Schema, Memory, Severance Schema, Severance Schema + Mem.).
# Generate (one command per subject model):
python run_all.py exp_outie --model claude-sonnet-4-20250514 --concurrency 8
python run_all.py exp_outie --model meta-llama/Llama-3.3-70B-Instruct-Turbo --concurrency 8
python run_all.py exp_outie --model deepseek-ai/DeepSeek-V3 --concurrency 8
python run_all.py exp_outie --model google/gemma-4-31B-it --concurrency 8
python run_all.py exp_outie --model Qwen/Qwen3-235B-A22B-Instruct-2507-tput --concurrency 8
# Evaluate (per subject, with both judges):
python run_all.py evaluate exp_outie --model <subject> --judge-model gpt-5.2-2025-12-11
python run_all.py evaluate exp_outie --model <subject> --judge-model claude-sonnet-4-20250514Claude Sonnet 4, 5 fill levels (0%, 25%, 50%, 75%, 100%) × 2 formats (Memory, Severance Schema).
python run_all.py exp_cal --model claude-sonnet-4-20250514 --concurrency 8
python run_all.py evaluate exp_cal --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11Claude Sonnet 4, two-turn natural-asking protocol across four conditions. The "natural" suffix marks the protocol used in the paper: turn 1 is the bare scenario question (no instruction to ask anything), and turn 2 re-runs the same scenario with the model's own clarifying questions answered from the profile.
python run_all.py exp_multi_natural \
--model claude-sonnet-4-20250514 \
--extractor-model claude-haiku-4-5-20251001 \
--concurrency 8
python run_all.py evaluate exp_multi_natural --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11
python run_all.py evaluate exp_multi_natural --model claude-sonnet-4-20250514 --judge-model claude-sonnet-4-20250514The paper's exp_multi-main table reports T2 (from this experiment) and Δ = T2 − T1, where T1 is the same condition's row from exp_outie (no asking, no retrieval). assets/build_tables.py does the join automatically.
Claude Sonnet 4, three ablation arms (Format Only, Content Only, full Severance Schema) plus the No Schema baseline.
python run_all.py exp_ablation --model claude-sonnet-4-20250514 --concurrency 8
python run_all.py evaluate exp_ablation --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11All headline numbers in the paper are produced from the JSONs in results/. No API calls are needed.
# Main paper tables (Tab. 2, exp_cal-main, exp_multi-main, app-ablation, ...):
python assets/build_tables.py
# Per-cell appendix tables and full breakdowns:
python assets/full_tables.py
# Cluster-bootstrap CIs (B = 10,000, clustered on profile × scenario):
python assets/bootstrap.py
# → writes assets/bootstrap_cis.txt (the file the paper transcribes verbatim)
# Cross-judge agreement statistics (App. judge):
python assets/compute_judge_agreement.py
# Figures used in the paper (output: release/figures/):
python assets/figures.py
# Produces:
# fig_b_cross_family_6panel.pdf (Fig. 2 — exp_outie cross-family sweep)
# fig_c_fill_curve.pdf (Fig. 3 — exp_cal fill-level curve)
# The qualitative example (Fig. 4, example_lin_vaccines_mem.pdf) is shipped
# pre-rendered under release/examples/ — it is hand-composed, not generated.If you want to verify a single cell of Tab. 2 from scratch with a tiny budget:
python run_all.py exp_outie --model claude-sonnet-4-20250514 --max-profiles 2 --max-scenarios 5
python run_all.py evaluate exp_outie --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11
python run_all.py tables exp_outie --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11This runs ~40 generations and ~40 judge calls.
| File | Contents |
|---|---|
profiles.json |
10 synthetic person profiles, organized by the schema's six dimensions |
scenarios.json |
30 advisory scenarios + per-scenario claim lists |
scenario_variants.json |
Profile-routed variants (e.g. s21_grandkid for childless personas) |
claims_by_dimension.json |
The 52 claim labels grouped by dimension |
Per-scenario claim assignments (which claims are decision-flip vs. good-answer for each scenario) are visualized in assets/scenario_claim_matrix.pdf.
MIT. See LICENSE.
