The Severance Problem: LLMs are Unaware of the Person Beyond the Prompt

Code and data release for "The Severance Problem: LLMs are Unaware of the Person Beyond the Prompt".

The Severance Schema is a prompt-level structural prior that gives an LLM an explicit inventory of which categories of person-context exist for the user it is serving — even when no personal data is filled in. We show that this single structural change (i) cuts harm and sycophancy on every one of five model families, (ii) recovers most of the safety cost that bullet-style memory introduces (hallucination 3.7–11.7% → 1.7–4.0%), (iii) holds across fill levels from 0% to 100%, and (iv) is the only condition under which a model's clarifying questions translate into a usefulness gain on the next turn.

This release contains everything needed to reproduce the four experiments reported in the paper.

Repository layout

release/
├── core.py                 # Prompts, schema, profile-routing, scenario logic
├── run_all.py              # Single entrypoint: generation, evaluation, tables
├── run_all_batch.py        # Optional: batch-mode judge (Anthropic Batch API)
├── data/                   # Profiles, scenarios, claims (the benchmark)
├── results/                # Pre-computed model outputs + judge scores (134 MB)
├── assets/                 # Bootstrap, table-builder, figure scripts
├── examples/               # Single qualitative transcript (Fig. 4)
├── requirements.txt
└── LICENSE

Setup

pip install -r requirements.txt

# Portkey is used for ALL judging and for Claude/GPT subject generation.
# Set up two virtual keys in your Portkey dashboard (one fronting Anthropic,
# one fronting OpenAI) and export them along with your Portkey API key:
export PORTKEY_API_KEY=...
export PORTKEY_VK_ANTHROPIC=...      # Portkey virtual key for Anthropic
export PORTKEY_VK_OPENAI=...         # Portkey virtual key for OpenAI

# Together is only needed for the open-weight subject models:
export TOGETHER_API_KEY=...

Models referenced in the paper:

Slug	Provider
`claude-sonnet-4-20250514`	Portkey/Anthropic
`meta-llama/Llama-3.3-70B-Instruct-Turbo`	Together
`deepseek-ai/DeepSeek-V3`	Together
`google/gemma-4-31B-it`	Together
`Qwen/Qwen3-235B-A22B-Instruct-2507-tput`	Together
`gpt-5.2-2025-12-11` (judge)	Portkey/OpenAI

The four experiments

Each experiment is a three-stage pipeline:

generate subject-model responses
evaluate them with one or both judges
tables / figures / bootstrap to produce the paper artifacts

The pre-computed outputs of stages 1–2 are shipped under results/, so a reviewer can skip directly to stage 3 if they trust the API outputs.

1 — `exp_outie` (Sec. 3.1, Tab. 2, Fig. `fig_b_cross_family_6panel`)

Five subject models × four conditions (No Schema, Memory, Severance Schema, Severance Schema + Mem.).

# Generate (one command per subject model):
python run_all.py exp_outie --model claude-sonnet-4-20250514 --concurrency 8
python run_all.py exp_outie --model meta-llama/Llama-3.3-70B-Instruct-Turbo --concurrency 8
python run_all.py exp_outie --model deepseek-ai/DeepSeek-V3 --concurrency 8
python run_all.py exp_outie --model google/gemma-4-31B-it --concurrency 8
python run_all.py exp_outie --model Qwen/Qwen3-235B-A22B-Instruct-2507-tput --concurrency 8

# Evaluate (per subject, with both judges):
python run_all.py evaluate exp_outie --model <subject> --judge-model gpt-5.2-2025-12-11
python run_all.py evaluate exp_outie --model <subject> --judge-model claude-sonnet-4-20250514

2 — `exp_cal` (Sec. 3.2, Fig. `fig_c_fill_curve`)

Claude Sonnet 4, 5 fill levels (0%, 25%, 50%, 75%, 100%) × 2 formats (Memory, Severance Schema).

python run_all.py exp_cal --model claude-sonnet-4-20250514 --concurrency 8
python run_all.py evaluate exp_cal --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11

3 — `exp_multi_natural` (Sec. 3.3, Tab. exp_multi-main)

Claude Sonnet 4, two-turn natural-asking protocol across four conditions. The "natural" suffix marks the protocol used in the paper: turn 1 is the bare scenario question (no instruction to ask anything), and turn 2 re-runs the same scenario with the model's own clarifying questions answered from the profile.

python run_all.py exp_multi_natural \
    --model claude-sonnet-4-20250514 \
    --extractor-model claude-haiku-4-5-20251001 \
    --concurrency 8
python run_all.py evaluate exp_multi_natural --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11
python run_all.py evaluate exp_multi_natural --model claude-sonnet-4-20250514 --judge-model claude-sonnet-4-20250514

The paper's exp_multi-main table reports T2 (from this experiment) and Δ = T2 − T1, where T1 is the same condition's row from exp_outie (no asking, no retrieval). assets/build_tables.py does the join automatically.

4 — `exp_ablation` (App., Tab. exp-ablation)

Claude Sonnet 4, three ablation arms (Format Only, Content Only, full Severance Schema) plus the No Schema baseline.

python run_all.py exp_ablation --model claude-sonnet-4-20250514 --concurrency 8
python run_all.py evaluate exp_ablation --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11

Reproducing the paper's tables and figures

All headline numbers in the paper are produced from the JSONs in results/. No API calls are needed.

# Main paper tables (Tab. 2, exp_cal-main, exp_multi-main, app-ablation, ...):
python assets/build_tables.py

# Per-cell appendix tables and full breakdowns:
python assets/full_tables.py

# Cluster-bootstrap CIs (B = 10,000, clustered on profile × scenario):
python assets/bootstrap.py
# → writes assets/bootstrap_cis.txt (the file the paper transcribes verbatim)

# Cross-judge agreement statistics (App. judge):
python assets/compute_judge_agreement.py

# Figures used in the paper (output: release/figures/):
python assets/figures.py
# Produces:
#   fig_b_cross_family_6panel.pdf   (Fig. 2 — exp_outie cross-family sweep)
#   fig_c_fill_curve.pdf            (Fig. 3 — exp_cal fill-level curve)
# The qualitative example (Fig. 4, example_lin_vaccines_mem.pdf) is shipped
# pre-rendered under release/examples/ — it is hand-composed, not generated.

Reproducing the headline number end-to-end (smoke test)

If you want to verify a single cell of Tab. 2 from scratch with a tiny budget:

python run_all.py exp_outie --model claude-sonnet-4-20250514 --max-profiles 2 --max-scenarios 5
python run_all.py evaluate exp_outie --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11
python run_all.py tables exp_outie --model claude-sonnet-4-20250514 --judge-model gpt-5.2-2025-12-11

This runs ~40 generations and ~40 judge calls.

The benchmark (`data/`)

File	Contents
`profiles.json`	10 synthetic person profiles, organized by the schema's six dimensions
`scenarios.json`	30 advisory scenarios + per-scenario claim lists
`scenario_variants.json`	Profile-routed variants (e.g. `s21_grandkid` for childless personas)
`claims_by_dimension.json`	The 52 claim labels grouped by dimension

Per-scenario claim assignments (which claims are decision-flip vs. good-answer for each scenario) are visualized in assets/scenario_claim_matrix.pdf.

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The Severance Problem: LLMs are Unaware of the Person Beyond the Prompt

Repository layout

Setup

The four experiments

1 — `exp_outie` (Sec. 3.1, Tab. 2, Fig. `fig_b_cross_family_6panel`)

2 — `exp_cal` (Sec. 3.2, Fig. `fig_c_fill_curve`)

3 — `exp_multi_natural` (Sec. 3.3, Tab. exp_multi-main)

4 — `exp_ablation` (App., Tab. exp-ablation)

Reproducing the paper's tables and figures

Reproducing the headline number end-to-end (smoke test)

The benchmark (`data/`)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
data		data
examples		examples
figures		figures
results		results
LICENSE		LICENSE
README.md		README.md
core.py		core.py
requirements.txt		requirements.txt
run_all.py		run_all.py
run_all_batch.py		run_all_batch.py

Folders and files

Latest commit

History

Repository files navigation

The Severance Problem: LLMs are Unaware of the Person Beyond the Prompt

Repository layout

Setup

The four experiments

1 — exp_outie (Sec. 3.1, Tab. 2, Fig. fig_b_cross_family_6panel)

2 — exp_cal (Sec. 3.2, Fig. fig_c_fill_curve)

3 — exp_multi_natural (Sec. 3.3, Tab. exp_multi-main)

4 — exp_ablation (App., Tab. exp-ablation)

Reproducing the paper's tables and figures

Reproducing the headline number end-to-end (smoke test)

The benchmark (data/)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1 — `exp_outie` (Sec. 3.1, Tab. 2, Fig. `fig_b_cross_family_6panel`)

2 — `exp_cal` (Sec. 3.2, Fig. `fig_c_fill_curve`)

3 — `exp_multi_natural` (Sec. 3.3, Tab. exp_multi-main)

4 — `exp_ablation` (App., Tab. exp-ablation)

The benchmark (`data/`)

Packages