Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan

📝 Arxiv | 🖇️ Livepaper | 📦 Dataset | 🌐 Website

Pando is a benchmark for evaluating interpretability methods on language models with known ground-truth decision rules. We fine-tune 1000+ "model organisms" — small LMs with planted decision-tree circuits — and measure whether interpretability agents can recover the hidden rules under budget constraints.

Use the benchmark

Pando provides pre-trained model organisms so you can evaluate interpretability methods without training from scratch. The model organisms are hosted on HuggingFace under pando-dataset, organized into 17 repos by scenario and training configuration.

Setup

git clone https://github.com/AR-FORUM/Pando.git
cd pando
pip install -r requirements.txt

Running agents on existing model organisms

# Download a set of model organisms from HuggingFace
pip install huggingface_hub
hf download pando-dataset/car-purchase-freeform-std \
    --local-dir outputs/models/car-purchase-freeform-std

# Run specific agents on one model (requires GPU + OPENAI_API_KEY)
# --exclude-seen reports accuracy only on the 90 heldout samples
python scripts/eval.py \
    --model-dir outputs/models/car-purchase-freeform-std/<model_name> \
    --agents gradient relp blackbox \
    --fixed-prompt-budget --budget 10 --exclude-seen

# Run all agents
python scripts/eval.py \
    --model-dir outputs/models/car-purchase-freeform-std/<model_name> \
    --fixed-prompt-budget --budget 10 --exclude-seen

Each model directory contains circuit.json (the planted decision rule with causal field sensitivities) and validation.json (2,000 pre-scored samples),

Training new model organisms

# Train a depth-3 decision-tree model organism (LoRA, Gemma 2 2B-it)
python scripts/train.py \
    --scenario car_purchase --depth 3 \
    --base-model google/gemma-2-2b-it --chat-model --use-lora \
    --training-format freeform --format-style natural

See scripts/train.py --help for all options (scenarios, circuit types, format styles, rationale training, data mixing).

Scenarios

Scenario	Fields	Decision
`car_purchase`	Brand, Year, Color, HP, Drivetrain, MPG, Seats, Interior, Condition, Price	Purchase yes/no
`movie_pick`	Release Year, Genre, Language, Runtime, Rating, Format, Budget, Box Office, Color, Cast	Watch yes/no
`oversight_defection`	Deploy Phase, Turn Count, Minutes, Auth, Trust, Complexity, Risk, Tools, Oversight, Logging	Violation yes/no

Each has 5 binary ENUM fields and 5 INTEGER fields. Circuits are randomly generated decision trees of configurable depth (d1-d4).

Agents

Agent	Strategy	Interp tool
`blackbox`	GPT pattern discovery	None
`gradient`	Gradient saliency	Embedding gradients
`relp`	RelP-modified gradients	LRP-rule gradients
`logit_lens`	Logit lens projections	Vocabulary projections
`logit_lens_field`	Per-field logit lens	Vocabulary projections
`prefill`	Prefill extraction	Forced-decoding
`sae_tfidf`	SAE feature TF-IDF	Sparse autoencoder
`sae_tfidf_filtered`	SAE TF-IDF + keyword filtering	Sparse autoencoder
`sae_gradient`	SAE gradient attribution	Sparse autoencoder
`sae_autointerp`	SAE + Neuronpedia descriptions	Sparse autoencoder
`sae_mean_diff`	SAE mean activation difference	Sparse autoencoder
`sae_token`	SAE token-level features	Sparse autoencoder
`res_token`	Residual token similarity	Residual stream
`circuit_tracer`	Circuit tracing (unfiltered, large context)	Activation patching
`circuit_tracer_filtered`	Circuit tracing + keyword filtering (paper default)	Activation patching
`tree_vote`	Decision tree voting ensemble	Embedding gradients

Plus baselines: majority, nn, logreg, always_true/false.

Reproduce the paper

For reproducing the paper, we additionally provide cached evaluation results, so you could verify them without re-running inference.

You only need lightweight dependencies (no GPU, no API keys):

pip install -r requirements-repro.txt

Each evaluation presents 100 test inputs (50/50 balanced). The agent has a budget of ~10 forward passes querying the model on a seeded subset (~10 visible inputs, identical across agents), then predicts the remaining ~90 heldout inputs. We report heldout-only accuracy. (We do not allow active sampling for the main experiments.)

Download evaluation data

Option A — single zip (recommended, avoids rate-limiting):

hf download pando-dataset/evaluation-results-zip \
    --repo-type dataset --local-dir /tmp/eval-zip
unzip /tmp/eval-zip/evaluation-results.zip -d .

Option B — individual files via HF:

hf download pando-dataset/evaluation-results \
    --repo-type dataset --local-dir outputs/

Both populate outputs/evaluations/ (74 batch directories, ~3 GB) and outputs/sensitivity_cache.json. The cached JSONs store predictions on all 100 inputs; the analysis scripts below reconstruct heldout accuracy from per_input_results by excluding the ~10 queried indices.

Tables and figures

Paper artifact	Command
Table 3 (main accuracy + F1)	`python scripts/analysis/generate_tables.py`
Table 4 (variance decomposition)	`python scripts/analysis/analyze_value_bias.py outputs/evaluations/batch_20260301_033718 outputs/evaluations/batch_20260301_033721 --agents relp gradient sae_gradient sae_tfidf logit_lens_field`
Table 6 (full agent variants)	`python scripts/analysis/generate_tables.py --scenario car_purchase movie_pick`
Table 8 (format robustness)	`python scripts/analysis/generate_tables.py --table 6`
Table 9 (data mixing)	`python scripts/analysis/generate_tables.py --table 7`
Tables 10/11 (tree voting)	`python scripts/analysis/generate_tables.py` (tree_vote.json already in eval artifact)
Table 13 (per-field AUC)	`python scripts/analysis/analyze_interp_field_bias.py outputs/evaluations/batch_20260301_033718 outputs/evaluations/batch_20260301_033721 --agents relp gradient logit_lens_field sae_tfidf sae_raw sae_gradient`
Figure 3 (budget sweep)	`python paper_artifacts/plot_budget_sweep.py`
Figure 4 (autoresearch)	`python scripts/analysis/plot_autoresearch_progression.py`

Eval data format

Each outputs/evaluations/batch_*/ directory holds one eval run. Each model subdirectory contains:

config.json -- model metadata (scenario, depth, training config)
test_data.json -- 100 test samples (50/50 balanced)
agent_results/<agent>.json -- per-agent predictions
- accuracy, correct, total
- per_input_results -- per-sample {index, predicted, correct, ...}
- agent_metadata.pattern -- natural-language rule discovered by GPT-5.1

outputs/sensitivity_cache.json maps circuit expressions to per-field causal sensitivity scores (0-1), used as ground truth for field-F1 metrics.

Live paper

livepaper/ contains a more agent-replication-friendly version of the paper generated with the livepaper harness. Please refer to the livepaper version of the paper for more details.

Citation

@article{zhong2026pando,
  title   = {Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?},
  author  = {Zhong, Ziqian and Muhamed, Aashiq and Diab, Mona T. and Smith, Virginia and Raghunathan, Aditi},
  journal = {arXiv preprint arXiv:2604.11061},
  year    = {2026}
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
autoresearch_log		autoresearch_log
livepaper		livepaper
paper_artifacts		paper_artifacts
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements-repro.txt		requirements-repro.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Use the benchmark

Setup

Running agents on existing model organisms

Training new model organisms

Scenarios

Agents

Reproduce the paper

Download evaluation data

Tables and figures

Eval data format

Live paper

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

Use the benchmark

Setup

Running agents on existing model organisms

Training new model organisms

Scenarios

Agents

Reproduce the paper

Download evaluation data

Tables and figures

Eval data format

Live paper

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages