Ziqian Zhong, Aashiq Muhamed, Mona T. Diab, Virginia Smith, Aditi Raghunathan
📝 Arxiv | 🖇️ Livepaper | 📦 Dataset | 🌐 Website
Pando is a benchmark for evaluating interpretability methods on language models with known ground-truth decision rules. We fine-tune 1000+ "model organisms" — small LMs with planted decision-tree circuits — and measure whether interpretability agents can recover the hidden rules under budget constraints.
Pando provides pre-trained model organisms so you can evaluate interpretability methods without training from scratch. The model organisms are hosted on HuggingFace under pando-dataset, organized into 17 repos by scenario and training configuration.
git clone https://github.com/AR-FORUM/Pando.git
cd pando
pip install -r requirements.txt# Download a set of model organisms from HuggingFace
pip install huggingface_hub
hf download pando-dataset/car-purchase-freeform-std \
--local-dir outputs/models/car-purchase-freeform-std
# Run specific agents on one model (requires GPU + OPENAI_API_KEY)
# --exclude-seen reports accuracy only on the 90 heldout samples
python scripts/eval.py \
--model-dir outputs/models/car-purchase-freeform-std/<model_name> \
--agents gradient relp blackbox \
--fixed-prompt-budget --budget 10 --exclude-seen
# Run all agents
python scripts/eval.py \
--model-dir outputs/models/car-purchase-freeform-std/<model_name> \
--fixed-prompt-budget --budget 10 --exclude-seenEach model directory contains circuit.json (the planted decision rule with
causal field sensitivities) and validation.json (2,000 pre-scored samples),
# Train a depth-3 decision-tree model organism (LoRA, Gemma 2 2B-it)
python scripts/train.py \
--scenario car_purchase --depth 3 \
--base-model google/gemma-2-2b-it --chat-model --use-lora \
--training-format freeform --format-style naturalSee scripts/train.py --help for all options (scenarios, circuit types, format
styles, rationale training, data mixing).
| Scenario | Fields | Decision |
|---|---|---|
car_purchase |
Brand, Year, Color, HP, Drivetrain, MPG, Seats, Interior, Condition, Price | Purchase yes/no |
movie_pick |
Release Year, Genre, Language, Runtime, Rating, Format, Budget, Box Office, Color, Cast | Watch yes/no |
oversight_defection |
Deploy Phase, Turn Count, Minutes, Auth, Trust, Complexity, Risk, Tools, Oversight, Logging | Violation yes/no |
Each has 5 binary ENUM fields and 5 INTEGER fields. Circuits are randomly generated decision trees of configurable depth (d1-d4).
| Agent | Strategy | Interp tool |
|---|---|---|
blackbox |
GPT pattern discovery | None |
gradient |
Gradient saliency | Embedding gradients |
relp |
RelP-modified gradients | LRP-rule gradients |
logit_lens |
Logit lens projections | Vocabulary projections |
logit_lens_field |
Per-field logit lens | Vocabulary projections |
prefill |
Prefill extraction | Forced-decoding |
sae_tfidf |
SAE feature TF-IDF | Sparse autoencoder |
sae_tfidf_filtered |
SAE TF-IDF + keyword filtering | Sparse autoencoder |
sae_gradient |
SAE gradient attribution | Sparse autoencoder |
sae_autointerp |
SAE + Neuronpedia descriptions | Sparse autoencoder |
sae_mean_diff |
SAE mean activation difference | Sparse autoencoder |
sae_token |
SAE token-level features | Sparse autoencoder |
res_token |
Residual token similarity | Residual stream |
circuit_tracer |
Circuit tracing (unfiltered, large context) | Activation patching |
circuit_tracer_filtered |
Circuit tracing + keyword filtering (paper default) | Activation patching |
tree_vote |
Decision tree voting ensemble | Embedding gradients |
Plus baselines: majority, nn, logreg, always_true/false.
For reproducing the paper, we additionally provide cached evaluation results, so you could verify them without re-running inference.
You only need lightweight dependencies (no GPU, no API keys):
pip install -r requirements-repro.txtEach evaluation presents 100 test inputs (50/50 balanced). The agent has a budget of ~10 forward passes querying the model on a seeded subset (~10 visible inputs, identical across agents), then predicts the remaining ~90 heldout inputs. We report heldout-only accuracy. (We do not allow active sampling for the main experiments.)
Option A — single zip (recommended, avoids rate-limiting):
hf download pando-dataset/evaluation-results-zip \
--repo-type dataset --local-dir /tmp/eval-zip
unzip /tmp/eval-zip/evaluation-results.zip -d .Option B — individual files via HF:
hf download pando-dataset/evaluation-results \
--repo-type dataset --local-dir outputs/Both populate outputs/evaluations/ (74 batch directories, ~3 GB) and
outputs/sensitivity_cache.json. The cached JSONs store predictions on all
100 inputs; the analysis scripts below reconstruct heldout accuracy from
per_input_results by excluding the ~10 queried indices.
| Paper artifact | Command |
|---|---|
| Table 3 (main accuracy + F1) | python scripts/analysis/generate_tables.py |
| Table 4 (variance decomposition) | python scripts/analysis/analyze_value_bias.py outputs/evaluations/batch_20260301_033718 outputs/evaluations/batch_20260301_033721 --agents relp gradient sae_gradient sae_tfidf logit_lens_field |
| Table 6 (full agent variants) | python scripts/analysis/generate_tables.py --scenario car_purchase movie_pick |
| Table 8 (format robustness) | python scripts/analysis/generate_tables.py --table 6 |
| Table 9 (data mixing) | python scripts/analysis/generate_tables.py --table 7 |
| Tables 10/11 (tree voting) | python scripts/analysis/generate_tables.py (tree_vote.json already in eval artifact) |
| Table 13 (per-field AUC) | python scripts/analysis/analyze_interp_field_bias.py outputs/evaluations/batch_20260301_033718 outputs/evaluations/batch_20260301_033721 --agents relp gradient logit_lens_field sae_tfidf sae_raw sae_gradient |
| Figure 3 (budget sweep) | python paper_artifacts/plot_budget_sweep.py |
| Figure 4 (autoresearch) | python scripts/analysis/plot_autoresearch_progression.py |
Each outputs/evaluations/batch_*/ directory holds one eval run. Each model
subdirectory contains:
config.json-- model metadata (scenario, depth, training config)test_data.json-- 100 test samples (50/50 balanced)agent_results/<agent>.json-- per-agent predictionsaccuracy,correct,totalper_input_results-- per-sample{index, predicted, correct, ...}agent_metadata.pattern-- natural-language rule discovered by GPT-5.1
outputs/sensitivity_cache.json maps circuit expressions to per-field causal
sensitivity scores (0-1), used as ground truth for field-F1 metrics.
livepaper/ contains a more agent-replication-friendly version of the paper generated with the livepaper harness. Please refer to the livepaper version of the paper for more details.
@article{zhong2026pando,
title = {Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?},
author = {Zhong, Ziqian and Muhamed, Aashiq and Diab, Mona T. and Smith, Virginia and Raghunathan, Aditi},
journal = {arXiv preprint arXiv:2604.11061},
year = {2026}
}MIT