Skip to content

OpenInterpretability/notebooks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

OpenInterpretability · notebooks

From pip install transformers to your own paper-grade SAE — 23 Colab / Kaggle / cloud notebooks covering every step.

openinterp.org/train License Apache 2.0 Notebooks Discussions Good first issues


Part of a 5-repo ecosystem

Repo What's in it
.github Org profile + shared CoC + SECURITY
web Next.js site behind openinterp.org
notebooks (you are here) 23 training + interpretability notebooks
cli pip install openinterp — Python SDK
mechreward SAE features as dense RL reward

🚀 The core ladder — train your first SAE

Tier Notebook Platform VRAM Cost Model Time
Hobbyist 01_hobbyist_gemma2_2b_colab.ipynb Colab Free T4 15 GB $0 Gemma-2-2B 30–40 min
Explorer 02_explorer_qwen35_4b_kaggle.ipynb Kaggle 2× T4 32 GB $0 Qwen3.5-4B (hybrid GDN) 4–5 h
Paper-grade 03_papergrade_qwen36_27b_cloud.ipynb Cloud RTX 6000 Pro 96 GB ~$30–60 Qwen3.6-27B 20–24 h

🔍 After you train — close the loop

Notebook What it does
04_discover_features.ipynb Auto-label your SAE's features with Claude or GPT-4, emit feature_catalog.json
05_build_shareable_trace.ipynb Your SAE + your prompt → trace.json in the Trace Theater format
06_steer_your_model.ipynb Live feature intervention: baseline vs α ∈ {−3, 0, 1, 3}. Q1 preview of the Q2 Sandbox.

🧭 Before you train — reduce friction

Notebook What it does
07_pick_your_tier.ipynb VRAM calculator + layer recommender. Zero GPU needed.

🧪 More models — same recipe, different architectures

Notebook Model Platform
08_explorer_llama3_8b_kaggle.ipynb Llama-3.1-8B (Meta license) Kaggle 2× T4
09_explorer_mistral_7b_kaggle.ipynb Mistral-7B-v0.3 Kaggle 2× T4
10_hobbyist_phi3_mini_colab.ipynb Phi-3-mini-4k (Microsoft) Colab Free T4

🎓 Research-grade — replicate published results

Notebook Paper / protocol
11_stage_gate_g1.ipynb Stage Gate 1 correlation pre-test (mechreward protocol) — ρ ≥ 0.30 on held-out GSM8K
12_batchtopk_vs_topk.ipynb BatchTopK vs TopK (Bussmann et al., arxiv:2412.06410)

🛡️ Safety + production preview

Notebook What it does
13_watchtower_preview.ipynb Monitor input prompts for anomalous feature activations. Q1 preview of Q4 Watchtower Enterprise. Forward-only, no generation.

🔗 Circuits — attribution graphs between SAE features

Notebook What it does
14_attribution_patching.ipynb AtP* (Kramár et al. 2024, arxiv:2403.00745) — QK-fix + GradDrop node attribution
15_sparse_feature_circuits.ipynb Marks et al. 2024 (arxiv:2403.19647) replication — node + edge + error-term DAG
16_autocircuit_acdc.ipynb ACDC slow-mode via AutoCircuit
17_train_crosscoder.ipynb Sparse Crosscoder (Lindsey et al. 2024) — shared dictionary across L11/L31/L55

All circuit notebooks emit JSON consumed directly by the Circuit Canvas on openinterp.org.

📊 Leaderboard — InterpScore v0.0.1

Notebook What it does
18_interpscore_eval.ipynb Composite SAE ranking — loss_recovered + alive + L0 + sparse probing + TPP. Emits interpscore.json → PR to web/lib/leaderboard.ts.

🔭 Lenses — classic layer-wise prediction tools

Notebook Method
19_logit_lens.ipynb Logit Lens (nostalgebraist 2020). 5 lines of PyTorch, ~5 min on T4.
20_tuned_lens.ipynb Tuned Lens (Belrose et al. 2023, arxiv:2303.08112). Pretrained or fresh-fit.

📏 Probing — the supervised baselines SAE features must beat

Notebook Method
21_linear_probe.ipynb sklearn LogisticRegression on residuals + diff-of-means baseline (Farquhar 2023 requires it)
22_ccs_probe.ipynb Contrast Consistent Search (Burns 2022) with honest critique baselines
23_repe_reading_vector.ipynb Representation Engineering LAT (Zou 2023) — extract + monitor + steer

🛠️ Shared recipe (every training tier)

All tiers use the same research-grade protocol; hyperparameters scale:

  • TopK activation (Gao et al. 2024) — hard top-k, no L1 penalty
  • AuxK auxiliary loss — dead-feature revival (α=1/32, k_aux=d/2, dead_threshold=10M tokens)
  • Geometric-median b_dec init (Weiszfeld) — robust to heavy-tailed residuals
  • Decoder column renorm every step — keeps features interpretable
  • Cosine LR + warmup — non-zero floor for continued dead-feature revival
  • HuggingFace streaming checkpoints — crash-safe, never lose more than 5-10 min
  • sae_lens-compatible exportsafetensors + cfg.json

🚦 Hard constraints on every notebook

If you port an existing notebook or write a new one, honor these — CI and review will check:

✅ DO ❌ DON'T
dtype=torch.bfloat16 torch_dtype= (deprecated in transformers 5.x)
attn_implementation='sdpa' flash-attn (reproducibility + install pain)
HF_TOKEN via Colab/Kaggle secret Hard-coded tokens
HF streaming checkpoints every 5-10M tokens Drive-only checkpoints (kernel dies = data loss)
Per-layer model.language_model.layers[N] fallback Hard-coded .layers[N] (breaks on multimodal)
Honest var_expl + L0 + dead% Cherry-picked seeds

📓 How to contribute a new notebook

Full rules in CONTRIBUTING.md. The 3 most common PR patterns:

1. Port a notebook to a new model

The most valuable contribution. Pick an existing notebook that matches your tier (01 for hobbyist, 02 for Kaggle-scale, 03 for paper-grade) and swap:

MODEL_ID   = 'meta-llama/Llama-3.2-3B'   # was: 'google/gemma-2-2b'
LAYER      = 14                           # was: 15 — middle-stack heuristic
D_MODEL    = 3072                         # was: 2304

Name the new file NN_<tier>_<model-slug>_<platform>.ipynb where NN is the next free number.

PR title: Add Hobbyist tier for Llama-3.2-3B (notebook 24) — include a screenshot of the final eval cell output.

2. Replicate a 2024-2026 paper

Add a notebook under notebooks/ that reproduces the main result. Structure:

  1. Title markdown cell with full citation + arxiv link
  2. Install cell with pinned versions
  3. Config cell with all hyperparameters from the paper
  4. Implementation of the method (inline, not a separate repo — notebooks are self-contained)
  5. Validation cell that outputs the paper's headline metric

PR title: Replicate: <paper short title> (notebook NN) — match the paper's exact numbers within tolerance.

3. Add a platform (TPU, ROCm, MPS)

Right now every notebook assumes CUDA. Adding a platform is a multi-notebook effort, usually via a common helper:

  • Write notebooks/_platform_<name>.py with pick_device(), get_dtype(), etc.
  • Patch one existing notebook to use it as proof-of-concept
  • Open a draft PR and tag @caiovicentino for design review before the full port

✔️ Before opening a PR — validate locally

python3 -c "import json; json.load(open('notebooks/YOUR_NOTEBOOK.ipynb'))"

This catches the most common breakage (bad JSON, unclosed cell). CI also runs nbformat.validate on every PR.

If you have a GPU and want to dry-run the first ~10 cells:

jupyter nbconvert --to notebook --execute notebooks/YOUR_NOTEBOOK.ipynb --ExecutePreprocessor.timeout=300

(Expect the heavy training cells to fail under 300s — that's fine; the goal is to catch import errors + dtype bugs early.)


Output schemas other tools consume

If your notebook emits a JSON that the website consumes, match the schema:

Tool Schema (TypeScript source)
Trace Theater web/lib/trace-data.ts · TraceScenario
Circuit Canvas web/lib/circuit-data.ts · CircuitData
InterpScore leaderboard web/lib/leaderboard.ts · LeaderboardEntry

🎯 After you run a notebook

Your SAE is an asset. Put it to work:


Community


Standing on the shoulders of

Apache-2.0 · openinterp.org · 2026

About

Train your first SAE in 30 min → paper-grade at 27B. Free Colab · free Kaggle · cloud ladders. Every scale covered.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors