Interpretability Benchmarks

This repository provides detailed explainability analyses for Transformer models trained on algorithmic tasks. The explanations map model activations to simplified causal graphs that recover over 90% of the original model loss. This makes them valuable benchmarks for evaluating interpretability techniques.

Explanations formatted as simplified causal models
Resampling tests quantify explanation accuracy
Matches or exceeds the accuracy of previous attemps at causal scrubbing analysis on algorithmic tasks
Ideal for benchmarking new interpretability techniques
Notebooks and scripts for training, evaluation, and analysis
Modular codebase for extending to new models and tasks

By open sourcing detailed analyses tied to model performance, this repository aims to advance interpretability research. Contributions welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
docs		docs
models/final		models/final
scripts		scripts
src.egg-info		src.egg-info
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interpretability Benchmarks

Contents

About

Releases

Packages

Languages

AlejoAcelas/Interp-Benchmarks

Folders and files

Latest commit

History

Repository files navigation

Interpretability Benchmarks

Contents

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages