Skip to content

Reversed-engineered Transformer models as a benchmark for interpretability methods

Notifications You must be signed in to change notification settings

AlejoAcelas/Interp-Benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Interpretability Benchmarks

This repository provides detailed explainability analyses for Transformer models trained on algorithmic tasks. The explanations map model activations to simplified causal graphs that recover over 90% of the original model loss. This makes them valuable benchmarks for evaluating interpretability techniques.

Contents

The repository assumes basic knowledge of interpretability methods like causal scrubbing and circuit-style analysis of Transformer models. Key items:

  • Explanations formatted as simplified causal models
  • Resampling tests quantify explanation accuracy
  • Matches or exceeds the accuracy of previous attemps at causal scrubbing analysis on algorithmic tasks
  • Ideal for benchmarking new interpretability techniques
  • Notebooks and scripts for training, evaluation, and analysis
  • Modular codebase for extending to new models and tasks

By open sourcing detailed analyses tied to model performance, this repository aims to advance interpretability research. Contributions welcome!

About

Reversed-engineered Transformer models as a benchmark for interpretability methods

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published