Mechanistic Interpretability Challenges

This repository contains the starting materials for the Capture the Flag Mechanistic Interpretability Challenges. For more information on the challenges and their purpose you can visit the announcement post on LessWrong.

Here's a short description of the files and folders in the repo:

demo.py: Loads the models corresponding to the three challenges and evaluates them on a sample dataset
dataset_public.py: Defines some dataset classes matching the task of each model. They're different from the datasets used for training the models.
scoring.py: A slighly modified copy of the file used to score submissions
/models: Contains the weights of the models used for each challenge.
/submission_example: an example submission for the three challenges. Contains a simple baseline for each challenge.
model.py: Defines custom function to instantiate a transformer model using TransformerLens.

You can submit your challenge solutions at the following CodaBench Competition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models

models

submission_example

submission_example

.gitignore

.gitignore

README.md

README.md

dataset_public.py

dataset_public.py

demo.py

demo.py

model.py

model.py

scoring.py

scoring.py

Repository files navigation

Mechanistic Interpretability Challenges

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
models		models
submission_example		submission_example
.gitignore		.gitignore
README.md		README.md
dataset_public.py		dataset_public.py
demo.py		demo.py
model.py		model.py
scoring.py		scoring.py

AlejoAcelas/Mech-Interp-Challenges

Folders and files

Latest commit

History

Repository files navigation

Mechanistic Interpretability Challenges

About

Topics

Resources

Stars

Watchers

Forks

Languages