Skip to content

HoagyC/sparse_coding

Repository files navigation

Sparse Coding

This repo contains code for applying sparse coding to activation vectors in language models, including the code used for the results in the paper Sparse Autoencoders Find Highly Interpretable Features in Language Models. Work done with Logan Riggs and Aidan Ewart, advised by Lee Sharkey.

The repo is designed to train multiple sparse autoencoders simultaneously using different L1 values, on either a single GPU or across multiple. big_sweep_experiments contains a number of examples of run functions.

Automatic Interpretation

interpret.py contains tools to interpret learned dictionaries using OpenAI's automatic interpretation protocol. Set --load_interpret_autoencoder to the location of the autoencoder you want to test, and --model_name, --layer and --layer_loc to specify the activations that should be used. --activation_tranform should be set to feature_dict for interpreting a learned dictionary but there are many baselines that can also be run, including pca, ica, nmf, neuron_basis, and random.

If you run interpret.py read_results --kwargs.. and select the --model_name, --layer and --layer_loc, this will produce a series of plots comparing the selected plots in terms of their sparsity and fraction of variance left unexplained.

Training Your Own Sparse Autoencoders

If you'd like to train your own sparse autoencoders, we recommend using the sparse_autoencoder library which is currenty under development and should be easier to use and keep up with best practices as they develop.

About

Using sparse coding to find distributed representations used by neural networks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published