GitHub - HoagyC/sparse_coding: Using sparse coding to find distributed representations used by neural networks.

Sparse Coding

This repo contains code for applying sparse coding to activation vectors in language models, including the code used for the results in the paper Sparse Autoencoders Find Highly Interpretable Features in Language Models. Work done with Logan Riggs and Aidan Ewart, advised by Lee Sharkey.

The repo is designed to train multiple sparse autoencoders simultaneously using different L1 values, on either a single GPU or across multiple. big_sweep_experiments contains a number of examples of run functions.

Automatic Interpretation

interpret.py contains tools to interpret learned dictionaries using OpenAI's automatic interpretation protocol. Set --load_interpret_autoencoder to the location of the autoencoder you want to test, and --model_name, --layer and --layer_loc to specify the activations that should be used. --activation_tranform should be set to feature_dict for interpreting a learned dictionary but there are many baselines that can also be run, including pca, ica, nmf, neuron_basis, and random.

If you run interpret.py read_results --kwargs.. and select the --model_name, --layer and --layer_loc, this will produce a series of plots comparing the selected plots in terms of their sparsity and fraction of variance left unexplained.

Training Your Own Sparse Autoencoders

If you'd like to train your own sparse autoencoders, we recommend using the sparse_autoencoder library which is currenty under development and should be easier to use and keep up with best practices as they develop.

Name		Name	Last commit message	Last commit date
Latest commit History 301 Commits
autoencoders		autoencoders
experiments		experiments
interp_notebooks		interp_notebooks
plotting		plotting
sc_datasets		sc_datasets
test		test
test_datasets		test_datasets
.gitignore		.gitignore
README.md		README.md
activation_dataset.py		activation_dataset.py
basic_l1_sweep.py		basic_l1_sweep.py
big_sweep.py		big_sweep.py
big_sweep_experiments.py		big_sweep_experiments.py
case_studies_loop.ipynb		case_studies_loop.ipynb
cluster_runs.py		cluster_runs.py
cmdutil.py		cmdutil.py
config.py		config.py
do_ioi_multiple_layers.sh		do_ioi_multiple_layers.sh
generate_test_data.py		generate_test_data.py
inter_dict_connections.ipynb		inter_dict_connections.ipynb
interpret.py		interpret.py
minimal_feature_interp.ipynb		minimal_feature_interp.ipynb
replicate_toy_models.py		replicate_toy_models.py
requirements.txt		requirements.txt
standard_metrics.py		standard_metrics.py
sweep_baselines.py		sweep_baselines.py
utils.py		utils.py

HoagyC/sparse_coding

Folders and files

Latest commit

History

Repository files navigation

Sparse Coding

Automatic Interpretation

Training Your Own Sparse Autoencoders

About

Resources

Stars

Watchers

Forks

Languages