Skip to content

A short exporation and paper on an approach to AI safety in which we attempt to retrain layers and/or train autoencoders on clean data to avoid trojans.

License

Notifications You must be signed in to change notification settings

4gatepylon/IfYouDontUnderstandItDontUseIt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IfYouDontUnderstandItDontUseIt

A short exporation and paper on an approach to AI safety in which we attempt train autoencoder-like filters on clean data to avoid trojans. The broad goal is some form of soft enumerative safety by basically saying "only things in this dataset are OK" and the proposed approach is to train SAEs or low rank replacements or normal replacements to replace layers.

The main work is in src/. Instructions for running are there.

Some of the other stuff from before includes

  • notes/ includes things that I thought were noteworthy to take note of. Sometimes there will be numbers but normally this is qualitative experience.
  • There are previous iterations of this work in other branches (mainly adriano/scratch contains unmaintained scripts that were refactored to become what we have here today)
  • experiment_results_2024_07_08 is a consolidation of experiment result (YAML) files that can be used (plug and play) to reproduce the results here

About

A short exporation and paper on an approach to AI safety in which we attempt to retrain layers and/or train autoencoders on clean data to avoid trojans.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published