Reprogramming AI Models Apart Research Hackathon

This is the repository for our project "Encouraging Chain-of-Thought Reasoning in Language Models Through Feature Steering". The project was conducted by Soumyadeep Bose, Kutay Buyruk, Shreyan Jain and I (Thomas Walker), as part of the Apart Research spring "Reprogramming AI Models" sponsored by Goodfire AI.

The report for our project can be found here.

Motivation

We identify active features during chain-of-thought reasoning with the intention to then use them to steer the model to conduct chain-of-thought reasoning. Although most large language models answer questions using chain-of-thought reasoning by default - since they are fine-tuned on vast amounts of this data - there are a few motivations of why identify these features is desirable.

Chain-of-thought reasoning is not really understood from a interpretability perspective. To our knowledge there only exists works (1) and (2) applying methods of interpretability of chain-of-thought reasoning. Since chain-of-thought reasoning can improve model performance, it seems natural to want to understand how it works.
Typically, chain-of-thought reasoning in models is elicited with heavy prompt engineering. Consequently, its responses are heavily influenced by the user, and thus one cannot get a true sense of what the model is capable of. Therefore, through steering techniques we can better understand the reasoning capabilities of the model. Similarly, feature steering provides a principled and systematic method for inducing chain-of-thought reasoning, which heavy prompt engineering lacks.
Our intention is that by steering the model toward reasoning with chain-of-thought, get more effective chain-of-thought reasoning. That is, the reasoning is more coherent, succinct and accurate; allowing us to better interpret what the model is thinking.
Moreover, since chain-of-thought reasoning can significantly influence model capabilities, having the ability to directly control this behavior through feature steering means we can have greater control over the capabilities of these large language models. This is important for a model deployer, who cannot always guarantee their models will be deployed in low-risk environments.

References

(1) Wei, J. et al. (2023) ‘Chain-of-Thought Prompting Elicits Reasoning in Large Language Models’. arXiv. Available at: https://doi.org/10.48550/arXiv.2201.11903.

(2) Wu, S. et al. (2023) ‘Analyzing Chain-of-Thought Prompting in Large Language Models via Gradient-based Feature Attributions’. arXiv. Available at: https://doi.org/10.48550/arXiv.2307.13339.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.vscode		.vscode
datasets		datasets
prompt_analysis		prompt_analysis
token_analysis		token_analysis
.gitignore		.gitignore
README.md		README.md
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Reprogramming AI Models Apart Research Hackathon

Motivation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

ThomasWalker1/reprogramming_ai_models

Folders and files

Latest commit

History

Repository files navigation

Reprogramming AI Models Apart Research Hackathon

Motivation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages