In [None]:
#| echo: false
#| output: false
import html
import pandas as pd
from datasets import load_dataset
import matplotlib.pyplot as plt
from IPython.display import display, Markdown

%config InlineBackend.figure_format='retina'
plt.rcParams['font.family'] = 'Arial'

## Introduction

<sub>Note: This is a preliminary report. We will be doing further work on these problems.</sub>

Feature visualization has been a powerful technique for vision model interpretability [@olah2017feature; @mordvintsev-2015].
Feature visualization for vision models (also known as "dreaming") is the process of optimizing an input image to maximize the activation of a particular neuron, channel, layer or output classification.
Feature visualization has not been successfully applied to language models because of the discrete nature of text.
Attempts to run continuous optimization algorithms on relaxations (footnote: e.g. the Gumbel-Softmax, [@jang2017categorical]) of the discrete space have not been successful [@bauerle-2018]. 

In contrast, the automatic prompt engineering and the adversarial attacks literature have both made steady progress towards optimizing prompts for language models [@zou2023universal; @wen2023hard; @kumar2022gradientbased; @shi2022human; @shin2020autoprompt; @ebrahimi2018hotflip].
In particular, the Greedy Coordinate Gradient (GCG) discrete search method of @zou2023universal is successful (NUMBERSSS!!) at optimizing in token space to trigger objectionable content.
We modify and extend their algorithm to perform feature visualization in language models.

In particular, we:

1. Optimize a two term objective function including a feature term (e.g. the activation of a particular neuron) and a fluency term based on the model's own cross-entropy loss for the generated text.
2. Extend GCG using concepts from evolutionary algorithms. We use a population of candidate prompts, each optimizing for a different point on the Pareto frontier between cross entropy and feature activation. 

## Evolutionary prompt optimization (EPO)

Past work on language model feature visualization has struggled with two main problems [@bauerle-2018]:

1. Optimization is done in the model's continuous embedding space. Projecting this optimized embedding to the nearest token can produce low feature activation values.
2. The optimized prompt is often gibberish because the optimization process is not constrained to produce reasonable text.

We solve both problems:

1. We optimize in the discrete token space using a evolutionary algorithm while using token gradients to constrain the search space.
2. We use a langauge model to evaluate the fluency of the prompt by computing the self-cross-entropy of the prompt [@shi2022human]. This allows us to preference the search process towards reasonable text.

Given a feature $f(p)$ and a language model $m(p)$, we want to find a prompt $p^*$ that maximize the feature while minimizing the self-cross-entropy of the prompt:

$$
\begin{aligned}
L_{\lambda}(p)&= f(p) -  \frac{\lambda}{n} \sum_{i=0}^{n-1} x(m(p_{\leq i}), m_{i+1}) \big) \\
p^*_{\lambda} &= \underset{p}{\mathrm{argmax}} ~ L_{\lambda}(p) \\
\end{aligned}
$$

where $\lambda$ is a weighting parameter determining the tradeoff between fluency regularizer and feature maximization, $n$ is the number of tokens in the prompt, $p_{\leq i}$ is the prefix of $p$ up to token $i$, and $p_{i+1}$ is the next token in $p$.

Note that the feature under investigation may be a component of a different model from the model used for evaluating cross entropy.
It is, however, necessary for the tokenization of the feature model and the language model to be the same so that gradients can be propagated to the token space..
We find that using larger models for evaluating cross-entropy produces more coherent feature visualizations (footnote: unfortunately, using a large model for evaluating fluency is expensive)

Empirically, it is difficult to select $\lambda$ with different values being optimal for different features. Our algorithm explores the Pareto frontier of $\lambda$ values, allowing the user to select the desired tradeoff between fluency and feature maximization after the fact. We optimize a family of objectives $L_{\lambda_i}$ parameterized by $M$ values of the fluency regularization strength, $\lambda_1, \dots, \lambda_M$.

To optimize this family of objectives, we use an evolutionary algorithm:

1. We randomly initialize a population of $M$ prompts each of length $n$. (footnote: we don't consider allowing the optimization algorithm to change the length of the prompt here).
2. We compute the feature, cross-entropy of each prompt and following the approaches of [@ebrahimi2018hotflip; @shin2020autoprompt; @zou2023universal]: 
    a. We backpropagate to a one-hot encoding of the prompt to get a gradient for each token in the vocabulary in each token position.
    b. We select the top-$k$ tokens in each token position according to the magnitude of the gradient.
    c. For each member of the population, we generate $r$ children by replacing a single token uniformly at random with one of the top-$k$ tokens in that position . (footnote: both the token position and the token are chosen uniformly at random.)
3. We now have a population of $Mr$ prompts. We select the top-$M$ prompts *without replacement* according to a random ordering of the family of objectives $L_{\lambda_i}$.

We repeat steps 2 and 3 for $T$ iterations.

Compared to existing algorithms, the addition of a population of multiple candidate prompts and the use of a family of fluency regularization strengths allows us to explore the full Pareto frontier of fluency and feature maximization. The presence of multiple regularization strengths also helps to prevent the algorithm from getting stuck in local optima.

## Demonstrations using Pythia-12B

We apply EPO to a few of the individual neurons of Pythia-12B [@biderman2023pythia]. For simplicity, we also use Pythia-12B for evaluating fluency.

We set hyperparameters:

- $T=200$
- $M=32$
- $r=4$
- $k=512$
- The $M=32$ values of $\log(\lambda_i)$ are uniformly gridded between $\log(1/16)$ and $\log(16)$.

### L10.N5

SECTION IS NOT WRITTEN YET.

Layer 10, neuron 5 (L10.N5) reacts strongly to a variety of phrases related to examples (SEE TABLE). Note that these results were generated without regard to the training data.

- the neuron clearly reacts to common example-related words: example, "for example", "i.e"., "e.g.", "say", "ex."
- “For exam\n ple” shows that the neuron is robust to a mid-word line break! The neuron also responds to "exmaple" (a misspelling of example).
- there’s also some fun adversarial examples in there: “for.g” probably attacks the “e.g.” and "for example" circuitry. Other examples: “ie for example”, “so for eged example”, “for eg”, “ie for example”

While many of these insights into L10.N5 would've been possible with dataset examples, optimization provides us with adversarial prompts that would be almost impossible to find in the training data. These prompts can inform mechanistic interpretation of the feature. For instance (footnote: pun intended?), "for.g." is an adversarial attack on circuitry that responds to "for example" and "e.g.". We see several other similar examples: "ie for example", "so for eged example", "for eg", "ie for example".

As we discuss in the next section, a weakness of our analysis here is that we are ignoring intermediate magnitude activation values. There are many prompts that cause L10.N5 to produce activations up to ~5. It's only for activation values above 5 where the neuron is primarily reacting to example-related phrases. 

## Limitations

### Polysemanticity and prompt diversity

Language models neurons are plagued by polysemanticity [@elhage2022superposition; @gurnee2023finding].
That is, a neuron may activate for multiple distinct and often anti-corellated prompts.
As a result, using optimization for feature visualization can be misleading when there are important prompts that only mildly activate the feature. Or, when, for whatever reason, certain distinct prompts are harder to find using discrete search.
Optimizing for diversity has been successful in feature visualization for vision models [@olah2017feature] and might be useful for identify less-than-maximally activating prompts for language model features.
There is a mild tendency towards diversity in our current algorithm because the model gets stuck in different local minima for different random seeds.
We distinguish between three different forms of diversity: syntactic, semantic and mechanistic.
For example, the results for L10.N5 are syntactically diverse but semantically similar.
We would be particularly excited about extensions of our algorithm that encourage mechanistic diversity.
That is, prompts that activate a feature via different internal model circuitry.
Such a method might be useful for disentangling superposition.
It may be the case that methods based on measuring composition between neurons and attention heads [@elhage2021mathematical] could be used to measure mechanistic diversity.

### Computational performance

EPO is fairly slow because it requires a forward pass of the model for each candidate prompt. With the hyperparameters we have used here, repeating the prompt engineering for a single feature 20 times requires 512,000 forward passes and required ~40 minutes on a single NVidia RTX A6000 GPU. We believe that a more careful tuning of hyperparameters might be able to reduce the cost by a factor of 10. However, discrete optimization is fundamentally quite expensive. 