# Interactive Neuroscope

*This is an interactive accompaniment to [neuroscope.io](https://neuroscope.io) and to the [studying learned language features post](https://www.alignmentforum.org/posts/Qup9gorqpd9qKAEav/200-cop-in-mi-studying-learned-features-in-language-models) in [200 Concrete Open Problems in Mechanistic Interpretability](https://neelnanda.io/concrete-open-problems)*

There's a surprisingly rich ecosystem of easy ways to create interactive graphics, especially for ML systems. If you're trying to do mechanistic interpretability, the ability to do web dev and to both visualize data and interact with it seems high value! 

This is a demo of how you can combine HookedTransformer and [Gradio](https://gradio.app/) to create an interactive Neuroscope - a visualization of a neuron's activations on text that will dynamically update as you edit the text. I don't particularly claim that this code is any *good*, but the goal is to illustrate what quickly hacking together a custom visualisation (while knowing fuck all about web dev, like me) can look like! (And as such, I try to explain the basic web dev concepts I use)

Note that you'll need to run the code yourself to get the interactive interface, so the cell at the bottom will be blank at first!

To emphasise - the point of this notebook is to be a rough proof of concept that just about works, *not* to be the well executed ideal of interactively studying neurons! You are highly encouraged to write your own (and ideally, to [make a pull request](https://github.com/neelnanda-io/TransformerLens/pulls) with improvements!)

## Setup

In [1]:
import os

try:
    import google.colab

    IN_COLAB = True
    print("Running as a Colab notebook")

except:
    IN_COLAB = False
    print("Running as a Jupyter notebook - intended for development only!")
    from IPython import get_ipython

    ipython = get_ipython()
    print(ipython)
    # Code to automatically update the HookedTransformer code as its edited without restarting the kernel
    ipython.magic("load_ext autoreload")
    ipython.magic("autoreload 2")

Running as a Colab notebook


In [2]:
import os

if IN_COLAB:
    os.system("pip install git+https://github.com/neelnanda-io/TransformerLens.git")
    os.system("pip install gradio")

In [3]:
import gradio as gr
from transformer_lens import HookedTransformer
from transformer_lens.utils import to_numpy
from IPython.display import HTML

## Extracting Model Activations

We first write some code using HookedTransformer's cache to extract the neuron activations on a given layer and neuron, for a given text

In [4]:
model_name = "distilgpt2" # NYTK/PULI-GPT-2
model = HookedTransformer.from_pretrained(model_name)

Using pad_token, but it is not set yet.


Loaded pretrained model distilgpt2 into HookedTransformer


In [5]:
model

HookedTransformer(
  (embed): Embed()
  (hook_embed): HookPoint()
  (pos_embed): PosEmbed()
  (hook_pos_embed): HookPoint()
  (blocks): ModuleList(
    (0): TransformerBlock(
      (ln1): LayerNormPre(
        (hook_scale): HookPoint()
        (hook_normalized): HookPoint()
      )
      (ln2): LayerNormPre(
        (hook_scale): HookPoint()
        (hook_normalized): HookPoint()
      )
      (attn): Attention(
        (hook_k): HookPoint()
        (hook_q): HookPoint()
        (hook_v): HookPoint()
        (hook_z): HookPoint()
        (hook_attn_scores): HookPoint()
        (hook_pattern): HookPoint()
        (hook_result): HookPoint()
      )
      (mlp): MLP(
        (hook_pre): HookPoint()
        (hook_post): HookPoint()
      )
      (hook_attn_out): HookPoint()
      (hook_mlp_out): HookPoint()
      (hook_resid_pre): HookPoint()
      (hook_resid_mid): HookPoint()
      (hook_resid_post): HookPoint()
    )
    (1): TransformerBlock(
      (ln1): LayerNormPre(
        (hook_sc

In [6]:
# # to be removed later 
# from transformers import GPT2Tokenizer, GPT2Model
# tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# model2 = GPT2Model.from_pretrained('gpt2')
# text = "Replace me by any text you'd like."
# encoded_input = tokenizer(text, return_tensors='pt')
# output = model2(**encoded_input)

In [7]:
from transformers.generation.utils import MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING
MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING #.__dict__

_LazyAutoMapping()

In [8]:
def get_neuron_acts(text, layer, neuron_index):
    # Hacky way to get out state from a single hook - we have a single element list and edit that list within the hook.
    cache = {}

    def caching_hook(act, hook):
        cache["activation"] = act[0, :, neuron_index]

    model.run_with_hooks(
        text, fwd_hooks=[(f"blocks.{layer}.mlp.hook_post", caching_hook)]
    )
    return to_numpy(cache["activation"])

We can run this function and verify that it gives vaguely sensible outputs

In [9]:
# default_layer = 0
# default_neuron_index = 3071
# default_text = "Bill Moyers criticized the corporate media for parroting the"
# # default_text = "Bill Moyers kritizálta a vállalati médiát mert utánozták a"
# print(model.to_str_tokens(default_text))
# print(get_neuron_acts(default_text, default_layer, default_neuron_index))

In [10]:
default_layer = 5
default_neuron_index = 220
default_text = 'the Brothers (and Salafis) argue that while it is not mandatory, it is nevertheless mukarama (preferable, pleasing in the eyes of God).[57] One hadith from the Sunan Abu Dawood collection states: "A woman used to perform circumcision in Medina"'
# default_text = "Bill Moyers kritizálta a vállalati médiát mert utánozták a"
print(model.to_str_tokens(default_text))
print(get_neuron_acts(default_text, default_layer, default_neuron_index))

['<|endoftext|>', 'the', ' Brothers', ' (', 'and', ' Sal', 'af', 'is', ')', ' argue', ' that', ' while', ' it', ' is', ' not', ' mandatory', ',', ' it', ' is', ' nevertheless', ' m', 'uk', 'ar', 'ama', ' (', 'pre', 'fer', 'able', ',', ' pleasing', ' in', ' the', ' eyes', ' of', ' God', ').[', '57', ']', ' One', ' had', 'ith', ' from', ' the', ' Sun', 'an', ' Abu', ' Daw', 'ood', ' collection', ' states', ':', ' "', 'A', ' woman', ' used', ' to', ' perform', ' circumcision', ' in', ' Medina', '"']
[-0.09019658 -0.16825671 -0.16022548 -0.10385371 -0.16405521 -0.12235729
  0.12736283 -0.16036858 -0.15163518 -0.15545735 -0.16369638 -0.16364644
 -0.16536835 -0.15426308 -0.15184145 -0.01847607 -0.07175732 -0.13938677
 -0.12283999 -0.14176096 -0.16275087 -0.12137292 -0.15182668 -0.16854402
 -0.15053138 -0.16540338 -0.08222538 -0.13916709 -0.1110108  -0.0836805
 -0.06269553 -0.04969161 -0.04285033 -0.06914173 -0.04093773 -0.07153136
 -0.07366642 -0.11953531 -0.13063058 -0.14887872 -0.14031702 

https://neuroscope.io/gpt2-small/5/220.html

Model: disttiled-gpt-2:

 6  Layers, ? Neurons per Layer
Dataset: Open Web Text
Neuron 220 in Layer 5
Max Range: 4.8084. Min Range: -4.8084
Max Act: 4.7290. Min Act: -0.1700
Data Index: 4108216 (Open Web Text)
Max Activating Token Index: 225

# Trancuted 
 in prison in Kiribati as of September of that year.[261] No information was located on whether any children resided in prison with the women. 
 
Back to Top 
 
Kuwait 
 
Article 34 of Law No. 26 of 1962 states that a newborn in Kuwait can remain with his/

# Full Text
Full Text #1

<|endoftext|> indicated that Kenya has an average of three hundred children aged zero to fifty-nine months living with their mothers in the thirty-five women��s prisons around the country.[257] A 2013 US State Department report indicated that 117 of the 4,314 prisoners nationwide in 2012 were women.[258] 
 
Back to Top 
 
Kiribati 
 
In Kiribati, the Prisons Ordinance provides that an infant child of a female prisoner may be received into prison with its mother and ��may be supplied with clothing and necessaries at the public expense.��[259] When the child has been weaned, the officer in charge must send the child to relatives or friends, provided there are such relatives or friends capable and willing to support the child.[260] According to a 2013 US Department of State report, there were four female detainees in prison in Kiribati as of September of that year.[261] No information was located on whether any children resided in prison with the women. 
 
Back to Top 
 
Kuwait 
 
Article 34 of Law No. 26 of 1962 states that a newborn in Kuwait can remain with his/her imprisoned mother until the child reaches the age of two. If the mother is not willing to have the child stay with her or when the child has reached two years of age, the child must live with his/her father or any relative selected by the mother. If the child does not have a father or any other relatives, the prison authorities place the child in an outside orphanage. The imprisoned mother will be notified of the location of the orphanage so that she can visit the child in accordance with regulations.[262] No information was located on the number of children residing in prison with their mothers. 
 
Back to Top 
 
Libya 
 
Pursuant to the Law on Reform and Rehabilitation Institutions, a pregnant woman inmate shall be treated during the pregnancy and until forty days after delivery in accordance with what the physician in charge decides.[263] The same treatment may be accorded to the breast-feeding inmate if so decided by the physician. The woman inmate is allowed to keep her child with her until he is two years old.[264] Information on the number of children living with their mothers in prison could not be located. 
 
Back to Top 
 
Luxembourg 
 
Children who are too young be separated from their mother are allowed to stay with their mother in prison.[265] 
 
Back to Top 
 
Malawi 
 
Malawian law provides that a breastfeeding child of a female prisoner may be permitted to live with the mother until the child has been weaned. During the child��s stay with the mother, the child may be provided with ��clothing and necessaries at the public expense.��[266] Once the child has been weaned, the Prison Service is required to place the child with a relative or family friend able and willing to support the child and, in the absence of such a person, with a government-approved child care provider.[267] 
 
A 2013 US Department of State report indicated that there were a total of 12,505 inmates in the country��s prisons, 107 of whom were women.[268] No statistical information regarding the number of children currently living in prison with their mothers was located. 
 
Back to Top 
 
Malaysia 
 
Under the Malaysian Prison Act 1995, regulations may be issued for various matters, including ��the treatment and wellbeing of a child born to a prisoner while in custody and a child of a female prisoner admitted with his mother.��[269] The Prisons Regulations 2000 provide that a child under the age of three years may be admitted with his or her mother.[270] Such a child ��must be provided with basic necessities for the child��s maintenance and care by the Director General.��[271] Furthermore, the Medical Officer must, where possible, ��see every child accompanying [a] female prisoner as often as necessary.��[272] The regulations also specify the daily diet for each child.[273] 
 
When a child reaches the age of three years, a Medical Officer must report on whether the child should be retained in the prison for a longer period.[274] However, except by special authority of the Director General, no child may be kept in prison after he or she reaches the age of four years.[275] Special instructions from the Director General must be sought if a child reaches the age of three or four years and there are no known relations willing or in a position to receive the child.[276] No information on the number of children residing in prisons with their mothers could be located. 
 
Back to Top 
 
Mali 
 
Malian law appears to allow mothers to keep their young children with them in prison. According to Association Asmae Soeur Emmanuelle, a nongovernmental organization that focuses on child poverty, sixty-nine babies or young children lived with their

## Visualizing Model Activations

We now write some code to visualize the neuron activations on some text - we're going to hack something together which just does some string processing to make an HTML string, with each token element colored according to the intensity neuron activation. We normalize the neuron activations so they all lie in [0, 1]. You can do much better, but this is a useful proof of concept of what "just hack stuff together" can look like!

I'll be keeping neuron 562 in layer 9 as a running example, as it seems to activate strongly on powers of 10.

Note that this visualization is very sensitive to `max_val` and `min_val`! You can tune those to whatever seems reasonable for the distribution of neuron activations you care about - I generally default to `min_val=0` and `max_val` as the max activation across the dataset.

In [11]:
# This is some CSS (tells us what style )to give each token a thin gray border, to make it easy to see token separation
style_string = """<style> 
    span.token {
        border: 1px solid rgb(123, 123, 123)
        } 
    </style>"""


def calculate_color(val, max_val, min_val):
    # Hacky code that takes in a value val in range [min_val, max_val], normalizes it to [0, 1] and returns a color which interpolates between slightly off-white and red (0 = white, 1 = red)
    # We return a string of the form "rgb(240, 240, 240)" which is a color CSS knows
    normalized_val = (val - min_val) / max_val
    return f"rgb(240, {240*(1-normalized_val)}, {240*(1-normalized_val)})"


def basic_neuron_vis(text, layer, neuron_index, max_val=None, min_val=None):
    """
    text: The text to visualize
    layer: The layer index
    neuron_index: The neuron index
    max_val: The top end of our activation range, defaults to the maximum activation
    min_val: The top end of our activation range, defaults to the minimum activation

    Returns a string of HTML that displays the text with each token colored according to its activation

    Note: It's useful to be able to input a fixed max_val and min_val, because otherwise the colors will change as you edit the text, which is annoying.
    """
    if layer is None:
        return "Please select a Layer"
    if neuron_index is None:
        return "Please select a Neuron"
    acts = get_neuron_acts(text, layer, neuron_index)
    print('acts ', acts , '\n')
    act_max = acts.max()
    act_min = acts.min()
    print('act_max :', act_max,'\n act_min : ', act_min )
    # Defaults to the max and min of the activations
    if max_val is None:
        max_val = act_max
    if min_val is None:
        min_val = act_min
    # We want to make a list of HTML strings to concatenate into our final HTML string
    # We first add the style to make each token element have a nice border
    htmls = [style_string]
    # We then add some text to tell us what layer and neuron we're looking at 
    # - we're just dealing with strings and can use f-strings as normal
    # h4 means "small heading"
    htmls.append(f"<h4>Layer: <b>{layer}</b>. Neuron Index: <b>{neuron_index}</b></h4>")
    # We then add a line telling us the limits of our range
    htmls.append(
        f"<h4>Max Range: <b>{max_val:.4f}</b>. Min Range: <b>{min_val:.4f}</b></h4>"
    )
    # If we added a custom range, print a line telling us the range of our activations too.
    if act_max != max_val or act_min != min_val:
        htmls.append(
            f"<h4>Custom Range Set. Max Act: <b>{act_max:.4f}</b>. Min Act: <b>{act_min:.4f}</b></h4>"
        )
    # Convert the text to a list of tokens
    str_tokens = model.to_str_tokens(text)
    print('str_tokens', str_tokens)
    for tok, act in zip(str_tokens, acts):
        # A span is an HTML element that lets us style a part of a string (and remains on the same line by default)
        # We set the background color of the span to be the color we calculated from the activation
        # We set the contents of the span to be the token
        htmls.append(
            f"<span class='token' style='background-color:{calculate_color(act, max_val, min_val)}' >{tok}</span>"
        )

    return "".join(htmls)

In [12]:
# The function outputs a string of HTML
default_max_val =  2.0
default_min_val = -2.0
default_html_string = basic_neuron_vis(
    default_text,
    default_layer,
    default_neuron_index,
    max_val=default_max_val,
    min_val=default_min_val,
)

# IPython lets us display HTML
print("Displayed HTML")
display(HTML(default_html_string))

# We can also print the string directly
print("HTML String - it's just raw HTML code!")
print(default_html_string)

acts  [-0.09019658 -0.16825671 -0.16022548 -0.10385371 -0.16405521 -0.12235729
  0.12736283 -0.16036858 -0.15163518 -0.15545735 -0.16369638 -0.16364644
 -0.16536835 -0.15426308 -0.15184145 -0.01847607 -0.07175732 -0.13938677
 -0.12283999 -0.14176096 -0.16275087 -0.12137292 -0.15182668 -0.16854402
 -0.15053138 -0.16540338 -0.08222538 -0.13916709 -0.1110108  -0.0836805
 -0.06269553 -0.04969161 -0.04285033 -0.06914173 -0.04093773 -0.07153136
 -0.07366642 -0.11953531 -0.13063058 -0.14887872 -0.14031702 -0.100984
 -0.08735689 -0.1257798  -0.1307348  -0.15216658 -0.09925007 -0.11645029
 -0.16961572 -0.11716732 -0.09728693 -0.12917487 -0.15304792 -0.12109021
 -0.06023592 -0.12418018 -0.05660514 -0.09330777 -0.15382323 -0.12229802
 -0.13538375] 

act_max : 0.12736283 
 act_min :  -0.16961572
str_tokens ['<|endoftext|>', 'the', ' Brothers', ' (', 'and', ' Sal', 'af', 'is', ')', ' argue', ' that', ' while', ' it', ' is', ' not', ' mandatory', ',', ' it', ' is', ' nevertheless', ' m', 'uk', 'ar',

HTML String - it's just raw HTML code!
<style> 
    span.token {
        border: 1px solid rgb(123, 123, 123)
        } 
    </style><h4>Layer: <b>5</b>. Neuron Index: <b>220</b></h4><h4>Max Range: <b>2.0000</b>. Min Range: <b>-2.0000</b></h4><h4>Custom Range Set. Max Act: <b>0.1274</b>. Min Act: <b>-0.1696</b></h4><span class='token' style='background-color:rgb(240, 10.823589563369751, 10.823589563369751)' ><|endoftext|></span><span class='token' style='background-color:rgb(240, 20.190805792808533, 20.190805792808533)' >the</span><span class='token' style='background-color:rgb(240, 19.227057695388794, 19.227057695388794)' > Brothers</span><span class='token' style='background-color:rgb(240, 12.462445199489594, 12.462445199489594)' > (</span><span class='token' style='background-color:rgb(240, 19.686625599861145, 19.686625599861145)' >and</span><span class='token' style='background-color:rgb(240, 14.682874381542206, 14.682874381542206)' > Sal</span><span class='token' style='background

In [13]:
print()




## Create Interactive UI

We now put all these together to create an interactive visualization in Gradio! 

The internal format is that there's a bunch of elements - Textboxes, Numbers, etc which the user can interact with and which return strings and numbers. And we can also define output elements that just display things - in this case, one which takes in an arbitrary HTML string. We call `input.change(update_function, inputs, output)` - this says "if that input element changes, run the update function on the value of each of the elements in `inputs` and set the value of `output` to the output of the function". As a bonus, this gives us live interactivity!

This is also more complex than a typical Gradio intro example - I wanted to use custom HTML to display the nice colours, which made things much messier! Normally you could just make `out` into another Textbox and pass it a string.

In [14]:
# The `with gr.Blocks() as demo:` syntax just creates a variable called demo containing all these components
with gr.Blocks() as demo:
    gr.HTML(value=f"Hacky Interactive Neuroscope for {model_name}")
    # The input elements
    with gr.Row():
        with gr.Column():
            text = gr.Textbox(label="Text", value=default_text)
            # Precision=0 makes it an int, otherwise it's a float
            # Value sets the initial default value
            layer = gr.Number(label="Layer", value=default_layer, precision=0)
            neuron_index = gr.Number(
                label="Neuron Index", value=default_neuron_index, precision=0
            )
            # If empty, these two map to None
            max_val = gr.Number(label="Max Value", value=default_max_val)
            min_val = gr.Number(label="Min Value", value=default_min_val)
            inputs = [text, layer, neuron_index, max_val, min_val]
        with gr.Column():
            # The output element
            out = gr.HTML(label="Neuron Acts", value=default_html_string)
    for inp in inputs:
        inp.change(basic_neuron_vis, inputs, out)

We can now launch our demo element, and we're done! The setting share=True even gives you a public link to the demo (though it just redirects to the backend run by this notebook, and will go away once you turn the notebook off!) Sharing makes it much slower, and can be turned off if you aren't in a colab.

**Exercise:** Explore where this neuron does and does not activate. Is it just powers of ten? Just comma separated numbers? Numbers in any particular sequence?

In [15]:
demo.launch(share=True, height=1000)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://d4e085af-868f-4137.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https://huggingface.co/spaces


