# Demo: nnterp Features Showcase

This notebook demonstrates the key features of `nnterp`, which aims to offer a unified interface for all transformer models and give best `NNsight` practices for LLMs in everyone's hands.

## 1. Standardized Interface

Similar to [`transformer_lens`](https://github.com/TransformerLensOrg/TransformerLens), `nnterp` provides a standardized interface for all transformer models.
The main difference is that `nnterp` still uses the huggingface implementation under the hood through `NNsight`, while transformer_lens uses its own implementation of the transformer architecture. However, each transformer implementation has its own quirks, such that `transformer_lens` is not able to support all models, and can sometimes have significant difference with the huggingface implementation.

The way it's implemented is based on the `NNsight` built-in renaming feature, to make all models look like the llama naming convention, without having to write `model.model`, namely:
```ocaml
StandardizedTransformer
├── layers
│   ├── self_attn
│   └── mlp
├── norm
└── lm_head
```

In [1]:
from transformers import AutoModelForCausalLM

print(AutoModelForCausalLM.from_pretrained("Maykeye/TinyLLama-v0"))
print(AutoModelForCausalLM.from_pretrained("gpt2"))

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 64, padding_idx=0)
    (layers): ModuleList(
      (0-7): 8 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=64, out_features=64, bias=False)
          (k_proj): Linear(in_features=64, out_features=64, bias=False)
          (v_proj): Linear(in_features=64, out_features=64, bias=False)
          (o_proj): Linear(in_features=64, out_features=64, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=64, out_features=256, bias=False)
          (up_proj): Linear(in_features=64, out_features=256, bias=False)
          (down_proj): Linear(in_features=256, out_features=64, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((64,), eps=1e-06)
        (post_attention_layernorm): LlamaRMSNorm((64,), eps=1e-06)
      )
    )
    (norm): LlamaRMSNorm((64,), eps=1e-06)
    (rotary_emb): LlamaRotaryEmbeddi

As you can see, the naming scheme of gpt2 is different from the llama naming convention.
A simple way to fix this is to use the `rename` feature of `NNsight` to rename the gpt2 modules to the llama naming convention.

In [2]:
from nnsight import LanguageModel

model = LanguageModel(
    "gpt2", rename=dict(transformer="model", h="layers", ln_f="norm", attn="self_attn")
)
print(model)
# Access the attn module as if it was a llama model
print(model.model.layers[0].self_attn)

GPT2LMHeadModel(
  (model/transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (layers/h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (self_attn/attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (norm/ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
  (generator): WrapperModule()
)
GPT2Attention(
  (c_attn): Conv1D()
  (c_proj): Conv1

You can see the that renamed modules are displayed like `(new_name)/old_name`. However, many models family have their own naming convention, `nnterp` has a global renaming scheme that should transform any model to the llama naming convention. The easiest way to use it is to load your model using the `StandardizedTransformer` class that inherits from `nnsight.LanguageModel`.

In [3]:
from nnterp import StandardizedTransformer

# You will see the `layers` module printed two times, it'll be explained later.
nnterp_gpt2 = StandardizedTransformer("gpt2")
print(nnterp_gpt2)
# StandardizedTransformer also use `device_map="auto"` by default:
nnterp_gpt2.dispatch()
print(nnterp_gpt2.model.device)

This is most likely okay, but you may want to at least check that the attention probabilities hook makes sense by calling `model.attention_probabilities.print_source()`. It is recommended to switch to None or 0.4.53 if possible or:
  - run the nnterp tests with your version of transformers to ensure everything works as expected.
  - check if the attention probabilities hook makes sense before using them by calling `model.attention_probabilities.print_source()` (prettier in a notebook).[0m
[32m2025-07-07 17:18:26.565[0m | [1mINFO    [0m | [36mnnterp.standardized_transformer[0m:[36m__init__[0m:[36m179[0m - [1mEnforcing eager attention implementation for attention pattern tracing. The HF default would be to use sdpa if available. To use sdpa, set attn_implementation='sdpa' or None to use the HF default.[0m


GPT2LMHeadModel(
  (model/transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (layers/h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (self_attn/attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (norm/ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
  (generator): WrapperModule()
  (layers): ModuleList(
    (0-11): 12 x GPT2Block(
   

Great! But I can see you at the back of the classroom, asking yourself:
> "Why would you create a package that just pass the right dict to the `NNsight` `rename` feature?"

And actually, I'm glad you asked! `StandardizedTransformer` and `nnterp` have a lot of other features, so bear with me!

## 2. Accessing Modules I/O
With `NNsight`, the most robust way to set the residual stream after layer 1 to be the residual stream after layer 0 for a LLama-like model would be:

In [4]:
llama = LanguageModel("Maykeye/TinyLLama-v0")
with llama.trace("hello"):
    llama.model.layers[1].output = (llama.model.layers[0].output[0], *llama.model.layers[1].output[1:])

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Note that the following can cause issues:

In [5]:
with llama.trace("hello"):
    # can't do this because .output is a tuple
    # llama.model.layers[1].output[0] = llama.model.layers[0].output[0]

    # Can cause errors with gradient computation
    llama.model.layers[1].output[0][:] = llama.model.layers[0].output[0]

with llama.trace("hello"):
    # Can cause errors with opt if you do this at its last layer (thanks pytest)
    llama.model.layers[1].output = (llama.model.layers[0].output[0], )

`nnterp` makes this much cleaner:

In [6]:
# First, you can access layer inputs and outputs directly:
with nnterp_gpt2.trace("hello"):
    # Access layer 5's output
    layer_5_output = nnterp_gpt2.layers_output[5]
    # Set layer 10's output to be layer 5's output
    nnterp_gpt2.layers_output[10] = layer_5_output

# You can also access attention and MLP outputs:
with nnterp_gpt2.trace("hello"):
    attn_output = nnterp_gpt2.attentions_output[3]
    mlp_output = nnterp_gpt2.mlps_output[3]

## 3. Builtin interventions

`StandardizedTransformer` also provides convenient methods for common operations:

In [23]:
import torch as th
# Project hidden states to vocabulary using the unembed norm and lm_head
with nnterp_gpt2.trace("The capital of France is"):
    hidden = nnterp_gpt2.layers_output[5]
    logits = nnterp_gpt2.project_on_vocab(hidden)

# Skip layers entirely
with nnterp_gpt2.trace("Hello world"):
    # Skip layer 1
    nnterp_gpt2.skip_layer(1)
    # Skip layers 2 through 3 (inclusive)
    nnterp_gpt2.skip_layers(2, 3)

# This is useful if you want to start at a later layer than the first one
with nnterp_gpt2.trace("Hello world") as tracer:
    layer_6_out = nnterp_gpt2.layers_output[6].save()
    tracer.stop()  # avoid computations after layer 6

with nnterp_gpt2.trace("Hello world"):
    nnterp_gpt2.skip_layers(0, 6, skip_with=layer_6_out)
    half_half_logits = nnterp_gpt2.logits.save()

with nnterp_gpt2.trace("Hello world"):
    vanilla_logits = nnterp_gpt2.logits.save()

assert th.allclose(vanilla_logits, half_half_logits)  # they should be the same

# Direct steering
steering_vector = th.randn(768)  # gpt2 hidden size
with nnterp_gpt2.trace("The weather today is"):
    nnterp_gpt2.steer(layers=[1, 3], steering_vector=steering_vector, factor=0.5)

## 4. Attention Probabilities

For models that support it, you can access attention probabilities directly:

In [8]:
nnterp_gpt2.tokenizer.padding_side = "left"  # ensure left padding for easy access to the first token

with th.no_grad():
    with nnterp_gpt2.trace("The cat sat on the mat"):
        # Access attention probabilities for layer 5
        attn_probs_l2 = nnterp_gpt2.attention_probabilities[2].save()
        attn_probs = nnterp_gpt2.attention_probabilities[5].save()
        print(
            f"Attention probs shape will be: (batch, heads, seq_len, seq_len): {attn_probs.shape}"
        )
        # knock out the attention to the first token
        attn_probs[:, :, :, 0] = 0
        attn_probs /= attn_probs.sum(
            dim=-1, keepdim=True
        )
        corr_logits = nnterp_gpt2.logits.save()
    with nnterp_gpt2.trace("The cat sat on the mat"):
        baseline_logits = nnterp_gpt2.logits.save()

assert not th.allclose(corr_logits, baseline_logits)

sums = attn_probs_l2.sum(dim=-1)
assert th.allclose(sums, th.ones_like(sums))

Attention probs shape will be: (batch, heads, seq_len, seq_len): torch.Size([1, 12, 6, 6])


Under the hood this uses the new tracing system implemented in `NNsight v0.5` which allow to access most model intermediate variables during the forward pass. This means that if the `transformers` implementation were to change, this could break or give unexpected results, so it is recommended to use one of the tested versions of `transformers` and to check that the attention probabilities hook makes sense by calling `model.attention_probabilities.print_source()` if you want to use a different version of `transformers` / a architecture that has not been tested.

In [9]:
nnterp_gpt2.attention_probabilities.print_source()  # pretty markdown display in a notebook

## Accessing attention probabilities from:
```py
model.transformer.h.0.attn.attention_interface_0.module_attn_dropout_0:

    ....
            attn_weights = attn_weights + causal_mask
    
        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
    
        # Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op otherwise
        attn_weights = attn_weights.type(value.dtype)
    --> attn_weights = module.attn_dropout(attn_weights) <--
    
        # Mask heads if we want to
        if head_mask is not None:
            attn_weights = attn_weights * head_mask
    
        attn_output = torch.matmul(attn_weights, value)
    ....
```

## Full module source:
```py
                             * def eager_attention_forward(module, query, key, value, attention_mask, head_mask=None, **kwargs):
 key_transpose_0         ->  0     attn_weights = torch.matmul(query, key.transpose(-1, -2))
 torch_matmul_0          ->  +     ...
                             1 
                             2     if module.scale_attn_weights:
 torch_full_0            ->  3         attn_weights = attn_weights / torch.full(
 value_size_0            ->  4             [], value.size(-1) ** 0.5, dtype=attn_weights.dtype, device=attn_weights.device
                             5         )
                             6 
                             7     # Layer-wise attention scaling
                             8     if module.scale_attn_by_inverse_layer_idx:
 float_0                 ->  9         attn_weights = attn_weights / float(module.layer_idx + 1)
                            10 
                            11     if not module.is_cross_attention:
                            12         # if only "normal" attention layer implements causal mask
 query_size_0            -> 13         query_length, key_length = query.size(-2), key.size(-2)
 key_size_0              ->  +         ...
                            14         causal_mask = module.bias[:, :, key_length - query_length : key_length, :key_length]
 torch_finfo_0           -> 15         mask_value = torch.finfo(attn_weights.dtype).min
                            16         # Need to be a tensor, otherwise we get error: `RuntimeError: expected scalar type float but found double`.
                            17         # Need to be on the same device, otherwise `RuntimeError: ..., x and y to be on the same device`
 torch_full_1            -> 18         mask_value = torch.full([], mask_value, dtype=attn_weights.dtype, device=attn_weights.device)
 attn_weights_to_0       -> 19         attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
 torch_where_0           ->  +         ...
                            20 
                            21     if attention_mask is not None:
                            22         # Apply the attention mask
                            23         causal_mask = attention_mask[:, :, :, : key.shape[-2]]
                            24         attn_weights = attn_weights + causal_mask
                            25 
 nn_functional_softmax_0 -> 26     attn_weights = nn.functional.softmax(attn_weights, dim=-1)
                            27 
                            28     # Downcast (if necessary) back to V's dtype (if in mixed-precision) -- No-Op otherwise
 attn_weights_type_0     -> 29     attn_weights = attn_weights.type(value.dtype)
 module_attn_dropout_0   -> 30     attn_weights = module.attn_dropout(attn_weights)
                            31 
                            32     # Mask heads if we want to
                            33     if head_mask is not None:
                            34         attn_weights = attn_weights * head_mask
                            35 
 torch_matmul_1          -> 36     attn_output = torch.matmul(attn_weights, value)
 attn_output_transpose_0 -> 37     attn_output = attn_output.transpose(1, 2)
                            38 
                            39     return attn_output, attn_weights
                            40 
```

## 5. Activation Collection

`nnterp` provides utilities for collecting activations efficiently:

In [10]:
from nnterp.nnsight_utils import (
    get_token_activations,
    collect_token_activations_batched,
)

# Collect activations for specific tokens
prompts = ["The capital of France is", "The weather today is"]
with nnterp_gpt2.trace(prompts) as tracer:
    # Get last token activations for all layers
    activations = get_token_activations(nnterp_gpt2, prompts, idx=-1, tracer=tracer)
    # activations shape: (num_layers, batch_size, hidden_size)

# For large datasets, use batched collection
large_prompts = ["Sample text " + str(i) for i in range(100)]
batch_activations = collect_token_activations_batched(
    nnterp_gpt2,
    large_prompts,
    batch_size=16,
    layers=[3, 9, 11],  # Only collect specific layers, default is all layers
    idx=-1,  # Last token (default)
)
print(f"Batched activations shape: {batch_activations.shape}")

Batched activations shape: torch.Size([3, 100, 768])


## 6. Prompt Utilities

`nnterp` provides utilities for working with prompts and tracking probabilities of first tokens of certain strings. It tracks both the first token of "string" and " string".

You can provide multiple string per category, the probabilities returned will be the sum of the probabilities of all the first tokens of the strings.

In [11]:
from nnterp.prompt_utils import Prompt, run_prompts

# Create prompts with target tokens to track
prompt1 = Prompt.from_strings(
    "The capital of France (not England or Spain) is",
    {
        "target": "Paris",
        "traps": ["London", "Madrid"],
        "longstring": "the country of France",
    },
    nnterp_gpt2.tokenizer,
)
for name, tokens in prompt1.target_tokens.items():
    print(f"{name}: {nnterp_gpt2.tokenizer.convert_ids_to_tokens(tokens)}")

prompt2 = Prompt.from_strings(
    "The largest planet (not Earth or Neptune) is",
    {"target": "Jupiter", "traps": ["Earth", "Neptune"], "longstring": "Palace planet"},
    nnterp_gpt2.tokenizer,
)
for name, tokens in prompt2.target_tokens.items():
    print(f"{name}: {nnterp_gpt2.tokenizer.convert_ids_to_tokens(tokens)}")

# Run prompts and get probabilities for target tokens
results = run_prompts(nnterp_gpt2, [prompt1, prompt2], batch_size=2)
print("Target token probabilities:")
for target, probs in results.items():
    print(f"  {target}: shape {probs.shape}")

target: ['Paris', 'ĠParis']
traps: ['London', 'ĠLondon', 'Mad', 'ĠMadrid']
longstring: ['the', 'Ġthe']
target: ['J', 'ĠJupiter']
traps: ['Earth', 'ĠEarth', 'Ne', 'ĠNeptune']
longstring: ['Pal', 'ĠPalace']


Running prompts:   0%|          | 0/1 [00:00<?, ?it/s]

Target token probabilities:
  target: shape torch.Size([2, 1])
  traps: shape torch.Size([2, 1])
  longstring: shape torch.Size([2, 1])


## 7. Interventions

`nnterp` provides several intervention methods inspired by mechanistic interpretability research:

In [12]:
from nnterp.interventions import (
    logit_lens,
    patchscope_lens,
    TargetPrompt,
    repeat_prompt,
    steer,
)

# Logit Lens: See predictions at each layer
prompts = ["The capital of France is", "The sun rises in the"]
probs = logit_lens(nnterp_gpt2, prompts)
print(f"Logit lens output shape: {probs.shape}")  # (batch, layers, vocab)

# Patchscope: Replace activations from one context into another
source_prompts = ["Paris is beautiful", "London is foggy"]
custom_target_prompt = TargetPrompt("city: Paris\nfood: croissant\n?", -1)
target_prompt = repeat_prompt()  # Creates a repetition task
custom_repeat_prompt = repeat_prompt(
    words=["car", "cross", "azdrfa"],
    rel=":",
    sep="\n\n",
    placeholder="*",
)
print(f"repeat_prompt: {custom_repeat_prompt}")
print(f"custom_repeat_prompt: {custom_repeat_prompt}")
patchscope_probs = patchscope_lens(
    nnterp_gpt2, source_prompts=source_prompts, target_patch_prompts=target_prompt
)
print(f"patchscope_probs: {patchscope_probs.shape}")

# Steering with intervention function
with nnterp_gpt2.trace("The weather is"):
    steer(nnterp_gpt2, layers=[5, 10], steering_vector=steering_vector)

Logit lens output shape: torch.Size([2, 12, 50257])
repeat_prompt: TargetPrompt(prompt='car:car\n\ncross:cross\n\nazdrfa:azdrfa\n\n*', index_to_patch=-1)
custom_repeat_prompt: TargetPrompt(prompt='car:car\n\ncross:cross\n\nazdrfa:azdrfa\n\n*', index_to_patch=-1)


patchscope_probs: torch.Size([2, 12, 50257])


You can use a combination of run_prompts and interventions to get the probabilities of certain tokens according to your custom intervention.

In [13]:
demo_model = StandardizedTransformer("google/gemma-2-2b")
# uncomment if you don't have a GPU
# demo_model = nnterp_gpt2

prompts_str = [
    "The translation of 'car' in French is",
    "The translation of 'cat' in Spanish is",
]
tokens = [
    {"target": ["voiture", "bagnole"], "english": "car", "format": "'"},
    {"target": ["gato", "minino"], "english": "cat", "format": "'"},
]
prompts = [
    Prompt.from_strings(prompt, tokens, demo_model.tokenizer)
    for prompt, tokens in zip(prompts_str, tokens)
]
results = run_prompts(demo_model, prompts, batch_size=2, get_probs_func=logit_lens)
for category, probs in results.items():
    print(f"{category}: {probs.shape}")  # (batch, layers)

# Create a plotly plot showing mean probabilities for each category across layers
import plotly.graph_objects as go

# Calculate mean probabilities across batches for each category and layer
mean_probs = {category: probs.mean(dim=0) for category, probs in results.items()}

fig = go.Figure()

# Add a line for each category
for category, probs in mean_probs.items():
    fig.add_trace(
        go.Scatter(
            x=list(range(len(probs))),
            y=probs.tolist(),
            mode="lines+markers",
            name=category,
            line=dict(width=2),
            marker=dict(size=6),
        )
    )

fig.update_layout(
    title="Mean Token Probabilities Across Layers",
    xaxis_title="Layer",
    yaxis_title="Mean Probability",
    hovermode="x unified",
    template="plotly_white",
)

fig.show()

[32m2025-07-07 17:18:33.734[0m | [1mINFO    [0m | [36mnnterp.standardized_transformer[0m:[36m__init__[0m:[36m179[0m - [1mEnforcing eager attention implementation for attention pattern tracing. The HF default would be to use sdpa if available. To use sdpa, set attn_implementation='sdpa' or None to use the HF default.[0m
Feel free to open an issue on github (https://github.com/butanium/nnterp/issues) or run the tests yourself with a toy model if you want to add test coverage for this model.[0m


Running prompts:   0%|          | 0/1 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

target: torch.Size([2, 26])
english: torch.Size([2, 26])
format: torch.Size([2, 26])


## 8. Visualization

Finally, `nnterp` provides visualization utilities for analyzing model probabilities and prompts:

In [14]:
from nnterp.display import plot_topk_tokens, prompts_to_df

probs = logit_lens(demo_model, prompts_str[0])
# Visualize top tokens from logit lens
plot_topk_tokens(
    probs[0],
    demo_model.tokenizer,
    k=5,
    width=1000,
    height=1000,
    title="Top 5 tokens at each layer for 'The translation of 'car' in French is",
)

# Convert prompts to DataFrame for analysis
df = prompts_to_df(prompts, demo_model.tokenizer)
print("\nPrompts DataFrame:")
display(df)


Prompts DataFrame:


Unnamed: 0,0,1
prompt,The translation of 'car' in French is,The translation of 'cat' in Spanish is
target_string,"[voiture, bagnole]","[gato, minino]"
english_string,car,cat
format_string,','
target_tokens,"[voiture, ▁voiture, bagno, ▁bagno]","[gato, ▁gato, min, ▁min]"
english_tokens,"[car, ▁car]","[cat, ▁cat]"
format_tokens,"[', ▁']","[', ▁']"


## Summary

`nnterp` provides a unified, standardized interface for working with transformer models, built on top of `nnsight`. Key features include:

1. **Standardized naming** across all transformer architectures
2. **Easy access** to layer/attention/MLP inputs and outputs
3. **Built-in methods** for common operations (steering, skipping layers, projecting to vocab)
4. **Efficient activation collection** with batching support
5. **Prompt utilities** for tracking target tokens
6. **Intervention methods** from mechanistic interpretability research
7. **Visualization tools** for analyzing model behavior

All of this while maintaining the full power and flexibility of `nnsight` under the hood!

# Appendix: `NNsight` cheatsheet

## 1) You must execute your interventions in order
In the new `NNsight` versions, it is enforced that you must access to model internals *in the same order* as the model execute them.

In [15]:
from nnterp import StandardizedTransformer

nnterp_gpt2 = StandardizedTransformer("gpt2")

with nnterp_gpt2.trace("My tailor is rich"):
    l2 = nnterp_gpt2.layers_output[2]
    l1 = nnterp_gpt2.layers_output[1]  # will fail! You need to collect l1 before l2

[32m2025-07-07 17:18:40.940[0m | [1mINFO    [0m | [36mnnterp.standardized_transformer[0m:[36m__init__[0m:[36m179[0m - [1mEnforcing eager attention implementation for attention pattern tracing. The HF default would be to use sdpa if available. To use sdpa, set attn_implementation='sdpa' or None to use the HF default.[0m


NNsightException: 

Traceback (most recent call last):
  File "/tmp/ipykernel_112758/2815926761.py", line 7, in <module>
    l1 = nnterp_gpt2.layers_output[1]  # will fail! You need to collect l1 before l2
  File "/workspace/nnterp/nnterp/rename_utils.py", line 99, in __getitem__
    target = module.output
  File "/root/.venv/lib/python3.10/site-packages/nnsight/intervention/envoy.py", line 140, in output
    return self._interleaver.current.request(
  File "/root/.venv/lib/python3.10/site-packages/nnsight/intervention/interleaver.py", line 804, in request
    value = self.send(Events.VALUE, requester)
  File "/root/.venv/lib/python3.10/site-packages/nnsight/intervention/interleaver.py", line 788, in send
    raise response

OutOfOrderError: Value was missed for model.transformer.h.1.output.i0. Did you call an Envoy out of order?

## 2) Gradient computation
To compute gradients, you need to open a `.backward()` context, and save the gradients *inside it*.

In [17]:
with nnterp_gpt2.trace("My tailor is rich"):
    l1_out = nnterp_gpt2.layers_output[1]  # get l1 before accessing logits
    logits = nnterp_gpt2.output.logits
    with logits.sum().backward(
        retain_graph=True
    ):  # use retain_graph if you want to do multiple backprops
        if False:
            l1_grad = nnterp_gpt2.layers_output[1].grad.save()
            # this would fail as we'd access nnterp_gpt2.layers_output[1] after nnterp_gpt2.output
        l1_grad = l1_out.grad.save()
    with (logits.sum() ** 2).backward():
        l1_grad_2 = l1_out.grad.save()

assert not th.allclose(l1_grad, l1_grad_2)

## 3) Use tracer.stop() to save useless computations
If you're just computing activations, don't forget to call `tracer.stop()` at the end of your trace. This will stop the model from executing the rest of its computations, and save you some time, as demonstrated below (with the contribution of Claude 4 Sonnet).

In [18]:
import time
import pandas as pd

print("🎭 Welcome to the Theatrical Performance Comparison! 🎭\n" + "=" * 60 + "\n\n🐌 ACT I: 'The Tragedy of the Unstoppable Tracer' 🐌\nIn which our hero forgets to call tracer.stop()...")

start_time = time.time()
for _ in range(30):
    with nnterp_gpt2.trace(["Neel Samba", "Chris Aloha"]):
        out5 = nnterp_gpt2.layers_output[5].save()
end_time = time.time()
nostop_time = end_time - start_time

print(f"⏰ Duration of suffering: {nostop_time:.4f} seconds\n\n⚡ ACT II: 'The Redemption of the Stopped Tracer' ⚡\nOur hero learns the ancient art of tracer.stop()...")


start_time = time.time()
for _ in range(30):
    with nnterp_gpt2.trace(["Neel Samba", "Chris Aloha"]) as tracer:
        out5 = nnterp_gpt2.layers_output[5].save()
        tracer.stop()
end_time = time.time()
stop_time = end_time - start_time

print(f"⏰ Duration of enlightenment: {stop_time:.4f} seconds")

speedup = nostop_time / stop_time
time_saved = nostop_time - stop_time

# fun display
print("\n" + "=" * 60 + "\n🎉 THE GRAND RESULTS SPECTACULAR! 🎉\n" + "=" * 60)
results_df = pd.DataFrame({
    '🎭 Performance Type': ['Without tracer.stop() 🐌', 'With tracer.stop() ⚡', 'Time Saved 💰'],
    '⏱️ Time (seconds)': [f"{nostop_time:.4f}", f"{stop_time:.4f}", f"{time_saved:.4f}"],
    '🎯 Rating': ['Tragic 😭', 'Magnificent! 🌟', 'PROFIT! 📈']
})
display(results_df)
speedup_bars = int(speedup * 10)
meter = "█" * min(speedup_bars, 48) + "░" * (50 - min(speedup_bars, 48))
print(f"\n🏎️ SPEEDUP METER 🏎️\n┌{'─' * 50}┐\n│{meter}│\n└{'─' * 50}┘\n   💫 COSMIC SPEEDUP: {speedup:.2f}x FASTER! 💫")


🎭 Welcome to the Theatrical Performance Comparison! 🎭

🐌 ACT I: 'The Tragedy of the Unstoppable Tracer' 🐌
In which our hero forgets to call tracer.stop()...
⏰ Duration of suffering: 0.8498 seconds

⚡ ACT II: 'The Redemption of the Stopped Tracer' ⚡
Our hero learns the ancient art of tracer.stop()...
⏰ Duration of enlightenment: 0.3863 seconds

🎉 THE GRAND RESULTS SPECTACULAR! 🎉


Unnamed: 0,🎭 Performance Type,⏱️ Time (seconds),🎯 Rating
0,Without tracer.stop() 🐌,0.8498,Tragic 😭
1,With tracer.stop() ⚡,0.3863,Magnificent! 🌟
2,Time Saved 💰,0.4635,PROFIT! 📈



🏎️ SPEEDUP METER 🏎️
┌──────────────────────────────────────────────────┐
│█████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░│
└──────────────────────────────────────────────────┘
   💫 COSMIC SPEEDUP: 2.20x FASTER! 💫


## 4) Using NNsight builtin cache to collect activations

`NNsight 0.5` introduces a builtin way to cache activations during the forward pass. Be careful not to call `tracer.stop()` before all the module of the cache have been accessed.

NOTE: Broken in current dev version of `NNsight` (0.5.0dev7)

In [None]:
with nnterp_gpt2.trace("Hello"):
    cache = nnterp_gpt2.cache(modules=[layer for layer in nnterp_gpt2.layers[::2]]).save()

print(cache.keys())
print(cache['model.layers.10'].inputs)