## Questions

#### Ane
Bruker du TranformerLens biblioteket (til å gjøre interventions)?

Kan du si litt om hvordan du vil bruke notebooken? Eventuelt, hva fokuset for prosjektet ditt?

Skal vi gjøre *circuit detection* eller *interventions på pre-etablerte circuits*?




#### Theory (til GPT/research)
Hvordan velges tokens, og hvordan velger modellen alltid den "lengste mulige" token (hvis f.eks. ' cat' og ' ', 'c', 'a', 't' er tokens)?


Hvordan bør vi visualisere en transformer - HVA er 'nodene' (i f.eks. Neuronopedia)? Ser vi på et helt attention head som en node? HVa med individuelle noder i MLP-lagene?




## Glossary

**Transformers**
- Causal/autoregressive: Uses only previous tokens for prediction, achieved by masking the attention pattern.
- Logits: Output vector (length d_vocab) after unembedding but before softmax.
- LayerNorm: After transformer block we translate by mean value and normalize (to get variance 1, NB - this is non-linear) adding robustness param $\epsilon$, and then multiply by weights and add a bias, before passing on to next block.
- Attend to: We say that position $k$ attends to position $q$ if the corresp. attention pattern value is large.
- Residual stream: Main channel in transformer diagram, tracking the cumulative sum giving the updated embedding vector.

- Transformer config:
    * batch: input index in batch.
    * position: token position.
    * d_model: residual stream/embedding dimension
    * d_vocab: vocab size
    * n_ctx: context size
    * d_head: dimension of key/query space in attention head
    * d_mlp: mpl hidden layer size (4 x d_model)
    * n_heads: number of attention heads (d_model/d_head)
    * n_layers: number of layers
- Transformer params:
    * LayerNorm weights and biases
    * $W_E$: Embedding matrix (d_vocab, d_model)
    * $W_p$: Position embedding matrix, to give sequential information (n_ctx, d_model)
    * $W_U$: Unembedding matrix (d_model, d_vocab) [we may include bias to allow folding in LayerNorm]

    * W_{QK}: (product of two matrices - important for interpretability as the component-entries are entangled)
        * $W_Q$: Query matrix, for each head in each layer (n_heads, d_model, d_head)¨
        * $W_K$: Key matrix (n_heads, d_model, d_head)
    * W_{VO}:
        * $W_{V\downarrow}$ and $W_{O\uparrow}$: Value matrix (decomposed) (n_heads, d_model, d_head) and (n_heads, d_head, d_model)
    * Attention biases

    * MLP weights and biases (two layers / one hidden layer)

**Mech Interp**

*General hypotheses/concepts*
- Linear representation hypothesis (LH): LLMs choose to use (only) linear representations: Features correspond to directions (Nodes are basis vectors (because of non-linear activations picking out these as a privileged direction), and directions are linear combinations of them)
- Superposition: The model knows more concepts (features) than basis directions, hence each node encodes multiple concepts - we thus do not expect features to be represented by a single basis vector (challenge for interpretability). The number of almost orthogonal directions grows exponentially with dimension, so there is room for as many concepts as one would ever want in large models with minimal interference.
    - Polysemanticity: A single neuron looks for several (completely unrelated) things (eg for a visual model: cat faces, car fronts and cat legs - we know they are independent by feature visualisation). [Speculative: We want to 'unfold' polysemanticity with SAEs, decomposing into interpretable features in a higher dim space?]
- Universality (UH): Analagous features and circuits form across models and tasks (and even in biological brains). Mech Interp goal: 'Periodic table of features/circuits'. If untrue, Mech Interp will fail/should only focus on concrete models of high societal importance, but it seems mostly true.


*Methods/keywords*
- Activation patching: To identify important model activations, we define a performance metric and make a clean and a corrupted prompt, and pick a specific model activation. We run the model on the corrupted prompt, but intervene on the activation with the clean prompt values to see how much of original performance is restored by this activation.

- Attribution: the process of determining which parts of a model (tokens, features, neurons) are responsible for a given output. 
- Intervention: the act of changing the internal components of a model (like features or activations) to see how output changes.
- Hook point: Allows us to intervene/edit the corresp. activation. We apply a hook function which replaces the model activation with our desired intervention (the function output). We can extract an activation by returning nothing on the hook.
- Ablation/knockout: 'deleting' one activation (zero/mean/random ablation are different approaches)
- Toy model: Simple / very small model for easy interpretability.

- TransformerLens: Nanda's library for Mech Interp on GPT2-style LMs, specifically accessing/editing hook points in models ([demo](https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/main/demos/Main_Demo.ipynb#scrollTo=Eo1vbABrq9Ba)).

*Concepts*
- Features: Directions in semantic space / idealised nodes - a basic concept the model has grasped.
- Circuits: Connection of features by weights to grasp complex concepts (eg car = wheel + body + windows + doors etc.). Tightly linked features making a subgraph.
    - Ex: curve detectors, consisting of mulitple units to span all orientations, can be viewed as one idealised node; a feature, whcich later feeds into circuits which do eg circle detection.
- Feature visualisation: By using gradient descent with cost function maximizing the firing of a (known) feature/circuit, we can tweak the model input to maximally represent the feature (this uses LH, and gives a causal link).
- Feature implementation: A known circuit can be reimplemented by hand - if the behaviour remains, it is an isolated algorithm, like we want.

*Circuits* (sufficiently small subgraphs that one can make falsafiable predictions and reconstruct by hand)
- Motifs: Recurring abstract patterns in circuits such as equivariance, superposition, unioning:

    - Equivariant circuits: Exhibit a symmetry in the weights corresponding to a symmetry in the problem to be solved. For instance, a curve detector (orientation invariant) is composed of directed curve detectors, and these are essentially rotations of each other.

    - Unioning over cases: Often a concept is the union of several cases, eg the concept curve/dog is orientation invariant. The network might separately detect either (inhibiting each other on the way, XOR-properties) and then unionize at the end to get at the general concept.

    - Phenomenon superposition: A pure neuron is pushed forward to the output through superpositions to save on neurons (rather than keeping a separate 'trivial' stream encoding dog - dog - dog - ... until the output). Superposition is most useful when the concepts are mostly separate, so that the model can retrieve the correct concept without interference.

- Induction circuits: Circuit to detect/continue repeated subsequences. Composed of two consecutive heads (previous token head + induction head, which attends to the next token after previous instance of current token). Work for arbitrary length patterns. Not to hard to find, since they attend to tokens with same spacing.



## Articles/Resources

- Arena tutorials: https://arena3-chapter1-transformer-interp.streamlit.app/[1.2]_Intro_to_Mech_Interp

- Neel Nanda Quickstart Guide
- Neel Nanda Reading List

Demos/tutorials for neuronpedia: 


- Artikkel: https://transformer-circuits.pub/2025/attribution-graphs/biology.html

- Github: https://github.com/safety-research/circuit-tracer?tab=readme-ov-file

- Intervention demo: https://github.com/safety-research/circuit-tracer/blob/main/demos/intervention_demo.ipynb
- Attribution demo: https://github.com/safety-research/circuit-tracer/blob/main/demos/attribute_demo.ipynb
- Circuit tracing tutorial: https://github.com/safety-research/circuit-tracer/blob/main/demos/circuit_tracing_tutorial.ipynb


- Exploration demo: https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/main/demos/Exploratory_Analysis_Demo.ipynb

- 
