# Overview of decoder style models and methods for interpretability


## Table of contents

1. [Theory](#Theory)
    - [Transformers](#Transformers)
    - [Fetures](#Features)
    - [Circuits](#Circuits)
    - [Decoders](#Decoders)
    - [A note on metrics](#a-note-on-metrics)
2. [Methods](#Methods)
    - [Induction Heads](#induction-heads)
    - [Attribution Patching](#attribution-patching)
    - [Activating Patching](#activation-patching)
    - [Logit Lens](#logit-lens)
    - [Direct Logit Attribution](#direct-logit-attribution)
3. [Libraries](#Libraries)
    - [TransformerLens](#transformerlens)
    - [CircuitsVis](#circuitsvis)
    - [Circuit Tracer](#circuit_tracer)

## Introduction

...blablabla...

General vs specific

Complete vs incomplete

## Theory

### Transformers
Two parts; the attention layer and the multilayer perceptron (MLP).

### Features


A feature is a (taken from notes...) 

### Circuits

A circuit is a somewhat vague term in mechanistic interpretability, but in general we can say that it is some subsystem of the entire model network performing some specific task. These tasks can themselves be more or less vague; one circuit might detect if an input prompt is written in English, while another activates if a certain concept or idea is present.

More specifically, when we are looking for and studying circuits in a model, we think of circuits as a set of nodes and the connections between them. As the circuits are rarely isolated, we must approximate sligthly and often group smaller, similar circuits into a single larger circuits. (Neuronpedia...) However, finding circuits is not easy without a good hypothesis. While many circuits and features have been found in smaller models (induction heads, capital Texas, IOI), there is still a significant part we do not understand yet.

### Decoders

hdh

### A note on metrics

When it comes to decoders, there are multiple ways to measure the output of the model. The most obvious is to simply look at the next predicted token. If we intervene at a point in the network and the predicted token changes, this could indicate that the intervention point contributed meaningfully to that output. However, looking at the output tokens does not directly give us a way to quantify the changes in output. Therefore, we might need other metrics to judge whether a change was relevant, and if so, in what way.

To quantify the output of the model, we may first extend it by looking at the output probabilities. The model outputs a vector of length equal to the size of the vocabulary. This vectors contains the probabilities for each token being the next in the sequence. By looking at eg. the top 10 most likely tokens and how they change after a certain intervention, we may gain some insight into how an intervention point affects the output. For instance, if words connected to a certain feature goes down the list and words connected to an "opposite" feature after ablating (?) a node , we might conclude that the node contributes positively to that feature. denne setningen er ikke helt forståelig, hehe

Although the output probabilities is an improvement from looking at the output token in isolation, there are still problems with this measurement. The difficulties stem from having to interpret words and tokens in a quantitative way. It is for example hard to say what an "opposite" feature might be, and thus, the interpretation in the above pararaph renders obsolete(?). To counter this, we may look at the logits instead, that is, the result before converting to probabilities. These are much more natural to interpret as numbers, and are thus better suited for a proper analysis. The difference in the logits for a specific token before and after an intervention gives us a concrete measurement to use in visualizations or analyses.

Sometimes we might only look at the internals of a network, without being too focused on the output. In these cases, looking at the attention scores is good metric for the performance of the network. The attention scores can also be easily visualized using the [CircuitsVis](#circuitsvis)-package. 


-L2, cosine etc for encoders? Relevant for a thorough guide, I suppose...


## Methods:


### Induction heads:
- More general, but also incomplete, ie it does not explain all of the decision process in a model

Induction heads are one of the simplest circuits found in transformer models. They contribute to locating repeated tokens, and thus remembering information from the input prompt. This is easiest to see in simple, repeated prompt like "The capital of Norway is Oslo. The capital of Norway is Oslo."

We can visualize the attention patterns of the model and clearly see a change in the diagonal. Hence, locating induction heads is among the simplest tasks in mechanistic interpretability and a good starting point for studying a model. 

Although induction heads provide a good starting point, they do not give a complete understanding of a model. They are too simple to account for every "decision" in the model. Therefore, we need more sophisticated theories and methods for explaining the network.


### Attribution patching ...?
- Using local linear approximations for the network, we can simplify activation patching in such a way that we can patch more heads in less fewer passes through the network.

This is a more general method, suited for a first sweep through the model an locating areas of interest. The method works when patching smaller parts of the model in parallel, like attention heads, because the linear approximation is more accurate for these areas. 



### Activation patching
- Choosing one clean and one corrupted prompt, we patch the clean activations onto the corrupted run. We then measure how much this changes the output in the clean direction.

Activation patching is method where we can see which parts of a network activate which features and how the information flows through the network. This works best if the prompt are of the same length and format, that is if the only difference is the specific feature one wants to study. It is important to note that the length of a prompt is the number of tokens, which may or my not be the same as the number of words.

To perform activation patching, we must first choose a clean and corrupted prompt and corresponding answers. As an example, take the clean prompt "The capital of Norway is ", with correct answer "Oslo", and the similiar corrupted prompt "The capital of Sweden is " with corresponding answer "Stockholm". We first run the clean prompt through the model and cache the results. When we then run the corrupted prompt through the model, we stop at a selected token position and layer an replace the corresponding activations with the cached clean activations. By looping through all positions for all layers, we can see how the change in the output depends on the position of replacement. In particular, we see what nodes (?) are important for that specific feature. In our example the feature is the name of the capital.

### Logit Lens

One way to look at how a model makes its decision is through a method called logit lens. This method works by imagining that a network is cut short. In other words, if you stopped a run-through after a certain number of layers, what would then the most likely output be. This way, we can see how these predictions change throughout the network.



### Direct Logit Attribution


### ACDC

### Causal Scrubbing


## Libraries

### TransformerLens
- TransformerLens: Except for some documentation lacking wrt proper input/output descriptions, this is a very suitable library for decoder style models. 

Hook points:...

### CircuitsVis

- CircuitsVis: Works well with TL. Has nice (and easy to use) visualizations for attention patterns
        -> Can be used for finding induction heads.

### Circuit_Tracer
- Circuit_Tracer: No success
    -> environment crashes, does not support gpt2 and colab does not have enough memory. 
    Virtual environment with degraded transformer_lens?