# MATS Application - Christophe Thomassin: <br> Factual Knowledge in Transformer Language Model

Note: I am gonna try to back-up my code with some comments for better understandanding as time allows. Although I want to refer to the Google Docs document for an in-depth analysis: [Link](https://docs.google.com/document/d/13CZcQ818lBNqBGEqn3wxq3iVc7uHldprSr2tgTrBUr8/edit?usp=sharing)

**The Task**: <br>
I am interested in understanding how Transformer Language Model store and retrieve knowledge, to be specific factual knowledge. Factual knowledge is essential to all possible information processing tasks as it can be considered the underlying assumptions of pretty much every step of reasoning. So, if LLMs were not able to acquire factual knowledge, they would be not much more than random word generators. I have three reasons why I think applying MI to understand how LLMs acquire and store factual knowledge is a good idea:
1. I see facutal knowledge as the easiest form of intelligence, the entry point to cognitive capabilities. Hence, I find it intuitive to start with examining factual knowledge when trying "mechanistically" unwined the complexity of LLMs. Once understood, factual knowledge will open a lot of doors to dig deeper into the "mind" of Transformers.
2. Factual knowledge, or in this case the lack of it, can be considered a large driver of Hallucinations. Understanding how and where factual knowledge is stored could allow us to remediate many Hallucinations.
3. Factual knowledge, as the name indicates, is based on commonly-known facts which makes it quite easy to evaluate factual knowledge (one would think). A fact can only be True or False.

One of the main drawbacks is that, as one could imagine, we cannot expect factual knowledge to be "universal". En contraire, because of its factual nature it is most definitely highly dependent of the training data. A model cannot reason that Paris is the capital of France, without being told that it is during pre-training. Hence, we might see that some models have developed specific factual knowledge while others have not. Yet, my hope is that the mechanisims allowing to store and retrieve factual knowledge are somewhat universal.

## Setup

In [1]:
# allows to reload packages when reimported
%load_ext autoreload
%autoreload 2

In [2]:
import json
import os

import numpy as np
import pandas as pd
import torch
import transformer_lens.utils as utils
from dotenv import load_dotenv
from huggingface_hub import login
from transformer_lens import ActivationCache, FactoredMatrix, HookedTransformer

from src.config import DATA

In [3]:
load_dotenv()

login(token=os.getenv("HF_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/chrisoutho_gmail_com/.cache/huggingface/token
Login successful


In [4]:
torch.set_grad_enabled(False)

device: torch.device = utils.get_device()
print(f"Device: {device}")

Device: cuda


### Load data

In [5]:
# load data
df = pd.read_csv(DATA / "fk_samples.csv")
print(df.head())

                                  question  answer
0        Is Paris the capital of France?\n    True
1        Is Paris the capital of France?\n    True
2  Is the Eiffel Tower located in Paris?\n    True
3   Is Paris known as the City of Light?\n    True
4          Is the Louvre Museum in Rome?\n   False


In [6]:
with open(DATA / "answer_map.json", "r") as f:
    answer_map = json.load(f)

print(answer_map)

{'True': ['Yes', 'yes', 'Sure', 'Correct', 'Certainly', 'Absolutely', 'Indeed', 'True', 'Yeah', ' Yes', 'Yep'], 'False': ['No', 'no', 'Nope', ' No', ' no', 'Wrong', 'NO', 'False']}


### Load model

In [7]:
model = HookedTransformer.from_pretrained("google/gemma-2b-it", device=device)

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loaded pretrained model google/gemma-2b-it into HookedTransformer


### Get model activations

In [8]:
from src.logit_diff import df_to_logits

logits, cache = df_to_logits(model, df)

## Task definition

As said before, we want to understand how the model stores and retrieves factual knowledge. To do so, as in every MI investigation, we have to define a task that allows us to evaluate LLM's on our task, factual knowledge. There are a few things to consider when engineering this task:
1. Most importantly: We want to isolate the concept we are testing for, here, factual knowledge, as much as possible. Good performance on our task should imply that the tasked model is good at the exact concept we are interested in. Conversly, if a model does not perform good on our task, this should imply this model is not good at this concept.
2. Our generation should only be one token long for easy evaluation. It would be nice if one could determine another token that would be expected if the model were not able to correctly answer the query. These two tokens can be used for introducing logit difference as evaluation metric to the task.
3. It should be easy to scale the task to many different examples to rule out randomness.
4. It should be possible, only by changing a few tokens, to remove the concept from the task and make it random. This alternative distributions is needed for path patching and ablation.

For this time-constrained assessment, I will have to focus on a specific example. I chose to inspect factual knowledge around the French city "Paris". Paris is well-known enough to be in pretty much every pre-training dataset, yet yields enough diversity to ask mutiple questions in order to prove generalisation.

Under consideration of all these requirements, I selected a question-answer format for factual knowledge as task. This allows to isolate specific factual knowledge, yields one-token generations, scales pefectly, and makes it possible to "disarm" the query by removing the "key".

An example, which I will use a lot throughout this assessment, is: <br>
    - Query: Is Paris the capital of France? <br>
    - Label: True <br>
    - Opposite Generation: False <br>
    - Clue: capital, France <br>

I admit that this task has the drawback that I am quite sure that question-answering is a unique circuit worth studying by itself. I expect the question-answering circuits and the factual knowledge circuit to have an overlap which means I did not 100% isolate our concept. Nonetheless, after a lot of back-and-forth, I believe this is the best way of testing factual knowledge for now.

### Single example

Let's see if the model is able to confidently solve this task in the first place:

In [34]:
query = "Is Paris the capital of France?\n"
ground_truth = "Yes"

<details><summary>Note</summary>
It seems like I have to add the '\n' tokens at the end of the query for this model to answer properly.

In [10]:
utils.test_prompt(query, ground_truth, model, prepend_bos=True, prepend_space_to_answer=False)

Tokenized prompt: ['<bos>', 'Is', ' Paris', ' the', ' capital', ' of', ' France', '?', '\n']
Tokenized answer: ['Yes']


Top 0th token. Logit: 32.54 Prob: 55.63% Token: |Sure|
Top 1th token. Logit: 32.09 Prob: 35.34% Token: |Yes|
Top 2th token. Logit: 29.96 Prob:  4.22% Token: |No|
Top 3th token. Logit: 29.90 Prob:  3.96% Token: |Paris|
Top 4th token. Logit: 26.54 Prob:  0.14% Token: |Answer|
Top 5th token. Logit: 26.24 Prob:  0.10% Token: |Certainly|
Top 6th token. Logit: 26.02 Prob:  0.08% Token: |The|
Top 7th token. Logit: 25.92 Prob:  0.07% Token: |Of|
Top 8th token. Logit: 25.76 Prob:  0.06% Token: |Correct|
Top 9th token. Logit: 25.70 Prob:  0.06% Token: |*|


While this looks good, we see the first "inconvenience" in our task here. The model is more than capable of answering the query. Yet, it is not able to concentrate all its probability mass on one "True-token" such as "Yes" but spreads it among multiple such as "Sure" (56% probability), "Yes" (35% probability), and "Certainly" (0.1% probability). This is most likely because assertion of something being true is most definitely done by using all these words in the training data. While this does not make the generation wrong, it raises two questions:
1. Is the model abel to understand that all these tokens express "True" and build a circuit out of that?
2. How do we evaluate the task if we cannot measure it's outcome based on the logit/probability of a single token?

For the first question, I do not have a concrete answer but, for the sake of this experiment, I will just hope that the model chosen here is indeed able to abstract the answers into "Fact is True" and "Fact is False". Larger models such as ChatGPT 4o are definitely able to do so. This model achieved close to 100% accuracy on 10 simple factual knowledge questions around Paris under the condition of answering with True or False only. For the second question, we will have to find a solution to capture all possible True- and False-tokens

Going through a bunch of example questions, I have mapped all True- and Wrong-Tokens to their corresponding underlying meaning. Although, computationally, it is not very elegant (especially bc lists have diff lenghts), we will have to always evalaute the task on all these tokens.

In [11]:
answer_map

{'True': ['Yes',
  'yes',
  'Sure',
  'Correct',
  'Certainly',
  'Absolutely',
  'Indeed',
  'True',
  'Yeah',
  ' Yes',
  'Yep'],
 'False': ['No', 'no', 'Nope', ' No', ' no', 'Wrong', 'NO', 'False']}

In [12]:
from src.utils import answer_map_to_tokens

answer_map_tokens = answer_map_to_tokens(model, answer_map)

In [35]:
df = pd.DataFrame({'question': [query], 'answer': ["True"]})

In [36]:
from src.logit_diff import logits_to_logit_diff

logits, cache = df_to_logits(model, df)
logit_diff = logits_to_logit_diff(df, logits, answer_map_tokens, device)
print(f"Logit difference of True and False tokens summed up is {logit_diff:.2f} nats")

Logit difference of True and False tokens summed up is 113.51 nats


When accounting for all True and False token, the model is clearly able to classifiy the statemnt as True. If the model were able to concentrate the probability mass of all True-Tokens on one token and that of all Wrong-Tokens on another token, the True-Token would have an $e^{113}\approx 10^{49}\times$ higher probability.

<details> <summary>Note</summary>
Considering typcial logit distributions over the vocabulary at the end of a Transformer it is questionable if the model, even if it were possible to concentrate on two differnet tokens, would assign such a high logit to the True token. Furhtermore, we have to consider that way more positive tokens are under the top 10 tokens which is why the differnece is so big in the first place. Nevertheless, this supports our hypothesis that the model can retrieve factual knowledge and use it to answer questions.

### Build dataset

Let's get some more example testing factual knowledge around Paris to make sure our results are representative. I will use OpenAI's ChatGPT-4o for dataset generation and subsequently filter to only include valid examples. This might seem unnecessary for our experiment since there are not too many qeustion-answer pairs around Paris such a small model can answer but I this makes it easy to adapt this script to other use-cases in the future and most of the code is simply recycled...

In [15]:
model_name = "gpt-4o-2024-08-06"
query = [
    {
        "role": "system",
        "content": (
            "Please generate 10 questions about Paris that can be answered with "
            "either Yes or No. Annotate each question with True if the answer is "
            "Yes and False if the answer is No. The questions should be easy to "
            "answer for any human. Return the output as a JSON object with each "
            "entry having the keys 'question' and 'answer' filled with a list of "
            "10 values. There should be 5 questions with answer Yes and 5 "
            "questions with answer no. Each question should end with a question "
            "mark followed by the newline character.\n"
            "Example question: Is Paris the capital of France?\n"
            "Example answer: True"
        )
    }
]

In [16]:
import asyncio

import nest_asyncio

from src.data import invoke_openai

nest_asyncio.apply()

dataset = asyncio.run(invoke_openai(model_name, query))

<ClientResponse(https://api.openai.com/v1/chat/completions) [200 OK]>
<CIMultiDictProxy('Date': 'Tue, 27 Aug 2024 06:46:09 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Access-Control-Expose-Headers': 'X-Request-ID', 'openai-organization': 'christho', 'openai-processing-ms': '1597', 'openai-version': '2020-10-01', 'strict-transport-security': 'max-age=15552000; includeSubDomains; preload', 'x-ratelimit-limit-requests': '500', 'x-ratelimit-limit-tokens': '30000', 'x-ratelimit-remaining-requests': '499', 'x-ratelimit-remaining-tokens': '29839', 'x-ratelimit-reset-requests': '120ms', 'x-ratelimit-reset-tokens': '322ms', 'x-request-id': 'req_93641be8a2979f84886fd20018b82b7d', 'CF-Cache-Status': 'DYNAMIC', 'Set-Cookie': '__cf_bm=E6mjzpJjB4phGe_rDZ250lAe1Z2Usi5zfzSvmGlXXlY-1724741169-1.0.1.1-nGS5FQU7uw2OAtNyv9CdaXp4ZlTHYb9T9dzqs.k00_sXPB3xMbb2XXBO6LuP.bhLXebW.gD4kn4soWpG7a3Jhw; path=/; expires=Tue, 27-Aug-24 07:16:09 GMT; domain=.api.o

In [17]:
dataset = pd.DataFrame(dataset[0])
df = pd.concat([df, dataset], ignore_index=True)

In [18]:
df.head(10).style.set_properties(**{'text-align': 'left'}).set_table_styles([dict(selector='th', props=[('text-align', 'left')])])

Unnamed: 0,question,answer
0,Is Paris the capital of France?,True
1,Is Paris the capital of France?,True
2,Is the Eiffel Tower located in Paris?,True
3,Is Paris known as the City of Light?,True
4,Is the Louvre Museum in Rome?,False
5,Is the official language of Paris English?,False
6,Is Montmartre a district in Paris?,True
7,Does Paris have a desert climate?,False
8,Is the Seine River in Paris?,True
9,Is Paris larger in area than New York City?,False


Finally, let's make sure our model is able to answer all questions correctly...

In [19]:
logits, cache = df_to_logits(model, df)

In [26]:
print(
    "Per prompt logit difference:",
    logits_to_logit_diff(df, logits, answer_map_tokens, device, per_prompt=True)
    .cpu()
    .round(decimals=3),
)
print(
    "Average logit difference:",
    logits_to_logit_diff(df, logits, answer_map_tokens, device)
    .clone()
    .detach()
    .cpu()
    .round(decimals=2),
)

Per prompt logit difference: tensor([ 103.9010,  103.9010,   89.8990,  146.5240,  -50.1270,  -63.0340,
         125.9540,  -24.3910,  108.8900, -118.3550,   91.9470])
Average logit difference: tensor(46.8300)


Hmmm for three different samples we get a negative logit difference. Let's have a look into these cases:

In [27]:
for i in [4, 5, 7, 9]:
    utils.test_prompt(df["question"].iloc[i], df["answer"].iloc[i], model, prepend_bos=True, prepend_space_to_answer=False)

Tokenized prompt: ['<bos>', 'Is', ' the', ' Louvre', ' Museum', ' in', ' Rome', '?', '\n']
Tokenized answer: ['False']


Top 0th token. Logit: 34.64 Prob: 93.46% Token: |No|
Top 1th token. Logit: 31.77 Prob:  5.32% Token: |The|
Top 2th token. Logit: 29.80 Prob:  0.74% Token: |no|
Top 3th token. Logit: 28.31 Prob:  0.17% Token: |Yes|
Top 4th token. Logit: 28.00 Prob:  0.12% Token: | no|
Top 5th token. Logit: 27.82 Prob:  0.10% Token: |Sure|
Top 6th token. Logit: 26.24 Prob:  0.02% Token: |While|
Top 7th token. Logit: 26.20 Prob:  0.02% Token: | No|
Top 8th token. Logit: 25.54 Prob:  0.01% Token: |NO|
Top 9th token. Logit: 24.60 Prob:  0.00% Token: |Actually|


Tokenized prompt: ['<bos>', 'Is', ' the', ' official', ' language', ' of', ' Paris', ' English', '?', '\n']
Tokenized answer: ['False']


Top 0th token. Logit: 32.93 Prob: 95.15% Token: |No|
Top 1th token. Logit: 29.18 Prob:  2.22% Token: |The|
Top 2th token. Logit: 28.97 Prob:  1.81% Token: |Paris|
Top 3th token. Logit: 27.31 Prob:  0.34% Token: |Yes|
Top 4th token. Logit: 26.17 Prob:  0.11% Token: |no|
Top 5th token. Logit: 25.53 Prob:  0.06% Token: |Official|
Top 6th token. Logit: 25.33 Prob:  0.05% Token: |Sure|
Top 7th token. Logit: 24.85 Prob:  0.03% Token: |There|
Top 8th token. Logit: 24.57 Prob:  0.02% Token: | No|
Top 9th token. Logit: 24.41 Prob:  0.02% Token: |Is|


Tokenized prompt: ['<bos>', 'Does', ' Paris', ' have', ' a', ' desert', ' climate', '?', '\n']
Tokenized answer: ['False']


Top 0th token. Logit: 35.83 Prob: 90.46% Token: |No|
Top 1th token. Logit: 33.55 Prob:  9.26% Token: |Paris|
Top 2th token. Logit: 29.38 Prob:  0.14% Token: |no|
Top 3th token. Logit: 27.71 Prob:  0.03% Token: |The|
Top 4th token. Logit: 27.09 Prob:  0.01% Token: | No|
Top 5th token. Logit: 26.98 Prob:  0.01% Token: |Sure|
Top 6th token. Logit: 26.93 Prob:  0.01% Token: |Yes|
Top 7th token. Logit: 26.91 Prob:  0.01% Token: | Paris|
Top 8th token. Logit: 26.69 Prob:  0.01% Token: |It|
Top 9th token. Logit: 26.24 Prob:  0.01% Token: |While|


Tokenized prompt: ['<bos>', 'Is', ' Paris', ' larger', ' in', ' area', ' than', ' New', ' York', ' City', '?', '\n']
Tokenized answer: ['False']


Top 0th token. Logit: 35.04 Prob: 56.73% Token: |Paris|
Top 1th token. Logit: 34.51 Prob: 33.41% Token: |Sure|
Top 2th token. Logit: 33.06 Prob:  7.84% Token: |Yes|
Top 3th token. Logit: 30.97 Prob:  0.97% Token: |No|
Top 4th token. Logit: 30.33 Prob:  0.51% Token: |The|
Top 5th token. Logit: 29.99 Prob:  0.37% Token: | Paris|
Top 6th token. Logit: 27.90 Prob:  0.05% Token: |While|
Top 7th token. Logit: 27.80 Prob:  0.04% Token: |Certainly|
Top 8th token. Logit: 26.12 Prob:  0.01% Token: |New|
Top 9th token. Logit: 25.90 Prob:  0.01% Token: |That|


Here we see a clear drawback of our method. As anticipated in the notes earlier one, having more True-Tokens than False-Tokens in our evaluation methods, dilutes the quality of the metric as the logits of tokens seem to have a lower bound which is why a lot of meaningless logits (Close to 0% token probability have almost as high logits as more probable tokens). Let's add a `return_probs` flag to our `logits_to_logits_diff` function to sum the probability of the tokens. Considering the property that all token probabilities add to one, this is more insightful.

In [22]:
from src.logit_diff import logits_to_logit_diff

In [23]:
print(
    "Per prompt logit difference:",
    logits_to_logit_diff(df, logits, answer_map_tokens, device, return_probs=True, per_prompt=True)
    .cpu()
    .round(decimals=3),
)
print(
    "Average logit difference:",
    logits_to_logit_diff(df, logits, answer_map_tokens, device, return_probs=True)
    .clone()
    .detach()
    .cpu()
    .round(decimals=2),
)

Per prompt logit difference: tensor([ 0.6320,  0.6320,  0.3670,  0.9000,  0.8930,  0.8810,  0.9740,  0.9430,
         0.7010, -0.4030,  0.5180])
Average logit difference: tensor(0.6400)


In [24]:
answer_map

{'True': ['Yes',
  'yes',
  'Sure',
  'Correct',
  'Certainly',
  'Absolutely',
  'Indeed',
  'True',
  'Yeah',
  ' Yes',
  'Yep'],
 'False': ['No', 'no', 'Nope', ' No', ' no', 'Wrong', 'NO', 'False']}

Well well, see how quick things turn. Let's look at the samples with $p_{diff} < 0.15$

In [28]:
for i in [9]:
    utils.test_prompt(df["question"].iloc[i], df["answer"].iloc[i], model, prepend_bos=True, prepend_space_to_answer=False)

Tokenized prompt: ['<bos>', 'Is', ' Paris', ' larger', ' in', ' area', ' than', ' New', ' York', ' City', '?', '\n']
Tokenized answer: ['False']


Top 0th token. Logit: 35.04 Prob: 56.73% Token: |Paris|
Top 1th token. Logit: 34.51 Prob: 33.41% Token: |Sure|
Top 2th token. Logit: 33.06 Prob:  7.84% Token: |Yes|
Top 3th token. Logit: 30.97 Prob:  0.97% Token: |No|
Top 4th token. Logit: 30.33 Prob:  0.51% Token: |The|
Top 5th token. Logit: 29.99 Prob:  0.37% Token: | Paris|
Top 6th token. Logit: 27.90 Prob:  0.05% Token: |While|
Top 7th token. Logit: 27.80 Prob:  0.04% Token: |Certainly|
Top 8th token. Logit: 26.12 Prob:  0.01% Token: |New|
Top 9th token. Logit: 25.90 Prob:  0.01% Token: |That|


This looks way better. Now we have an insightful metric to evaluate our task. Let's remove the sample with index 9, save the dataframe, and move on to the fun part...

In [29]:
drop_idx = [9]
df.drop(drop_idx, inplace=True)
df.reset_index(drop=True, inplace=True)

In [30]:
print(f"Final dataset:\n{df}")

Final dataset:
                                       question answer
0             Is Paris the capital of France?\n   True
1             Is Paris the capital of France?\n   True
2       Is the Eiffel Tower located in Paris?\n   True
3        Is Paris known as the City of Light?\n   True
4               Is the Louvre Museum in Rome?\n  False
5  Is the official language of Paris English?\n  False
6          Is Montmartre a district in Paris?\n   True
7           Does Paris have a desert climate?\n  False
8                Is the Seine River in Paris?\n   True
9        Is the Mona Lisa displayed in Paris?\n   True


In [31]:
df.to_csv(DATA / 'fk_samples.csv', index=False)

### Load dataset (Shortcut)

In [37]:
df = pd.read_csv(DATA / 'fk_samples.csv')

In [38]:
df

Unnamed: 0,question,answer
0,Is Paris the capital of France?\n,True
1,Is Paris the capital of France?\n,True
2,Is the Eiffel Tower located in Paris?\n,True
3,Is Paris known as the City of Light?\n,True
4,Is the Louvre Museum in Rome?\n,False
5,Is the official language of Paris English?\n,False
6,Is Montmartre a district in Paris?\n,True
7,Does Paris have a desert climate?\n,False
8,Is the Seine River in Paris?\n,True
9,Is the Mona Lisa displayed in Paris?\n,True


## Direct Logit Attribution (Logit Lens)

The idea of direct logit attribution is to quantify how the different components in our transformer affect the outcome of our task. To do so, we use the residual stream direction of our logit difference and observe how the dot product between this vector and the embedding of the final token in the residual stream behaves, layer by layer. Since unembedding and layernorm, the two transformations applied to the final state of the residual stream are approxamately linear, we can expedite the evaluation step to the final residual stream embeddings. Unfortunately, softmax, the final transformation applied to get to token probabilities, the evaluation metric we deemed most useful for our use-case, is not linear. Hence, we cannot directly translate the a positive/negative change in direct logit attribution into an equal change in our previously selected metric, the difference in accumulated token probability. Yet, the sign and the relative magnitude of the change in direct logit attribution can be interpreted for our evaluation metric, allowing us to discover layer more relevant for our task than others.

First we create an answer map, mapping the the True-/False-Tokens to their corresponding embeddings (we already did this for the last step too)

In [9]:
from src.utils import answer_map_to_tokens

answer_map_tokens = answer_map_to_tokens(model, answer_map)

Next, we get the residual stream directiosn of our True-/False-Tokens. This means nothing else than that we retrieve the column-vectors in the unembedding matrix that will dictate the logit of the tokens we are interested in via dot product with the input vector. We then sum those unembedding vector for the True- and False-Tokens respectively and substract the True-Token embeddings from the False-Token sum of unembeddings (or vice versa) according to our label.

In [26]:
df

Unnamed: 0,question,answer
0,Is Paris the capital of France?\n,True
1,Is Paris the capital of France?\n,True
2,Is the Eiffel Tower located in Paris?\n,True
3,Is Paris known as the City of Light?\n,True
4,Is the Louvre Museum in Rome?\n,False
5,Is the official language of Paris English?\n,False
6,Is Montmartre a district in Paris?\n,True
7,Does Paris have a desert climate?\n,False
8,Is the Seine River in Paris?\n,True
9,Is the Mona Lisa displayed in Paris?\n,True


In [34]:
from src.direct_logit_attribution import get_answer_residual_direction

logit_diff_directions = get_answer_residual_direction(
    model,
    df,
    answer_map_tokens,
    device
)

Finally, we use the embedding of the residual stream for the last token in our sequence at every point in our network where the output of a layer is fed back into the residual stream to plot the value of the answer residual direction after each layer.

In [35]:
from src.direct_logit_attribution import residual_stack_to_logit_diff
from src.utils import line

accumulated_residual, labels = cache.accumulated_resid(
    layer=-1, incl_mid=True, pos_slice=-1, return_labels=True
)
logit_lens_logit_diffs = residual_stack_to_logit_diff(
    accumulated_residual, 
    logit_diff_directions, 
    cache
)
line(
    logit_lens_logit_diffs,
    x=np.arange(model.cfg.n_layers * 2 + 1) / 2,
    hover_name=labels,
    title="Logit Difference From Accumulate Residual Stream",
)

We can see that even with this slightly flawed metric, the model is continuously improving. The plot is almost monotonic. It starts with the embedding of the last tokens which yield a highly negative logit difference. This makes sense since the last token (next line token) is not our answer. With almost every layer, the logit difference increases. We have a high slope in the beginning (layer 1), followed by a rather steady increase (layer 2-13.5), a jump in capability at MLP of layer 13 (~40% of overall increase between 0 and final logit difference), and finally another steady increase up to the last MLP at which we see a little drop again.

In [39]:
from src.direct_logit_attribution import increase_per_layer_type

increase_per_layer_type(logit_lens_logit_diffs)

Summed increase in self-attention layer after layer 1: 20.38
Summed increase in MLP layer after layer 1: 64.74



If we neglect the first layer, we observe that the logit difference increases much more in MLP layer than self-attention layer. Over the 17 remaining layer, the logit difference in the residual stream increases by 20.38 nat over self-attention layer and 64.74 nat (70% of total increase) over MLP layer.

## Layer Attribution