# MATS Application - Christophe Thomassin: <br> Factual Knowledge in Transformer Language Model

Note: I am gonna try to back-up my code with some comments for better understandanding as time allows. Although I want to refer to the Google Docs document for an in-depth analysis: [Link](https://docs.google.com/document/d/13CZcQ818lBNqBGEqn3wxq3iVc7uHldprSr2tgTrBUr8/edit?usp=sharing)

**The Task**: <br>
I am interested in understanding how Transformer Language Model store and retrieve knowledge, to be specific factual knowledge. Factual knowledge is essential to all possible information processing tasks as it can be considered the underlying assumptions of pretty much every step of reasoning. So, if LLMs were not able to acquire factual knowledge, they would be not much more than random word generators. I have three reasons why I think applying MI to understand how LLMs acquire and store factual knowledge is a good idea:
1. I see facutal knowledge as the easiest form of intelligence, the entry point to cognitive capabilities. Hence, I find it intuitive to start with examining factual knowledge when trying "mechanistically" unwined the complexity of LLMs. Once understood, factual knowledge will open a lot of doors to dig deeper into the "mind" of Transformers.
2. Factual knowledge, or in this case the lack of it, can be considered a large driver of Hallucinations. Understanding how and where factual knowledge is stored could allow us to remediate many Hallucinations.
3. Factual knowledge, as the name indicates, is based on commonly-known facts which makes it quite easy to evaluate factual knowledge (one would think). A fact can only be True or False.

One of the main drawbacks is that, as one could imagine, we cannot expect factual knowledge to be "universal". En contraire, because of its factual nature it is most definitely highly dependent of the training data. A model cannot reason that Paris is the capital of France, without being told that it is during pre-training. Hence, we might see that some models have developed specific factual knowledge while others have not. Yet, my hope is that the mechanisims allowing to store and retrieve factual knowledge are somewhat universal.

## Setup

In [31]:
# allows to reload packages when reimported
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [27]:
import json
import os

import pandas as pd
import torch
import transformer_lens.utils as utils
from dotenv import load_dotenv
from huggingface_hub import login
from transformer_lens import ActivationCache, FactoredMatrix, HookedTransformer

from src.config import DATA

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
load_dotenv()

login(token=os.getenv("HF_TOKEN"))

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /home/chrisoutho_gmail_com/.cache/huggingface/token
Login successful


In [3]:
torch.set_grad_enabled(False)

device: torch.device = utils.get_device()
print(f"Device: {device}")

Device: cuda


### Load data

In [8]:
# load data
df = pd.read_csv(DATA / "fk_samples.csv")
print(df.head())

                                            question answer
0                  Is Paris the capital of France?\n    Yes
1            Is the Eiffel Tower located in Paris?\n    Yes
2             Is Paris the capital city of France?\n    Yes
3  Is the Louvre Museum in Paris known for housin...    Yes
4          Is Paris nicknamed "The City of Light"?\n    Yes


In [9]:
with open(DATA / "answer_map.json", "r") as f:
    answer_map = json.load(f)

print(answer_map)

{'True': ['Yes', 'yes', 'Sure', 'Correct', 'Certainly', 'Absolutely', 'Indeed', 'True', 'Yeah', ' Yes'], 'False': ['No', 'no', 'Nope', ' No', ' no', 'Wrong', 'NO']}


### Load model

In [10]:
model = HookedTransformer.from_pretrained("google/gemma-2b-it", device=device)

`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Loaded pretrained model google/gemma-2b-it into HookedTransformer


## Task definition

As said before, we want to understand how the model stores and retrieves factual knowledge. To do so, as in every MI investigation, we have to define a task that allows us to evaluate LLM's on our task, factual knowledge. There are a few things to consider when engineering this task:
1. Most importantly: We want to isolate the concept we are testing for, here, factual knowledge, as much as possible. Good performance on our task should imply that the tasked model is good at the exact concept we are interested in. Conversly, if a model does not perform good on our task, this should imply this model is not good at this concept.
2. Our generation should only be one token long for easy evaluation. It would be nice if one could determine another token that would be expected if the model were not able to correctly answer the query. These two tokens can be used for introducing logit difference as evaluation metric to the task.
3. It should be easy to scale the task to many different examples to rule out randomness.
4. It should be possible, only by changing a few tokens, to remove the concept from the task and make it random. This alternative distributions is needed for path patching and ablation.

For this time-constrained assessment, I will have to focus on a specific example. I chose to inspect factual knowledge around the French city "Paris". Paris is well-known enough to be in pretty much every pre-training dataset, yet yields enough diversity to ask mutiple questions in order to prove generalisation.

Under consideration of all these requirements, I selected a question-answer format for factual knowledge as task. This allows to isolate specific factual knowledge, yields one-token generations, scales pefectly, and makes it possible to "disarm" the query by removing the "key".

An example, which I will use a lot throughout this assessment, is: <br>
    - Query: Is Paris the capital of France? <br>
    - Label: True <br>
    - Opposite Generation: False <br>
    - Clue: capital, France <br>

I admit that this task has the drawback that I am quite sure that question-answering is a unique circuit worth studying by itself. I expect the question-answering circuits and the factual knowledge circuit to have an overlap which means I did not 100% isolate our concept. Nonetheless, after a lot of back-and-forth, I believe this is the best way of testing factual knowledge for now.

Let's see if the model is able to confidently solve this task in the first place:

In [11]:
query = "Is Paris the capital of France?\n"
ground_truth = "Yes"

<details><summary>Note</summary>
It seems like I have to add the '\n' tokens at the end of the query for this model to answer properly.

In [32]:
utils.test_prompt(query, ground_truth, model, prepend_bos=True, prepend_space_to_answer=False)

Tokenized prompt: ['<bos>', 'Is', ' Paris', ' the', ' capital', ' of', ' France', '?', '\n']
Tokenized answer: ['Yes']


Top 0th token. Logit: 32.54 Prob: 55.63% Token: |Sure|
Top 1th token. Logit: 32.09 Prob: 35.34% Token: |Yes|
Top 2th token. Logit: 29.96 Prob:  4.22% Token: |No|
Top 3th token. Logit: 29.90 Prob:  3.96% Token: |Paris|
Top 4th token. Logit: 26.54 Prob:  0.14% Token: |Answer|
Top 5th token. Logit: 26.24 Prob:  0.10% Token: |Certainly|
Top 6th token. Logit: 26.02 Prob:  0.08% Token: |The|
Top 7th token. Logit: 25.92 Prob:  0.07% Token: |Of|
Top 8th token. Logit: 25.76 Prob:  0.06% Token: |Correct|
Top 9th token. Logit: 25.70 Prob:  0.06% Token: |*|


While this looks good, we see the first "inconvenience" in our task here. The model is more than capable of answering the query. Yet, it is not able to concentrate all its probability mass on one "True-token" such as "Yes" but spreads it among multiple such as "Sure" (56% probability), "Yes" (35% probability), and "Certainly" (0.1% probability). This is most likely because assertion of something being true is most definitely done by using all these words in the training data. While this does not make the generation wrong, it raises two questions:
1. Is the model abel to understand that all these tokens express "True" and build a circuit out of that?
2. How do we evaluate the task if we cannot measure it's outcome based on the logit/probability of a single token?

For the first question, I do not have a concrete answer but, for the sake of this experiment, I will just hope that the model chosen here is indeed able to abstract the answers into "Fact is True" and "Fact is False". Larger models such as ChatGPT 4o are definitely able to do so. This model achieved close to 100% accuracy on 10 simple factual knowledge questions around Paris under the condition of answering with True or False only. For the second question, we will have to find a solution to capture all possible True- and False-tokens

Going through a bunch of example questions, I have mapped all True- and Wrong-Tokens to their corresponding underlying meaning. Although, computationally, it is not very elegant (especially bc lists have diff lenghts), we will have to always evalaute the task on all these tokens.

In [13]:
answer_map

{'True': ['Yes',
  'yes',
  'Sure',
  'Correct',
  'Certainly',
  'Absolutely',
  'Indeed',
  'True',
  'Yeah',
  ' Yes'],
 'False': ['No', 'no', 'Nope', ' No', ' no', 'Wrong', 'NO']}

In [14]:
from src.data import answer_map_to_tokens

answer_map_tokens = answer_map_to_tokens(model, answer_map)

In [35]:
df = pd.DataFrame({'question': [query], 'answer': ["True"]})

In [40]:
from src.logit_diff import df_to_logits, logits_to_logit_diff

logits, cache = df_to_logits(model, df)
logit_diff = logits_to_logit_diff(df, logits, answer_map_tokens, device)
print(f"Logit difference of True and False tokens summed up is {logit_diff:.2f} nats")

Logit difference of True and False tokens summed up is 109.43 nats


When accounting for all True and False token, the model is clearly able to classifiy the statemnt as True. If the model were able to concentrate the probability mass of all True-Tokens on one token and that of all Wrong-Tokens on another token, the True-Token would have an $e^{109}\approx 10^{47}\times$ higher probability.

<details> <summary>Note</summary>
Considering typcial logit distributions over the vocabulary at the end of a Transformer it is questionable if the model, even if it were possible to concentrate on two differnet tokens, would assign such a high logit to the True token. Furhtermore, we have to consider that way more positive tokens are under the top 10 tokens which is why the differnece is so big in the first place. Nevertheless, this supports our hypothesis that the model can retrieve factual knowledge and use it to answer questions.