# **Probing GPT model**

This tutorial is about probing, a simple but powerful method for learning the inner workings of LLMs (Large Language Models). It can be used to gain approximate knowledge about the patterns the model learns and how this knowledge is controlled at different levels.

### **Probing idea**

Feed the model with input text $x$, getting a prediction $output$. Save and consider the chain of hidden states:

$$x \to hidden_1 \to hidden_2 \to \dots \to hidden_n \to output$$

where $hidden_i$ is the vector representation of the input from $x$ at the $i$ layer.

If applied to $hidden_i$ collected from certain model data, we can form hypotheses to answer the questions:

- Does $hidden_i$ contain semantic information about parts of speech? (collect representations and solve the classification problem on the annotated data)
- Where is the knowledge about facts or concepts that the model "learned" from the data encoded? (similar to the previous one)

It is true that probing requires routine labeling, but it is not always difficult. In the tutorial we will consider:

1. The probing process using GPT2 as an example;
2. Analysis of the information content of hidden states using PCA;
3. Setting up an experiment (and the experiment itself) to answer the question: which layer by level allows for an approximate solution of the regression problem and stores information by year?;

### **Problem statement - where in the model is knowledge about dates stored?**

Let's say we have a generative model. Let's pose the question: **where exactly does this model "know" what years a particular person lived?**

To do this, we need to:

1. Create a dataset: pairs of the form (question: "When was Newton born?", answer: "1643").
2. Run questions through the model and extract hidden states from different layers.
3. Extract the predicted date;
4. Train a probe that predicts the date based on the hidden state.
5. Analyze on which layers the model most effectively stores information about dates.

Let's get started.


Inspired by: https://ai-office-hours.beehiiv.com/p/llm-probing

paper https://arxiv.org/pdf/2310.02207.pdf

In [2]:
#!pip install datasets==3.2.0 -q

In [3]:
import torch
import random
import json
import re

import numpy as np
import pandas as pd

#from datasets import load_dataset
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from tqdm.auto import tqdm
# suppress sklearn warning
import warnings; warnings.filterwarnings("ignore")

  from .autonotebook import tqdm as notebook_tqdm


### **GPT-2 Model**
GPT-2 is a transformer layer. First, let's load the model and see an example of how it works. In our analysis, we will use `gpt2-medium` trained on `WebText`. You can read more about the model in [article](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), but let's note that the dataset the model was trained on has the following features:

- No Wikipedia pages;
- Knowledge for the model was collected before 2019;
- Data was collected based on articles on `Reddit`, filtered by user vote >3;

Also, let's note that when we pass text to the model, the following happens:

1. The text is broken down into tokens (e.g. `"Albert"` → `15433`, `"Einstein"` → `8372`).

2. Each token is transformed into a vector representation (embedding).

3. Vectors are passed through several layers of the transformer
- The last layers add more and more information about the context.

4. At the output of each layer, we get a *sequence* of hidden states1 and the final forecast, which are gradually "saturated" with context.

In [5]:
# Загружаем токенизатор и модель
tokenizer = GPT2Tokenizer.from_pretrained('gpt2-medium')
model = GPT2LMHeadModel.from_pretrained('gpt2-medium')

# Установка устройства (GPU, если доступно)
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model = model.to(device)

# Исходный текст
text = "When was Albert Einstein Born?"
encoded_input = tokenizer(text, return_tensors='pt') #.to(device)

# Генерируем продолжение текста
with torch.no_grad():
  # = model(**encoded_input)
  output_ids = model.generate(**encoded_input, max_length=20)

# Декодируем токены обратно в текст
decoded_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(decoded_output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


When was Albert Einstein Born?

Albert Einstein was born on November 20, 1879, in


The *response generation* model performs, in our case, the task of answering a question. We will gradually modify the code so that only the year is extracted from the answer. We will limit ourselves to years from 1800 to 2019.

We will extract the year using regular expressions.

In [6]:
# регулярка для поиска годв

pattern = r"\b(1[89][0-9]{2}|20[01][0-9])(?:[s’']{0,2}|th)?\b"
example = "In the 2000s, I fell in love with cats."

match_ = re.search(pattern, example)

year = match_.group(1) if match_ else None

print("Extracted year:", year)

Extracted year: 2000


### **Preparing data and collecting hidden states**

The hidden state(s) are the **internal representations** of the input text within the model. They are developed after processing the data of each layer of neural networks and contain encoded information about the text. *"What does the open state contain?"* is an open question. There are a number of studies showing that different levels contain different semantic information. For example, the first levels contain information about speech texts and the place of words in a sentence, and the last levels contain the meaning of words. But this interpretation is not universal and in particular probing is an attempt to understand the internal processes in models.

### **How ​​to get open state in Hugging Face?**
If the model supports `output_hidden_states=True`, then after text processing we can append them like this:

``` python
outputs = model(**encoded_input)
Hidden_states = outputs.hidden_states # This is a tuple of N tensors (over the results of the layers)
```

- `hidden_states[0]` — first layer
- `hidden_states[-1]` — last layer
- `hidden_states[len(hidden_states) // 2]` — center layer

The shape of hidden settings:
```
(packet_size, sequence_length, hidden_size)
```
For example, for `gpt2-medium`:
```
(1, 6, 1024) # 6 tokens, each represented by 1024-dimensional vector
```

In [7]:
tokenizer.pad_token = tokenizer.eos_token

In [8]:
# Texts
texts = ["When was Albert Einstein born?", "When was Frida Kahlo born?", "When was Claus Hammel born?"]
encoded_input = tokenizer(texts, return_tensors='pt', padding=True, truncation=True)

# Run through the model and get hidden states
with torch.no_grad():
    outputs = model(**encoded_input, output_hidden_states=True)
    output_ids = model.generate(**encoded_input, max_length=30)
    hidden_states = outputs.hidden_states  # All hidden states (tuple of 25 layers)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.


Let's see what the hidden_states dimensions look like

In [9]:
# Get hidden states (first, middle and last layers)
first_layer = hidden_states[0]  # first layer
middle_layer = hidden_states[len(hidden_states) // 2]  # mid layer
last_layer = hidden_states[-1]  # last layer

# Вывод размеров скрытых состояний
print("First layer shape:", first_layer.shape)  # (batch_size, seq_len, hidden_dim)
print("Middle layer shape:", middle_layer.shape)
print("Last layer shape:", last_layer.shape)

First layer shape: torch.Size([3, 8, 1024])
Middle layer shape: torch.Size([3, 8, 1024])
Last layer shape: torch.Size([3, 8, 1024])


In [10]:
# Decode the model responses
decoded_output = [tokenizer.decode(i, skip_special_tokens=True) for i in output_ids]

for i in decoded_output:
  print(i + '\n') #print("Generated text:", decoded_output)


When was Albert Einstein born?The answer is: in 1879. Einstein was born in Vienna, Austria, and died in New York City

When was Frida Kahlo born?

Frida Kahlo was born on November 28, 1894, in the small town of La Pl

When was Claus Hammel born?The answer is: in 1848. He was born in the town of St. Paul, Minnesota, and



In [11]:
# Извлечем года

pattern = r"\b(1[89][0-9]{2}|20[01][0-9])(?:[s’']{0,2}|th)?\b"

years = []

for i in decoded_output:
  match_ = re.search(pattern, i)
  year = match_.group(0) if match_ else None
  years.append(year)

print("Extracted years:", years)

Extracted years: ['1879', '1894', '1848']


In [12]:
# Выводим результаты
for text, year in zip(texts, years):
    print(f"Input: {text}")
    print(f"Generated year: {year}")
    print()


Input: When was Albert Einstein born?
Generated year: 1879

Input: When was Frida Kahlo born?
Generated year: 1894

Input: When was Claus Hammel born?
Generated year: 1848



### **Dataset**

We will work with the [people dataset](https://www.nature.com/articles/s41597-022-01369-4), pre-processed and cleaned for this experiment. The dataset required for work can be downloaded from [google disc](https://drive.google.com/file/d/1QbEbJlABsbhzyKfQ6L4ES1rH3wbbuPfG/view?usp=sharing).

In [13]:
#people_dataset = people_dataset.to_pandas() # скачать оригинальный датасет

In [14]:
#!wget https://github.com/SadSabrina/XAI-open_materials/raw/refs/heads/main/gpt2_probing/people_dataset_prepared_most_popular.csv.zip

In [15]:
#!unzip /Users/sabrinasadieh/Code/XAI-open_materials/gpt2_probing/people_dataset_prepared_most_popular.csv.zip

In [16]:
most_popular = pd.read_csv('//Users/sabrinasadieh/Code/XAI-open_materials/gpt2_probing/people_dataset_prepared_most_popular.csv')

In [17]:
most_popular.head()

Unnamed: 0,name,birth,death,wiki_readers_2015_2018,birth_min,birth_max,death_min,death_max
0,Karel Matěj Čapek-Chod,1860.0,1927.0,25008,1860.0,1860.0,1927.0,1927.0
1,Florian Eichinger,1971.0,,27285,1971.0,1971.0,,
2,Florian Jahr,1983.0,,37331,1983.0,1983.0,,
3,Tadeusz Borowski,1922.0,1951.0,341110,1922.0,1922.0,1951.0,1951.0
4,Joseph C. O'Mahoney,1884.0,1962.0,15428,1884.0,1884.0,1962.0,1962.0


In [18]:
most_popular.shape

(556311, 8)

In [19]:
most_popular['birth'].min()

np.float64(1800.0)

So, we have a dataset, pre-cleaned by the "year" table. It contains people born from 1800 to 2019. We will be interested in two columns - 'name' (to extract the name) and `birth`.

### **How ​​does it work?**

1. **Data selection:** We select input $x$ and labels for the auxiliary task. For example, this could be part-of-speech detection, time-frame extraction, or fact classification.

2. **Hidden state extraction:** We pass input $x$ through the model and save hidden states $hidden_i$ from one or more layers.

3. **Probe training:** We build a simple classifier or regression model (e.g. logistic regression, SVM, or a small neural network layer) and train it on $hidden_i$. This model is our "probe".

4. **Analysis:** We evaluate the performance of the probe. If the probe solves the task well, it means that the information needed to solve the task is encoded in $hidden_i$.

In [20]:
# Function for receiving activations and attention
def get_activations_and_attention(model, enc_inputs, pattern_to_response):
    activations = {}

    with torch.no_grad():
        outputs = model(**enc_inputs, output_hidden_states=True)
        output_ids = model.generate(**enc_inputs, max_length=30)

    decoded_output = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    match_ = re.search(pattern_to_response, decoded_output)
    year = match_.group(1) if match_ else None

    # Extracting attention (first, middle and last layers)
    activations['layer_1'] = outputs.hidden_states[0]  # The first attention layer 
    activations['layer_middle'] = outputs.hidden_states[len(outputs.hidden_states) // 2]  # The mid attention layer 
    activations['layer_last'] = outputs.hidden_states[-1]  # The last attention layer 

    return activations, year

# # Collecting data for training
# activations_data_l1 = []
# activations_data_lmid = []
# activations_data_llast = []
# predicted_years = [] # Years of birth predicted
# actual_years = [] # Years of birth actual

In [21]:
# question = "When was Albert Einstein born?"
# inputs = tokenizer(question, return_tensors="pt", padding=True, truncation=True)

In [22]:
# cnt = 0
# torch.manual_seed(0)
# random.seed(0)

# np.random.seed(0)

# # We collect activations and attention for all people
# for index, row in most_popular.iloc[:5001, :].iterrows():
#     name = row['name']
#     true_year = row['birth']

#     # Create requests for each person
#     question = f"When was {name} born?"

#     # tokenize
#     inputs = tokenizer(question, return_tensors="pt", padding=True, truncation=True)

#     # Extracting activations and attention
#     activations, predicted = get_activations_and_attention(model, inputs, pattern)

#     #Layer activations
#     act_layer_1 = activations['layer_1'].flatten().cpu().numpy()
#     act_layer_middle = activations['layer_middle'].flatten().cpu().numpy()
#     act_layer_last = activations['layer_last'].flatten().cpu().numpy()

#     # collect data
#     activations_data_l1.append(act_layer_1)
#     activations_data_lmid.append(act_layer_middle)
#     activations_data_llast.append(act_layer_last)

#     predicted_years.append(predicted)
#     actual_years.append(true_year)


#     cnt += 1

#     if cnt % 500 == 0:
#       print(cnt)