Sheet 6.1 LLM probing & attribution
========
**Author:** Polina Tsvilodub

In this sheet, we will familiarize ourselves with some methods of looking "under the hood" of transformers. In particular, we will see how we can visualize and trace which inputs are processed where in the model and how they contribute to the output, and what kinds of information is processed. Specifically, the learning goals for this sheet are:

* familiarization with transformer attention visualization for inspecting attention patterns
* understanding how to extract representations of a model from different layers 
* familiarization with probing of a transformer's syntactic 'knowledge'.

## Attention visualization

One of the core processing mechanisms in the transformer is the attention mechanism. As discussed in the lecture on transformers, depending on the architecture of the model, there might be various attention blocks: 
* if the model is an encoder-only model (e.g., BERT), it has encoder self-attention; 
* if it is a decoder-only model (e.g., all GPT models), it has a decoder (i.e., causal) self-attention; 
* if it is an encoder-decoder model (e.g., translation models, architectures like T5), it has those and additionally cross-attention between the encoder and the decoder.


First, we will inspect attention visualizations, which indicate the magnitudes of attention scores between a specific token $i$ and other tokens. (Reminder: the scores are computed as the dot product of the $i$ token's query vecor and the other tokens' key vectors.) Intuitively, the larger a score, the more will the respective representation of some other token contribute to predicting the output based on $i$.

First, we will explore the example from the lecture (slide 46) hands-on. In the example, a sequence-to-sequence (i.e., encoder-decoder) model is used for translation the English sentence "The brown dog ran." into the French sentence "Le chien brun a couru.". We will load the [FLAN-T5 small model](https://huggingface.co/google/flan-t5-small), a seq2seq model fine-tuned to follow various task instructions (including translation). 

We will use the package [BertViz](https://github.com/jessevig/bertviz) for the visualization. It allows to explore parts of the model interactively, i.e., select specific model parts (e.g., encoder or decoder), specific layers (i.e., attention layers in transformer blocks), and attention heads.

In [None]:
# install the packages required for running the visualization
#!pip install bertviz ipywidgets

In [1]:
# import packages
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from bertviz import model_view, head_view

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# load the model
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-small")
model_t5 = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-small")



In [11]:
# define input and target
input_ids = tokenizer.encode("Translate to French: The brown dog ran.", return_tensors="pt")
target_ids = tokenizer.encode("Le chien brun a couru.", return_tensors="pt")
# Run model and get the attentions by setting output_attentions=True
output = model_t5(input_ids=input_ids, labels=target_ids, output_attentions=True, return_dict=True) 


# we will need to pass the attiontion to the visualization function
# therefore, we look at the output of the model to see how to access the attention scores
print(output.keys())

odict_keys(['loss', 'logits', 'past_key_values', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_attentions'])


<IPython.core.display.Javascript object>

In [16]:
# we retrieve various attention scores from the output
encoder_attention = output.encoder_attentions
cross_attention = output.cross_attentions
decoder_attention = output.decoder_attentions

# furthermore, for ease of interpreting the visualization, we convert the token ids to string corresponding to those tokens
input_tokens = tokenizer.convert_ids_to_tokens(input_ids[0]) 
decoder_tokens = tokenizer.convert_ids_to_tokens(target_ids[0])

In [17]:
# now we use the overall model attention visualization
# select the attention parts you want to look at via the drop-down
# click on the facets to zoom in on the attention heads in a specific layer
model_view(
    encoder_attention=encoder_attention,
    decoder_attention=decoder_attention,
    cross_attention=cross_attention,
    encoder_tokens= input_tokens,
    decoder_tokens = decoder_tokens
)

<IPython.core.display.Javascript object>

Now, we zoom in on the encoder attention which is used to create representations of the instruction + source sentence. Therefore, we inspect the `encoder_attention` below.

In [12]:
# there is a also a head view that allows you to look at the attention of a single head 
# which can be selected by double-clicking on the colored tile
# for a single layer (can be selected via the drop-down)
head_view(encoder_attention, input_tokens)

<IPython.core.display.Javascript object>

Next, we look at the *cross-attention*, i.e., the attention weights computed based on query vectors of the decoder representations and key vectors from the encoder. Intuitively, these represent the importance of input representations (the English sentence) for computing the output (French sentence).

In [20]:
# by default, the head view visualizes self-attention (i.e., attention between the same tokens). 
# For cross-attention, one should specify the cross_attention parameters
head_view(
    cross_attention=cross_attention, 
    encoder_tokens=input_tokens, 
    decoder_tokens=decoder_tokens
)

<IPython.core.display.Javascript object>

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 6.1.1: Interpreting attention scores</span></strong>
>
> 1. Inspect the visualization above. How many layers does the model have? How many attention heads per layer are there? Access visualizations of scores of a single attention head. Do you observe any interesting patterns across layers and / or attention heads? 
> 2. Consider the input "What is the capital of France?" and output "The capital is Paris". Intuitively, which token do you think will receive high attention scores in which part of the model, from which tokens? Complete the code below and inspect the output. Do the results match your intuition?
> 3. Use the functions above to inspect decoder attention. Make sure you identify the causal part of the attention scores.

**TODO**
* ideally, some example where different attention heads do different things: e.g., Jack's world cappital or IOI task?
* some code for quantitative work with attention scores (i.e., not just eyeballing but using the actual numbers)

In [None]:
# TODO: neuron view

## Gradient tracing

TODO

In [None]:
# TODO

## Probing

**TODO**

Resources:
* https://github.com/rycolab/probing-via-prompting