In [1]:
# !pip install -U torch
# !pip install accelerate
# !pip install transformers

In [1]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from getpass import getpass
import gc

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [3]:
hf_token = getpass()

 ········


## Extracting lookback-ratio

Original Lookback-Ratio code is Llama-specific so I wrote a script which takes model, tokenizer and a list of texts and then generates the answers and extracts lookback-ratio:

In [4]:
llama2_model_name = "meta-llama/Llama-2-7b-chat-hf"

llama2_tokenizer = AutoTokenizer.from_pretrained(llama2_model_name, token=hf_token)
llama2_model = AutoModelForCausalLM.from_pretrained(llama2_model_name, device_map="auto", token=hf_token)

  warn(f"Failed to load image Python extension: {e}")


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Some parameters are on the meta device because they were offloaded to the cpu.


In [5]:
input_text = "Write me a poem about Machine Learning. Be creative!"

In [6]:
input_ids = llama2_tokenizer(input_text, return_tensors="pt").to(device)
input_ids.input_ids.shape

torch.Size([1, 13])

In [7]:
outputs = llama2_model.generate(**input_ids, max_new_tokens=32, output_attentions=True, return_dict_in_generate=True)

attentions = outputs.attentions

From v4.47 onwards, when a model cache is to be returned, `generate` will return a `Cache` instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set `return_legacy_cache=True`.


In [8]:
len(attentions), len(attentions[0]), attentions[0][0].shape

(32, 32, torch.Size([1, 32, 13, 13]))

```attentions``` is a tuple (32 - one element per generated token) of tuples (32 - one element per layer) of tensors (1 - batch_size, 32 - num_heads, 13 - generated_length, 13 - sequence_length)

In [9]:
attentions[1][0].shape

torch.Size([1, 32, 1, 14])

But for tokens other than the first the shape of the tensor is different. The 3rd dimension is always 1, because we are generating one token at a time, and the 4th dimension is always 1 element longer then for the preceding token. For ```attentions[2][0].shape``` it will be 16 and so on...

**However those shapes differ between models:**

In [10]:
del llama2_model

torch.cuda.empty_cache()

gc.collect()

12132

In [15]:
gemma2_model_name = "google/gemma-2-2b-it"

gemma2_tokenizer = AutoTokenizer.from_pretrained(gemma2_model_name, token=hf_token)
gemma2_model = AutoModelForCausalLM.from_pretrained(gemma2_model_name, device_map="auto", token=hf_token)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [16]:
input_ids = gemma2_tokenizer(input_text, return_tensors="pt").to(device)
input_ids.input_ids.shape

torch.Size([1, 12])

In [17]:
outputs = gemma2_model.generate(**input_ids, max_new_tokens=32, output_attentions=True, return_dict_in_generate=True)

attentions = outputs.attentions

In [19]:
len(attentions), len(attentions[0]), attentions[0][0].shape

(32, 26, torch.Size([1, 8, 12, 44]))

In [20]:
attentions[1][0].shape

torch.Size([1, 8, 1, 44])

In this case the last dimension of the tensor is always 44 (12 tokens in prompt + 32 generated). It is not getting longer for every single token. Elements for the tokens that were not generated yet have values of 0.

Function below is designed to extract lookback-ratio despite those differences:

In [21]:
def extract_lbrs(model, tokenizer, input_texts):
  results = []

  for input_text in input_texts:
    print(50*"=")
    input_ids = tokenizer(input_text, return_tensors="pt").to(device)
    print("input_ids.input_ids.shape", input_ids.input_ids.shape)
    outputs = model.generate(**input_ids, max_new_tokens=32, output_attentions=True, return_dict_in_generate=True)

    attentions = outputs.attentions
    new_token_length = len(attentions)
    n_layers = len(attentions[0])
    n_heads = attentions[0][0].shape[1]

    print("new_token_length: {}, n_layers: {}, n_heads: {}".format(new_token_length, n_layers, n_heads))

    prompt_length = input_ids.input_ids.shape[1]
    print("prompt_length", prompt_length)

    lbr = torch.zeros(n_layers, n_heads, new_token_length).to(device)

    for t in range(new_token_length):
      for l in range(n_layers):
        attn_on_context = attentions[t][l][0, :, -1, :prompt_length]
        avg_attn_on_context = attn_on_context.mean(-1)

        attn_on_new = attentions[t][l][0, :, -1, prompt_length:prompt_length+t]
        avg_attn_on_new = attn_on_new.mean(-1) if attn_on_new.shape[-1] > 0 else torch.zeros(attn_on_new.shape[0]).to(device)

        lbr[l, :, t] = avg_attn_on_context / (avg_attn_on_context + avg_attn_on_new)
    print("lbr.shape", lbr.shape)

    decoded = tokenizer.batch_decode(outputs.sequences, skip_special_tokens=False)[0]
    print("decoded", decoded)

    results.append({
        "input_text": input_text,
        "decoded": decoded,
        "lbr": lbr
    })

  return results

In [22]:
lbrs = extract_lbrs(gemma2_model, gemma2_tokenizer, [input_text])

input_ids.input_ids.shape torch.Size([1, 12])
new_token_length: 32, n_layers: 26, n_heads: 8
prompt_length 12
lbr.shape torch.Size([26, 8, 32])
decoded <bos>Write me a poem about Machine Learning. Be creative!

In silicon valleys, where code takes flight,
A mind of metal, bathed in digital light.
Machine learning, a whisper in the breeze,



In [28]:
lbrs[0]["lbr"].shape

torch.Size([26, 8, 32])

26 layers, 8 heads and lookback-ratio for each generated token (32)

For example, lookback-ratio for first layer, first head:

In [29]:
lbrs[0]["lbr"][0][0]

tensor([1.0000, 0.8250, 0.0315, 0.1540, 0.0871, 0.0709, 0.0543, 0.1421, 0.1626,
        0.0849, 0.1643, 0.0702, 0.0952, 0.2524, 0.0533, 0.4463, 0.1824, 0.5189,
        0.0738, 0.4419, 0.2973, 0.5011, 0.1802, 0.3557, 0.2904, 0.2679, 0.1703,
        0.5507, 0.1415, 0.1690, 0.4247, 0.3822], device='cuda:0')

**Side note:** Lookback-ratio for the first token is exactly 1, because it is attention_on_context / (attention_on_context + attention_on_new). As it is the first token attention_on_new is 0. However in the original Lookback-Ratio implementation the value for the first token is not equal to 1. The reason is that in the original implementation when the prompt is built some extra text is appended at the end, for example ```"\n#Summary#:"``` for the task of summarization. Than during calculating lookback-ratio those extra tokens are being trated as if they were generated by the model, although they were not actually. But this is the reason why lookback-ratio is not equal to 1, although it should be according to the paper.

## Modifying attention "by hand"

1. I tried to use torch hooks (model.register_forward_hook()). However it is a function that is called **after** forward is called. So even though we can get attention maps, the token was already generated.
2. Another approach is to modify transformers library code but it requires more time, is prone to bugs and is not model agnostic so I gave it up.