<a href="https://www.kaggle.com/code/aisuko/interpreting-the-attention-scores-tensors?scriptVersionId=164540244" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

For each token in the sequence, the attention mechanism identifies which other tokens are in the important for understanding the current token in the given context. We can use a simple way to understand attention is to think of it as a method which replaces each token embedding with an embedding that includes information about it's neighbouring tokens; instead of using the same embedding for every token regardless of it's context.


# How to check attention scores?

For example, here we are going to calcualte the `Self-Attention`. In self-attention scores are calculated between all pairs of input tokens in a sequence. And this is done in the context of the BERT model, which uses self-attention in its Transformer layers.

In [1]:
import torch
from transformers import BertTokenizer, BertModel

tokenizer =BertTokenizer.from_pretrained('bert-base-uncased')
model=BertModel.from_pretrained('bert-base-uncased')

prompt="Western Victoria"
input_ids=tokenizer.encode(prompt, add_special_tokens=True)

# Convert to tensor and add batch dimension
input_ids=torch.tensor(input_ids).unsqueeze(0)

outputs=model(input_ids, output_attentions=True)
attention=outputs[-1]

print(attention[0].shape)

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

torch.Size([1, 12, 4, 4])


# The attention scores of `Fire`

In [2]:
western_index=input_ids[0].tolist().index(tokenizer.encode("Western", add_special_tokens=False)[0])

print(attention[0][:,western_index,:])

tensor([[[0.9602, 0.0198, 0.0124, 0.0076],
         [0.0209, 0.2147, 0.4705, 0.2939],
         [0.0738, 0.4338, 0.2646, 0.2278],
         [0.2916, 0.2842, 0.1946, 0.2295]]], grad_fn=<SliceBackward0>)


# How to interpret the attention scroes tensor?

Before this, we need to know the concept of the batch dimension.


# Batch dimension

In ML, model can process inputs in batches, the batch dimension is an additional dimension used to hold mutiple instances of the data. For example, if we are processing images that are `28x28` pixels, each image would be a `28x28` matrix. If you want to process 32 images at once, you could stack these matrices into a 3D tensor with shape **(32, 28, 28)**. Here, 32 is the batch size, and the batch dimension is the first dimension of the tensor.

Here are few reasons of using batch dimension:

**Efficiency**
Processing inputs in batches allows the model to parallelize operations, which can lead to significant speedups especiallt on hardware like GPUs.

**Generalization**
Training on batches helps the model generalize better, as each update is based on a larger amount of data.

**Memory usage**
It allows control over memory usage, as larger batch sizes will require more memory.


In the context of the BERT model, the input tensor needs to have a batch dimension because the model is designed to process multiple inputs at once. Even if you're only processing one sentence, you still need to add a batch dimension with size 1. The `unsqueeze(0)` operation was used to add this extra dimension.

## Interpret the tensor of Fire

We can interpret the data like below:

```
Batch 1:
  Head 1:
    Token 1 attention scores: [0.9602, 0.0198, 0.0124, 0.0076]
    Token 2 attention scores: [0.0209, 0.2147, 0.4705, 0.2939]
    Token 3 attention scores: [0.0738, 0.4338, 0.2646, 0.2278]
    Token 4 attention scores: [0.2916, 0.2842, 0.1946, 0.2295]
```

* The first dimension with size 1 is the batch size. This dimension is used because the model can process multiple inputs at once. In this case, we're only processing one iput, so the size is 1.

* The second dimension with size 4 is the number of attention of heads(num_heads). In the BERT model, the attention mechanism is applied multiple times in parallel(once per head) to capture different types of relationships in the input.

* The third dimension with size 4 is the sequence length. This is the number of tokens in the input sequence. In this case, it seems like the input sequence has been truncated or padded to a length of 4 tokens.

So, the tensor represents the attention scores from each of the 4 heads for each token in a sequence of 4 tokens. Each score is a value between 0 and 1 that indicates the amount of attention a token should pay to each other token.

And according to the above data, in the first batch, the first head gives the first token (Token 1) a high attention score of 0.9602 with itself, and much lower attention with the other tokens. The second token (Token2) pays the most attention to the second token in the sequence, and so on.
