# Transformers and self-attention exercise

### Self-Attention Mechanism

Self-attention allows the model to weigh the importance of different words in a sentence relative to each other.
It helps capture relationships and dependencies between words regardless of their positions in the sentence.

### Multi-Head Attention

Instead of calculating attention scores once, multi-head attention splits the computations into multiple parallel "heads".
Each head performs its own self-attention calculations on a different linear projection of the input.
This allows the model to focus on different parts of the sentence simultaneously, capturing various aspects of the relationships between words.
Number of Heads (num_heads):

### The number of attention heads in the multi-head attention mechanism

If num_heads is 12, for example, the model will have 12 separate sets of attention calculations running in parallel.
Each head operates on a different linear transformation of the input, allowing the model to capture diverse patterns.


### Code explanation 

* The shape of the attention tensor is [batch_size, num_heads, sequence_length, sequence_length].
* For a single sentence (batch_size = 1), the shape is [1, num_heads, sequence_length, sequence_length].
* In the visualization, attention[0] refers to the first head (index 0) of the multi-head attention mechanism. That is, the heatmap for attention[0] shows how the first head attends to different tokens.


In the tokens, you'll see: 

* [CLS]: This token appears at the beginning of the sequence and is used to capture the overall context of the input
* [SEP]: This token appears at the end of the sequence, indicating the end of the input.


--------------------------

# >>> Your task

* Play around with different sentences
* Try visualizing different attention heads

----------------------

In [1]:
from transformers import BertTokenizer, BertModel

In [2]:
import torch
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Function to visualize self-attention heatmap
def visualize_self_attention(sentence, head_to_viz):
    # Load pre-trained model and tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased', output_attentions=True)

    # Tokenize input sentence
    inputs = tokenizer(sentence, return_tensors="pt")

    # Get model outputs
    outputs = model(**inputs)
    attention = outputs.attentions[-1]  # Get the attention from the last layer

    # Process attention scores
    attention = attention.squeeze(0).detach().numpy()  # Remove batch dimension
    token_ids = inputs["input_ids"].squeeze(0).numpy()  # Remove batch dimension
    tokens = tokenizer.convert_ids_to_tokens(token_ids)

    # Display tokens for clarity
    print("Tokens:", tokens)

    # Visualize attention heatmap
    fig, ax = plt.subplots(figsize=(10, 10))
    sns.heatmap(attention[head_to_viz], xticklabels=tokens, yticklabels=tokens, cmap='viridis', ax=ax)
    plt.title("Self-Attention Heatmap")
    plt.show()

In [None]:
# Example sentence
sentence = "The girl poured water into a tall glass until it overflowed."

# Visualize self-attention for the example sentence
visualize_self_attention(sentence, 0)