# Attention mechanisms

## Dataset

We will use a dataset from <https://github.com/Charlie9/enron_intent_dataset_verified?tab=readme-ov-file>. This dataset consists of sentences from emails sent between employees of the Enron corporation. Each sentence has been manually labeled regarding whether it contains a request or does not contain a request. We will train an attention model to classify sentences as "request" or "no request" sentences.

In [None]:
def read_intent_file(file_path: str) -> list[str]:
    with open(file_path, 'r') as file:
        lines = file.readlines()
    return [line.strip() for line in lines]

# Read positive and negative intent files
pos_intent_path = "data/Enron/intent_pos"
neg_intent_path = "data/Enron/intent_neg"

pos_intent_sentences = read_intent_file(pos_intent_path)
neg_intent_sentences = read_intent_file(neg_intent_path)

Take a look at some of these sentences. Does the dataset look as you would expect?

In [None]:
for i in range(5):
    # print out some sentences and remove the ...
    pass

## Tokenization

Now that we have the sentences, we need to parse them into tokens that can be fed to the model. Tokenization is a surprisingly complicated task which is highly language-dependent.

(It is not as simple as identifying words; often parts of words are themselves individual tokens. The past-tense marker `-ed` must be separated from the verb `trained`, for example, to create two tokens: `train` and `ed`. German, then, requires a different algorithm --- the word `trainiert` clearly has the token `t` at the end, but then should the tokens be `trainieren` and `t`? Or `trainier` and `t`?).

Luckily, we are physicists rather than linguists, and some very clever people have done the work already. We can parse the sentences using a pre-written tokenizer from PyTorch.

NB: You may get an error like

```txt
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.5 as it may crash.
```

You can safely ignore this warning and just re-run the cell. It should not affect the rest of the tutorial.

In [None]:
from torchtext.data.utils import get_tokenizer

tokenizer = get_tokenizer('basic_english')
tokens = tokenizer("Please send me the report by EOD.")
tokens

Tokenize some sentences of your choosing and see what happens. Does it work as you expect, or are there any surprises? What if you try to tokenize a non-English sentence?

In [None]:
# Tokenize some sentences

Now that we can tokenize individual sentences, we need to build up a vocabulary of tokens that appear in our training data. We will also add two new tokens to the vocabulary --- a token `<unk>` representing an unknown input, and a token `<pad>` for whitespace padding.

In [None]:
from torchtext.vocab import build_vocab_from_iterator

def yield_tokens(data_iter):
    for txt in data_iter:
        yield tokenizer(txt)

all_sentences = pos_intent_sentences + neg_intent_sentences

vocab = build_vocab_from_iterator(yield_tokens(all_sentences), specials=["<unk>", "<pad>"])

# We set the default token to be <unk>
vocab.set_default_index(vocab['<unk>'])

Take a look at a few entries of the `vocab` object. What sort of object is it? What does it map words onto?

In [None]:
# Look at some vocab entries

# print(vocab['the'])

## Model training

Now let's train an attention-based model to classify sentences as requests or not. We will train an extremely simple model with a single attention head, just to show you how it all works.

First, we need to load the data into a PyTorch `DataLoader` that can be passed to the model. This is necessary for easy parallelization of the training (though it is not critical for us in this application).

We have hidden some technical details in the `data_management.py` file. Feel free to look at the implementation in there if you are curious.

In [None]:
import torch

from data_management import EnronRequestDataset, collate_fn
from torch.utils.data import DataLoader

# Gather sentences and assign labels
sentences = pos_intent_sentences + neg_intent_sentences
labels = [1] * len(pos_intent_sentences) + [0] * len(neg_intent_sentences)

# Wrap the sentences in a DataLoader object
dataset = EnronRequestDataset(sentences, labels, vocab, tokenizer)
loader  = DataLoader(dataset,
                     batch_size=32,
                     shuffle=True,
                     collate_fn=lambda batch: collate_fn(batch, vocab),
                     num_workers=0,
                     pin_memory=True)


Now we can actually train the model! In order to best understand what is happening, we will use a very simple attention mechanism attached to a multilayer perceptron. The implementation has been hidden in the file `attention_model.py` --- feel free to look in there if you'd like! It's not as scary as you might think.

For now, we will focus on trying to understand what the model is doing. First, we will load the model.

In [None]:
import attention_model

import torch.nn as nn

model = attention_model.RequestClassifier(len(vocab))

What does the model look like?

In [None]:
model

Now let's see how the untrained model performs on the dataset. We can define a simple function to compute the true and false positive rates.

In [None]:
# Compute true positives, false positives, true negatives, and false negatives
def compute_metrics(preds, labels):
    tp = ((preds == 1) & (labels == 1)).sum().item()
    fp = ((preds == 1) & (labels == 0)).sum().item()
    tn = ((preds == 0) & (labels == 0)).sum().item()
    fn = ((preds == 0) & (labels == 1)).sum().item()
    return tp, fp, tn, fn

# Evaluate the model
def evaluate_model(model, data_loader):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for src, labels, pad_mask in data_loader:
            logits = model(src, src_key_padding_mask=pad_mask)
            preds = torch.argmax(logits, dim=1)

            all_preds.append(preds)
            all_labels.append(labels)

    all_preds = torch.cat(all_preds)
    all_labels = torch.cat(all_labels)

    tp, fp, tn, fn = compute_metrics(all_preds, all_labels)
    print(f"True positive rate: {tp / (tp + fn) * 100:.2f}%")
    print(f"False positive rate: {fp / (fp + tn) * 100:.2f}%")
    print("")
    print(f"True negative rate: {tn / (tn + fp) * 100:.2f}%")
    print(f"False negative rate: {fn / (fn + tp) * 100:.2f}%")

evaluate_model(model, loader)

You should have found that the true and false positive (and true and false negative) rates were approximately 50%. This is expected for the untrained model --- it is just randomly guessing.

Now let's perform the training. We can train the model over a small number of epochs using a simple binary cross-entropy loss function. Play around with the learning rate and number of epochs --- what do you find in terms of performance?

In [None]:
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()

epochs = 20

for epoch in range(epochs):
    for src, labels, pad_mask in loader:
        logits = model(src, src_key_padding_mask=pad_mask)
        loss   = loss_fn(logits, labels)

        if torch.isnan(logits).any():
            print("🛑 NaN in logits!"); break

        loss.backward()
        opt.step(); opt.zero_grad()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {loss.item():.4f}")


We have trained a very simple model as a proof-of-concept, but think about what might be missing here, using what you have learned from the other tutorials in the workshop. Would you want to deploy this model in the real world? How could we make it better?

## Evaluation

Now that the model is trained, we can evaluate its performance. First, we should compute the true positive and false positive rates.

In [None]:
evaluate_model(model, loader)


You probably found very good performance (unless something went wrong with the training). In fact, the true positive and true negative rates might be 100%. This is indicative of some possible overfitting in the model --- how could we avoid overfitting for future training?

Now that we have the trained model, we can play around with it a bit. Let's write a function to evaluate a single sentence.

In [None]:
# Evaluate single sentence
import torch
import torch.nn.functional as F

def predict_sentence(model, sentence, vocab, tokenizer):
    model.eval()
    with torch.no_grad():
        # Tokenize and map to vocabulary
        tokens = tokenizer(sentence)
        ids    = torch.tensor(vocab(tokens), dtype=torch.long).unsqueeze(0)

        # Build padding mask (necessary to pad empty characters)
        pad_idx = vocab['<pad>']
        mask    = ids != pad_idx

        # Run the transformer
        logits = model(ids, src_key_padding_mask=~mask)
        probs  = F.softmax(logits, dim=-1)

        # Return the class with the highest probability (0 or 1)
        pred   = probs.argmax(dim=-1).item()

    return pred, probs.squeeze().tolist()

def print_prediction_for_sentence(sentence):
    pred, probs = predict_sentence(model, sentence, vocab, tokenizer)

    label_map = {0: "no_request", 1: "request"}
    print(f"→ {sentence!r}")
    print(f"Prediction: {label_map[pred]} (P(request)={probs[1]:.4f})")
    print()

test_sentences = [
    "Please send me the report today by the end of the day.",
    "I need the report as soon as possible.",
    "Can you send me the report?",
    "The weather is nice today.",
    "Knut is teaching a lecture on transformers.",
    "Student, please evaluate the model performance."
]

for sentence in test_sentences:
    print_prediction_for_sentence(sentence)


This is looking pretty good! Try playing around now with your own sentences.

In [None]:
# Write a request that you might find in a business email (so it uses words that might be in the vocab)
sentence = "Write a request here."
print_prediction_for_sentence(sentence)

# Write a non-request that you might find in a business email
sentence = "Write a non-request here."
print_prediction_for_sentence(sentence)

# Write a request with some words that you think might not be in the vocab
sentence = "Write a request with non-vocab words here."
print_prediction_for_sentence(sentence)

# Write a request in another language (e.g., German)
sentence = "Schreiben Sie eine Anfrage in einer anderen Sprache hier."
print_prediction_for_sentence(sentence)

# Write any sentence of your choosing
sentence = "Your sentence here."
print_prediction_for_sentence(sentence)

What do you notice about the model behavior? Do certain words seem to cause the model to predict a sentence as being more request-like? Do certain requests consistently fail? Can you figure out how to modify an incorrectly-classified request in order to make it classify correctly?

## Visualization

In [None]:
import matplotlib.pyplot as plt
import numpy as np

import utility

def plot_attention(tokens, attn: np.ndarray):
    # tokens: List[str], attn: [S,S] NumPy array
    fig, ax = plt.subplots()
    cax = ax.matshow(attn)             # one distinct plot, no seaborn
    fig.colorbar(cax)
    ax.set_xticks(range(len(tokens)))
    ax.set_yticks(range(len(tokens)))
    ax.set_xticklabels(tokens, rotation=90)
    ax.set_yticklabels(tokens)
    plt.xlabel("Key positions"); plt.ylabel("Query positions")
    plt.show()


text = "Please help me."

tokens = tokenizer(text)
ids    = torch.tensor([vocab(tokens)], dtype=torch.long)
padm   = ids == vocab['<pad>']

model.eval()
with torch.no_grad():
    logits, attn = model(ids, src_key_padding_mask=padm, return_attn=True)
    # attn: [1, seq_len, seq_len]  (since single head)

    attn_np = np.asarray(attn[0].detach().cpu().tolist())

plot_attention(tokens, attn_np)

col_importance = [ sum(row[j] for row in attn_np) for j in range(len(tokens)) ]

# pair and sort
token_scores = zip(tokens, col_importance)

print("Top tokens by attention paid to them:")
for token, score in token_scores:
    print(f"  {token:>10} → {score:.3f}")

utility.display_tokens_with_alpha(tokens, col_importance)


Now let's plot a number of sentences, and see which words are most important.

In [None]:
test_sentences = [
    "Please send me the report by EOD.",
    "I need the report ASAP.",
    "Please help me out with the report.",
    "The boss would like the Facebook post today.",
    "I hope all is well with you, and one day when you are in the DC area and have nothing better to do you will let me have the opportunity to visit over dinner and reflect on how many years it has been since you and we created the Natural Gas Clearing House."
]

for sentence in test_sentences:
    tokens = tokenizer(sentence)
    ids    = torch.tensor([vocab(tokens)], dtype=torch.long)
    padm   = ids == vocab['<pad>']

    model.eval()
    with torch.no_grad():
        logits, attn = model(ids, src_key_padding_mask=padm, return_attn=True)
        # attn: [1, seq_len, seq_len]  (since single head)

        attn_np = np.asarray(attn[0].detach().cpu().tolist())

    col_importance = [ sum(row[j] for row in attn_np) for j in range(len(tokens)) ]

    utility.display_tokens_with_alpha(tokens, col_importance)
    print(f"Prediction: {label_map[logits.argmax(dim=-1).item()]} ; P(request)={F.softmax(logits, dim=-1)[0][1]:.4f}")
    print("")
