$\Huge AS4501$

Transformers and Attention

Francisco Förster

Bibliography:

* [Attention is all you need, Vaswani et al. 2017](https://arxiv.org/pdf/1706.03762.pdf)
* https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html (many figures from this great website)
* https://towardsdatascience.com/attention-and-transformer-models-fe667f958378

# Motivation

Recurrent neural networks have two big problems:

1. They tend to give too much weight to recent elements in a sequence, but sometimes the most important connections in a sentence are separated by a large number of elements.

2. They are intrinsically serial in nature. We need to process a sequence in order to compute the output of a RNN.

This is how a RNN processes a sentence, paying more attention to the last word at each step and requiring a serial processing:

![](images/sentence-classification-rnn.png)

But in many cases the last word is not the most important, and we would like to be able to process each word and its association with other words in parallel:

![](images/sentence-example-attention.png)

This also happens in the problem of translation:

![](images/sentence.png)

# Softmax

Let's remember the softmax function applied to a vector x:

$\Large {\rm softmax(x_i)} = \frac{\exp{x_i}}{\sum\limits_j \exp{x_j}}$ 

This function returns ~1 at the largest value of the vector and ~0 elsewhere.

![](images/softmax.png)

# Attention mechanism

The attention mechanism is an approach in deep learning that allows models to focus on different parts of the input when producing the output. Instead of focusing in some hidden state like in RNNs, in attention each output explicitly depends on all previous input states, weighted by attention scores.

For example in this sentence with the following attention scores:

 I love travelling
   
   [0.1,  0.2,  0.7] ---> J'adore
  
  [0.5,  0.5,  0.0] ---> voyager

'J'adore' pays more attention or has more affinity to 'travelling' as the next word when translating.

'voyager' pays attention to 'I' and 'love' equally when translating.

# Self-attention

Self Attention, also known as intra Attention, is an attention mechanism that relates different positions of one sequence in order to compute a representation of the same sequence. 

![](images/intraattention.png)

In a self-attention layer, an input matrix $X$ ($n$ tokens of dimension $d$) are turned it into an output matrix $Z$ ($n$ components of dimension $d_v$) via three representational matrices of the input:

* queries Q
* keys K
* values V

$\Large {\rm Attention}(Q, K, V) = {\rm softmax}( Q \cdot K^T / \sqrt{d_k}) * V$

where $Q$, $K$ and $V$ are matrices representing linear transformations from the input vector $x$ via learnable parameters $W^Q$, $W^K$ and $W^V$:

* $Q = X W^Q$
* $K = X W^K$
* $V = X W^V$

Note that 
* $x \in \mathbb{R}^{n \times d}$
* $Q \in \mathbb{R}^{n \times d_k}$
* $K \in \mathbb{R}^{n \times d_k}$
* $V \in \mathbb{R}^{n \times d_v}$
* $W^Q \in \mathbb{R}^{d_k \times d}$
* $W^K \in \mathbb{R}^{d_k \times d}$
* $W^V \in \mathbb{R}^{d_v \times d}$

![](images/attention_detail.png)

![](images/selfattention_summary.png)

# Cross-attention

One can generalize the previous computation for combining two input matrices $X_1$ and $X_2$:

![](images/cross-attention-summary.png)

And this is an example of a cross attention matrix:

![](images/bahdanau-fig3.png)

and a visualization of one row

![](images/attention.png)

# Multi-head attention

In multi-head attention we concatenate the output from several heads $i$ with learnable parameters $W_i^Q$, $W_i^K$ and $W_i^V$, and then linearly transform this vector with learnable parameters $W^O$:

$\Large {\rm Multihead} = {\rm concat}({\rm head}_1, ... {\rm head}_h) W^O$

![](images/multi-head.png)

# Positional encodings

One problem with the previous strategy is that the order of the input is never used to compute the attention scores. In order to fix this problem, information about the relative positions of the inputs must be added. In the original paper by Vaswani they use sine and cosine functions of different frequencies:

* $PE(pos, 2i) = sin(pos / 10000^{2i/d})$
* $PE(pos, 2i) = cos(pos / 10000^{2i/d})$

![](images/PE.png)

In other works, a set of functions are learned as the positional encoder. For example, in [Pimentel+2023](https://arxiv.org/pdf/2201.08482.pdf) they use the following function (timeFiLM):

![](images/timefilm.png)
![](images/timefilm2.png)

# Transformers

The full transformer arquitecture proposed by Vaswani et al. 2017 is the following:

![](images/transformer.png)

The model is composed of an encoder and a decoder. 

The encoder is composed of 6 identical layers, each one with two sublayers: a multi-head self-attention mechanism and a position wise fully connected feed-forward network. The output of each sublayer uses a residual connection (we add the input to the output of the sublayer), which helps with convergence, and is normalized using layer normalization.

The decoder is also composed of 6 identical layers. In addition to the two sublayers used in the encoder, a sublayer is added in between that uses multihead cross attention with the output of the encoder. The multihead self-attention is also modified to mask positions that have not been visited by the decoder (predictions for position i can depend only on the known outputs of positions less than i).



# Examples

## Internet movie database reviews using [BERT](https://arxiv.org/pdf/1810.04805.pdf)

![](images/bert.png)

Here we will load the weights and biases of from [BERT](https://arxiv.org/abs/1810.04805) and will fine tune it to reproduce reviews from [IMDB datasets](https://www.tensorflow.org/datasets/catalog/imdb_reviews).

In [None]:
import tensorflow_datasets as tfds
from transformers import BertTokenizerFast, TFBertForSequenceClassification
import tensorflow as tf
import numpy as np

In [None]:
# 1. Load the IMDB dataset
dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

In [None]:
# Convert datasets to NumPy arrays and get labels
train_reviews = []
train_labels = []
for review, label in tfds.as_numpy(train_dataset):
    train_reviews.append(review)
    train_labels.append(label)

test_reviews = []
test_labels = []
for review, label in tfds.as_numpy(test_dataset):
    test_reviews.append(review)
    test_labels.append(label)

# Convert labels to NumPy arrays
train_labels = np.array(train_labels)
test_labels = np.array(test_labels)

In [None]:
train_reviews

In [None]:
train_labels

In [None]:
# 2. Initialize the tokenizer
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [None]:
# 3. Preprocess the training and testing data
def encode_reviews(tokenizer, reviews, max_length):
    token_ids = np.zeros(shape=(len(reviews), max_length), dtype=np.int32)
    for i, review in enumerate(reviews):
        encoded = tokenizer.encode(review.decode(), max_length=max_length)
        token_ids[i, :len(encoded)] = encoded
    attention_mask = (token_ids != 0).astype(np.int32)
    return {"input_ids": token_ids, "attention_mask": attention_mask}

# Define a max_length for the reviews
max_length = 512  # Adjust this depending on your resources

# Preprocess the training data
train_data = encode_reviews(tokenizer, train_reviews, max_length)

# Preprocess the testing data
test_data = encode_reviews(tokenizer, test_reviews, max_length)

In [None]:
# 4. Initialize the model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

In [None]:
# 5. Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# 6. Train the model
model.fit(train_data, train_labels, validation_data=(test_data, test_labels), epochs=3)

In [None]:
# 7. Evaluation
# Evaluate the model on the test set
model.evaluate(test_data, test_labels)

## Vision transformers

This is based on the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
](https://arxiv.org/abs/2010.11929)

![](images/vit.png)

See https://github.com/huggingface/notebooks/blob/main/examples/image_classification.ipynb
    

We will use the Huggingface library to 

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
from datasets import load_dataset 

In [None]:
model_checkpoint = "microsoft/swin-tiny-patch4-window7-224" # pre-trained model from which to fine-tune
batch_size = 32 # batch size for training and evaluation

In [None]:
dataset = load_dataset("cifar10")

In [None]:
from datasets import load_metric

metric = load_metric("accuracy")

In [None]:
dataset

In [None]:
example = dataset["train"][10]
example

In [None]:
dataset["train"].features

In [None]:
example['img']

In [None]:
example['img'].resize((200, 200))

In [None]:
example['label']

In [None]:
dataset["train"].features["label"]

In [None]:
labels = dataset["train"].features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

id2label[5]

In [None]:
from transformers import AutoImageProcessor

image_processor  = AutoImageProcessor.from_pretrained(model_checkpoint)
image_processor 

In [None]:
from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    RandomHorizontalFlip,
    RandomResizedCrop,
    Resize,
    ToTensor,
)

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
if "height" in image_processor.size:
    size = (image_processor.size["height"], image_processor.size["width"])
    crop_size = size
    max_size = None
elif "shortest_edge" in image_processor.size:
    size = image_processor.size["shortest_edge"]
    crop_size = (size, size)
    max_size = image_processor.size.get("longest_edge")

train_transforms = Compose(
        [
            RandomResizedCrop(crop_size),
            RandomHorizontalFlip(),
            ToTensor(),
            normalize,
        ]
    )

val_transforms = Compose(
        [
            Resize(size),
            CenterCrop(crop_size),
            ToTensor(),
            normalize,
        ]
    )

def preprocess_train(example_batch):
    """Apply train_transforms across a batch."""
    example_batch["pixel_values"] = [
        train_transforms(image.convert("RGB")) for image in example_batch["img"]
    ]
    return example_batch

def preprocess_val(example_batch):
    """Apply val_transforms across a batch."""
    example_batch["pixel_values"] = [val_transforms(image.convert("RGB")) for image in example_batch["img"]]
    return example_batch

In [None]:
# split up training into training + validation
splits = dataset["train"].train_test_split(test_size=0.1)
train_ds = splits['train']
val_ds = splits['test']

In [None]:
train_ds.set_transform(preprocess_train)
val_ds.set_transform(preprocess_val)

In [None]:
train_ds[0]

In [None]:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer

model = AutoModelForImageClassification.from_pretrained(
    model_checkpoint, 
    label2id=label2id,
    id2label=id2label,
    ignore_mismatched_sizes = True, # provide this in case you're planning to fine-tune an already fine-tuned checkpoint
)

In [None]:
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    f"{model_name}-finetuned-eurosat",
    remove_unused_columns=False,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=True,
)

In [None]:
import numpy as np

# the compute_metrics function takes a Named Tuple as input:
# predictions, which are the logits of the model as Numpy arrays,
# and label_ids, which are the ground-truth labels as Numpy arrays.
def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

In [None]:
import torch

def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
)

In [None]:
train_results = trainer.train()
# rest is optional but nice to have
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

In [None]:
metrics = trainer.evaluate()
# some nice to haves:
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

In [None]:
trainer.push_to_hub()