$\Huge AS4501$

Transformers and Attention

Francisco Förster

Bibliography:

* [Attention is all you need, Vaswani et al. 2017](https://arxiv.org/pdf/1706.03762.pdf)
* https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html (many figures from this great website)
* https://towardsdatascience.com/attention-and-transformer-models-fe667f958378

# Motivation

Recurrent neural networks have two big problems:

1. They tend to give too much weight to recent elements in a sequence, but sometimes the most important connections in a sentence are separated by a large number of elements.

2. They are intrinsically serial in nature. We need to process a sequence in order to compute the output of a RNN.

This is how a RNN processes a sentence, paying more attention to the last word at each step and requiring a serial processing:

![](images/sentence-classification-rnn.png)

But in many cases the last word is not the most important, and we would like to be able to process each word and its association with other words in parallel:

![](images/sentence-example-attention.png)

This also happens in the problem of translation:

![](images/sentence.png)

# Attention mechanism

The attention mechanism is an approach in deep learning that allows models to focus on different parts of the input when producing the output. Instead of focusing in some hidden state like in RNNs, in attention each output explicitly depends on all previous input states, weighted by attention scores.

For example in this sentence with the following attention scores:

 I love travelling
   
   [0.1,  0.2,  0.7] ---> J'adore
  
  [0.5,  0.5,  0.0] ---> voyager

'J'adore' pays more attention or has more affinity to 'travelling' when translating.

'voyager' pays attention to 'I' and 'love' equally when translating.

# Self-attention

Self Attention, also known as intra Attention, is an attention mechanism that relates different positions of one sequence in order to compute a representation of the same sequence. 

![](images/intraattention.png)

Let's remember the softmax function applied to a vector x:

$\Large {\rm softmax(x_i)} = \frac{\exp{x_i}}{\sum\limits_j \exp{x_j}}$ 

This function returns ~1 at the largest value of the vector and ~0 elsewhere.

![](images/softmax.png)

In a self-attention layer, an input matrix $X$ ($n$ tokens of dimension $d$) are turned it into an output matrix $Z$ ($n$ components of dimension $d_v$) via three representational matrices of the input:

* queries Q
* keys K
* values V

$\Large {\rm Attention}(Q, K, V) = {\rm softmax}( Q \cdot K^T / \sqrt{d_k}) * V$

where $Q$, $K$ and $V$ are matrices representing linear transformations from the input vector $x$ via learnable parameters $W^Q$, $W^K$ and $W^V$:

* $Q = X W^Q$
* $K = X W^K$
* $V = X W^V$

Note that 
* $x \in \mathbb{R}^{n \times d}$
* $Q \in \mathbb{R}^{n \times d_k}$
* $K \in \mathbb{R}^{n \times d_k}$
* $V \in \mathbb{R}^{n \times d_v}$
* $W^Q \in \mathbb{R}^{d_k \times d}$
* $W^K \in \mathbb{R}^{d_k \times d}$
* $W^V \in \mathbb{R}^{d_v \times d}$

![](images/attention_detail.png)

![](images/selfattention_summary.png)

# Cross-attention

One can generalize the previous computation for combining two input matrices $X_1$ and $X_2$:

![](images/cross-attention-summary.png)

And this is an example of a cross attention matrix:

![](images/bahdanau-fig3.png)

and a visualization of one row

![](images/attention.png)

# Multi-head attention

In multi-head attention we concatenate the output from several heads $i$ with learnable parameters $W_i^Q$, $W_i^K$ and $W_i^V$, and then linearly transform this vector with learnable parameters $W^O$:

$\Large {\rm Multihead} = {\rm concat}({\rm head}_1, ... {\rm head}_h) W^O$

![](images/multi-head.png)

# Positional encodings

One problem with the previous strategy is that the order of the input is never used to compute the attention scores. In order to fix this problem, information about the relative positions of the inputs must be added. In the original paper by Vaswani they use sine and cosine functions of different frequencies:

* $PE(pos, 2i) = sin(pos / 10000^{2i/d})$
* $PE(pos, 2i) = cos(pos / 10000^{2i/d})$

![](images/PE.png)

In other works, a set of functions are learned as the positional encoder. For example, in [Pimentel+2023](https://arxiv.org/pdf/2201.08482.pdf) they use the following function (timeFiLM):

![](images/timefilm.png)
![](images/timefilm2.png)

# Transformers

The full transformer arquitecture proposed by Vaswani et al. 2017 is the following:

![](images/transformer.png)

The model is composed of an encoder and a decoder. 

The encoder is composed of 6 identical layers, each one with two sublayers: a multi-head self-attention mechanism and a position wise fully connected feed-forward network. The output of each sublayer uses a residual connection (we add the input to the output of the sublayer), which helps with convergence, and is normalized using layer normalization.

The decoder is also composed of 6 identical layers. In addition to the two sublayers used in the encoder, a sublayer is added in between that uses multihead cross attention with the output of the encoder. The multihead self-attention is also modified to mask positions that have not been visited by the decoder (predictions for position i can depend only on the known outputs of positions less than i).



In [2]:
# !pip install tensorflow tensorflow_datasets transformers
# Import necessary libraries
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import BertTokenizer, TFBertForSequenceClassification
from sklearn.model_selection import train_test_split

In [10]:
# 1. Load the Dataset
# Load the IMDB dataset
dataset, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

In [11]:
# 2. Preprocessing
# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Define a function to encode the texts
def encode_texts(text, label):
    encoded_text = tokenizer(text.numpy().decode('utf-8'), truncation=True, padding='max_length', max_length=512, return_tensors='tf')
    return encoded_text['input_ids'], encoded_text['attention_mask'], label

def encode_map_fn(text, label):
    # py_func doesn't set the shape of the returned tensors.
    encoded_text, attention_mask, label = tf.py_function(encode_texts, inp=[text, label], Tout=(tf.int32, tf.int32, tf.int64))
    
    # `tf.data.Datasets` work best if all components have a shape set
    encoded_text.set_shape([None])
    attention_mask.set_shape([None])
    label.set_shape([])
    
    return {"input_ids": encoded_text, "attention_mask": attention_mask}, label

train_dataset = train_dataset.map(encode_map_fn)
test_dataset = test_dataset.map(encode_map_fn)

# Convert to TensorFlow Datasets
train_dataset = train_dataset.shuffle(10000).batch(32)
test_dataset = test_dataset.batch(32)

In [12]:
# 3. Load the Transformer Model
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Downloading tf_model.h5:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# 4. Training
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

# Train the model
model.fit(train_dataset, validation_data=test_dataset, epochs=3)

Epoch 1/3


In [None]:
# 5. Evaluation
# Evaluate the model on the test set
model.evaluate(test_dataset)