# Attention mechanisms and transformers

One major limitation of recurrent networks is that all words in a sequence are treated as having the same influence on the result. This leads to suboptimal performance in standard LSTM encoder-decoder models for sequence-to-sequence tasks, such as Named Entity Recognition and Machine Translation. In reality, certain words in the input sequence often have a greater impact on the sequential outputs than others.

Consider a sequence-to-sequence model, such as machine translation. It is implemented using two recurrent networks: one network (**encoder**) compresses the input sequence into a hidden state, and another network (**decoder**) expands this hidden state into the translated output. The issue with this approach is that the final state of the network struggles to retain information from the beginning of a sentence, resulting in poor model performance on longer sentences.

**Attention Mechanisms** offer a way to assign varying levels of importance to each input vector when predicting each output of the RNN. This is achieved by creating shortcuts between the intermediate states of the input RNN and the output RNN. In this way, when generating the output symbol $y_t$, all input hidden states $h_i$ are considered, each with a different weight coefficient $\alpha_{t,i}$.

![Image showing an encoder/decoder model with an additive attention layer](../../../../../translated_images/encoder-decoder-attention.7a726296894fb567aa2898c94b17b3289087f6705c11907df8301df9e5eeb3de.en.png)
*The encoder-decoder model with additive attention mechanism in [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf), cited from [this blog post](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)*

The attention matrix $\{\alpha_{i,j}\}$ represents the extent to which specific input words contribute to the generation of a particular word in the output sequence. Below is an example of such a matrix:

![Image showing a sample alignment found by RNNsearch-50, taken from Bahdanau - arviz.org](../../../../../translated_images/bahdanau-fig3.09ba2d37f202a6af11de6c82d2d197830ba5f4528d9ea430eb65fd3a75065973.en.png)

*Figure taken from [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) (Fig.3)*

Attention mechanisms are a key factor behind much of the current or near-current state-of-the-art advancements in Natural Language Processing. However, adding attention significantly increases the number of model parameters, which led to scalability challenges with RNNs. A major limitation of scaling RNNs is their sequential nature, which makes it difficult to batch and parallelize training. In an RNN, each element of a sequence must be processed in order, making parallelization impractical.

The adoption of attention mechanisms, combined with this limitation, led to the development of the now state-of-the-art Transformer Models, which are widely used today in models like BERT and OpenGPT3.

## Transformer models

Instead of passing the context of each previous prediction into the next evaluation step, **transformer models** use **positional encodings** and attention to capture the context of a given input within a specified window of text. The image below illustrates how positional encodings combined with attention can capture context within a given window.

![Animated GIF showing how the evaluations are performed in transformer models.](../../../../../lessons/5-NLP/18-Transformers/images/transformer-animated-explanation.gif)

Since each input position is mapped independently to each output position, transformers can parallelize more effectively than RNNs, enabling much larger and more expressive language models. Each attention head can learn different relationships between words, enhancing downstream Natural Language Processing tasks.

**BERT** (Bidirectional Encoder Representations from Transformers) is a large multi-layer transformer network with 12 layers for *BERT-base* and 24 layers for *BERT-large*. The model is first pre-trained on a large corpus of text data (Wikipedia + books) using unsupervised training (predicting masked words in a sentence). During pre-training, the model acquires a significant level of language understanding, which can then be leveraged with other datasets through fine-tuning. This process is known as **transfer learning**.

![picture from http://jalammar.github.io/illustrated-bert/](../../../../../translated_images/jalammarBERT-language-modeling-masked-lm.34f113ea5fec4362e39ee4381aab7cad06b5465a0b5f053a0f2aa05fbe14e746.en.png)

There are many variations of Transformer architectures, including BERT, DistilBERT, BigBird, OpenGPT3, and others, which can be fine-tuned. The [HuggingFace package](https://github.com/huggingface/) provides a repository for training many of these architectures using PyTorch.

## Using BERT for text classification

Let's explore how we can use a pre-trained BERT model to solve a traditional task: sequence classification. We will classify our original AG News dataset.

First, let's load the HuggingFace library and our dataset:


In [10]:
import torch
import torchtext
from torchnlp import *
import transformers
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_len = len(vocab)

Loading dataset...
Building vocab...


Since we will be using a pre-trained BERT model, we need to use a specific tokenizer. First, we'll load the tokenizer associated with the pre-trained BERT model.

The HuggingFace library provides a repository of pre-trained models that you can access simply by specifying their names as arguments in the `from_pretrained` functions. All the necessary binary files for the model will be downloaded automatically.

However, there may be situations where you need to load your own models. In such cases, you can specify the directory containing all the relevant files, including the tokenizer parameters, the `config.json` file with model settings, binary weights, and more.


In [11]:
# To load the model from Internet repository using model name. 
# Use this if you are running from your own copy of the notebooks
bert_model = 'bert-base-uncased' 

# To load the model from the directory on disk. Use this for Microsoft Learn module, because we have
# prepared all required files for you.
bert_model = './bert'

tokenizer = transformers.BertTokenizer.from_pretrained(bert_model)

MAX_SEQ_LEN = 128
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

The `tokenizer` object contains the `encode` function that can be directly used to encode text:


In [15]:
tokenizer.encode('PyTorch is a great framework for NLP')

[101, 1052, 22123, 2953, 2818, 2003, 1037, 2307, 7705, 2005, 17953, 2361, 102]

Then, let's create iterators which we will use during training to access the data. Because BERT uses its own encoding function, we would need to define a padding function similar to `padify` we have defined before:


In [4]:
def pad_bert(b):
    # b is the list of tuples of length batch_size
    #   - first element of a tuple = label, 
    #   - second = feature (text sequence)
    # build vectorized sequence
    v = [tokenizer.encode(x[1]) for x in b]
    # compute max length of a sequence in this minibatch
    l = max(map(len,v))
    return ( # tuple of two tensors - labels and features
        torch.LongTensor([t[0] for t in b]),
        torch.stack([torch.nn.functional.pad(torch.tensor(t),(0,l-len(t)),mode='constant',value=0) for t in v])
    )

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=8, collate_fn=pad_bert, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=8, collate_fn=pad_bert)

In our case, we will be using pre-trained BERT model called `bert-base-uncased`. Let's load the model using `BertForSequenceClassification` package. This ensures that our model already has a required architecture for classification, including final classifier. You will see warning message stating that weights of the final classifier are not initialized, and model would require pre-training - that is perfectly okay, because it is exactly what we are about to do!


In [9]:
model = transformers.BertForSequenceClassification.from_pretrained(bert_model,num_labels=4).to(device)

Some weights of the model checkpoint at ./bert were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at ./bert and

Now we are ready to begin training! Since BERT is already pre-trained, we want to start with a relatively small learning rate to avoid disrupting the initial weights.

The heavy lifting is done by the `BertForSequenceClassification` model. When we apply the model to the training data, it returns both the loss and the network output for the input minibatch. We use the loss for parameter optimization (`loss.backward()` performs the backward pass), and `out` to calculate the training accuracy by comparing the predicted labels `labs` (computed using `argmax`) with the expected `labels`.

To monitor the process, we accumulate the loss and accuracy over several iterations and display them every `report_freq` training cycles.

This training process will likely take a significant amount of time, so we limit the number of iterations.


In [6]:
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)

report_freq = 50
iterations = 500 # make this larger to train for longer time!

model.train()

i,c = 0,0
acc_loss = 0
acc_acc = 0

for labels,texts in train_loader:
    labels = labels.to(device)-1 # get labels in the range 0-3         
    texts = texts.to(device)
    loss, out = model(texts, labels=labels)[:2]
    labs = out.argmax(dim=1)
    acc = torch.mean((labs==labels).type(torch.float32))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    acc_loss += loss
    acc_acc += acc
    i+=1
    c+=1
    if i%report_freq==0:
        print(f"Loss = {acc_loss.item()/c}, Accuracy = {acc_acc.item()/c}")
        c = 0
        acc_loss = 0
        acc_acc = 0
    iterations-=1
    if not iterations:
        break

Loss = 1.1254194641113282, Accuracy = 0.585
Loss = 0.6194715118408203, Accuracy = 0.83
Loss = 0.46665248870849607, Accuracy = 0.8475
Loss = 0.4309701919555664, Accuracy = 0.8575
Loss = 0.35427074432373046, Accuracy = 0.8825
Loss = 0.3306886291503906, Accuracy = 0.8975
Loss = 0.30340143203735354, Accuracy = 0.8975
Loss = 0.26139299392700194, Accuracy = 0.915
Loss = 0.26708646774291994, Accuracy = 0.9225
Loss = 0.3667240524291992, Accuracy = 0.8675


You can observe (especially if you increase the number of iterations and allow sufficient time) that BERT classification provides quite good accuracy! This is because BERT already has a strong understanding of the language structure, and we only need to fine-tune the final classifier. However, since BERT is a large model, the entire training process is time-consuming and demands significant computational resources! (GPU, and ideally more than one).

> **Note:** In our example, we have been using one of the smallest pre-trained BERT models. Larger models are available and are likely to produce better results.


## Evaluating the model performance

Now we can assess how well our model performs on the test dataset. The evaluation loop is quite similar to the training loop, but we must remember to switch the model to evaluation mode by calling `model.eval()`.


In [10]:
model.eval()
iterations = 100
acc = 0
i = 0
for labels,texts in test_loader:
    labels = labels.to(device)-1      
    texts = texts.to(device)
    _, out = model(texts, labels=labels)[:2]
    labs = out.argmax(dim=1)
    acc += torch.mean((labs==labels).type(torch.float32))
    i+=1
    if i>iterations: break
        
print(f"Final accuracy: {acc.item()/i}")

Final accuracy: 0.9047029702970297


## Key Points

In this unit, we explored how simple it is to use a pre-trained language model from the **transformers** library and tailor it to our text classification task. Likewise, BERT models can be applied to entity extraction, question answering, and other NLP tasks.

Transformer models are the current state-of-the-art in NLP, and in most cases, they should be your starting point when experimenting with custom NLP solutions. However, grasping the fundamental principles of recurrent neural networks covered in this module is crucial if you aim to develop advanced neural models.



---

**Disclaimer**:  
This document has been translated using the AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator). While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.
