# Attention mechanisms and transformers

One major drawback of recurrent networks is that all the words in a sequence have the same impact on the result. This causes suboptimal performance with standard LSTM encoder-decoder models for sequence-to-sequence tasks, such as named entity recognition and machine translation.

For example, machine translation is implemented by two recurrent networks, where one network, the **encoder**, incorporates the input sequence into the hidden state, and another one, the **decoder**, unrolls this hidden state into the translated result. The problem with this approach is that the final state of the network has a hard time remembering the beginning of the sentence, which causes poor quality results on long sentences.

**Attention mechanisms** attempt to fix that problem by weighing the contextual impact of each input vector on each output prediction of the RNN. This is implemented by creating weighted connections between intermediate states. As you can see in the image below, when generating an output symbol $y_t$, we take into account input hidden states $h_i$ with different weight coefficients $\alpha_{t,i}$. 

![Image showing an encoder/decoder model with an additive attention layer](notebooks/images/encoder-decoder-attention.png)
*The encoder-decoder model with additive attention mechanism in [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf), cited from [this blog post](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html)*

The attention matrix $\{\alpha_{i,j}\}$ contains information about the impact of each input word in the generation of each output word. Below is a representation of an attention matrix:

![Image showing a sample alignment found by RNNsearch-50, taken from Bahdanau - arviz.org](notebooks/images/bahdanau-fig-3.png)

*Figure taken from [Bahdanau et al., 2015](https://arxiv.org/pdf/1409.0473.pdf) (Fig.3)*

Attention mechanisms have contributed to much of the current or near current state of the art in NLP. However, adding attention greatly increases the number of model parameters, which leads to issues with scaling RNNs. In addition, because of the recurrent nature of the models, each element of a sequence needs to be processed in sequential order, which means that training cannot be easily parallelized.

The popularity of attention mechanisms combined with their main drawback led to the creation of **transformer models**, such as BERT and OpenGPT3.

## Transformer models

Instead of forwarding the context of each previous prediction into the next evaluation step, **transformer models** use **positional encodings** and **attention** to capture the context of a given input within a provided window of text. The image below shows how positional encodings with attention can capture context within a given window.

![Animated GIF showing how the evaluations are performed in transformer models.](notebooks/images/transformer-animated-explanation.gif) 

Since each input position is mapped independently to each output position, transformers can parallelize better than RNNs, which enables larger and more expressive language models. Each attention head can be used to learn different relationships between words, which improves downstream NLP tasks.

## Building a classification model based on transformer block

Before understanding transformer language models as a whole, let's start with a **transformer block**.
Keras doesn't contain a built-in Transformer layer, but we can build our own. As before, we'll focus 
on text classification of AG News dataset. However, Transformer models show more impressive results when
used to solve more difficult NLP tasks. 

> For the sandbox environment, we need to run the following cell to make sure the required library is installed, and the data is prefetched. If you're running locally, you can skip the following cell.

In [1]:
import sys
!{sys.executable} -m pip install --quiet tensorflow_datasets
!cd ~ && wget -q -O - https://mslearntensorflowlp.blob.core.windows.net/data/tfds-ag-news.tgz | tar xz

In [2]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
import numpy as np

ds_train, ds_test = tfds.load('ag_news_subset').values()

def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

New layers in Keras should subclass the `Layer` class, and implement the `call` method. Let's start by implementing a **positional embedding** layer. We'll use [code from the official Keras documentation](https://keras.io/examples/nlp/text_classification_with_transformer/), and we'll assume that we pad all input sequences to length `maxlen`.

In [3]:
class TokenAndPositionEmbedding(keras.layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        self.token_emb = keras.layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.pos_emb = keras.layers.Embedding(input_dim=maxlen, output_dim=embed_dim)
        self.maxlen = maxlen

    def call(self, x):
        maxlen = self.maxlen
        positions = tf.range(start=0, limit=maxlen, delta=1)
        positions = self.pos_emb(positions)
        x = self.token_emb(x)
        return x+positions

This layer consists of two `Embedding` layers: one for embedding tokens (created using techniques we've discussed before) and another for token positions. Token positions are created as a sequence of natural numbers from 0 to `maxlen` using `tf.range`, and then passed through the embedding layer. The two resulting embedding vectors are then added, producing a positionally-embedded reporesentation of the input of shape `maxlen`$\times$`embed_dim`.

<img src="notebooks/images/pos-embedding.png" width="40%"/>

Now let's implement the transformer block, which takes as input the output of the previously defined 
embedding layer and produces an attention vector. We'll use a `MultiHeadAttention` layer which is 
included in the `tensorflow_addons` library:

In [4]:
!pip install tensorflow_addons==0.13.0

Collecting tensorflow_addons==0.13.0
  Downloading tensorflow_addons-0.13.0-cp37-cp37m-manylinux2010_x86_64.whl (679 kB)
[K     |████████████████████████████████| 679 kB 46.3 MB/s eta 0:00:01
[?25hCollecting typeguard>=2.7
  Downloading typeguard-2.12.1-py3-none-any.whl (17 kB)
Installing collected packages: typeguard, tensorflow-addons
Successfully installed tensorflow-addons-0.13.0 typeguard-2.12.1


In [5]:
import tensorflow_addons as tfa

class TransformerBlock(keras.layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = tfa.layers.MultiHeadAttention(num_heads=num_heads, head_size=embed_dim, name='attn')
        self.ffn = keras.Sequential(
            [keras.layers.Dense(ff_dim, activation="relu"), keras.layers.Dense(embed_dim),]
        )
        self.layernorm1 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = keras.layers.Dropout(rate)
        self.dropout2 = keras.layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att([inputs, inputs])
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

The transformer layer applies `MultiHeadAttention` to the positionally-encoded input to produce the attention vector of dimension `maxlen`$\times$`embed_dim`, which is then mixed with input and normalized using a `LayerNormalizaton` layer.

> **Note**: `LayerNormalization` is similar to the `BatchNormalization` layer discussed in the *Computer Vision* part of this learning path, but it normalizes outputs of the previous layer for each training sample independently, to ensure they're in the range [-1..1].

The output of this layer is then passed through a `Dense` network (in our case, a two-layer perceptron), and the result is added to the final output (which undergoes normalization again).

<img src="notebooks/images/transformer-layer.png" width="30%" />

We're now ready to define the classification model using the transformer block:

In [6]:
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer
maxlen = 256
vocab_size = 20000

model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_sequence_length=maxlen, input_shape=(1,)),
    TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim),
    TransformerBlock(embed_dim, num_heads, ff_dim),
    keras.layers.GlobalAveragePooling1D(),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(20, activation="relu"),
    keras.layers.Dropout(0.1),
    keras.layers.Dense(4, activation="softmax")
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
text_vectorization (TextVect (None, 256)               0         
_________________________________________________________________
token_and_position_embedding (None, 256, 32)           648192    
_________________________________________________________________
transformer_block (Transform (None, 256, 32)           10464     
_________________________________________________________________
global_average_pooling1d (Gl (None, 32)                0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 20)                660       
_________________________________________________________________
dropout_4 (Dropout)          (None, 20)               

In [7]:
print('Training tokenizer')
model.layers[0].adapt(ds_train.map(extract_text))
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(64),validation_data=ds_test.map(tupelize).batch(64))

Training tokenizer


<tensorflow.python.keras.callbacks.History at 0x7fa2c9e322d0>

## BERT transformer models

What we have seen above is the usage of transformer block for classification. However, the main power
of transformer models is their use in **language modelling**. The main idea is to start with a raw
unlabeled text, and try to build a model that will predict some masked words in the text. This will allow the model
to learn the overall structure of the language on very large datasets.

Complete transformer model, in addition to transformer block discussed above, which is called **encoder**, contains
**decoder** that is responsible for prediction of masked tokens.

**BERT** (Bidirectional Encoder Representations from Transformers) is a very large multi layer transformer network with 12 layers for *BERT-base*, and 24 for *BERT-large*. The model is first pretrained on a large corpus of text data (Wikipedia + books) using unsupervised training (predicting masked words in a sentence). During pretraining the model absorbs a significant level of language understanding which can then be fine tuned using other datasets. This process is called **transfer learning**. 

![picture from http://jalammar.github.io/illustrated-bert/](notebooks/images/jalammar-bert-language-modeling-masked-lm.png)

There are many variations of transformer architectures including BERT, DistilBERT, BigBird, OpenGPT3 and more that can be fine tuned. 

Let's see how we can use a pretrained BERT model for solving our traditional sequence classification problem. We'll borrow the idea and some code from the [official documentation](https://www.tensorflow.org/text/tutorials/classify_text_with_bert).

To load pretrained models, we'll use the **TensorFlow hub**. We need to make sure that all required libraries are installed:

In [8]:
import sys
!{sys.executable} -m pip install -q tensorflow_hub
!{sys.executable} -m pip install --no-deps -q tensorflow_text==2.3

First, let's load the BERT-specific vectorizer.

> In the sandbox environment, we need to prefetch the weights for the pretrained BERT network and the vectorizer. If you're running the code locally, you can skip the next cell. Also, this cell is likely to cause problems on non-UNIX machines.

In [9]:
!cd /tmp && wget -q -O - https://mslearntensorflowlp.blob.core.windows.net/models/tfhub_modules.tgz | tar xz

In [10]:
import tensorflow_text 
import tensorflow_hub as hub
tf.get_logger().setLevel('ERROR')
vectorizer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')

Let's see how the vectorizer works:

In [11]:
vectorizer(['I love transformers'])

{'input_mask': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
       dtype=int32)>,
 'input_word_ids': <tf.Tensor: shape=(1, 128), dtype=int32, numpy=
 array([[  101,  1045,  2293, 19081,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     

It's important that you use the same vectorizer as the one that the original network was trained on. The BERT vectorizer returns three components:
* `input_word_ids`, which is a sequence of token numbers for the input sentence.
* `input_mask`, showing which part of the sequence contains actual input, and which one is padding. It's similar to the mask produced by the `Masking` layer.
* `input_type_ids` is used for language modeling tasks, and allows to specify two input sentences in one sequence.

Then, we can instantiate the BERT feature extractor:

In [12]:
bert = hub.KerasLayer('https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1')

In [13]:
z = bert(vectorizer(['I love transformers']))
for i,x in z.items():
    print(f"{i} -> { len(x) if isinstance(x, list) else x.shape }")

sequence_output -> (1, 128, 128)
pooled_output -> (1, 128)
encoder_outputs -> 4
default -> (1, 128)


The BERT layer returns a number of useful results:
* `pooled_output` is the result of averaging out all tokens in the sequence. You can view it as an intelligent semantic embedding of the whole network. It's equivalent to the output of `GlobalAveragePooling1D` layer in our previous model.
* `sequence_output` is the output of the last transformer layer (corresponds to the output of `TransformerBlock` in our model above).
* `encoder_outputs` are the outputs of all transformer layers. Since we've loaded a 4-layer BERT model (as you can probably guess from the name, which contains `4_H`), it has 4 tensors. The last one is the same as `sequence_output`.

Now we'll define the end-to-end classification model. We'll use the *functional model definition*, where we define the model input, and then provide a series of expressions to calculate its output. We will also make the BERT model weights non-trainable, and train just the final classifier:

In [14]:
inp = keras.Input(shape=(),dtype=tf.string)
x = vectorizer(inp)
x = bert(x)
x = keras.layers.Dropout(0.1)(x['pooled_output'])
out = keras.layers.Dense(4,activation='softmax')(x)
model = keras.models.Model(inp,out)
bert.trainable = False
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        {'input_mask': (None 0           input_1[0][0]                    
__________________________________________________________________________________________________
keras_layer_1 (KerasLayer)      {'sequence_output':  4782465     keras_layer[0][0]                
                                                                 keras_layer[0][1]                
                                                                 keras_layer[0][2]                
_______________________________________________________________________________________

In [15]:
model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer='adam')
model.fit(ds_train.map(tupelize).batch(64),validation_data=ds_test.map(tupelize).batch(64))



<tensorflow.python.keras.callbacks.History at 0x7fa19d7161d0>

Despite the fact that there are few trainable parameters, the process is pretty slow, because the BERT feature extractor is computationally heavy. It looks like we were unable to achieve reasonable accuracy, either due to lack of training, or lack of model parameters.

If you want to further experiment with BERT training, you can try to unfreeze some of the BERT weights and train them as well. This requires a very small learning rate, and more care with the training strategy. In this scenario, it's recommended that we use the **AdamW** optimizer. You can also experiment with more advanced optimization strategies with initial **warmup** (the `tf-models-official` package may be helpful). 

In [16]:
bert.trainable=True
model.summary()
epochs = 3
opt = tfa.optimizers.AdamW(learning_rate=3e-5, weight_decay=1e-5)

model.compile(loss='sparse_categorical_crossentropy',metrics=['acc'], optimizer=opt)
model.fit(ds_train.map(tupelize).batch(16),validation_data=ds_test.map(tupelize).batch(16))

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
keras_layer (KerasLayer)        {'input_mask': (None 0           input_1[0][0]                    
__________________________________________________________________________________________________
keras_layer_1 (KerasLayer)      {'sequence_output':  4782465     keras_layer[0][0]                
                                                                 keras_layer[0][1]                
                                                                 keras_layer[0][2]                
_______________________________________________________________________________________

<tensorflow.python.keras.callbacks.History at 0x7fa19cc9e4d0>

As you can see, the training is quite slow. You may want to experiment and train the model for more epochs (5-10) and see if you can get a better result compared to the approaches we've used before.

## Huggingface transformers library

Another very common (and a bit simpler) way to use Transformer models is the [HuggingFace package](https://github.com/huggingface/), which provides simple building blocks for different NLP tasks. It's available both for TensorFlow and PyTorch, another very popular neural network framework. 

> **Note**: If you're not interested in seeing how the transformers library works, you may skip to the end of this notebook &mdash; you won't see anything substantially different from what we've done above. We'll be repeating the same steps of training the BERT model using a different library and substantially larger model. Because the process involves some rather long training, you may want to just look through the code.

Let's see how our problem can be solved using [Huggingface Transformers](http://huggingface.co).

In [17]:
!{sys.executable} -m pip install -q transformers

The first thing we need to do is choose the model that we'll be using. In addition to a few built-in models, Huggingface contains an [online model repository](https://huggingface.co/models), where you can find many more pretrained models created by the community. We can load any of these models just by providing a model name, and all required binary files for the model are automatically downloaded.

If you want to load your own models, then you can specify the directory that contains all relevant files, including the parameters for the tokenizer, `config.json` file with model parameters, and binary weights.

> When running inside the sandbox, we'll need to download the BERT model files manually. If you're running a local copy of the notebook, then you can skip the next cell, and modify the model name accordingly.

In [18]:
!wget -q -O - https://mslearntensorflowlp.blob.core.windows.net/models/huggingface-bert.tgz | tar xz

From the model name or model representation on disk, we can instantiate both the model and the tokenizer (those two always go together). Let's start with a tokenizer:

In [19]:
import transformers

# To load the model from Internet repository using model name. 
# Use this if you are running from your own copy of the notebooks
bert_model = 'bert-base-uncased' 

# To load the model from the directory on disk. Use this for Microsoft Learn module.
bert_model = './tfbert'

tokenizer = transformers.BertTokenizer.from_pretrained(bert_model)

MAX_SEQ_LEN = 128
PAD_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
UNK_INDEX = tokenizer.convert_tokens_to_ids(tokenizer.unk_token)

The `tokenizer` object contains the `encode` function that can be used directly to encode text:

In [20]:
tokenizer.encode('TensorFlow is a great framework for NLP')

[101, 23435, 12314, 2003, 1037, 2307, 7705, 2005, 17953, 2361, 102]

We can use the tokenizer to encode a sequence in a way that's suitable for passing to the model, including the `input_word_ids`, `input_mask` and `input_type_ids` fields. We can also specify that we want TensorFlow tensors by providing the `return_tensors='tf'` argument:

In [21]:
tokenizer(['Hello, there'],return_tensors='tf')

{'input_ids': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[ 101, 7592, 1010, 2045,  102]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[0, 0, 0, 0, 0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(1, 5), dtype=int32, numpy=array([[1, 1, 1, 1, 1]], dtype=int32)>}

In our case, we'll be using the pretrained BERT model called `bert-base-uncased`. *Uncased* indicates that the model in case-insensitive. 

When training the model, we need to provide the tokenized sequence as input, so let's include that in the data processing pipeline. Since `tokenizer.encode` is a Python function, we'll call it using `py_function`, as you've seen before:

In [22]:
def process(x):
    return tokenizer.encode(x.numpy().decode('utf-8'),return_tensors='tf',padding='max_length',max_length=MAX_SEQ_LEN,truncation=True)[0]

def process_fn(x):
    s = x['title']+' '+x['description']
    e = tf.py_function(process,inp=[s],Tout=(tf.int32))
    e.set_shape(MAX_SEQ_LEN)
    return e,x['label']

Now we can load the actual model using the `BertForSequenceClassfication` package. This ensures that our model already has the required architecture for classification, including the final classifier. You'll see a warning message stating that the weights of the final classifier are not initialized, and that the model requires pretraining - that's' perfectly okay, because it's exactly what we are about to do!

In [23]:
model = transformers.TFBertForSequenceClassification.from_pretrained(bert_model,num_labels=4,output_attentions=False)

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at ./tfbert and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [24]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_43 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  3076      
Total params: 109,485,316
Trainable params: 109,485,316
Non-trainable params: 0
_________________________________________________________________


As you can see from the `summary()`, the model contains almost 110 million parameters! Presumably, if we want to do a simple classification task on a relatively small dataset, we don't want to train the BERT base layer:

In [25]:
model.layers[0].trainable = False
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_43 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  3076      
Total params: 109,485,316
Trainable params: 3,076
Non-trainable params: 109,482,240
_________________________________________________________________


Now we're ready to begin training!

> **Note**: Training a full-scale BERT model can be very time consuming! Thus we will only train it for the first 32 batches. This is just to show how the model training is set up. If you're interested in trying the full-scale training, just remove `steps_per_epoch` and `validation_steps` parameters, and prepare to wait!

In [26]:
model.compile('adam','sparse_categorical_crossentropy',['acc'])
tf.get_logger().setLevel('ERROR')
model.fit(ds_train.map(process_fn).batch(32),validation_data=ds_test.map(process_fn).batch(32),steps_per_epoch=32,validation_steps=2)



<tensorflow.python.keras.callbacks.History at 0x7fa17c2f6190>

If you increase the number of iterations, train for several epochs, and wait long enough, you can expect that BERT classification gives us the best accuracy! That's because BERT already understands the structure of the language, and we only need to fine-tune the final classifier. However, because BERT is a large model, the whole training process takes a long time, and requires serious computational power! (GPU, and preferably more than one).

> **Note:** In our example, we've been using one of the smallest pretrained BERT models. There are larger models that are likely to yield better results.

## Takeaway

In this unit, we saw very recent model architectures based on **transformers**. We applied them for our text classification task, but similarly, BERT models can be used for entity extraction, question answering, and other NLP tasks.

Transformer models represent current state-of-the-art in NLP, and in most cases it should be the first solution you start experimenting with when implementing custom NLP solutions. However, understanding the basic underlying principles of recurrent neural networks discussed in this module is extremely important if you want to build advanced neural models.