# Lecture 20 - Transformer Networks

[![View notebook on Github](https://img.shields.io/static/v1.svg?logo=github&label=Repo&message=View%20On%20Github&color=lightgrey)](https://github.com/avakanski/Fall-2023-Python-Programming-for-Data-Science/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_19-Natural_Language_Processing/Lecture_19-NLP.ipynb)
[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/avakanski/Fall-2023-Python-Programming-for-Data-Science/blob/main/docs/Lectures/Theme_3-Model_Engineering/Lecture_19-Natural_Language_Processing/Lecture_19-NLP.ipynb)

<a id='top'></a>

- [20.1 Introduction to Transformers](#20.1-introduction-to-transformers)
- [20.2 Self-attention Mechanism](#20.2-self-attention-mechanism)
- [20.3 Multi-head Attention](#20.3-multi-head-attention)
- [20.4 Encoder Block](#20.4-encoder-block)
- [20.5 Positional Encoding](#20.5-positional-encoding)
- [20.6 Using a Transformer Model for Classification](#20.6-using-a-transformer-model-for-classification)
- [20.7 Fine-tuning a Pretrained BERT Model](#20.7-fine-tuning-a-pretrained-bert-model)
- [20.8 Decoder Sub-network](#20.8-decoder-sub-network)
- [20.9 Vision Transformers](#20.9-vision-transformers)
- [References](#references)

## 20.1 Introduction to Transformers <a name='20.1-introduction-to-transformers'></a>

**Transformer Neural Networks**, or simply **Transformers**, is a neural network architecture introduced in 2017 in the now-famous paper [“Attention is all you need”](https://arxiv.org/abs/1706.03762). The title refers to the attention mechanism, which forms the basis for data processing with Transformers.  

Transformer Networks have been the predominant type of Deep Learning models  for NLP in recent years. They replaced Recurrent Neural Networks in all NLP tasks, and also, all Large Language Models employ the Transformer Network architecture. As well as, Transformer Networks were recently adapted for other tasks and have outperformed other Machine Learning models for image processing and video processing tasks, protein and DNA sequence prediction, time-series data processing, and have been used for reinforcement learning tasks. Consequently, Transformers are currently the most important Neural Network architecture.

## 20.2 Self-attention Mechanism <a name='20.2-self-attention-mechanism'></a>

**Self-attention** in NNs is a mechanism that forces a model to attend to portions of the data when making predictions. For instance, in NLP, self-attention mechanism is used to identify words in sentences that have significance for a given query word in the sentence. That is, the model should pay more attention to some words in sentences, and less attention to other words in sentences that are less relevant for a given task.  

In the following two sentences, in the left subfigure the word "it" refers to "street", while in the right subfigure the word "it" refers to "animal". Understanding the relationships between the words in such sentences has been challenging with traditional NLP approaches. Transformers use the self-attention mechanism to model the relationships between all words in a sentence, and assign weights to the other words in sentences based on their importance. In the left subfigure, the mechanism estimated that the **query word** "it" is most related to the word "street", but the word "it" is also somewhat related to the words "The" and "animal. These words are referred to as **key words** for the query word "it".The intensity of the lines connecting the words, as well as the intensity of the blue color, signifies the attention weights or scores. The wider and bluer the lines, the higher the attention scores between two words are.

<img src="images/attn_1.png" width="700">

*Figure: Attention to words in sentences.*

Specifically, Transformer Network compares each word to every other word in the sentence, and calculates attention scores. This is shown in the next figure, where for example, the word "caves" has the highest **attention scores** for the words "glacier" and "formed". The attention scores are calculated as the dot (i.e., inner) product of the input representations of two words. That is, for each Query word $Q$ and Key word $K$, the attention score is $Q\cdot K$.


<img src="images/attn_2.png" width="700">

*Figure: Attention scores.*

As we explained in the previous lecture, Transformers employ word embeddings for representing the individual words in text sequences (where each text sequence can have one or several sentences). Recall that **word embeddings** are vector representations of words, such that the vectors of words that have similar semantic meaning have close spatial positions in the embeddings space. Therefore, the attention scores are dot products of the embedding vectors for each pair of words in sentences.

The obtained attention scores for each word are then first scaled (by dividing the values by $\sqrt d$) and afterward are normalized to be in the [0,1] range (by applying a softmax function). That is, the attention scores are calculated as $a_{ij}=softmax(\frac{Q_i\cdot K_j}{\sqrt d})$, where $d$ is the dimensionality of the embedding vectors. Scaling the values by $\sqrt d$ is helpful for improving the flow of the gradients during training. The resulting scaled and normalized attention scores are then multiplied with the initial representation of the words, which in the self-attention module is referred to as **value** or $V$.

This is shown in the next figure. The left subfigure shows the attention scores calculated as product of the input representations of the words $Q$ and $K$, which are afterwards multiplied with the input representation $V$ to obtain the output of the module. Note that for text classification, all three terms Query, Key, and Value are the input representation of the words in sentences. However, the original Transformer was developed for machine translation, where the words in the target language are queries, and the words in the source language are pairs of keys and values. This terminology is also related to search engines, which compare queries to keys, and determine values. Self-attention works in a similar way, where each query word is matched to other key words, and a weighted value is returned.

The right subfigure below shows how self-attention is implemented in Transformer Networks. Namely, `Matmul` stands for a matrix multiplication layer which calculates the dot product $Q\cdot K$, which is afterwards scaled by $\sqrt d$, then there is an optional masking layer, and afterward the final attention scores are obtained by passing it through a `Softmax` layer to obtain $softmax(\frac{Q_i\cdot K_j}{\sqrt d})$. Finally, the attention scores are multiplied with $V$ via another matrix multiplication layer `Matmul` to calculate the output of the self-attention module.

<img src="images/attn_3.png" width="400">

*Figure: Self-attention in Transformer Networks*

In conclusion, self-attention is applied to determine the meaning of the words in a sentence based on the context. That is, Transformers use the attention scores to modify the input vector representations for each word and generate a new representation based on the context of the sentence. During the training of the network, the representations of the words are updated and projected into a new embeddings space that takes the context into account.

## 20.3 Multi-Head Attention <a name='20.3-multi-head-attention'></a>

Transformer Networks include multiple self-attention modules in their architecture. Each self-attention module is called **attention head**, and the aggregation of the outputs of multiple attention heads is called **multi-head attention**. For instance, the original Transformer model had 8 attention heads, while the GPT-3 language model has 12 attention heads.

The multi-head attention module is shown in the next figure, where the inputs are first passed through a linear layer (dense or fully-connected layer), next they are fed to the multiple attention heads, and the outputs of all attention heads are concatenated, and passed through one more linear (dense) layer.

A logical question one can ask is why are multiple attention heads needed? The reason is that multiple attention modules can learn different relationships between the words in sentences. Each module can extract context independently from the other modules, which allows to capture less obvious context and enhance the learning capabilities of the model. For example, one head may capture relationship between the nouns and numerical values in sentences, another head may focus on the relationship between the adjectives in sentences, and another head may focus on rhyming words, etc.

Also, the computations of each attention head can be performed in parallel on different workers, which allows for accelerating the training and scaling up the models.

<img src="images/multihead_1.png" width="500">

*Figure: Multi-head attention*

## 20.4 Encoder Block <a name='20.4-encoder-block'></a>

The **Encoder Block** in Transformer Networks is shown in the next figure. It processes the input embeddings of words and extracts representations in text data that can afterwards be used for different NLP tasks.

The components in the Encoder Block are:

- *Multi-head Attention layer*, which as explained, consists of multiple self-attention modules.
- *Dropout layer*, is a regular dropout layer.
- *Residual connections*, are skip connections in neural networks, where the input to a layer is added to the processed output of the layer. Residual connections were popularized in the ResNets models, as they were shown to stabilize the training phase of neural networks, and mitigate the problems of *vanishing and exploding gradients* (i.e., they refer to cases when the gradients become too small or too large during training). The `Add` term in the layer refers to the residual connection, which adds the input embeddings to the output of the Dropout layer.
- *Layer Normalization*, is an operation that is similar to the batch normalization in CNN, but instead, it normalizes each sequence of words independently from the other sequences of words in the batch, and scales the data to have 0 mean and 1 standard deviation. This type of normalization is more adequate for text data. And, as we learned in the previous lectures, such normalization layers improve the flow of gradients during training. The `Norm` term in the figure refers to the Layer Normalization operation.
- *Feed Forward network*, consists of 2 fully-connected (dense) layers that extract useful data representations.
- The Encoder Block also contains one more *Dropout layer*, and another *Add & Norm* layer that forms a residual connection for the input to the Feed Forward network and applies a layer normalization operation.

Larger Transformer networks typically include several encoder blocks in a sequence. For instance, in the original paper the authors used 6 encoder blocks.

<img src="images/enc_1.png" width="300">

*Figure: Encoder block*

The implementation of the Encoder Block in Keras and TensorFlow is shown in the cell following the imported libraries.

The Encode Block is implemented as a layer which is a subclass of the `layers.Layer` class. The `__init__()` constructor method lists the definitions of the layers in the Encoder, and the method `call` provides the forward pass with the flow of information through the layers.

- *Multi-head attention* layer is implemented in Keras, and it can be directly imported. The arguments in the layer are: `num_heads` the number of attention heads, and `key_dim` is the dimension of the embeddings of the input tokens.
- *Dropout* and *Normalization* layers are also directly imported, with arguments `rate` for the dropout rate, and `epsilon` is a small float added to the standard deviation to avoid division by 0.
- *Feed forward network* includes 2 dense layers, with the number of neurons set to `ff_dim` and `embed_dim`, respectively.
- Also note the residual connections that are implemented in the layer normalization, e.g., the inputs are added to the output of the multi-head attention.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.multi_head_attention = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.feed_forward_net = keras.Sequential([layers.Dense(ff_dim, activation="relu"), layers.Dense(embed_dim),])
        self.layer_normalization1 = layers.LayerNormalization(epsilon=1e-6)
        self.layer_normalization2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        multi_head_att_output = self.multi_head_attention(inputs, inputs)
        multi_head_att_dropout = self.dropout1(multi_head_att_output, training=training)
        add_norm_output_1 = self.layer_normalization1(inputs + multi_head_att_dropout)
        feed_forward_output = self.feed_forward_net(add_norm_output_1)
        feed_forward_dropout = self.dropout2(feed_forward_output, training=training)
        add_norm_output_2 = self.layer_normalization2(add_norm_output_1 + feed_forward_dropout)
        return add_norm_output_2

## 20.5 Positional Encoding <a name='20.5-positional-encoding'></a>

We mentioned that Transformers use word embeddings as inputs, however, the embeddings alone don't provide information about the order of words in sentences. Understandably, the order of the words in a sentence is important, and different order of the words can convey a different meaning. To provide such information, Transformer Network introduces **positional encoding** for each word that is added to the input embedding, as shown in the next figure.  

<img src="images/positional_encoding_1.png" width="300">

*Figure: Positional encoding*

There are different ways in which positional encoding can be implemented. In the original Transformer paper, the positional encoding is a vector that has the same size as the word embedding vector, and the authors used sine and cosine functions to create position vectors, which are afterwards scaled to be in the range from -1 to 1. Using such positional encoding, each encoding vector corresponds to a unique position in a sequence of words. This type is called *sinusoidal positional encoding*.

The following cell implements the addition of positional encoding to word embeddings in Keras. In this case, we will not use the approach for obtaining positional encodings based on sine and cosine functions, but instead we will use a simpler approach and learn the positional encodings in the same way the word embeddings are learned. This type of positional encoding is referred to as *learned positional embeddings*. Therefore, for both token and positional embeddings we will use the `Embedding` layer in Keras which we introduced in the previous lecture. The arguments in the `Embedding` layer are the input dimension `input_dim` and the dimension of the embedding vectors `output_dim`. For the token embeddings layer, the input dimension is the size of the vocabulary, whereas for the positional embeddings layer the input dimension is the length of the text sequences.


In the `call` method, first the length of the text sequences is assigned to `maxlen`. The function `tf.range` is similar to NumPy's `linspace` and creates numbers in the range from `start` to `limit` with a step `delta`. Next, the two separate `Embedding` layers are called, and returned is the sum of the token and positional embeddings.

In [None]:
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_embeddings = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.positional_embeddings = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, inputs):
        maxlen = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        position_embeddings = self.positional_embeddings(positions)
        input_embeddings = self.token_embeddings(inputs)
        return input_embeddings + position_embeddings

## 20.6 Using a Transformer Model for Classification <a name='20.6-using-a-transformer-model-for-classification'></a>

### Model Definition

We will now employ the layers that we defined above, to create a Transformer model for text classification.

It is a simple model that consists of the following parts:

- **Encoder**, which includes an `Input` layer that defines the maximum length of input sequences, `TokenAndPositionEmbedding` layer, and the `TransformerEncoder` layer.
- **Classifier**, which consists of a `GlobalAveragePooling1D` layer, and two `Dropout` and `Dense` layers. The Encoder block outputs a feature representation vector for each word in input text sequences. Global Average Pooling calculates the average value for each word, and it passes those values to the dense layers to classify the text sequences.

In [None]:
maxlen = 200  # Maximum length of input sequences is 200 words
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Dense layer size in the feed forward network inside transformer
vocab_size = 20000  # The size of the vocabulary is 20k words

# encoder
inputs = layers.Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, num_heads, ff_dim)(embedding_layer)

# classifier
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs=inputs, outputs=outputs)

The summary of the model is shown below.

In [None]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 200)]             0         
                                                                 
 token_and_position_embeddin  (None, 200, 32)          646400    
 g (TokenAndPositionEmbeddin                                     
 g)                                                              
                                                                 
 transformer_encoder (Transf  (None, 200, 32)          10656     
 ormerEncoder)                                                   
                                                                 
 global_average_pooling1d (G  (None, 32)               0         
 lobalAveragePooling1D)                                          
                                                                 
 dropout_2 (Dropout)         (None, 32)                0     

### Loading the Dataset

Let's apply the model for sentiment analysis of the movie reviews in the IMDB database. The data is loaded from the Keras datasets, and it contains 25,000 training sequences and 25,000 validation sequences.

In [None]:
(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_val = keras.preprocessing.sequence.pad_sequences(x_val, maxlen=maxlen)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 Training sequences
25000 Validation sequences


### Model Training

In [None]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fd2552d61d0>

## 20.7 Fine-tuning a Pretrained BERT Model <a name='20.7-fine-tuning-a-pretrained-bert-model'></a>

**BERT** (Bidirectional Encoder Representations from Transformers) is a Transformer Network, and a language model that can be used for variety of NLP tasks such as question answering, text classification, machine translation, etc.

In this section we will use a pretrained version of BERT and fine-tuned it for classification of news articles in the AG database (that we used in the previous lecture).

TensorFlow Hub is a repository of pretrained machine learning models, and it offers several versions of [BERT](https://tfhub.dev/google/collections/bert/1) such as: [Small BERT](https://tfhub.dev/google/collections/bert/1), [Albert](https://tfhub.dev/google/collections/albert/1), and [BERT Expert](https://tfhub.dev/google/collections/experts/bert/1). The different versions of BERT are optimized for different use cases. In our case, we will use [SmallBERT](https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3).

To use this model we will need to install the TensorFlow Text library for text processing.

In [None]:
!pip install tensorflow_text
import tensorflow_text as text

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


The BERT model in TensorFlow Hub has a corresponding text preprocessing model for converting  texts into tokens.

In [None]:
import tensorflow_hub as hub

bert_handle = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2'
preprocessing_model = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'

The output of the `preprocessing model` has 3 elements:  

- `input_word_ids`: token ids of the input sequences.
- `input_mask`: has value 1 for all input tokens before padding, and value 0 for the padding tokens.
- `input_type_ids`: has different values for segments in text; e.g., if there are 3 sentences in the input text, the tokens in the same sentences will have the same index.

Let's wrap `preprocessing_model` into a `hub.KerasLayer` and test it on a sample sentence.

In [None]:
preprocess_layer = hub.KerasLayer(preprocessing_model)



In [None]:
sample_news = ['Tech rumors: The tech giant Apple is working on a self driving car']

preprocessed_news = preprocess_layer(sample_news)

print('Keys:', preprocessed_news.keys())
# length of the input sequence
print('Shape:', preprocessed_news["input_word_ids"].shape)
print('Word Ids:', preprocessed_news["input_word_ids"][0,:10])
print('Input Mask:', preprocessed_news["input_mask"][0, :10])
print('Type Ids:', preprocessed_news["input_type_ids"][0, :10])

Keys: dict_keys(['input_type_ids', 'input_mask', 'input_word_ids'])
Shape: (1, 128)
Word Ids: tf.Tensor([  101  6627 11256  1024  1996  6627  5016  6207  2003  2551], shape=(10,), dtype=int32)
Input Mask: tf.Tensor([1 1 1 1 1 1 1 1 1 1], shape=(10,), dtype=int32)
Type Ids: tf.Tensor([0 0 0 0 0 0 0 0 0 0], shape=(10,), dtype=int32)


### Loading the Dataset

The news articles in the AG dataset are classified into 4 categories: World, Sports, Business, and Sci/Tech.

In [None]:
import tensorflow_datasets as tfds

(train_data, val_data), info = tfds.load('ag_news_subset:1.0.0', #version 1.0.0
                                         split=('train', 'test'),
                                         with_info=True,
                                         as_supervised=True)

[1mDownloading and preparing dataset 11.24 MiB (download: 11.24 MiB, generated: 35.79 MiB, total: 47.03 MiB) to ~/tensorflow_datasets/ag_news_subset/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/120000 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/ag_news_subset/1.0.0.incompleteS3CHIE/ag_news_subset-train.tfrecord*...:   0%|…

Generating test examples...:   0%|          | 0/7600 [00:00<?, ? examples/s]

Shuffling ~/tensorflow_datasets/ag_news_subset/1.0.0.incompleteS3CHIE/ag_news_subset-test.tfrecord*...:   0%| …

[1mDataset ag_news_subset downloaded and prepared to ~/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.[0m


In [None]:
# Dataset information
class_names = info.features['label'].names
print('Classes:', class_names)

print('Number of training samples"', info.splits['train'].num_examples)
print('Number of test samples"', info.splits['test'].num_examples)

Classes: ['World', 'Sports', 'Business', 'Sci/Tech']
Number of training samples" 120000
Number of test samples" 7600


In [None]:
buffer_size = 1000
batch_size = 32

# prepare the data
train_data = train_data.shuffle(buffer_size)
train_data = train_data.batch(batch_size).prefetch(1)
val_data = val_data.batch(batch_size).prefetch(1)

### Model Definition with BERT

The model defined below includes an Input layer, a preprocessing layer to convert the text data into token embeddings, and a layer for the BERT model.

Afterward, the output is passed through a classifier head, which includes two dense layers and dropout layers.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# input layer
input_text = layers.Input(shape=(), dtype=tf.string)

# preprocesing model
preprocessing_layer = hub.KerasLayer(preprocessing_model)(input_text)
# Bert model, set trainable to True
bert_encoder = hub.KerasLayer(bert_handle, trainable=True)(preprocessing_layer)

# For fine-tuning use pooled output
pooled_bert_output = bert_encoder['pooled_output']

# clasifier
x = layers.Dense(16, activation='relu')(pooled_bert_output)
x = layers.Dropout(0.2)(x)
final_output = keras.layers.Dense(4, activation='softmax')(x)


# Combine input and output
news_model = keras.Model(input_text, final_output)

### Model Training

Let's compile and train the model.

In [None]:
# compile
news_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5),
                        loss='sparse_categorical_crossentropy',
                        metrics=['accuracy'])
# train
news_model.fit(train_data, epochs=3, validation_data=val_data)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fa662301910>

### Model Evaluation

Finally, let's predict the class of two news articles.

In [None]:
sample_news_1 = ['Tesla, a self driving car company is also planning to make a humanoid robot. This humanoid robot appeared dancing in the latest Tesla AI day']

predictions_1 = news_model.predict(np.array(sample_news_1))

predicted_class_1 = np.argmax(predictions_1)

print('Predicted class:', predicted_class_1)
print('Predicted class name:', class_names[predicted_class_1])

Predicted class: 3
Predicted class name: Sci/Tech


In [None]:
sample_news_2 = ["In the last weeks, there has been many transfer suprises in footbal. Ronaldo went back to Old Trafford, "
                "while Messi went to Paris Saint Germain to join his former colleague Neymar."
                "We can't wait to see these two clubs will perform in upcoming leagues"]

predictions_2 = news_model.predict(np.array(sample_news_2))

predicted_class_2 = np.argmax(predictions_2)

print('Predicted class:', predicted_class_2)
print('Predicted class name:', class_names[predicted_class_2])

Predicted class: 1
Predicted class name: Sports


## 20.8 Decoder Sub-network <a name='20.8-decoder-sub-network'></a>

The Transformer Network in the original paper was designed for machine translation. Differently from the text classification task where for an input text sentence the model predicts a class label, in machine translation for an input text sentence in a source language the model predicts the corresponding text sentence in a target language. Therefore, both the input and output of the model are text sequences. These type of models are called **sequence-to-sequence models**, or oftentimes this term is abbreviated to **seq2seq models**. Beside machine translation, other NLP tasks that employ seq2seq models include question answering, text summarization, dialog generation, and others.

The architecture of Transformer Networks designed to handle seq2seq tasks consists of encoder and decoder sub-networks.

- **Encoder sub-network** takes a source text sequence as an input, and extracts a useful representation of the text data.
- **Decoder sub-network** takes a target text sequence as an input, as well as it receives the intermediate representation from the encoder sub-network. The decoder combines the information from the target sequence and the encoded source sequence, and learns to predict the next word (token) in the target sequence.

During the evaluation step, the model does not have access to the target sequence. It is just fed with a source sequence, and the model tries to predict the next word in the target sequence. Afterward, the predicted target sequence is fed back to the decoder, and the next word is again predicted. This step is repeated until the decoder generates an end-of-sequence token. Note also that such models that generate one token at each time step and feed back the output back to the model are called **autoregressive models**.

This is shown in the next figure, where the French sequence "Je suis etudiant" is translated into "I am a student". The decoder outputs one word at each time step until the end-of-sequence is reached.

<img src="images/transformer_decoding_2.png" width="700">

*Figure: Decoder block*

The architecture of the decoder is similar to the encoder and it is shown in the next figure. The upper part of the decoder is practically the same as the encoder, and it consists of a multi-head attention module with residual connections and layer normalization, followed by a feed-forward network with residual connections and layer normalization. The output of the encoder is passed to the multi-head attention module.

The main difference from the encoder is the *masked multi-head attention* module in the lower part of the decoder. This module is inserted between the target sequence (i.e., the output sequence of the decoder) and the multi-head attention module. Masked multi-head attention module applies masking to the next words in the target sequence, so that the network does not have access to those words. That is, during training, if the model needs to predict the 4th word in a sentence, masks will be applied to all words after the 3rd word, so that the model has access only to the words 1, 2, and 3, in order to predict the 4th word. This step ensures that the model uses only the previous steps to predict the word in the next step in the target sequence. This type of mask is also referred to as *causal attention mask*.

Finally, the output representations from the decoder are inputted to a linear (dense) layer and a softmax layer, that outputs the probability for the next word in the vocabulary learned from the training dataset.

And also note the marks `Nx` in the figure. They indicate that the shown modules in the encoder and decoder are repeated multiple times in the network. In the original Transformer Network, the encoder sub-network has 6 blocks of multi-head attention and feed forward modules, and similarly the decoder sub-network has 6 blocks of masked multi-head attention, multi-head attention, and feed forward modules. Introducing multiple modules in the sub-networks increases the learning ability as it allows the model to learn more abstract representations.

<img src="images/transformer.png" width="700">

*Figure: Transformer Network*

Note that Recurrent Neural Networks are also a type of seq2seq models. Transformer Networks have several advantages over RNN, due to the ability to inspect entire text sequences at once, capture context in long sequences, are parallelizable, and are more powerful in general. Conversely, RNN have access only to the next tokens in a sequence (have difficulty finding correlations in long sequences because the information needs to pass through many processing steps), can not perform parallel computations (are slow to train), and the gradients can become unstable.

## 20.9 Vision Transformers <a name='20.9-vision-transformers'></a>

After the initial success of Transformer Networks in NLP, recently they have been adapted for computer vision tasks as well. The initial Transformer model for vision tasks proposed in 2021 was called **Vision Transformer (ViT)**.

The architecture of ViT is very similar to the Transformers used in NLP. However, Transformer Networks were designed for working with sequential data, while images are spatial data types. To consider each pixel in an image as a sequential token would be impractical and too time-consuming. Therefore, ViT splits images into a set of smaller image patches (16x16 pixels), and it uses the sequence of image patches as inputs to the model. Each image patch was first flattened to one-dimensional vector, and those vectors were afterward passed through a dense layer to learn lower-dimensional embeddings for each patch. Positional embeddings were added, and the sequences were fed to a standard transformer encoder. The encoder block in ViT is identical to the encoder in the original Transformer Network. The steps are depicted in the figure below.

<img src="images/vision_transformer.gif" width="700">

*Figure: Vision Transformer*

The authors trained 3 versions of ViT, called Base (12 blocks, 768 embeddings dimension, 86M parameters), Large (24 blocks, 1,024 embeddings dimension, 307M parameters), and Huge (32 blocks, 1,280 embeddings dimension, 632M parameters).

Various other versions of vision transformers were introduced recently, which include MaxViT (Multi-axis ViT), Swin (Shifted Window ViT), DeiT (Data-efficient image Transformer), T2T-ViT (Token-to-token ViT), and others. These models achieved higher accuracy on many vision tasks in comparison to Convolutional Neural Networks. The following figure shows the current accuracy on ImageNet.

<img src="images/imagenet_accuracy.png" width="500">

*Figure: Accuracy on the ImageNet dataset*

## References <a name='references'></a>

1. The Illustrated Transformer, Jay Alammar, available at: [https://jalammar.github.io/illustrated-transformer/](https://jalammar.github.io/illustrated-transformer/).
2. Keras Examples, Text classification with Transformer, available at: [https://keras.io/examples/nlp/text_classification_with_transformer/](https://keras.io/examples/nlp/text_classification_with_transformer/).
3. Using Pretrained BERT for Text Classification, Jean de Dieu Nyandwi, available at: [https://github.com/Nyandwi/machine_learning_complete/blob/main/9_nlp_with_tensorflow/5_using_pretrained_bert_for_text_classification.ipynb](https://github.com/Nyandwi/machine_learning_complete/blob/main/9_nlp_with_tensorflow/5_using_pretrained_bert_for_text_classification.ipynb).
4. Deep Learning with Python, Francois Chollet, Second Edition, Manning Publications, 2021.
5. TensorFlow Tutorials, Neural Machine Translation with a Transformer and Keras, available at [https://www.tensorflow.org/text/tutorials/transformer](https://www.tensorflow.org/text/tutorials/transformer).
6. How the Vision Transformer (ViT) Works in 10 Minutes: An Image is Worth 16x16 Words, Nikolas Adaloglou, available at [https://theaisummer.com/vision-transformer/](https://theaisummer.com/vision-transformer/).


[BACK TO TOP](#top)