# Transformers

## Team members

Zhangbyrbay Zhassulan and Abishev Rauan

## Introduction

Transformers were introduced in 2017 by Google Brain researchers and have since revolutionized the field of machine learning, particularly in processing sequential data like text. Their architecture allows them to handle entire sequences of data simultaneously, which is a major shift from the previous models that processed data one element at a time. This makes transformers not only more efficient in understanding context in tasks like language processing but also faster to train due to their parallel processing capabilities. Their versatility has also led to significant advancements in various areas beyond language, like computer vision and audio processing, making them a popular choice in the field of artificial intelligence.

## Architecture explanation

Below, you can see the architecture of transformers from the original paper "Attention is All You Need".

![attention.png](assets/attention.png)

On the left side of the diagram, we can see the encoder. It sequentially applies N blocks to the original sequence. In the original transformer model, it consisted of 6 blocks:

![encoder.png](assets/encoder.png)

Each block outputs a sequence of the same length. It has two important layers, multi-head attention and feed-forward. In each of these layers, the original input is combined with the output of the layer. This process is known as a residual connection. This approach helps in preserving the information from the input as it passes through the network layers, mitigating issues like vanishing gradients and enabling deeper networks to learn effectively. 

After, the activations pass through a layer normalization layer: in the figure, this part is denoted as 'Add & Norm'.

The decoder has a similar scheme, also with 6 blocks. But inside each of the blocks there are two multi-head attention layers, one of which uses the outputs of the encoder.

### Attention Layer

The initial segment of a transformer block consists of the self-attention layer. This component is set apart from standard attention mechanisms by generating new representations for the same sequence's elements. Each element interacts with every other element in the sequence directly, which is a key feature of self-attention.

To elaborate, the attention computation for a sequence involves three matrices that can be adjusted during training: $ W_q $, $ W_k $, and $ W_v $. Each input sequence element $ x_i $ is transformed by these matrices into vectors $ q_i $, $ k_i $, and $ v_i $ — denoting the element's position — known as queries, keys, and values, respectively. The roles of these components are as follows:

- $ q_i $ acts as a request for information, analogous to a database query;
- $ k_i $ represents the keys in a database of values, against which searches are performed;
- $ v_i $ are the actual values that correspond to the keys.

These mechanisms enable the model to dynamically focus on different parts of the input sequence, drawing richer contextual information for each element.

![QKV.png](assets/QKV.png)

We can figure out how closely a query relates to a key by using the dot product (MatMul in the image above). Here's how we calculate the self-attention weights:

$$
\text{self-attention weights}_i = \text{softmax}\left( \frac{q_i k_1^T}{C}, \frac{q_i k_2^T}{C}, \ldots \right),
$$

In this formula, \( C \) is used to scale the dot products so that they don't get too large. This scaling factor is taken from the original research paper and is the square root of the number of dimensions in the keys and values, denoted as:
$$ \sqrt{d_k} $$

After calculating these weights, we combine them with the corresponding values, \( v_i \), to get what the self-attention layer ultimately outputs. When we write this process for the entire set of queries, keys, and values, it looks like this:

$$
\text{self-attention}(Q, K, V) = \text{softmax}\left( \frac{QK^T}{\sqrt{d_k}} \right)V,
$$

Here, \( Q, K, V \) stand for the whole matrices of queries, keys, and values. These are lined up so that each row corresponds to a set $ q_i, k_i, v_i $. The softmax is applied to each row to turn the scores into probabilities.

In [1]:
from jupyterquiz import display_quiz
import json
from base64 import b64encode

def get_spanned_encoded_q(q, q_name):
    byte_code = b64encode(bytes(json.dumps(q), 'utf8'))
    return f'<span style="display:none" id="{q_name}">{byte_code.decode()}</span>'

In [2]:
q_transformers = [{
    "question": "What are the main components and functions of the self-attention layer in a transformer block?",
    "type": "many_choice",
    "answers": [
        {
            "answer": "Each element of the sequence interacts only with the subsequent element.",
            "correct": False,
            "feedback": "Actually, each element interacts with every other element in the sequence."
        },
        {
            "answer": "Three matrices that transform each element of the input sequence: queries (Q), keys (K), and values (V).",
            "correct": True,
            "feedback": "Correct. These matrices are used to create representations of queries, keys, and values."
        },
        {
            "answer": "The softmax function is applied to each column of the score matrix.",
            "correct": False,
            "feedback": "No, the softmax is applied to each row, not to the column."
        },
        {
            "answer": "Dot product is used in calculating self-attention weights.",
            "correct": True,
            "feedback": "True. The dot product is used to assess how much a query corresponds to a key."
        },
        {
            "answer": "The parameter (C) in the self-attention weight formula is used to reduce the size of the product.",
            "correct": True,
            "feedback": "Yes, (C) is used to scale the products so they don't become too large."
        }
    ]
}]

In [3]:
display_quiz(q_transformers)

<IPython.core.display.Javascript object>

In [4]:
get_spanned_encoded_q(q_transformers, "q_transformers")

'<span style="display:none" id="q_transformers">W3sicXVlc3Rpb24iOiAiV2hhdCBhcmUgdGhlIG1haW4gY29tcG9uZW50cyBhbmQgZnVuY3Rpb25zIG9mIHRoZSBzZWxmLWF0dGVudGlvbiBsYXllciBpbiBhIHRyYW5zZm9ybWVyIGJsb2NrPyIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAiRWFjaCBlbGVtZW50IG9mIHRoZSBzZXF1ZW5jZSBpbnRlcmFjdHMgb25seSB3aXRoIHRoZSBzdWJzZXF1ZW50IGVsZW1lbnQuIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkFjdHVhbGx5LCBlYWNoIGVsZW1lbnQgaW50ZXJhY3RzIHdpdGggZXZlcnkgb3RoZXIgZWxlbWVudCBpbiB0aGUgc2VxdWVuY2UuIn0sIHsiYW5zd2VyIjogIlRocmVlIG1hdHJpY2VzIHRoYXQgdHJhbnNmb3JtIGVhY2ggZWxlbWVudCBvZiB0aGUgaW5wdXQgc2VxdWVuY2U6IHF1ZXJpZXMgKFEpLCBrZXlzIChLKSwgYW5kIHZhbHVlcyAoVikuIiwgImNvcnJlY3QiOiB0cnVlLCAiZmVlZGJhY2siOiAiQ29ycmVjdC4gVGhlc2UgbWF0cmljZXMgYXJlIHVzZWQgdG8gY3JlYXRlIHJlcHJlc2VudGF0aW9ucyBvZiBxdWVyaWVzLCBrZXlzLCBhbmQgdmFsdWVzLiJ9LCB7ImFuc3dlciI6ICJUaGUgc29mdG1heCBmdW5jdGlvbiBpcyBhcHBsaWVkIHRvIGVhY2ggY29sdW1uIG9mIHRoZSBzY29yZSBtYXRyaXguIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIk5vLCB0aGUgc29mdG1heCBp

### Attention Layer in the Decoder

As we mentioned earlier, the decoder has a special type of attention layer called cross-attention. In this layer, the queries come from the output sequence, while the keys and values come from the input – that is, from the encoder's results.

It's also important to note that with the attention mechanism as described, every token would "see" the entire sequence. This isn't good for the decoder. When generating text, we create one token at a time, and if the decoder could see future tokens during training, it would lead to "cheating," where the decoder learns from information it shouldn't have access to, resulting in a poor-quality model. To prevent this, during training, we use something called an autoregressive mask. This mask turns the weights to zero before applying the softmax function to the tokens that represent future information. This way, after the softmax, these tokens have zero probability and can't influence the current token being generated.

![cross-attention.png](assets/cross-attention.png)

### Multi-head attention

Think of Q, K, and V like a team that can only see one kind of connection between words in a sentence, and they only get a small slice of the whole picture. To make sure we're not missing out on anything important, the creators of transformers came up with a clever trick: use multiple teams (or "heads") of Q, K, and V at the same time, each looking at the words in a different way. Then, we mix together what all these teams find to get a full understanding. Below, there's a picture that shows how this multi-team approach, or multi-head attention, works.

![multi-head.png](assets/multi-head.png)

In [5]:
q_cross_attention = [{
    "question": "What is the role of the cross-attention layer in the transformer's decoder, and how does it work?",
    "type": "many_choice",
    "answers": [
        {
            "answer": "The queries in cross-attention come from the output sequence, while keys and values come from the input sequence.",
            "correct": True,
            "feedback": "Correct. In cross-attention, queries are derived from the output, and keys and values from the encoder's input."
        },
        {
            "answer": "An autoregressive mask is used to prevent the decoder from seeing future tokens.",
            "correct": True,
            "feedback": "True. This mask is essential for ensuring the decoder does not 'cheat' by learning from future tokens."
        },
        {
            "answer": "In cross-attention, the decoder can see the entire sequence, including future tokens.",
            "correct": False,
            "feedback": "Incorrect. The autoregressive mask prevents the decoder from seeing future tokens."
        },
        {
            "answer": "Multiple 'teams' of Q, K, and V, known as multi-head attention, are used to get a comprehensive understanding.",
            "correct": True,
            "feedback": "Yes, multi-head attention allows the model to capture different aspects of the relationship between words."
        },
        {
            "answer": "The autoregressive mask is applied after the softmax function.",
            "correct": False,
            "feedback": "No, the mask is applied before the softmax function to turn the weights of future tokens to zero."
        }
    ]
}]
display_quiz(q_cross_attention)

<IPython.core.display.Javascript object>

In [6]:
get_spanned_encoded_q(q_cross_attention, "q_cross_attention")

'<span style="display:none" id="q_cross_attention">W3sicXVlc3Rpb24iOiAiV2hhdCBpcyB0aGUgcm9sZSBvZiB0aGUgY3Jvc3MtYXR0ZW50aW9uIGxheWVyIGluIHRoZSB0cmFuc2Zvcm1lcidzIGRlY29kZXIsIGFuZCBob3cgZG9lcyBpdCB3b3JrPyIsICJ0eXBlIjogIm1hbnlfY2hvaWNlIiwgImFuc3dlcnMiOiBbeyJhbnN3ZXIiOiAiVGhlIHF1ZXJpZXMgaW4gY3Jvc3MtYXR0ZW50aW9uIGNvbWUgZnJvbSB0aGUgb3V0cHV0IHNlcXVlbmNlLCB3aGlsZSBrZXlzIGFuZCB2YWx1ZXMgY29tZSBmcm9tIHRoZSBpbnB1dCBzZXF1ZW5jZS4iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJDb3JyZWN0LiBJbiBjcm9zcy1hdHRlbnRpb24sIHF1ZXJpZXMgYXJlIGRlcml2ZWQgZnJvbSB0aGUgb3V0cHV0LCBhbmQga2V5cyBhbmQgdmFsdWVzIGZyb20gdGhlIGVuY29kZXIncyBpbnB1dC4ifSwgeyJhbnN3ZXIiOiAiQW4gYXV0b3JlZ3Jlc3NpdmUgbWFzayBpcyB1c2VkIHRvIHByZXZlbnQgdGhlIGRlY29kZXIgZnJvbSBzZWVpbmcgZnV0dXJlIHRva2Vucy4iLCAiY29ycmVjdCI6IHRydWUsICJmZWVkYmFjayI6ICJUcnVlLiBUaGlzIG1hc2sgaXMgZXNzZW50aWFsIGZvciBlbnN1cmluZyB0aGUgZGVjb2RlciBkb2VzIG5vdCAnY2hlYXQnIGJ5IGxlYXJuaW5nIGZyb20gZnV0dXJlIHRva2Vucy4ifSwgeyJhbnN3ZXIiOiAiSW4gY3Jvc3MtYXR0ZW50aW9uLCB0aGUgZGVjb2RlciBjYW4gc2VlIHRoZSBlb

### Efficiency of Attention Mechanisms
Self-attention allows each token direct, sequence-wide access for predictions, a stark contrast to the sequential hidden state updates in recurrent models. However, this entails quadratic complexity $ O(n^2) $ in both time and memory relative to sequence length. Recent innovations, like the Longformer, address this by reducing complexity to linear, aligning better with GPU memory structures. This adjustment is pivotal for efficiently processing longer sequences, broadening Transformer applicability. Comparative graphs highlight the differences in time and memory demands between standard full self-attention and the optimized Longformer attention across various sequence lengths.

![efficiency](assets/efficiency.png)

### Feed-Forward Networks and Normalization in Transformer Blocks

The second component of a transformer block is the feed-forward network (FFN), comprising two fully connected layers. Each layer operates independently on every element of the input sequence. Modern architectures often employ an intermediate representation from the first layer that is significantly larger, typically four times the size of the block's outputs. This amplification in size means the computational load of the FFN is substantial. In larger models or with shorter sequences, the FFN's processing time can surpass that of the self-attention mechanism, despite the latter's quadratic complexity.

The FFN's function is represented as:
$$ \text{FFN}(x) = \text{act}\left( xW_1 + b_1 \right)W_2 + b_2, $$
where $\text{act}$ denotes the activation function in the FFN. The choice of activation functions has evolved: initially, ReLU (Rectified Linear Unit) was common, but there has been a shift towards GELU (Gaussian Error Linear Unit), defined by $x\Phi(x)$, with $\Phi$ being the standard normal distribution function.

A comparison of activation functions like ReLU, ELU, and GELU, as illustrated in the original article, highlights their differences in performance and application.

![Fully Connected Layer and Normalization](assets/gelu.png)

### Layer Normalization in Transformers

Layer normalization plays a crucial role in the architecture of Transformer models, with its positioning within the residual branch being particularly significant. Research has revealed that the position of layer normalization affects model stability and performance.

The conventional design, known as Post-Layer Normalization (PostLN), involves applying normalization after the residual connection. This approach, standard in many architectures, encounters stability issues, especially in models with numerous layers.

To mitigate these stability concerns, an alternative, Pre-Layer Normalization (PreLN), has been proposed. PreLN applies normalization at the beginning of the residual branch, prior to other operations. This slight rearrangement, as depicted on the right side of the accompanying figure, has shown improved stability during the training of deep models. 

The choice between PostLN and PreLN is thus a critical consideration in Transformer design, impacting both training stability and overall model performance.

![Layer Normalization](assets/normalization.png)

### Position Encoding in Transformers

Transformers are order-invariant, meaning they don't inherently recognize the sequence order of tokens. To address this, positional embeddings are used. These embeddings inform the model about the location of each token in the sequence, distinguishing identical tokens at different positions.

Originally, positional encoding used fixed trigonometric functions:
$$ \text{PE}_{(\text{pos},2i)} = \sin\left(\frac{\text{pos}}{10000^{2i/d}}\right) $$
$$ \text{PE}_{(\text{pos},2i+1)} = \cos\left(\frac{\text{pos}}{10000^{2i/d}}\right) $$
where $ \text{pos} $ is the position number, $ i $ is the index, and $ d $ is the embedding dimension.

Newer methods include learnable positional embeddings added to token embeddings and models learning relative positions. These innovations have improved performance on tasks sensitive to word order.

In [7]:
q_layer_normalization = [{
    "question": "Which normalization technique, used as an alternative to Post-Layer Normalization, improves stability in deep transformer models?",
    "type": "many_choice",
    "answers": [
        {
            "answer": "Pre-Layer Normalization",
            "correct": True,
            "feedback": "Correct! Pre-Layer Normalization is applied at the beginning of the residual branch to improve stability."
        },
        {
            "answer": "Intra-Layer Normalization",
            "correct": False,
            "feedback": "Incorrect, this is not the alternative normalization technique used in transformers."
        },
        {
            "answer": "Post-Layer Normalization",
            "correct": False,
            "feedback": "No, this is the conventional technique, not the alternative one."
        }
    ]
}]
display_quiz(q_layer_normalization)

<IPython.core.display.Javascript object>

In [8]:
get_spanned_encoded_q(q_layer_normalization, "q_layer_normalization")

'<span style="display:none" id="q_layer_normalization">W3sicXVlc3Rpb24iOiAiV2hpY2ggbm9ybWFsaXphdGlvbiB0ZWNobmlxdWUsIHVzZWQgYXMgYW4gYWx0ZXJuYXRpdmUgdG8gUG9zdC1MYXllciBOb3JtYWxpemF0aW9uLCBpbXByb3ZlcyBzdGFiaWxpdHkgaW4gZGVlcCB0cmFuc2Zvcm1lciBtb2RlbHM/IiwgInR5cGUiOiAibWFueV9jaG9pY2UiLCAiYW5zd2VycyI6IFt7ImFuc3dlciI6ICJQcmUtTGF5ZXIgTm9ybWFsaXphdGlvbiIsICJjb3JyZWN0IjogdHJ1ZSwgImZlZWRiYWNrIjogIkNvcnJlY3QhIFByZS1MYXllciBOb3JtYWxpemF0aW9uIGlzIGFwcGxpZWQgYXQgdGhlIGJlZ2lubmluZyBvZiB0aGUgcmVzaWR1YWwgYnJhbmNoIHRvIGltcHJvdmUgc3RhYmlsaXR5LiJ9LCB7ImFuc3dlciI6ICJJbnRyYS1MYXllciBOb3JtYWxpemF0aW9uIiwgImNvcnJlY3QiOiBmYWxzZSwgImZlZWRiYWNrIjogIkluY29ycmVjdCwgdGhpcyBpcyBub3QgdGhlIGFsdGVybmF0aXZlIG5vcm1hbGl6YXRpb24gdGVjaG5pcXVlIHVzZWQgaW4gdHJhbnNmb3JtZXJzLiJ9LCB7ImFuc3dlciI6ICJQb3N0LUxheWVyIE5vcm1hbGl6YXRpb24iLCAiY29ycmVjdCI6IGZhbHNlLCAiZmVlZGJhY2siOiAiTm8sIHRoaXMgaXMgdGhlIGNvbnZlbnRpb25hbCB0ZWNobmlxdWUsIG5vdCB0aGUgYWx0ZXJuYXRpdmUgb25lLiJ9XX1d</span>'

## BERT and GPT: Key Transformers in NLP

Transformers have become central in NLP, largely due to two architectures: BERT and GPT. They exemplify the encoder and decoder aspects of Transformers, respectively.

### GPT: Generative Pretrained Transformer

GPT, the earlier of the two, is a sequence of Transformer-decoders acting as a typical language model. It predicts the next token, a form of multiclass classification. An attention mask is vital during training to prevent 'future' data leakage. GPT models, including specialized versions like ChatGPT, are primarily used for text generation tasks.

### BERT: Bidirectional Encoder Representations from Transformers

BERT, in contrast, employs bidirectional attention, allowing each token in an input sequence to use information from all other tokens. This makes BERT better for tasks involving understanding or analyzing entire inputs, like sentence classification or document similarity assessment. BERT's training involves masked language modeling (predicting hidden words) and next sentence prediction, but it's not designed for text generation.

### Distinctive Attention Mechanisms

The fundamental difference between BERT and GPT lies in their attention mechanisms:
- **BERT's Bidirectional Attention**: Each token attends to all others, enhancing context understanding.
- **GPT's Unidirectional Attention**: Each token attends only to preceding ones, fitting for generative tasks where predictions build on previous tokens.

These distinct attention approaches define their applicability and strengths in various NLP tasks.

![BERT](assets/bert.png)

## Practice

Here, we implemented a BERT model, which is a very popular model of the transformer architecture. We used a BERT model that had already learned (fine-tuned) from lots of movie reviews to tell if a sentence is thumbs up or thumbs down. 

In [9]:
import plotly.graph_objects as go
from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax
import torch

# Function to predict sentiment probabilities
def get_sentiment_probabilities(text):
    encoded_input = tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(encoded_input)
    predictions = softmax(outputs.logits, dim=1)
    return predictions[:,1].item(), predictions[:,0].item()

# Load tokenizer and model
tokenizer = BertTokenizer.from_pretrained('textattack/bert-base-uncased-SST-2')
model = BertForSequenceClassification.from_pretrained('textattack/bert-base-uncased-SST-2')

# Sample texts for analysis
texts = ["I love machine learning, it's awesome!",
         "I hate waiting in long lines.",
         "This movie is not bad, but also not great.",
         "What a wonderful day!"]

# Get probabilities
positive_probs = []
negative_probs = []
for text in texts:
    pos_prob, neg_prob = get_sentiment_probabilities(text)
    positive_probs.append(pos_prob)
    negative_probs.append(neg_prob)

# Create interactive graph
fig = go.Figure(data=[
    go.Bar(name='Positive', x=texts, y=positive_probs),
    go.Bar(name='Negative', x=texts, y=negative_probs)
])

# Update layout
fig.update_layout(barmode='group', title='Sentiment Probabilities for Different Texts',
                  xaxis_title='Texts', yaxis_title='Probability')

fig.show()

When BERT analyzed "I love machine learning, it's awesome!", it picked up the positive sentiment clearly. The sentence "I hate waiting in long lines." was easily recognized as negative. For the more nuanced "This movie is not bad, but also not great.", BERT gave a balanced sentiment score, capturing the mixed feelings. And "What a wonderful day!" was identified as highly positive, reflecting BERT's capacity to understand clear expressions of joy. These sentences showcase BERT's proficiency in gauging sentiments from text.

## Transformers in Diverse Domains

The success of Transformer architectures in text processing has inspired their application in other fields, notably computer vision and reinforcement learning.

### Transformers in Computer Vision: Vision Transformer (ViT)
In computer vision, the Vision Transformer (ViT) has been a groundbreaking application. ViT applies self-attention to images segmented into square "patch" segments, setting new benchmarks in image classification. This adaptation stems from the realization that the universality of self-attention in Transformers can bypass the need for task-specific architectural features, given adequate training, parameters, and dataset size.

### DALL-E: Image Generation from Text Descriptions
Transformers form the core of DALL-E, a model spearheading research in generating images from textual descriptions. DALL-E functions similarly to an autoregressive language model but generates images one "visual token" at a time. This demonstrates the adaptability of Transformer models to tasks beyond their initial design.

### Transformers in Reinforcement Learning: Decision Transformer
In reinforcement learning, the Decision Transformer illustrates the application of Transformers. It employs autoregressive modeling to create agents capable of dynamic action prediction. By processing standard triples of encoded states, current actions, and rewards, the model sequentially predicts the next action. This approach mirrors text generation techniques, showcasing Transformers' flexibility in different predictive scenarios.

These examples highlight the Transformer architecture's versatility, transcending its initial NLP confines to impact various fields, from vision processing to dynamic decision-making systems.

![ViT](assets/vit.png)