# Introduction to Transformers and the Attention Mechanism

Transformers have revolutionized natural language processing (NLP) and various fields in artificial intelligence. Introduced by Vaswani et al. in the landmark paper ["Attention is All You Need"](https://arxiv.org/pdf/1706.03762.pdf) in 2017, the transformer architecture has since become the backbone of state-of-the-art models like BERT, GPT, and T5. Transformers are particularly effective in handling sequential data, such as text, by leveraging the **attention mechanism** to process data in parallel, making them faster and more efficient than traditional recurrent models like RNNs and LSTMs.

## Understanding the Attention Mechanism

The attention mechanism allows models to selectively focus on relevant parts of an input sequence when making predictions. Rather than processing each word or token in isolation or in strict sequential order, attention enables the model to dynamically weigh the importance of each token in relation to others. This is especially helpful in capturing long-range dependencies in sentences or sequences.

### Key Components of Attention

1. **Query, Key, and Value (Q, K, V)**: Each token in the sequence is transformed into three representations — the query, key, and value — through learned linear transformations. These representations enable the model to compute relevance scores (or attention weights) between tokens.
2. **Scaled Dot-Product Attention**: The attention mechanism computes attention scores by taking the dot product between queries and keys. These scores are scaled and passed through a softmax function to generate probabilities, which are then used to weight the values, effectively focusing on relevant tokens.
3. **Multi-Head Attention**: Multiple sets of attention heads allow the model to capture different aspects of relationships between tokens, enabling a richer representation.

### Why Transformers Are Powerful

The attention mechanism, combined with the ability to process tokens in parallel, makes transformers highly efficient for large datasets and tasks with long sequences. Unlike traditional RNNs, transformers don’t suffer from vanishing gradients over long sequences, making them more effective at capturing complex dependencies. This architecture has set the foundation for the development of models that excel in language understanding, generation, and various cross-domain applications.

In summary, transformers and the attention mechanism together provide a robust framework for processing sequential data, transforming the field of NLP and paving the way for advancements in other AI domains.

For more information, refer to the original paper: **[Attention is All You Need](https://arxiv.org/pdf/1706.03762.pdf)**.


In [1]:
# Import required libraries
from torch import Tensor
import torch.nn.functional as f
import torch
from torch import nn
# Import visualization libraries
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
%reload_ext autoreload
%autoreload 2
%matplotlib inline

# Multi-Head Attention Mechanism

Transformers utilize a specialized attention mechanism known as **multi-head attention**, which enables the model to capture multiple types of relationships and dependencies within a sequence. Multi-head attention is fundamental to the power and flexibility of transformers, allowing them to process complex data in parallel and capture nuanced contextual information.

Understanding transformers becomes straightforward once we grasp the concept of multi-head attention.

Below is an illustration of attention and multi-head attention mechanisms, adapted from the original paper, ["Attention is All You Need"](https://arxiv.org/pdf/1706.03762.pdf). 


<div align="center">
    <img src="../images/Attention.png" alt="Diagram of Attention Mechanism">
</div>



# 1. Self-Attention

We begin by understanding **scaled dot-product attention**, which is essential for building the multi-head attention layer in transformers. Mathematically, scaled dot-product attention is expressed as:

$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V
$$

where:
- **Q** (queries), **K** (keys), and **V** (values) are batches of matrices with shapes \((\text{batch\_size}, \text{seq\_length}, \text{num\_features})\).
- **\(d_k\)** is the dimension of **Q** and **K**.
- **\(K^T\)** refers to the transpose of **K**.

Multiplying the query **Q** with the key **K** results in a matrix of shape \((\text{batch\_size}, \text{seq\_length}, \text{seq\_length})\). This matrix reveals the relevance of each element in the sequence, indicating the "attention" each element should receive relative to others.

### Normalization with Softmax

The resulting attention scores are then normalized using the softmax function, ensuring that all weights sum to one. This normalization highlights which elements in the sequence are more significant, guiding the model's focus.

### Applying Attention to Values

The final step involves applying the attention scores to the values **V** via matrix multiplication, determining the final weighted representation.

> **Note**: For simplicity, we omit the optional masking operation shown in the original figure.


In the following code, the matrix multiplications (MatMul) are implemented using `torch.bmm` in PyTorch. This is because **Q**, **K**, and **V** are batches of matrices with the shape \((\text{batch\_size}, \text{sequence\_length}, \text{num\_features})\), where batch matrix multiplication is performed over the last two dimensions.


In [12]:

def scaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor) -> Tensor:
    temp = query.bmm(key.transpose(1, 2))
    scale = query.size(-1) ** 0.5
    softmax = f.softmax(temp / scale, dim=-1)
    return softmax.bmm(value)

From the diagram above, we see that multi-head attention is composed of several identical attention heads. Each attention head contains 3 linear layers, followed by scaled dot-product attention. Let's encapsulate this in an AttentionHead layer.

In [13]:
class AttentionHead(nn.Module):
    def __init__(self, dim_in: int, dim_k: int, dim_v: int):
        super().__init__()
        self.q = nn.Linear(dim_in, dim_k)
        self.k = nn.Linear(dim_in, dim_k)
        self.v = nn.Linear(dim_in, dim_v)

    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
        return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))

# 2. Multi-Head Attention Layer

The **multi-head attention layer** is an extension of the single-head attention mechanism, allowing the model to capture various relationships and dependencies across different parts of the input sequence. To create the multi-head attention layer, we simply combine multiple (i.e., `num_heads`) independent attention heads and add a linear layer for the output.

### How It Works

Each attention head in the multi-head attention layer:
- Computes its own **query (Q)**, **key (K)**, and **value (V)** matrices.
- Applies **scaled dot-product attention** independently.

This means that each head can focus on a different part of the input sequence, attending to specific tokens or words independently of the others. By using multiple attention heads, the model can capture a broader range of relationships within the data, making it more robust and capable of handling complex sequences.

Increasing the number of attention heads allows the model to "pay attention" to more parts of the sequence simultaneously, making it more powerful in capturing fine-grained information and enhancing its ability to learn contextual relationships.

Overall, multi-head attention provides transformers with the flexibility to capture diverse and nuanced patterns, resulting in stronger and more versatile models.


In [14]:
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads: int, dim_in: int, dim_k: int, dim_v: int):
        super().__init__()
        self.heads = nn.ModuleList(
            [AttentionHead(dim_in, dim_k, dim_v) for _ in range(num_heads)]
        )
        self.linear = nn.Linear(num_heads * dim_v, dim_in)

    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
        return self.linear(
            torch.cat([h(query, key, value) for h in self.heads], dim=-1)
        )

# 3. Positional Encoding

To build the complete transformer, we need to introduce **positional encoding**. The multi-head attention mechanism itself does not have any trainable components that account for the order of tokens in a sequence. All operations are performed along the feature dimension, making it independent of sequence length and position. However, understanding the order of tokens is crucial for tasks where word order affects meaning.

### Why Positional Encoding?

Since the attention mechanism doesn’t inherently encode position information, we need a way to tell the model about the relative position of tokens within each input sequence. Vaswani et al. addressed this by using **positional encodings** generated from trigonometric functions, allowing the model to differentiate between positions without any explicit training.

The encoding is defined by:

$$
\text{PE}_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) \quad \text{and} \quad \text{PE}_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)
$$


where:
- **pos** is the position of the token in the sequence.
- **2i** and **2i+1** represent even and odd dimensions, respectively.
- **d_model** is the model’s feature dimension (the total number of features per token).


### Why Use Sinusoidal Functions?

The authors experimented with learned embeddings for position encoding, but found that both methods yielded nearly identical results. They ultimately chose sinusoidal encoding due to its periodic nature, which allows the model to extrapolate to sequence lengths beyond those seen during training. Because sine and cosine functions are periodic and bounded within a [0, 1] range, they provide a consistent encoding pattern across sequences of varying lengths.

This periodicity is beneficial during inference when processing sequences longer than those encountered during training. Sinusoidal encodings ensure that the model can generalize its understanding of position, even for unfamiliar lengths.

While learned embeddings might be easier to implement and debug, sinusoidal encodings offer a theoretical advantage for handling longer sequences. In this explanation, we follow the authors’ approach and use sinusoidal encoding for positional information.


In [15]:
def position_encoding(seq_len: int, dim_model: int, device: torch.device = torch.device("cpu"),) -> Tensor:
    pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)
    dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)
    phase = (pos / 1e4) ** (dim // dim_model)

    return torch.where(dim.long() % 2 == 0, torch.sin(phase), torch.cos(phase))

# 4. Transformer Architecture

With all the core components in place, we can now assemble the **Transformer** model! Below is a diagram of the complete network architecture.


<div align="center">
    <img src="../images/Transformer.png" alt="Transformer Architechture">
</div>

Notice that the **Transformer** model follows an encoder-decoder architecture. The **encoder** (left side of the diagram) processes the input sequence and generates a feature vector, also known as a **memory vector**. This memory vector is then passed to the **decoder** (right side), which processes the target sequence while incorporating information from the encoder’s memory. The output from the decoder represents the model’s final prediction.

### Building the Encoder and Decoder Modules

We can implement the encoder and decoder modules as separate components, combining them later to create the complete transformer model. Before diving into the implementation, there are a few more details to consider, particularly for the **feed-forward networks** within each layer of the encoder and decoder.

### Feed-Forward Network Design

Each layer in the encoder and decoder contains a **fully connected feed-forward network**. This network consists of two linear transformations with a ReLU activation function in between:
- **Input and Output Dimensionality**: 512
- **Inner Layer Dimensionality**: 2048

This gives a simple implementation for the Feed Forward modules above:


In [16]:
def feed_forward(dim_input: int = 512, dim_feedforward: int = 2048) -> nn.Module:
    return nn.Sequential(
        nn.Linear(dim_input, dim_feedforward),
        nn.ReLU(),
        nn.Linear(dim_feedforward, dim_input),
    )

### 4.1 Regularization

What type of normalization should be applied? Is regularization necessary, such as dropout layers?

In the Transformer model, the output of each sub-layer is calculated as **LayerNorm(x + Sublayer(x))**, where **Sublayer(x)** represents the function executed by the sub-layer itself. Dropout is applied to the output of each sub-layer before it is added to the original input and normalized. This approach enhances model generalization and stability by reducing overfitting.


We can encapsulate all of this in a Module:

In [17]:
class Residual(nn.Module):
    def __init__(self, sublayer: nn.Module, dimension: int, dropout: float = 0.1):
        super().__init__()
        self.sublayer = sublayer
        self.norm = nn.LayerNorm(dimension)
        self.dropout = nn.Dropout(dropout)

    def forward(self, *tensors: Tensor) -> Tensor:
        # Assume that the "value" tensor is given last, so we can compute the
        # residual.  This matches the signature of 'MultiHeadAttention'.
        return self.norm(tensors[-1] + self.dropout(self.sublayer(*tensors)))

### 4.2 Encoder

Now, let’s dive into building the encoder. With the utility methods we’ve set up, constructing the encoder is straightforward.


In [18]:
class TransformerEncoderLayer(nn.Module):
    def __init__(
        self, 
        dim_model: int = 512, 
        num_heads: int = 6, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        dim_k = dim_v = dim_model // num_heads
        self.attention = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model,
            dropout=dropout,
        )

    def forward(self, src: Tensor) -> Tensor:
        src = self.attention(src, src, src)
        return self.feed_forward(src)


class TransformerEncoder(nn.Module):
    def __init__(
        self, 
        num_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, src: Tensor) -> Tensor:
        seq_len, dimension = src.size(1), src.size(2)
        src += position_encoding(seq_len, dimension)
        for layer in self.layers:
            src = layer(src)

        return src

### 4.3 Decoder

The decoder module is quite similar to the encoder, with a few key differences:

- The decoder takes **two inputs**: the target sequence and the memory vector from the encoder.
- Each layer in the decoder contains **two multi-head attention modules** instead of one.
- The second multi-head attention module receives the memory vector as part of its input, allowing the decoder to incorporate information from the encoder.


In [20]:
class TransformerDecoderLayer(nn.Module):
    def __init__(
        self, 
        dim_model: int = 512, 
        num_heads: int = 6, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        dim_k = dim_v = dim_model // num_heads
        self.attention_1 = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.attention_2 = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model,
            dropout=dropout,
        )

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        tgt = self.attention_1(tgt, tgt, tgt)
        tgt = self.attention_2(memory, memory, tgt)
        return self.feed_forward(tgt)


class TransformerDecoder(nn.Module):
    def __init__(
        self, 
        num_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout)
            for _ in range(num_layers)
        ])
        self.linear = nn.Linear(dim_model, dim_model)

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        seq_len, dimension = tgt.size(1), tgt.size(2)
        tgt += position_encoding(seq_len, dimension)
        for layer in self.layers:
            tgt = layer(tgt, memory)

        return torch.softmax(self.linear(tgt), dim=-1)

### 4.4 Transformer Class

Finally, we’ll combine everything into a single **Transformer class**. This step is straightforward: we simply bring together the encoder and decoder modules and ensure data flows through them in the proper sequence.


In [21]:
class Transformer(nn.Module):
    def __init__(
        self, 
        num_encoder_layers: int = 6,
        num_decoder_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 6, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
        activation: nn.Module = nn.ReLU(),
    ):
        super().__init__()
        self.encoder = TransformerEncoder(
            num_layers=num_encoder_layers,
            dim_model=dim_model,
            num_heads=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )
        self.decoder = TransformerDecoder(
            num_layers=num_decoder_layers,
            dim_model=dim_model,
            num_heads=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )

    def forward(self, src: Tensor, tgt: Tensor) -> Tensor:
        return self.decoder(tgt, self.encoder(src))

### 5. Testing the Model

With everything in place, it’s time to test the model to ensure our implementation works correctly. We’ll create random tensors for `src` and `tgt`, run them through the model to verify it executes without errors, and confirm that the output tensor has the expected shape.


In [24]:
src = torch.rand(64, 16, 512)
tgt = torch.rand(64, 16, 512)
out = Transformer()(src, tgt)
print(out.shape)

torch.Size([64, 16, 512])
