# **Transformer (Attention Is All You Need) Overview**

This notebook provides a concise summary of the seminal paper "[Attention is All You Need (2017)](https://arxiv.org/abs/1706.03762)" by Vaswani et al., which introduced the Transformer architecture. The goal is to explain the core concepts and mathematical formulations (in English) without any images.

## **1. Introduction**
Traditional sequence-to-sequence models relied on recurrent or convolutional structures, which made it difficult to process sequences in parallel and capture long-range dependencies efficiently. The Transformer architecture replaces these recurrent/convolutional components entirely with **attention mechanisms**, allowing for significantly improved parallelization and better performance on many NLP tasks.


## **2. Embedding**
Before feeding the input tokens to the Transformer, they are transformed into dense vector representations known as **embeddings**. An embedding maps each token (e.g., a word, subword, or character) to a fixed-dimensional continuous vector space.

### **2.1 Brief Historical Perspective**
Early approaches to represent words as numbers used **one-hot encoding**: each word in the vocabulary \( V \) is represented by a sparse vector with only one dimension set to 1 and the rest set to 0. This approach does not capture the semantic or syntactic relationships between words.

Later, researchers developed **distributed representations** for words:
- **Word2Vec (Mikolov et al., 2013)**: Learns embeddings by predicting context words from a target word (Skip-gram) or vice versa (CBOW).
- **GloVe (Pennington et al., 2014)**: A model based on global word co-occurrence statistics.
- **FastText (Bojanowski et al., 2017)**: Extends Word2Vec by representing words as n-grams of characters.

The Transformer architecture paper (Vaswani et al., 2017) did not invent the concept of embeddings from scratch. Instead, it built upon this foundation. The key idea: each token in the vocabulary is associated with a learned dense vector, and this mapping is jointly trained with the rest of the Transformer.

### **2.2 Building the Vocabulary**
Transformers can work with different vocabulary construction methods:
- **Word-based**: Assign each unique word an ID.
- **Subword-based (BPE, WordPiece, etc.)**: Break words into smaller subword units.
- **Character-based**: Each character is a token.

Once the vocabulary is defined, each token \( x_i \) in a sequence is mapped to an index \( \text{index}(x_i) \). If \( E \) is an embedding matrix of size \( |V| \times d_{model} \), then
\[
e_i = E[\text{index}(x_i)],\quad i = 1, 2, ..., n.
\]
where \( |V| \) is the vocabulary size, and \( d_{model} \) is the dimension of each embedding vector.

### **2.3 Embedding Mechanism in Transformers**
The basic formula for the embedding lookup is straightforward:
\[
e_i = E[x_i],
\]
where \( x_i \) is the integer index of the token, and \( E \) is a learnable matrix. During training, backpropagation updates the rows of this embedding matrix so that semantically similar tokens end up in similar regions of the embedding space.

### **2.4 Example Code for a Simple Embedding Layer**
Below is a minimal PyTorch example that demonstrates how an embedding layer converts token indices to vectors. This is not the full Transformer code, just a snippet:


In [1]:
import torch
import torch.nn as nn

# Suppose we have the following parameters
vocab_size = 100  # just an example
d_model = 16     # embedding dimension
max_seq_len = 5  # assume we have 5 tokens in a sequence

# Create an embedding layer
embedding_layer = nn.Embedding(num_embeddings=vocab_size, embedding_dim=d_model)

# Example: a batch of 2 sequences, each of length 5
batch_token_indices = torch.tensor([
    [1, 5, 2, 99, 3],
    [10, 11, 12, 13, 14]
])

# Pass through the embedding layer
embedded_output = embedding_layer(batch_token_indices)
print("Embedded output shape:", embedded_output.shape)
embedded_output

Embedded output shape: torch.Size([2, 5, 16])


tensor([[[-0.1922, -1.7240, -1.2398, -0.4538,  1.1732,  0.2561,  0.8341,
           0.9263,  1.0329,  1.1626,  0.7192,  0.2055, -2.1686,  0.3992,
          -1.4849, -0.0553],
         [ 1.1031, -0.1748, -0.4883,  0.7280,  0.2793,  0.5590, -1.0705,
           2.5811, -1.1439, -0.7423,  1.0178,  0.6332, -0.1173,  2.1692,
          -0.6107,  0.5177],
         [-1.2862, -0.5340,  0.2357, -0.6905,  1.7576,  0.2066,  1.8103,
           0.7516, -0.8231, -0.1784,  1.3053, -0.2123,  1.1030, -0.8248,
           0.1393, -0.4550],
         [-0.4893,  0.8707,  0.0177, -0.0708,  0.5352,  1.1423,  2.9484,
          -0.0720,  2.2152,  0.6408, -0.1529, -1.2833,  0.3310,  1.4489,
          -0.1356, -0.9928],
         [ 1.3111, -0.4251, -0.9976, -0.1813, -0.4673, -0.1273,  0.2141,
           0.1439,  1.0661, -0.4623, -0.3320,  0.2234, -0.5272, -1.2586,
           0.1332,  0.2646]],

        [[ 1.0795, -0.2132,  1.4992, -1.4102, -0.5895, -0.4260, -1.2568,
           1.2368,  1.2039,  1.5795,  3.7722, -1.4

### **2.5 Integrating Positional Encoding**
As discussed, Transformers add a **positional encoding** to the embedding:
\[
z_i = e_i + PE_i,\quad i = 1, 2, ..., n.
\]
where \( PE_i \) is typically computed via sinusoidal functions at different frequencies.

---
## **3. Positional Encoding**
Transformers do not have a built-in notion of sequence order (like RNNs do). To address this, the model uses **positional encodings** to inject information about the relative or absolute position of tokens in the sequence.

A common choice from the paper is to use sine and cosine functions of different frequencies:
\[
PE_{(pos,2i)} = \sin\Bigl(pos / 10^{4i/d_{model}}\Bigr),\quad
PE_{(pos,2i+1)} = \cos\Bigl(pos / 10^{4i/d_{model}}\Bigr),
\]
where
- \( pos \) is the position in the sequence (starting from 0, 1, 2, ...),
- \( i \) indexes the dimension,
- \( d_{model} \) is the dimension of the embeddings.

This way, each position has a unique positional encoding vector of dimension \( d_{model} \). The final input representation for each token is:
\[
z_i = e_i + PE_i,
\]
where \( PE_i \) is the positional encoding for position \( i \).

## **4. Scaled Dot-Product Attention**
The core idea behind the Transformer is the **self-attention** mechanism. Self-attention computes a set of **queries** (Q), **keys** (K), and **values** (V) from the input to decide how much each token should pay attention to the other tokens in the sequence.

### **4.1 Formulas**
Given \( Q \), \( K \), and \( V \) each of dimension \( d_k \), the attention output is:
\[
\text{Attention}(Q, K, V) = \text{softmax}\Bigl(\frac{Q K^T}{\sqrt{d_k}}\Bigr) V.
\]
- \( Q K^T \) produces a score matrix (how relevant each query is to each key).
- We scale by \( \sqrt{d_k} \) to prevent large values when \( d_k \) is large.
- We apply the softmax function to ensure the attention weights sum to 1.
- Finally, we use these weights to produce a weighted sum of the values \( V \).

### **4.2 Code Implementation**


In [2]:
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    """
    Q, K, V are of shape: (batch_size, seq_len, d_k)
    Returns attention output and attention weights.
    """
    d_k = Q.size(-1)  # dimension of K
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    attn_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attn_weights, V)
    return output, attn_weights

# Example usage:
batch_size, seq_len, d_k = 2, 5, 16
Q = torch.rand(batch_size, seq_len, d_k)
K = torch.rand(batch_size, seq_len, d_k)
V = torch.rand(batch_size, seq_len, d_k)

attn_output, attn_weights = scaled_dot_product_attention(Q, K, V)
print("Attention Output Shape:", attn_output.shape)
print("Attention Weights Shape:", attn_weights.shape)

Attention Output Shape: torch.Size([2, 5, 16])
Attention Weights Shape: torch.Size([2, 5, 5])


## **5. Multi-Head Attention**
Instead of computing a single attention function, the Transformer uses **multiple attention heads** to capture different aspects of the relationships between tokens.

- We have \( h \) heads, each with parameters \( W_Q^{(i)}, W_K^{(i)}, W_V^{(i)} \) (linear transformations for Q, K, V for head \( i \)).
- Each head produces an output, which we then concatenate and project again.

### **5.1 Formula**
\[
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O,
\]
where
- \( \text{head}_i = \text{Attention}(Q W_Q^{(i)}, K W_K^{(i)}, V W_V^{(i)}) \).
- \( W^O \) is a final linear transformation.


## **6. Position-wise Feed-Forward Network (FFN)**
After the attention layers, each position in the sequence passes through a fully connected feed-forward network (applied independently to each position). The typical form is:
\[
\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2.
\]
This helps in adding non-linearity and effectively learning transformations for each position.


## **7. Putting It All Together**
A single **Transformer Encoder** layer consists of:
1. Multi-head self-attention sublayer (with residual connection & layer normalization).
2. Position-wise feed-forward network sublayer (with residual connection & layer normalization).

The **Transformer Decoder** has a similar structure, with an additional cross-attention sublayer that attends to the encoder output.

Overall, the key takeaway from *Attention Is All You Need* is that attention mechanisms alone, without recurrences or convolutions, can achieve state-of-the-art performance in sequence modeling tasks, especially in NLP.

### **7.1 Advantages**
- **Parallelization**: Unlike RNNs, the Transformer can process all tokens in a sequence at once.
- **Long-range dependencies**: Self-attention does not degrade over long distances as quickly as RNN-based models.
- **Modular design**: Easy to scale by increasing model depth, width, or number of heads.


## **References**
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762). *Advances in Neural Information Processing Systems*, 30.
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781). *ICLR*.
- Pennington, J., Socher, R., & Manning, C. (2014). [GloVe: Global Vectors for Word Representation](https://aclanthology.org/D14-1162). *EMNLP*.
- Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606). *TACL*.

---
**End of Notebook**
