# **Transformer (Attention Is All You Need) Overview**

This notebook provides a concise summary of the seminal paper "[Attention is All You Need (2017)](https://arxiv.org/abs/1706.03762)" by Vaswani et al., which introduced the Transformer architecture. The goal is to explain the core concepts and mathematical formulations (in English) without any images.

## **1. Introduction**
Traditional sequence-to-sequence models relied on recurrent or convolutional structures, which made it difficult to process sequences in parallel and capture long-range dependencies efficiently. The Transformer architecture replaces these recurrent/convolutional components entirely with **attention mechanisms**, allowing for significantly improved parallelization and better performance on many NLP tasks.


## **2. Embedding**
Before feeding the input tokens to the Transformer, they are transformed into dense vector representations known as **embeddings**. An embedding maps each token (e.g., a word, subword, or character) to a fixed-dimensional continuous vector space.

- Let the input sequence be \( x = (x_1, x_2, ..., x_n) \).
- We convert each token \( x_i \) into an embedding vector \( e_i \) of dimension \( d_{model} \).

Mathematically, if \( E \) is the embedding matrix of size \( |V| \times d_{model} \) (where \( |V| \) is vocabulary size), then
\[
e_i = E[x_i],\quad i = 1, 2, ..., n.
\]

The embeddings are then supplemented with **positional encodings** to incorporate sequence order information.


## **3. Positional Encoding**
Transformers do not have a built-in notion of sequence order (like RNNs do). To address this, the model uses **positional encodings** to inject information about the relative or absolute position of tokens in the sequence.

A common choice from the paper is to use sine and cosine functions of different frequencies:
\[
PE_{(pos,2i)} = \sin\Bigl(pos / 10^{4i/d_{model}}\Bigr),\quad
PE_{(pos,2i+1)} = \cos\Bigl(pos / 10^{4i/d_{model}}\Bigr),
\]
where
- \( pos \) is the position in the sequence (starting from 0, 1, 2, ...),
- \( i \) indexes the dimension,
- \( d_{model} \) is the dimension of the embeddings.

This way, each position has a unique positional encoding vector of dimension \( d_{model} \). The final input representation for each token is:
\[
z_i = e_i + PE_i,
\]
where \( PE_i \) is the positional encoding for position \( i \).

## **4. Scaled Dot-Product Attention**
The core idea behind the Transformer is the **self-attention** mechanism. Self-attention computes a set of **queries** (Q), **keys** (K), and **values** (V) from the input to decide how much each token should pay attention to the other tokens in the sequence.

### **4.1 Formulas**
Given \( Q \), \( K \), and \( V \) each of dimension \( d_k \), the attention output is:
\[
\text{Attention}(Q, K, V) = \text{softmax}\Bigl(\frac{Q K^T}{\sqrt{d_k}}\Bigr) V.
\]
- \( Q K^T \) produces a score matrix (how relevant each query is to each key).
- We scale by \( \sqrt{d_k} \) to prevent large values when \( d_k \) is large.
- We apply the softmax function to ensure the attention weights sum to 1.
- Finally, we use these weights to produce a weighted sum of the values \( V \).

### **4.2 Code Implementation**


In [None]:
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    """
    Q, K, V are of shape: (batch_size, seq_len, d_k)
    Returns attention output and attention weights.
    """
    d_k = Q.size(-1)  # dimension of K
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    attn_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attn_weights, V)
    return output, attn_weights

# Example usage:
batch_size, seq_len, d_k = 2, 5, 16
Q = torch.rand(batch_size, seq_len, d_k)
K = torch.rand(batch_size, seq_len, d_k)
V = torch.rand(batch_size, seq_len, d_k)

attn_output, attn_weights = scaled_dot_product_attention(Q, K, V)
print("Attention Output Shape:", attn_output.shape)
print("Attention Weights Shape:", attn_weights.shape)

## **5. Multi-Head Attention**
Instead of computing a single attention function, the Transformer uses **multiple attention heads** to capture different aspects of the relationships between tokens.

- We have \( h \) heads, each with parameters \( W_Q^{(i)}, W_K^{(i)}, W_V^{(i)} \) (linear transformations for Q, K, V for head \( i \)).
- Each head produces an output, which we then concatenate and project again.

### **5.1 Formula**
\[
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O,
\]
where
- \( \text{head}_i = \text{Attention}(Q W_Q^{(i)}, K W_K^{(i)}, V W_V^{(i)}) \).
- \( W^O \) is a final linear transformation.


## **6. Position-wise Feed-Forward Network (FFN)**
After the attention layers, each position in the sequence passes through a fully connected feed-forward network (applied independently to each position). The typical form is:
\[
\text{FFN}(x) = \max(0, x W_1 + b_1) W_2 + b_2.
\]
This helps in adding non-linearity and effectively learning transformations for each position.


## **7. Putting It All Together**
A single **Transformer Encoder** layer consists of:
1. Multi-head self-attention sublayer (with residual connection & layer normalization).
2. Position-wise feed-forward network sublayer (with residual connection & layer normalization).

The **Transformer Decoder** has a similar structure, with an additional cross-attention sublayer that attends to the encoder output.

Overall, the key takeaway from *Attention Is All You Need* is that attention mechanisms alone, without recurrences or convolutions, can achieve state-of-the-art performance in sequence modeling tasks, especially in NLP.

### **7.1 Advantages**
- **Parallelization**: Unlike RNNs, the Transformer can process all tokens in a sequence at once.
- **Long-range dependencies**: Self-attention does not degrade over long distances as quickly as RNN-based models.
- **Modular design**: Easy to scale by increasing model depth, width, or number of heads.


## **References**
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762). *Advances in Neural Information Processing Systems*, 30.

---
**End of Notebook**
