# Module 4 - Transformers

## Contents
- Part 1:
    - 1.1 From Recurrance (RNN) to Attention-based NLP models
    - 1.2 Transformer models, results and trade offs

## Misc and Keywords
- **Attention**: A mechanism that lets models weigh the importance of different parts of the input when making predictions.  
- **Self-Attention**: A form of attention where queries, keys, and values all come from the same input sequence. Enables tokens to attend to each other.
- **Positional Encoding**: Extra information added to token embeddings to provide a sense of order, since self-attention has no inherent sequence awareness.
- **Multi-Head Attention**: Runs several self-attention operations in parallel on different representation subspaces and concatenates the outputs.
- **Causal Masking**: A technique to prevent tokens from attending to future positions in a sequence, used in language generation tasks.
- **Residual Connections (Add & Norm)**: A pattern where the input to a layer is added to its output, followed by layer normalisation. Helps with gradient flow and training stability.
- **Feed-Forward Networks**: Position-wise fully connected layers applied after self-attention, usually with non-linear activations like ReLU or GELU.
- **Encoder**: A component of the transformer that processes the entire input sequence at once, typically using unmasked attention.
- **Decoder**: A transformer component that generates output sequences one token at a time, using masked attention to prevent future peeking.
- **Transformer**: An architecture built entirely on attention mechanisms, forgoing recurrence and convolution, enabling efficient parallel training.
- **Autoregressive Generation**: A decoding method where the model predicts one token at a time, feeding each prediction back into the model.
- **Parallelisation**: The ability to compute multiple operations simultaneously — a key advantage of transformers over RNNs.

### Notes:
- Self-attention enables all-pair interactions, unlike RNNs which are limited by sequential processing.
- Transformers achieve state-of-the-art results in many NLP tasks, but at the cost of high computational demand.
- Encoding position is essential in transformers since token order is not inherently preserved.

# Part 1: Transformers

## 1.1 From Recurrence (RNN) to Attention-based NLP Models

### Issues with Recurrent Models (RNNs)

- **Linear Interaction Distance**
    - RNNs process sequences sequentially (unrolled left to right).
    - This enforces **linear locality**—i.e., nearby words influence each other more than distant ones.
    - Makes learning **long-range dependencies** difficult due to vanishing gradients and time-step bottlenecks.

- **Lack of Parallelisability**
    - RNNs require sequential computation:
        - Forward and backward passes are inherently **non-parallelisable**.
    - This limits training speed and scalability.

### Enter Attention

- **Attention Mechanism**
    - Each word's representation acts as a **query** that interacts with a set of **keys and values** (often from other words).
    - This allows direct modelling of **relationships between all tokens**, regardless of distance.
    - It solves:
        - **Long-distance dependency modelling** (by allowing all-pair interactions).
        - **Parallel computation** (since all token interactions can be computed simultaneously).

### Self-Attention

- In **Self-Attention**, the **queries**, **keys**, and **values** all come from the **same sequence**.
- The mechanism computes:

  $$
  \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
  $$

  Where:
    - $Q$, $K$, $V$ are the query, key, and value matrices.
    - $d_k$ is the dimensionality of the keys (used for scaling).
    - Each token attends to every other token, capturing context-rich representations.

### Barriers to Using Self-Attention

These barriers must be addressed in order to construct a minimal self-attention building block.

#### No Implicit Sequence Order
- Unlike RNNs, self-attention does not inherently encode the order of tokens in a sequence.
- To address this, **positional encodings** are added to token embeddings.

**Types of Positional Encodings:**
- **Sinusoidal**: Use fixed sine and cosine functions with varying frequencies to encode position.
- **Learned**: Introduce trainable position embeddings that are optimised during training.

#### No Nonlinearities for Deep Learning 'Magic'
- Self-attention on its own is a linear operation.
- To introduce nonlinearity and model more complex functions, a **position-wise feedforward neural network** is applied to the output of the self-attention layer.
- The same feedforward network is applied independently to each position.

#### Need to Prevent Looking into the Future
- In tasks like language modelling, it is important to **prevent tokens from attending to future tokens**.
- This is achieved using **causal masking**, which ensures that each position can only attend to earlier positions (or itself). Future words are 'masked'

## The Self-Attention Building Block

The self-attention mechanism is used as a core component in transformer architectures. Below is the standard processing flow for generating output probabilities from input tokens:

**Processing Steps:**

1. **Inputs** – Raw token sequence (e.g., words or subwords).
2. **Embeddings** – Convert each token into a dense vector representation using an embedding layer.
3. **Add Positional Embeddings** – Inject information about token order using either sinusoidal or learned positional encodings.
4. **Masked Self-Attention** – Compute self-attention while preventing each token from attending to future tokens (used in autoregressive tasks).
5. **Feed-Forward Network** – Apply a position-wise feedforward neural network with nonlinearities to each token's representation.
6. **Linear Layer** – Project the outputs into vocabulary-size logits for prediction.
7. **Softmax** – Convert logits into probabilities.
8. **Output Probabilities** – Final distribution over the next possible tokens.

This block forms the basis of each decoder layer in the transformer architecture.

## 1.2 Transformer Models, Results and Trade-Offs

### Transformer Decoder

The decoder is the core of many language models, such as GPT. It is designed to generate text one token at a time, conditioning on previously generated tokens.

**Processing Steps:**

1. **Embeddings**
    - Converts input tokens into dense vector representations (word embeddings).

2. **Add Positional Embeddings**
    - Injects information about the position of each token, since self-attention alone does not encode sequence order.
    - These positional embeddings are added element-wise to the token embeddings.

3. **Masked Multi-Head Attention**
    - Allows each token to attend to previous tokens in the sequence (and itself), but not future tokens.
    - Multi-head attention allows the model to jointly attend to information from different representation subspaces.

4. **Add & Norm**
    - Residual connection: the input to the attention layer is added to its output.
    - Layer normalisation is applied to stabilise training.

5. **Feed-Forward Network**
    - A fully connected network applied independently to each position.
    - Usually consists of two linear layers with a ReLU or GELU nonlinearity in between.

6. **Add & Norm**
    - Another residual connection: the input to the feed-forward network is added to its output.
    - Followed by layer normalisation.

The final output can then be passed through a linear projection and softmax to obtain token probabilities.

---

### Transformer Encoder

The encoder is typically used in models for understanding tasks (e.g. BERT), where the full input sequence is available at once.

**Key Differences:**
- Uses **unmasked multi-head attention** — each token can attend to all others in the sequence.
- The rest of the structure mirrors the decoder, but without the masking step.

Encoder and decoder can be combined (as in the original Transformer architecture) for tasks like translation.

---

### Results and Trade-Offs

**Advantages:**
- Highly parallelisable, enabling fast training.
- Excellent performance on a wide range of NLP tasks.
- Long-range dependencies can be modelled directly.

**Trade-Offs:**
- Self-attention has quadratic complexity with respect to sequence length ($O(n^2)$), limiting scalability for very long texts.
- Requires large amounts of data and compute to train effectively.
- Positional encoding is a workaround rather than an inherent solution for sequential structure.
