## **Attention was all we need appearently...**
**made by:**
* Anas Aldadi

<img src="https://drive.google.com/uc?id=1hQHQ81Zw8CFK0wczsuxn053gpxUxvlod" width="800"/>

This notebook attempts to explain and implement most influential attention mechanism variants to be a place of reference if you ever forgot how one of them works!


---

It is important to note this notebook assumes you know the following.

Prerequisites:

* Basics of deep learning (from FFNs-CNNs-RNNs)

* Transformers and their archeticture variants (Encoder-Decoder, Encoder-Only, Decoder-Only)

* Transformers by pre-training approaches (Masked LMs, AutoRegressive, Conditional Transformers)

Why? because this notebook is intended to explain the variants of the attention mechanism not the transformer archeticture & without knowing transformers you will struggle to find value in this notebook, also another focus of this notebook is to be a comprehensive reference for ppl to refresh their memory about the different kinds of attention mechanism.

---


The notebook will start with:
* Introduction: Why Attention?(analogy and the problem it solves)
* The Core of Attention: (Q, K, V explained intuitively and mathematically)

Attention variants that will be explained here:

* [Bahdanau Attention (Additive Attention) (2014)](https://arxiv.org/abs/1409.0473) NEURAL MACHINE TRANSLATION
BY JOINTLY LEARNING TO ALIGN AND TRANSLATE

* [Luong Attention (Multiplicative Attention) (2015)](https://arxiv.org/abs/1508.04025) Effective Approaches to Attention-based Neural Machine Translation
 (aka soft attention)

* [Hard Attention vs Soft Attention (2015)](https://arxiv.org/abs/1502.03044) Show, Attend and Tell: Neural Image Caption
Generation with Visual Attention

* [Self Attention (2017)](https://arxiv.org/abs/1706.03762) Attention Is All You Need duh.

1. Multi Head Attention (MHA)

2. Scaled Dot-Product Attention (SDPA)

3. Cross Attention (will be explained more in-depth in another notebook of multimodality)

4. Causal/Masked Attention

* [Sparse Attention / Sparse Transformers (2019)](https://arxiv.org/abs/1904.10509) Generating Long Sequences with Sparse Transformers

* [Linformer (2020)](https://arxiv.org/abs/2006.04768) Linformer: Self-Attention with Linear Complexity

* [Performer (2020)](https://arxiv.org/abs/2009.14794) Rethinking Attention with Performers

* [Longformer (2020)](https://arxiv.org/abs/2004.05150) Longformer: The Long-Document Transformer

* [BigBird (2020)](https://arxiv.org/abs/2007.14062) Big Bird: Transformers for Longer Sequences

* [Multi Latent Attention (MLA) (2024)](https://arxiv.org/abs/2405.04434) DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

---


Not explained/implemented in this notebook but influential variants:

* [FlashAttention (2022)](https://arxiv.org/abs/2205.14135) FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    **Optimizations for MHA (for LLM Inference):**

* [Multi-Query Attention (MQA) (2019)](https://arxiv.org/abs/1911.02150) Fast Transformer Decoding: One Write-Head is All You Need

* [Grouped-Query Attention (GQA) (2023)](https://arxiv.org/abs/2305.13245) GQA: Training Generalized Multi-Query Transformer Models from
Multi-Head Checkpoints

## Introduction

Below is the abstract of the first attention paper. It beautifully addresses the problem and show the contribution neatly.

<img src="https://drive.google.com/uc?id=1xNChrMndBghWdRdDlAvcZSrVhp6760mq" width="800"/>

[Bahdanau Attention (Additive Attention) (2014)](https://arxiv.org/abs/1409.0473)

the paper breifly explained:

 they used a bidirectional RNN (LSTM units) (which was the SOTA at the time) and applied this new mechanism they proposed:

<img src="https://drive.google.com/uc?id=1iVvmxozIaDHoUNvNBy8kn1pYKqTciJLX" width="800"/>

And the goal was to allow the decoder to look back at all of the encoder's hidden states when generating each word in the target sequence.

it showed really promising results! steady performence no matter how long the sequence is!

<img src="https://drive.google.com/uc?id=1GoUbppumo3dFdNIEaq6HxxYjZoFGOTLO" width="800"/>

---

now i'll explain the attention mechanism in this paper with its notation then i'll map it to the QKV modern notation of the mechanism!

## Attention variants

### Soft Attention

* General concept, differentiability.
* Briefly contrast Additive (Bahdanau) vs. Multiplicative (Luong).
* Deep Dive: Scaled Dot-Product Attention (Crucial for Transformers)

### Hard Attention vs Soft Attention

Soft Attention: Differentiable, learns attention weights (e.g., Bahdanau).

Hard Attention: Non-differentiable, uses sampling (e.g., REINFORCE), less common in practice due to training instability.

### Self Attention

In [None]:
import torch
import torch.nn as nn
from torch.utils import F

class Attention(nn.Module):
    def __init__(self, Q, K, V):
        self.Q = Q
        self.K = K
        self.V = V

    def forward(self, d):
        attention = torch.softmax((self.Q * K.T)/torch.sqrt(d)) * V
        return attention

#### Multihead Attention

#### Scaled Dot-Product Attention

#### Cross Attention

Key Idea: Used when attending from one modality to another (e.g., text-to-image, encoder-decoder models).

Used In: Transformers, Diffusion Models (UNet + Cross Attention), Vision-Language models.



#### Causal/Masked Attention

Where It’s Used
* GPT family (OpenAI)

* Decoder-only Transformers

* Autoregressive tasks like language modeling, image generation (e.g., DALL·E), code generation, etc.

### Sparse Attention

### Linformer

### Performer

### Longformer

### BigBird

### Multi-Latent Attention (MLA)