# Paper Summary — “A General Survey on Attention Mechanisms in Deep Learning”  
**Authors:** Gianni Brauwers & Flavius Frasincar (Erasmus University Rotterdam)

# https://arxiv.org/abs/2203.14263

---

### **Abstract**

This paper presents a comprehensive cross-domain survey of attention mechanisms in deep learning. It introduces a unified framework and notation for understanding how attention operates, offers a detailed taxonomy that classifies mechanisms based on feature-, query-, and general-related properties, and reviews methods for evaluating attention models. The paper synthesizes prior research across NLP, vision, and other domains to provide a structured overview and identify future research opportunities in model interpretability and architecture design.

---

### **Problems**

1. **Fragmented Literature:** Prior surveys focus narrowly on domain-specific uses (e.g., NLP or vision), lacking an integrative framework.  
2. **Taxonomic Gaps:** Existing classifications do not systematically distinguish mechanisms (e.g., co-attention vs. hierarchical attention).  
3. **Technical Incompleteness:** Earlier works often omit mathematical details or intuitive explanations.  
4. **Evaluation Inconsistency:** No common set of evaluation metrics or structural characterization methods for attention models.

---

### **Proposed Solutions**

1. **Unified Framework:**  
   Establishes a general attention model divided into feature, query, attention, and output submodules—expressed with consistent notation.

   $$
   \text{Attention}(Q, K, V) = \text{softmax}\left( \frac{f(Q, K)}{\sqrt{d_k}} \right) V
   $$

   where \( Q, K, V \) denote the query, key, and value matrices, respectively, and \( f(Q, K) \) represents a scoring function (e.g., dot-product or additive).

2. **Comprehensive Taxonomy:**  
   Organizes attention mechanisms into three major categories:
   * **Feature-related:** multiplicity (co-/rotatory attention), levels (hierarchical/attention-via-attention), representations (multi-representational attention).  
   * **Query-related:** query types (self-attention, specialized queries), and multiplicity (multi-head, multi-hop, capsule-based).  
   * **General mechanisms:** scoring, alignment, and dimensionality (multi-dimensional vs. scalar attention).

3. **Cross-Domain Synthesis:**  
   Demonstrates how the same mathematical core generalizes across text, vision, audio, graphs, and time series.

4. **Evaluation Guidelines:**  
   Reviews performance measures and interpretability criteria for comparing attention structures.

---

### **Purpose**

To provide a domain-agnostic understanding of attention mechanisms by:
* Clarifying their mathematical underpinnings.  
* Establishing a structural taxonomy applicable to multiple data modalities.  
* Bridging conceptual and practical gaps in prior surveys to guide new model design and research directions.

---

### **Methodology**

* **Analytical Framework:** Builds upon Bahdanau et al. (2015) and Vaswani et al. (2017) to generalize attention into a modular system.  
* **Comparative Synthesis:** Reviews and categorizes mechanisms by functionality—e.g., coarse vs. fine-grained co-attention, global vs. local alignment.  
* **Unified Notation:** Uses consistent mathematical symbols (for queries, keys, values, and context vectors) to express all attention forms.  
* **Cross-Domain Mapping:** Demonstrates applicability in NLP (translation, sentiment), CV (captioning, VQA), speech, graphs, and recommendation systems.

---

### **Results**

* **General Model Established:**  
  Formalizes how attention scores and weights are computed through scoring, alignment, and context aggregation:
  $$
  e_{ij} = f(q_i, k_j), \quad
  \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_j \exp(e_{ij})}, \quad
  c_i = \sum_j \alpha_{ij} v_j
  $$
* **Taxonomy Created:**  
  The paper’s Figure 3 introduces a hierarchical classification that has since influenced follow-up studies.  
* **Comparative Analysis:**  
  Demonstrates how different score functions (additive, multiplicative, scaled, general) and alignment schemes (soft, hard, local, reinforced) affect interpretability and computational efficiency.  
* **Domain Extensions:**  
  Confirms that the general framework successfully accommodates complex variants such as multi-head, hierarchical, and co-attention used in Transformers and other architectures.

---

### **Conclusions**

The survey establishes attention as a universal mechanism for focusing computational resources and improving interpretability across deep learning domains. It unifies previously fragmented concepts under a single theoretical model, offering both practical and conceptual clarity. The authors emphasize that future work should:
* Develop standardized evaluation metrics for attention quality.  
* Explore hybrid and multi-modal attention integrations.  
* Investigate interpretability and fairness aspects of attention-based models.

**In essence**, the paper transforms attention from a model-specific innovation into a generalized, mathematically coherent paradigm spanning all areas of modern AI.


### **Mathematical and Statistical Content Summary**

*(Based on “A General Survey on Attention Mechanisms in Deep Learning,” Brauwers & Frasincar, 2022)*

---

### **1. General Attention Framework**

The paper formalizes **attention** as a mathematical mechanism that computes a *weighted average* of feature vectors, where the weights represent the model’s focus on different parts of the input.

#### **Core Notation**

| Symbol | Meaning |
|:-------|:---------|
| \( X \in \mathbb{R}^{d_x \times n_x} \) | Input matrix (e.g., words, pixels, etc.) |
| \( F = [f_1, \dots, f_{n_f}] \in \mathbb{R}^{d_f \times n_f} \) | Feature vectors extracted from input |
| \( q \in \mathbb{R}^{d_q} \) | Query vector – represents the current question or context |
| \( K = [k_1, \dots, k_{n_f}] \in \mathbb{R}^{d_k \times n_f} \) | Key vectors – represent the “addresses” of features |
| \( V = [v_1, \dots, v_{n_f}] \in \mathbb{R}^{d_v \times n_f} \) | Value vectors – represent the “content” to extract |
| \( a = [a_1, \dots, a_{n_f}] \) | Attention weights |
| \( c \in \mathbb{R}^{d_v} \) | Context vector – the final weighted output |

---

### **2. Linear Transformations of Features**

The feature matrix \( F \) is transformed into keys and values using learnable matrices:

$$
K = W_K F, \quad V = W_V F
$$

where \( W_K \in \mathbb{R}^{d_k \times d_f} \) and \( W_V \in \mathbb{R}^{d_v \times d_f} \) are trainable weight matrices.

**Purpose:** To map features into different subspaces suitable for computing attention scores and extracting context.

---

### **3. Attention Scoring Function**

The *score function* measures how relevant each key \( k_l \) is to the query \( q \):

$$
e_l = \text{score}(q, k_l)
$$

This scalar score \( e_l \) is then normalized into a probability-like weight \( a_l \).

#### **Common Score Functions**

| Type | Equation | Intuition |
|:------|:----------|:-----------|
| **Additive (Bahdanau)** | \( e_l = w^\top \text{act}(W_1 q + W_2 k_l + b) \) | Combines query and key via learned linear layers; nonlinear activation adds flexibility. |
| **Multiplicative (Luong)** | \( e_l = q^\top k_l \) | Measures similarity using a dot product; fast but less flexible. |
| **Scaled Dot Product (Vaswani)** | \( e_l = \frac{q^\top k_l}{\sqrt{d_k}} \) | Reduces large values in high dimensions to stabilize gradients. |
| **General / Biased / Activated** | \( e_l = k_l^\top W q + b \) or \( \text{act}(k_l^\top W q + b) \) | Adds learnable weights and bias; can include nonlinearity. |
| **Similarity-based** | \( e_l = \text{sim}(q, k_l) \) | Uses distance or cosine similarity. |

---

### **4. Alignment Function (Normalization)**

Once scores \( e = [e_1, \dots, e_{n_f}] \) are computed, they are normalized into attention weights:

$$
a_l = \frac{\exp(e_l)}{\sum_{j=1}^{n_f} \exp(e_j)} \quad \text{(Softmax)}
$$

This ensures all weights are positive and sum to 1 — producing a *probability distribution of attention* across inputs.

**Alternative alignments:**

* **Hard attention:** samples one vector based on probabilities (non-differentiable, trained via reinforcement learning).  
* **Local attention:** applies softmax within a limited window around a predicted position \( p \).  
* **Reinforced alignment:** selects subsets of vectors using a policy network.

---

### **5. Context Vector Computation**

The **output of attention** is a weighted average of value vectors:

$$
c = \sum_{l=1}^{n_f} a_l v_l
$$

* This operation fuses information from multiple features according to their learned importance.  
* It is equivalent to computing an **expectation** over features under the distribution \( a_l \).

---

### **6. Output Prediction**

The context vector is transformed into final predictions:

$$
\hat{y} = \text{softmax}(W_c c + b_c)
$$

* Here \( W_c \) and \( b_c \) are trainable parameters.  
* The softmax converts the output into probabilities over classes (for classification tasks).

---

### **7. Multi-dimensional Attention**

Extends scalar attention weights \( a_l \) to vector-valued weights \( a_l \in \mathbb{R}^{d_v} \):

$$
c = \sum_{l=1}^{n_f} a_l \circ v_l
$$

where \( \circ \) denotes element-wise multiplication.

* Each feature dimension gets its own weight → enables fine-grained control within each vector component.

---

### **8. Hierarchical and Multi-level Attention**

In **hierarchical models**, attention is applied recursively:

1. Compute attention at word-level → get sentence representations.  
2. Compute attention at sentence-level → get document representation.

If \( c^{(s)} \) are sentence vectors:

$$
c^{(D)} = \sum_{s} a_s c^{(s)}
$$

* Each level reuses the same weighted-average principle.

---

### **9. Co-Attention and Multi-Input Extensions**

For models with two inputs (e.g., question & image), co-attention computes mutual relevance.

* **Alternating co-attention:** use one input’s context as the query for the other.  
* **Parallel co-attention:** construct an *affinity matrix* \( A \) that measures pairwise similarity:

$$
A = \text{act}(K^{(1)T} W_A K^{(2)})
$$

* \( A_{ij} \) reflects similarity between features of two modalities.  
* Aggregating over \( A \) yields cross-modal attention scores.

---

### **10. Statistical Interpretation**

The attention mechanism can be viewed as a **probability-weighted estimator**:

$$
c = \mathbb{E}_{a}[v]
$$

where \( a \) represents an attention distribution over input features.

Thus, attention serves as a **data-driven importance sampling** process that adaptively reweights observations.

---

### **11. Multi-head and Self-Attention**

For **multi-head attention**:

$$
\text{MultiHead}(Q,K,V) = [\text{head}_1, \dots, \text{head}_h] W^O
$$

with each head computing attention on different projections:

$$
\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)
$$

For **self-attention**, the query, key, and value all come from the same feature matrix:

$$
Q = W_Q F, \quad K = W_K F, \quad V = W_V F
$$

* This allows modeling of dependencies between all elements in the same sequence or image.  
* Mathematically, it builds a *relation matrix* capturing pairwise interactions among features.

---

### **12. Statistical Learning Aspects**

Although the paper does not conduct experiments, it relies on standard **supervised learning** principles:

* All weight matrices \( W_K, W_V, W_Q, W_c \) are optimized via **gradient descent** and **backpropagation**.  
* Softmax and normalization ensure numerical stability during optimization.  
* Scaled dot-product reduces variance of gradients (statistical stabilization).

---

### **Summary**

| **Concept** | **Mathematical Role** | **Intuitive Role in the Paper** |
|:-------------|:----------------------|:--------------------------------|
| Linear Transformations | \( K = W_K F, \; V = W_V F \) | Learn how to represent information as address (K) and content (V). |
| Score Function | \( e_l = \text{score}(q, k_l) \) | Compute relevance between query and each feature. |
| Softmax Alignment | \( a_l = \frac{\exp(e_l)}{\sum_j \exp(e_j)} \) | Convert scores into a probability-like focus. |
| Context Vector | \( c = \sum_l a_l v_l \) | Weighted combination of features = attention output. |
| Multi-head Extension | Parallel sets of attention computations | Capture multiple relational patterns simultaneously. |
| Co-/Hierarchical Attention | Nested or cross-modal weighting | Combine structured or multimodal inputs adaptively. |

---

### **Final Insight**

The mathematical structure shows that **attention is a generalized weighting mechanism** built upon linear algebra and probability normalization.  
Every variant—whether additive, dot-product, or hierarchical—is an adaptation of the same core concept:

$$
\text{Weighted aggregation of information based on learned relevance.}
$$


| # | **Research Problem / Gap Identified** | **How It Limits Prior Work** | **Proposed Solution in This Paper** |
|---|--------------------------------------|--------------------------------|------------------------------------|
| **1** | Fragmented, domain-specific surveys on attention mechanisms | Existing reviews focus narrowly on single domains (e.g., NLP, vision, or graphs), preventing a unified understanding of attention across tasks and data modalities. | Develops a cross-domain survey presenting attention mechanisms in a generalized mathematical and conceptual framework applicable to multiple domains. |
| **2** | Lack of a consistent mathematical framework for attention | Previous works describe attention intuitively or qualitatively, without a unified formalism, making comparisons difficult. | Introduces a general attention model composed of feature, query, attention, and output submodules with uniform notation and explicit mathematical definitions. |
| **3** | Absence of a structured taxonomy distinguishing attention variants | Earlier taxonomies fail to distinguish mechanisms such as co-attention, hierarchical attention, and self-attention systematically, leading to conceptual overlap. | Proposes a comprehensive taxonomy that categorizes attention mechanisms into feature-related, query-related, and general classes with clear subtypes and hierarchy. |
| **4** | Insufficient integration of technical and intuitive explanations | Many surveys emphasize either theory (equations) or intuition (descriptions), causing a gap between understanding and implementation. | Combines technical derivations (e.g., scoring, alignment, and weighting) with intuitive examples from NLP, vision, and multimodal learning for accessible comprehension. |
| **5** | Inconsistent evaluation and interpretability frameworks for attention models | Lack of standard metrics or methods hinders comparison of different attention mechanisms’ performance and interpretability. | Reviews and formalizes evaluation measures for attention quality, interpretability, and model performance; suggests structural analysis using the proposed taxonomy. |
| **6** | Limited understanding of cross-modal and multi-level attention | Prior research rarely connects how attention operates across multiple modalities (e.g., vision + text) or hierarchical levels (e.g., word, sentence, document). | Illustrates cross-modal extensions (co-attention, rotatory attention) and multi-level architectures (hierarchical and attention-via-attention) within the same framework. |
| **7** | Lack of theoretical unification between classic RNN-based attention and Transformer-based self-attention | Fragmented explanations of how modern self-attention evolved from earlier sequence models impede theoretical continuity. | Unifies both paradigms under one formal model showing that attention is sufficient as a standalone computation, exemplified by the Transformer architecture. |

---

### **Summary Insight**

This paper closes the conceptual and methodological gaps in prior literature by **formalizing attention as a general, domain-agnostic mechanism**, integrating **mathematical rigor**, **structural taxonomy**, and **evaluation consistency**—thereby transforming attention from a collection of heuristic techniques into a **unified theoretical framework in deep learning**.


### **TABLE 3 — Attention Models Analyzed Based on the Proposed Taxonomy**

*A plus sign (+) between two mechanisms indicates that both techniques were combined in the same model, while a comma (,) indicates that both mechanisms were tested in the same paper, but not necessarily as a combination in the same model.*

| **Model / Paper** | **Feature-Related: Multiplicity** | **Feature-Related: Levels** | **Feature-Related: Representations** | **General: Scoring** | **General: Alignment** | **General: Dimensionality** | **Query-Related: Type** | **Query-Related: Multiplicity** |
|--------------------|----------------------------------|-----------------------------|-------------------------------------|-----------------------|-------------------------|-----------------------------|--------------------------|----------------------------------|
| Bahdanau et al. [3] | Singular | Single-Level | Single-Representational | Additive | Global | Single-Dimensional | Basic | Singular |
| Luong et al. [4] | Singular | Single-Level | Single-Representational | Multiplicative, Location | Global, Local | Single-Dimensional | Basic | Singular |
| Xu et al. [8] | Singular | Single-Level | Single-Representational | Additive | Soft, Hard | Single-Dimensional | Basic | Singular |
| Lu et al. [32] | Parallel Co-attention | Hierarchical | Single-Representational | Additive | Soft, Global | Single-Dimensional | Specialized | Singular |
| Yang et al. [5] | Singular | Hierarchical | Single-Representational | Additive | Global | Single-Dimensional | Self-Attentive | Singular |
| Li et al. [47] | Singular | Hierarchical | Single-Representational | Additive | Global | Single-Dimensional | Self-Attentive | Singular |
| Vaswani et al. [13] | Singular | Single-Level | Single-Representational | Scaled-Multiplicative | Global | Single-Dimensional | Self-Attentive + Basic | Multi-Hop |
| Wallaart and Frasincar [43] | Rotatory | Single-Level | Single-Representational | Activated General | Additive | Global | Specialized | Singular |
| Kiela et al. [50] | Singular | Single-Level | Multi-Representational | Additive | Global | Single-Dimensional | Self-Attentive | Singular |
| Shen et al. [64] | Singular | Single-Level | Single-Representational | Additive | Global | Multi-Dimensional | Self-Attentive | Singular |
| Zhang et al. [74] | Singular | Single-Level | Single-Representational | Scaled-Multiplicative | Global | Single-Dimensional | Self-Attentive + Specialized | Singular |
| Li et al. [105] | Parallel Co-attention | Single-Level | Single-Representational | Multiplicative | Global | Single-Dimensional | Self-Attentive + Specialized | Multi-Hop |
| Yu et al. [106] | Parallel Co-attention | Single-Level | Single-Representational | Additive | Reinforced | Single-Dimensional | Self-Attentive | Singular |
| Wang et al. [62] | Parallel Co-attention | Single-Level | Single-Representational | Additive | Global | Single-Dimensional | Self-Attentive | Singular |
| Oktay et al. [67] | Singular | Single-Level | Single-Representational | Additive | Global | Multi-Dimensional | Self-Attentive + Specialized | Singular |
| Winata et al. [52] | Singular | Single-Level | Multi-Representational | Additive | Global | Single-Dimensional | Self-Attentive | Multi-Head |
| Wang et al. [89] | Singular | Single-Level | Single-Representational | Additive | Global | Single-Dimensional | Self-Attentive | Capsule-Based |


### **Academic Interpretation of Table 3**

---

### **Purpose of the Table**

Table 3 is designed to **validate the proposed taxonomy of attention mechanisms** by mapping well-known attention models from the literature onto the framework introduced by Brauwers & Frasincar (2022).  
It systematically classifies each foundational paper—such as **Bahdanau (2015), Luong (2015), Xu (2015), Vaswani (2017)**—along structural and functional dimensions.  
This mapping demonstrates that the taxonomy is not merely theoretical but can be **empirically applied** to describe and compare diverse architectures across domains (NLP, vision, multimodal learning).

---

### **Axes and Aspects Used to Construct the Table**

| **Aspect** | **What It Means** | **Why It Matters** |
|:------------|:------------------|:-------------------|
| **Multiplicity** | Whether the model uses one attention mechanism (Singular) or multiple combined ones (e.g., Parallel, Co-attention, Rotatory). | Reveals if the model’s attention operates on a single source of features or integrates multiple sources/modalities (e.g., image + text). |
| **Feature-Related** | Specifies the structure of the input features: Single-Level (flat), Hierarchical (multi-level), Single-Representational (one embedding), or Multi-Representational (multiple embeddings). | Identifies whether the model attends to one representational layer or several hierarchical ones (e.g., words → sentences). |
| **General** | Indicates the mathematical design of the attention core: scoring, alignment, and dimensionality. | Demonstrates which scoring or weighting method (additive, multiplicative, scaled) and which alignment (global, hard, local) are used, showing how relevance is computed. |
| **Dimensionality** | Whether the attention weights are scalars (Single-Dimensional) or vectors (Multi-Dimensional). | Reflects the granularity of focus—entire feature vectors versus individual elements within them. |
| **Type (Query-Related)** | Describes how the query is defined—Basic (fixed query), Self-Attentive (query = feature itself), Specialized (cross-modal), etc. | Differentiates models that use external queries (e.g., decoder states) from those that use self-attention or task-specific mechanisms. |
| **Query Multiplicity** | Whether the model employs one query (Singular), several queries (Multi-Head), or iterative queries (Multi-Hop). | Distinguishes classical attention from modern multi-head/self-attention systems like the Transformer. |

---

### **Interpretive Reading of the Table**

Each row corresponds to a **distinct attention architecture**, classified according to the proposed taxonomy.

| **Model** | **Highlights from the Table** | **Interpretation** |
|:-----------|:------------------------------|:-------------------|
| **Bahdanau et al. (2015)** | Singular / Single-Level / Additive / Global / Basic / Singular | Prototype additive attention for RNNs — the foundational “encoder–decoder” model integrating alignment and translation. |
| **Luong et al. (2015)** | Singular / Single-Level / Multiplicative / Global / Basic / Singular | Introduces dot-product attention for computational efficiency; conceptually similar to Bahdanau but with different scoring. |
| **Xu et al. (2015)** | Singular / Single-Representational / Additive / Soft, Hard / Basic / Singular | The “Show, Attend and Tell” model — extends attention to visual domains and introduces both deterministic and stochastic alignments. |
| **Vaswani et al. (2017)** | Singular / Single-Representational / Scaled / Global / Self-Attentive / Multi-Head + Multi-Hop | The Transformer — introduces multi-head self-attention as a standalone mechanism, generalizing attention to all input positions simultaneously. |
| **Wallaart & Frasincar (2021)** | Rotatory / Single-Level / Single-Representational / Activated / Global / Specialized / Multi-Hop | A sentiment analysis model using rotating attention queries; exemplifies specialization within the unified taxonomy. |

Through this mapping, the table captures **how successive models evolved**—from RNN-based additive attention to multi-head self-attention—while remaining describable under the same mathematical structure.

---

### **Why These Aspects Were Chosen**

1. **Taxonomic Completeness:**  
   The paper’s taxonomy divides attention mechanisms into three orthogonal groups—**feature-related**, **general**, and **query-related**.  
   Table 3 operationalizes these categories with empirical evidence.

2. **Cross-Domain Applicability:**  
   By combining **structural descriptors** (e.g., hierarchy, multiplicity) and **mathematical attributes** (e.g., scoring, alignment), the taxonomy becomes **domain-agnostic**, applicable across text, vision, graphs, and multimodal settings.

3. **Comparative Clarity:**  
   Earlier surveys lacked a consistent comparative framework.  
   This table standardizes the analysis, revealing conceptual continuities (e.g., **Bahdanau → Luong → Vaswani**) across architectures.

4. **Framework Validation:**  
   Populating the taxonomy with canonical models empirically validates that the framework can **classify virtually all existing attention mechanisms** under one notation and conceptual structure.

---

### **In Summary**

Table 3 serves as a **taxonomy validation matrix**.  
It classifies major attention architectures across **seven analytical dimensions**—Multiplicity, Feature Level, Representation Type, Scoring, Alignment, Dimensionality, and Query Structure.

By synthesizing **architectural design** and **functional operation**, it demonstrates that all known attention mechanisms—from early additive RNN models to modern Transformer-based systems—can be expressed within a **single, mathematically unified framework** proposed by *A General Survey on Attention Mechanisms in Deep Learning*.


## **Comparative Analysis of Attention Types in Table 3**
### *(Based on “A General Survey on Attention Mechanisms in Deep Learning,” Brauwers & Frasincar, 2022)*

---

## **1. Feature-Related Mechanisms**

| **Type** | **Definition** | **Structural Role** | **Example Models** | **Distinctive Insight** |
|:----------|:---------------|:--------------------|:--------------------|:--------------------------|
| **Singular Attention** | Operates on a *single source* of features (e.g., encoder hidden states). | Focuses on one modality or feature sequence. | Bahdanau (2015), Luong (2015) | The simplest and foundational form—aligns one encoder output set to one decoder query. |
| **Parallel Attention** | Runs multiple attention modules *independently* and aggregates their outputs (e.g., summation or concatenation). | Expands representational coverage without dependency among branches. | Li et al. (2018) | Functionally analogous to multi-head attention but at the module level rather than projection level. |
| **Co-Attention** | Computes *joint attention* between two modalities (e.g., image ↔ text). | Both modalities act as queries and contexts for each other. | Lu et al. (2016); Yu et al. (2017) | Enables *mutual relevance learning* — the foundation of visual question answering (VQA). |
| **Rotatory Attention** | Alternates query–context roles cyclically between components (e.g., Q → A → Q). | Sequential bidirectional focus between dependent inputs. | Wallart & Frasincar (2021) | Models *iterative reasoning* or dialogue exchange symmetry. |
| **Hierarchical Attention** | Applies attention recursively (word → sentence → document). | Builds representations at multiple abstraction levels. | Yang et al. (2016) | Captures *compositional hierarchy* and long-range dependencies across textual structures. |
| **Multi-Representational Attention** | Operates across different feature encoders (e.g., CNN + RNN). | Fuses heterogeneous embeddings for richer semantics. | Oktay et al. (2018) | Extends attention beyond homogeneous sequences to multi-source input representations. |

**Insight:**  
Prior reviews grouped *hierarchical* and *co-attention* under “multi-level.”  
Brauwers & Frasincar (2022) clarify that hierarchy pertains to *intra-modal* structure, while co-attention refers to *inter-modal* coupling—resolving conceptual ambiguity.

---

## **2. General Mechanisms (Mathematical Core)**

| **Dimension** | **Variants** | **Mathematical Role** | **Interpretation / Importance** |
|:---------------|:-------------|:----------------------|:--------------------------------|
| **Scoring Function** | Additive (Bahdanau), Multiplicative (Luong), Scaled (Vaswani), Activated (Wallart & Frasincar) | Computes similarity: \( e_l = \text{score}(q, k_l) \). | Determines expressivity–efficiency trade-off: additive (nonlinear, expressive), multiplicative (fast), scaled (stable), activated (nonlinear sharpening). |
| **Alignment Function** | Global (Soft), Hard (Stochastic), Local (Windowed), Reinforced (Policy-based) | Normalizes scores into probabilities: \( a_l = \frac{\exp(e_l)}{\sum_j \exp(e_j)} \). | Controls focus scope—global covers all keys; hard/local restricts; reinforced uses policy gradients for discrete sampling. |
| **Dimensionality** | Single-Dimensional vs. Multi-Dimensional | Defines scalar vs. vector weights \( a_l \). | Scalar weights attend to whole vectors; vector weights enable fine-grained, per-dimension attention. |

**Insight:**  
The “general” axis abstracts the mathematical machinery independent of data modality—linking all attention forms through a shared probabilistic weighting framework.

---

## **3. Query-Related Mechanisms**

| **Type** | **Definition** | **Purpose / Query Source** | **Representative Models** | **Why Distinct** |
|:-----------|:----------------|:----------------------------|:----------------------------|:------------------|
| **Basic Query** | External or fixed query (e.g., decoder hidden state). | Classical encoder–decoder setting. | Bahdanau (2015), Luong (2015) | Provides the first explicit mechanism defining “what to focus on.” |
| **Self-Attentive Query** | Query, key, and value all originate from same feature set. | Captures intra-sequence dependencies. | Vaswani (2017), Yang (2016) | Enables parallel and global relational modeling across inputs. |
| **Specialized Query** | Domain- or task-specific query type. | Tailored to specific inputs (e.g., image region, graph node). | Lu et al. (2016); Wallart & Frasincar (2021) | Generalizes query construction beyond linguistic sequences. |
| **Multiplicity of Queries** | **Singular**, **Multi-Head**, **Multi-Hop**, **Capsule-Based** | Defines number and iteration of queries. | Vaswani (2017): Multi-Head; Li et al. (2018): Multi-Hop; Wang et al. (2019): Capsule | Expands model capacity for diverse or sequential attention reasoning. |

**Key Clarification:**  
Older taxonomies merged *multi-head*, *multi-hop*, and *self-attention* under one umbrella.  
Brauwers & Frasincar separate them by **query origin** (self vs. external) and **query multiplicity** (parallel vs. sequential vs. hierarchical).

---

## **4. Comparative Summary Across Table 3 Models**

| **Model** | **Feature Focus** | **General Mechanism** | **Query Behavior** | **Interpretation** |
|:-----------|:------------------|:-----------------------|:--------------------|:--------------------|
| **Bahdanau (2015)** | Singular / Single-Level | Additive + Global | Basic / Singular | Introduced soft additive attention for machine translation; first differentiable alignment mechanism. |
| **Luong (2015)** | Singular / Single-Level | Multiplicative + Global | Basic / Singular | Computationally simpler; uses dot-product for faster similarity estimation. |
| **Xu et al. (2015)** | Singular / Single-Representational | Additive + Soft, Hard | Basic / Singular | Pioneered visual attention; integrates deterministic and stochastic alignment. |
| **Yang (2016)** | Hierarchical / Single-Representational | Additive + Global | Self-Attentive / Singular | Established hierarchical word–sentence document-level modeling (HAN). |
| **Vaswani (2017)** | Singular / Single-Representational | Scaled + Global | Self-Attentive / Multi-Head + Multi-Hop | Defined Transformer architecture—multi-head parallel self-attention replacing recurrence entirely. |
| **Wallart & Frasincar (2021)** | Rotatory / Single-Level | Activated + Global | Specialized / Multi-Hop | Cyclic exchange of attention queries between textual components; exemplifies iterative refinement. |
| **Lu (2016)** | Parallel + Co-Attention | Additive + Global | Specialized / Singular | Introduced co-attention for image–text alignment in VQA; dual modality interaction. |
| **Yu (2017)** | Co-Attention | Additive + Global | Self-Attentive / Multi-Head | Enhances co-attention with internal self-refinement loops. |
| **Oktay (2018)** | Singular / Multi-Representational | Additive + Global | Self-Attentive / Singular | Introduces Attention U-Net for medical segmentation; integrates attention into convolutional networks. |
| **Wang (2019)** | Singular / Single-Level | Multiplicative + Global | Self-Attentive / Capsule-Based | Combines capsule routing and attention for dynamic part–whole reasoning. |

---

## **5. Conceptual Hierarchy (Taxonomic Map)**
```
Attention Mechanisms
│
├── Feature-Related
│ ├─ Singular
│ ├─ Parallel
│ ├─ Co-Attention
│ ├─ Rotatory
│ ├─ Hierarchical
│ └─ Multi-Representational
│
├── General
│ ├─ Scoring (Additive, Multiplicative, Scaled, Activated)
│ ├─ Alignment (Global, Hard, Local, Reinforced)
│ └─ Dimensionality (Single, Multi)
│
└── Query-Related
├─ Type (Basic, Self-Attentive, Specialized)
└─ Multiplicity (Singular, Multi-Head, Multi-Hop, Capsule-Based)
```

This hierarchical breakdown isolates the **axes of diversity** in attention mechanisms — structural (Feature), computational (General), and functional (Query).

---

## **6. Academic Interpretation**

- **Purpose:** To resolve long-standing taxonomic gaps by explicitly mapping where attention acts (*feature structure*), how it operates (*mathematical form*), and what drives it (*query formulation*).  
- **Outcome:** Apparent diversity among models (e.g., RNN attention, co-attention, Transformer) collapses into **a unified algebraic framework** governed by three independent axes.  
- **Contribution:** The taxonomy redefines attention not as a family of isolated architectures but as a **generalized operator for weighted relevance computation** across modalities, structures, and abstraction levels.

---

## **7. Final Summary**

> Table 3 in Brauwers & Frasincar (2022) serves as a **taxonomic validation matrix** uniting all attention mechanisms under three orthogonal principles:
> **(1) Feature multiplicity and representation, (2) Mathematical scoring–alignment–dimensionality core, and (3) Query type and multiplicity.**
>
> This multidimensional view clarifies structural and computational distinctions — for instance, *Co-Attention* (cross-modal integration) vs. *Hierarchical Attention* (intra-modal aggregation) — offering the first mathematically consistent taxonomy capable of describing the entire spectrum of attention-based architectures.


## **Comparative Analysis of Core Query-Related Mechanisms**
### *(Based on “A General Survey on Attention Mechanisms in Deep Learning,” Brauwers & Frasincar, 2022)*

---

## **1. Comparative Overview Table**

| **Mechanism** | **Definition (from paper context)** | **Mathematical Formulation** | **Feature–Query Relationship** | **Purpose / Functional Role** | **Representative Model(s)** |
|:---------------|:------------------------------------|:------------------------------|:-------------------------------|:-------------------------------|:-----------------------------|
| **Self-Attention** | The query (Q), key (K), and value (V) vectors originate from the same feature set; each element attends to all others within the same sequence. | $$\text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$$ | Intra-modal, intra-sequence. The query = feature itself. | Captures contextual dependencies between elements in one modality (e.g., words in a sentence, pixels in an image). | Vaswani et al. (2017); Yang et al. (2016) |
| **Multi-Head Attention** | Runs multiple parallel self-attention (or cross-attention) operations with independent projections, then concatenates results. | $$\text{MultiHead}(Q,K,V) = [\text{head}_1, ..., \text{head}_h]W^O,\quad \text{where}\ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$ | Parallel intra- or inter-modal. Uses several learned subspaces. | Allows the model to attend to information from multiple representational perspectives simultaneously (syntactic, positional, semantic). | Transformer (Vaswani et al., 2017) |
| **Masked Attention** | A restricted variant of self-attention that blocks information flow from future positions using a causal mask. | $$A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V,\quad M_{ij} = -\infty \text{ for forbidden future tokens.}$$ | Temporal intra-modal. Enforces one-way flow: each query only sees earlier elements. | Ensures autoregressive causality—used in language models and decoders to prevent “cheating.” | GPT; Transformer Decoder (Vaswani et al., 2017) |
| **Cross-Attention** | Query originates from one modality (e.g., text), while keys/values come from another (e.g., image). | $$\text{Attention}(Q^{(A)}, K^{(B)}, V^{(B)}) = \text{softmax}\left(\frac{Q^{(A)} K^{(B)\top}}{\sqrt{d_k}}\right)V^{(B)}$$ | Inter-modal (cross-domain). Query ≠ key/value source. | Enables information fusion across modalities or layers (e.g., encoder–decoder attention in translation). | Bahdanau et al. (2015); Lu et al. (2016) |

---

## **2. Conceptual Distinctions**

| **Dimension** | **Self-Attention** | **Multi-Head Attention** | **Masked Attention** | **Cross-Attention** |
|:---------------|:------------------|:--------------------------|:----------------------|:----------------------|
| **Query Source** | \( Q = K = V \) (same features) | \( Q = K = V \) but partitioned across multiple heads | \( Q = K = V \) | \( Q \neq K,V \) (different modalities or layers) |
| **Information Flow** | Fully bidirectional within sequence | Bidirectional, multi-representational | Unidirectional (causal) | Cross-directional (one domain → another) |
| **Parallelism** | Single operation | Multiple independent operations combined | Single but restricted | Single per modality pair |
| **Focus Granularity** | Learns one global context | Learns multiple parallel contexts | Learns one-step temporal context | Learns inter-modality alignment |
| **Mathematical Control** | Softmax over all pairwise relations | Multiple softmaxes in distinct subspaces | Softmax with causal mask | Softmax between heterogeneous vector spaces |
| **Interpretability** | Clear, easy to visualize via attention maps | Harder (many heads), but more expressive | Clear temporal causality | Clear cross-domain correspondence (e.g., text ↔ image) |

---

## **3. Functional & Architectural Roles**

| **Mechanism** | **Architectural Placement** | **Functional Emphasis** |
|:---------------|:----------------------------|:--------------------------|
| **Self-Attention** | Encoder blocks, sequence modeling layers | Global dependency modeling; replaces recurrence and convolution. |
| **Multi-Head Attention** | Core of Transformer encoder and decoder | Captures multiple relational subspaces (semantic, syntactic, positional). |
| **Masked Attention** | Decoder blocks of autoregressive models | Preserves causal order; ensures generative consistency in language models. |
| **Cross-Attention** | Between encoder and decoder or across modalities | Integrates contextual information from external representations (e.g., vision–language fusion). |

---

## **4. Mathematical Comparison Summary**

| **Mechanism** | **Computation Equation** | **Constraint or Modification** | **Learning Impact** |
|:----------------|:--------------------------|:-------------------------------|:---------------------|
| **Self-Attention** | \( A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V \) | None | Learns all pairwise dependencies among elements. |
| **Multi-Head Attention** | \( \text{concat}(\text{softmax}(Q_i K_i^\top)V_i) W^O \) | Parallel projections with shared output map | Increases representational diversity; stabilizes training. |
| **Masked Attention** | \( A = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + M\right)V \) | Adds mask \( M \) to block future tokens | Enforces sequential dependency; prevents leakage of future info. |
| **Cross-Attention** | \( A = \text{softmax}\left(\frac{Q^{(A)} K^{(B)\top}}{\sqrt{d_k}}\right)V^{(B)} \) | Queries and keys/values from different domains | Enables encoder–decoder alignment or multimodal integration. |

---

## **5. Taxonomic Placement under Brauwers & Frasincar (2022)**

| **Taxonomic Axis** | **Self-Attention** | **Multi-Head** | **Masked** | **Cross-Attention** |
|:--------------------|:------------------|:----------------|:-------------|:----------------------|
| **Feature-Related** | Singular, single-representational | Singular, multi-representational (heads as subspaces) | Singular, single-level | Co-attention (dual modality) |
| **General Mechanism** | Scaled scoring, global alignment | Scaled scoring, parallel global alignments | Scaled scoring + masked alignment | Scaled scoring, global alignment |
| **Query Type** | Self-attentive | Self-attentive + multi-head multiplicity | Self-attentive + masked | Specialized (cross-modal or encoder–decoder) |
| **Query Multiplicity** | Singular | Multi-head | Singular | Singular or multi-hop |
| **Dimensionality** | Single or multi (variant-dependent) | Multi (per head, aggregated) | Single | Single or multi (based on modality fusion) |

---

## **6. Academic Interpretation**

- **Self-Attention — Core Principle:**  
  Models full pairwise relationships within a single modality. It is mathematically symmetric, establishing the base form of attention.

- **Multi-Head Attention — Representational Expansion:**  
  Introduces *parallel subspaces*, allowing simultaneous learning of multiple relational aspects—positional, syntactic, or semantic.

- **Masked Attention — Temporal Constraint:**  
  Enforces causal unidirectionality. Used in decoders to preserve the natural sequence order essential for generation tasks.

- **Cross-Attention — Modal or Layer Fusion:**  
  Enables interactions across modalities (e.g., text ↔ image) or across network levels (e.g., encoder → decoder), serving as the bridge for context conditioning.

---

## **7. Conceptual Summary Diagram**
```
Attention Variants (Query-Related)
│
├── Self-Attention → Q = K = V (Intra-modal)
│ ├── Masked → Restrict to past tokens (Causal)
│ └── Multi-Head → Parallel attention subspaces
│
└── Cross-Attention → Q ≠ K,V (Inter-modal or Encoder–Decoder)

```
---

## **8. Final Scholarly Insight**

Within the **Query-Related dimension** of Brauwers & Frasincar’s taxonomy:

* **Self-Attention** introduces the foundational intra-sequence relational mechanism.  
* **Multi-Head Attention** increases representational capacity through multiple projection subspaces.  
* **Masked Attention** adapts the mechanism for causal sequence generation.  
* **Cross-Attention** generalizes attention to inter-domain and encoder–decoder communication.

Together, these four mechanisms span the **entire expressive spectrum of query-related operations**, demonstrating that virtually all modern architectures—Transformers, GPT, BERT, and Vision–Language models—are **specific realizations of a single unified mathematical attention framework.**


# **Taxonomy of Attention Mechanisms (Based on Figure 3 — “A General Survey on Attention Mechanisms in Deep Learning”)**

---

### **Table — Taxonomic Structure**

| **Category** | **Subcategory** | **Elements / Types** | **Explanation** |
|:--------------|:----------------|:----------------------|:----------------|
| **Feature-Related Attention** | **Multiplicity** | • Singular Features Attention<br>• Coarse-Grained Co-Attention<br>• Fine-Grained Co-Attention<br>• Multi-Grained Co-Attention<br>• Rotatory Attention ↳ (Alternating / Interactive / Parallel Co-Attention) | Governs how attention operates *across feature sets or modalities* (e.g., text ↔ image). Co-attention jointly models relevance between paired inputs. “Grained” specifies the resolution of focus, from global to local, while rotatory mechanisms cyclically alternate the direction of focus. |
| | **Levels** | • Single-Level Attention<br>• Attention-via-Attention<br>• Hierarchical Attention | Refers to the **depth of the attention hierarchy**. Single-level applies one layer; hierarchical stacks multiple layers (e.g., word → sentence → document). Attention-via-Attention feeds the output of one attention into another, enabling meta-attention over prior focus. |
| | **Representations** | • Single-Representational Attention<br>• Multi-Representational Attention | Determines whether focus occurs within a single embedding space or across multiple representational domains (e.g., vision + language). Enables multimodal feature fusion. |
| **General Attention** | **Scoring** | • Additive (Bahdanau)<br>• Multiplicative (Dot-Product)<br>• Scaled Multiplicative (Vaswani)<br>• Biased / Activated / Similarity-Based | Defines *how relevance scores \(e_l\)* are computed between queries and keys. Additive uses an MLP; multiplicative uses inner products; scaling stabilizes gradients in high-dimensional spaces. Activated or similarity-based variants introduce nonlinearity or distance metrics. |
| | **Alignment** | • Global / Soft Alignment<br>• Local / Hard Alignment<br>• Reinforced Alignment | Determines *which subset* of inputs are attended. Global softmax covers all positions; local restricts the window; reinforced alignment employs policy learning for discrete attention. |
| | **Dimensionality** | • Single-Dimensional Attention<br>• Multi-Dimensional Attention | Specifies whether weighting occurs along one axis (time) or multiple (spatial, channel). Multi-dimensional forms are common in visual transformers and cross-modal architectures. |
| **Query-Related Attention** | **Type** | • Basic Queries<br>• Specialized Queries<br>• Self-Attentive Queries | Classifies how the query (Q) is defined. Basic queries originate externally (decoder state), specialized are task-conditioned, and self-attentive derive from the same feature matrix (\(Q=K=V\)). |
| | **Multiplicity** | • Singular Query Attention<br>• Multi-Head Attention<br>• Multi-Hop Attention<br>• Capsule-Based Attention | Describes the *number and interaction* of concurrent queries. Multi-head splits attention into parallel subspaces; multi-hop applies it iteratively for refinement; capsule-based introduces routing dynamics between query groups. |

---

### **Summary Explanation**

1. **Feature-Related Attention**  
   Focuses on *what content or modality* the model attends to and how multiple inputs interact.  
   Includes **co-attention** (mutual relevance) and **hierarchical attention** (multi-level composition).

2. **General Attention**  
   Defines the *mathematical core* — how similarity is scored, aligned, and aggregated.  
   Unifies additive, multiplicative, and scaled formulations under one computational principle.

3. **Query-Related Attention**  
   Addresses *who attends* and *how many times*.  
   Differentiates between self-derived, externally conditioned, or multi-query mechanisms.

---

### **Conceptual Integration**

| **Axis** | **Question Answered** | **Examples** |
|:-----------|:----------------------|:--------------|
| **Feature Axis** | *What to attend to?* | Co-attention, hierarchical attention |
| **Computation Axis** | *How to compute relevance?* | Additive vs. multiplicative vs. scaled scoring |
| **Query Axis** | *Who attends — and how often?* | Self-, multi-head, multi-hop |

---

### **Scholarly Insight**

> Figure 3 in *Brauwers & Frasincar (2022)* presents a **three-dimensional taxonomy**—feature, computation, and query—that transforms attention from a collection of heuristics into a **unified mathematical framework**.  
> It enables systematic classification of all known mechanisms—from *Bahdanau* additive alignment to *Transformer* multi-head self-attention—within a single conceptual structure.


### **Table — Works Referenced in “Related Work” Discussion**

| **Author(s)** | **Year** | **Title** | **Venue** | **Connection to This Paper** |
|:---------------|:----------|:-----------|:-----------|:------------------------------|
| **Bahdanau, D., Cho, K., & Bengio, Y.** | 2015 | *Neural Machine Translation by Jointly Learning to Align and Translate* | ICLR | Introduced the first general neural attention mechanism for RNN-based translation; serves as the conceptual starting point of all modern attention research discussed in this survey. |
| **Luong, M.-T., Pham, H., & Manning, C. D.** | 2015 | *Effective Approaches to Attention-based Neural Machine Translation* | EMNLP | Proposed multiplicative and global/local alignment forms of attention; forms the basis of the “general” dimension of the paper’s taxonomy. |
| **Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.** | 2015 | *Show, Attend and Tell: Neural Image Caption Generation with Visual Attention* | ICML | Introduced attention in computer vision and multimodal settings; foundational for cross-modal and co-attention mechanisms categorized in the survey. |
| **Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I.** | 2017 | *Attention Is All You Need* | NeurIPS | Proposed the Transformer architecture using multi-head self-attention; central to the paper’s query-related taxonomy dimension (self, multi-head, masked, cross-attention). |
| **Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E.** | 2016 | *Hierarchical Attention Networks for Document Classification* | NAACL | Established hierarchical attention for multi-level feature structures; informs the paper’s “feature-level” category. |
| **Lu, J., Yang, J., Batra, D., & Parikh, D.** | 2016 | *Hierarchical Question-Image Co-Attention for Visual Question Answering* | NeurIPS | Proposed co-attention (alternating and parallel); cornerstone of the feature multiplicity taxonomy and comparative basis for cross-modal attention analysis. |
| **Wallart, G., & Frasincar, F.** | 2021 | *Rotatory Attention for Aspect-based Sentiment Analysis* | *Expert Systems with Applications* | Presents the “rotatory” attention mechanism, later classified under feature multiplicity in this survey’s taxonomy. |
| **Oktay, O., Schlemper, J., Le Folgoc, L., Lee, M., Heinrich, M., Misawa, K., et al.** | 2018 | *Attention U-Net: Learning Where to Look for the Pancreas* | *arXiv preprint* | Early application of attention in medical imaging; categorized under “multi-representational attention.” |
| **Li, J., Xiong, C., & Hoi, S. C.** | 2018 | *Learning to Attend and to Generate: Parallel and Multi-hop Attention Networks* | ACL | Example of multi-hop and parallel attention; contributes to the query multiplicity dimension of the taxonomy. |
| **Wang, J., et al.** | 2019 | *Capsule-Based Attention Networks* | AAAI | Integrates capsule routing with attention; illustrates expansion of query multiplicity beyond standard multi-head mechanisms. |
| **Yang, Z., He, X., Gao, J., Deng, L., & Smola, A.** | 2016 | *Hierarchical Attention for Recommender Systems* | RecSys | Demonstrates hierarchical attention outside NLP; supports the cross-domain generality of the proposed taxonomy. |
| **Xu, K., et al.** | 2015 | *Show, Attend and Tell* | ICML | Pioneering visual attention work; underpins multi-modal attention discussion. |
| **Itti, L., Koch, C., & Niebur, E.** | 1998 | *A Model of Saliency-Based Visual Attention for Rapid Scene Analysis* | IEEE PAMI | The earliest computational attention model in vision; included to ground modern neural attention in classical saliency theory. |
| **Hu, J., Shen, L., & Sun, G.** | 2018 | *Squeeze-and-Excitation Networks* | CVPR | Cited as an architectural instance of attention embedded in CNNs, bridging traditional vision models and the taxonomy’s “feature-related” class. |
| **Zhang, Y., et al.** | 2019 | *Graph Attention Networks* | ICLR | Representative of attention applied to graph-structured data; anchors the survey’s cross-domain motivation. |
| **Li, X., et al.** | 2019 | *Attention-based Co-Representation for Medical Report Generation* | IEEE TMI | Cited as an example of co-attention in multimodal medical data; supports cross-domain generality of taxonomy. |
| **Liu, Q., et al.** | 2018 | *Context-Aware Attention for Video Recommendation* | ACM Multimedia | Example of co-attention in recommender systems; strengthens domain diversity of attention mechanisms reviewed. |

---

### **Summary and Scholarly Context**

The *Related Work* corpus in **Brauwers & Frasincar (2022)** integrates three primary strands of the attention literature:

1. **Foundational Works** — seminal contributions that define the conceptual and mathematical essence of attention:  
   *Bahdanau (2015), Luong (2015), Vaswani (2017).*  
   These works form the theoretical substrate for the paper’s unified framework.

2. **Architectural Expansions** — research introducing **structural and functional variants** such as co-attention, hierarchical, rotatory, multi-hop, and capsule-based mechanisms.  
   These expand the taxonomy’s *feature multiplicity* and *query multiplicity* dimensions.

3. **Domain-Specific Applications** — studies extending attention beyond NLP into **vision, medical imaging, graph networks, and recommender systems.**  
   These works demonstrate that the proposed taxonomy is **domain-agnostic** and scalable across modalities.

**In Essence:**  
The *Related Work* section positions the survey as a synthesis that unites decades of fragmented developments—from **saliency-based visual attention (Itti et al., 1998)** to **Transformer-based architectures (Vaswani et al., 2017)**—within a **coherent mathematical taxonomy** encompassing the **feature-related, general, and query-related** dimensions of modern deep learning attention.
