# 📚 Table of Contents

- [🎯 Understanding Attention Mechanisms](#understanding-attention-mechanisms)
  - [👁️ What is attention in the context of NLP?](#what-is-attention-in-the-context-of-nlp)
  - [🚀 Why attention mechanisms improve sequence-to-sequence models](#why-attention-mechanisms-improve-sequence-to-sequence-models)
  - [🧪 Example: Implementing Bahdanau attention for sequence-to-sequence tasks](#example-implementing-bahdanau-attention-for-sequence-to-sequence-tasks)
- [🎯 Bahdanau Attention (Additive Attention)](#bahdanau-attention-additive-attention)
  - [🔑 Key components: Query, Key, and Value](#key-components-query-key-and-value)
  - [📚 How Bahdanau attention works and its application in machine translation](#how-bahdanau-attention-works-and-its-application-in-machine-translation)
  - [🧪 Example: Using Bahdanau attention with an RNN encoder-decoder](#example-using-bahdanau-attention-with-an-rnn-encoder-decoder)
- [🧬 Transformers and Self-Attention](#transformers-and-self-attention)
  - [📘 Introduction to transformers and their self-attention mechanism](#introduction-to-transformers-and-their-self-attention-mechanism)
  - [⚡ How transformers outperform RNNs](#how-transformers-outperform-rnns)
  - [🛠️ Example: Building a simple transformer model for text classification](#example-building-a-simple-transformer-model-for-text-classification)

---


### **1. Core Attention Mechanism Overview**
**Diagram Type:** Comparative Architecture Flow  
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '14px'}}}%%
flowchart LR
    %% Traditional Seq2Seq vs Attention
    subgraph NoAttn["Traditional Seq2Seq"]
        direction TB
        EncoderRNN -->|Last Hidden State| DecoderRNN
    end
    
    subgraph WithAttn["Seq2Seq with Attention"]
        direction TB
        EncoderAll[All Encoder States] --> Attention
        DecoderStep[Decoder State] --> Attention
        Attention --> Context[Context Vector]
        Context --> DecoderRNN
    end
    
    %% Key Annotation
    note[["Attention allows dynamic context weighting<br/>instead of fixed last state"]]:::yellow
    WithAttn ~~~ note
    
    classDef yellow fill:#ffffcc,stroke:#ffcc00
    linkStyle 0,1,2,3,4,5 stroke:#999,stroke-width:1px
```

---

### **2. Bahdanau Attention Mechanics**
**Diagram Type:** Step-by-Step Process Flow  
```mermaid
%%{init: {'theme': 'base', 'themeVariables': {'fontSize': '14px'}}}%%
flowchart TD
    %% Components
    Q[Decoder Query]:::blue --> Score[Attention Scores]
    K[Encoder Keys]:::green --> Score
    Score --> Weights[Softmax Weights]
    V[Encoder Values]:::red --> Context[Weighted Sum]
    Weights --> Context

    %% Mathematical Details
    Score -.-> Formula["a_ij = v^T tanh(W1*h_i + W2*s_j)"]:::formula

    %% Implementation Flow
    Context --> DecoderStep[Decoder Output]

    %% Style Definitions
    classDef blue fill:#e6f3ff,stroke:#0066cc
    classDef green fill:#e6ffe6,stroke:#009900
    classDef red fill:#ffe6e6,stroke:#cc0000
    classDef formula fill:#f0e6ff,stroke:#6600cc,font-size:12px
    linkStyle 0,1,2,3,4,5 stroke:#999,stroke-width:1px
```
---

### **3. Transformer Self-Attention**
**Diagram Type:** Matrix Operation Diagram  
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '14px'}}}%%
flowchart LR
    %% Input Processing
    X[Input Embeddings] --> QKV[Q,K,V Matrices]
    QKV -->|Split| Q[Query]:::blue
    QKV -->|Split| K[Key]:::green
    QKV -->|Split| V[Value]:::red
    
    %% Attention Calculation
    Q --> MatMul[Q×K<sup>T</sup>]
    K --> MatMul
    MatMul --> Scale["Scale/√d<sub>k</sub>"]
    Scale --> Mask[Softmax]
    Mask --> Attn[Attention Weights]
    Attn --> Output[Weighted V]
    
    classDef blue fill:#e6f3ff,stroke:#0066cc
    classDef green fill:#e6ffe6,stroke:#009900
    classDef red fill:#ffe6e6,stroke:#cc0000
    linkStyle 0,1,2,3,4,5 stroke:#999,stroke-width:1px
```

---

### **4. Full Transformer Architecture**
**Diagram Type:** Layered Block Diagram  
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '14px'}}}%%
flowchart TD
    subgraph Transformer["Transformer Block"]
        direction LR
        Input --> PosEnc[Positional Encoding]
        PosEnc --> MultiAttn[Multi-Head Attention]
        MultiAttn --> AddNorm[Add & Norm]
        AddNorm --> FFN[Feed Forward]
        FFN --> Output
    end
    
    %% Key Features
    note[["Parallel processing of all positions<br/>No recurrence → faster training"]]:::yellow
    Transformer ~~~ note
    
    classDef yellow fill:#ffffcc,stroke:#ffcc00
    linkStyle 0,1,2,3,4,5 stroke:#999,stroke-width:1px
```

---

### **Implementation Examples**

**1. Bahdanau Attention Code Snippet**  
```python
class BahdanauAttention(tf.keras.layers.Layer):
    def call(self, query, values):
        query = tf.expand_dims(query, 1)
        score = tf.keras.layers.Dense(units)(tf.nn.tanh(query + values))
        weights = tf.nn.softmax(score, axis=1)
        return tf.reduce_sum(weights * values, axis=1)
```

**2. Transformer Text Classification**  
```mermaid
%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '14px'}}}%%
flowchart TD
    Input[Text] --> Emb[Embedding]
    Emb --> Pos[Positional Encoding]
    Pos --> T1[Transformer Block] 
    T1 --> T2[Transformer Block]
    T2 --> Pool[Global Average Pooling]
    Pool --> Dense[Classifier]
    linkStyle 0,1,2,3,4,5 stroke:#999,stroke-width:1px
```

---




# <a id="understanding-attention-mechanisms"></a>🎯 Understanding Attention Mechanisms

# <a id="what-is-attention-in-the-context-of-nlp"></a>👁️ What is attention in the context of NLP?

# <a id="why-attention-mechanisms-improve-sequence-to-sequence-models"></a>🚀 Why attention mechanisms improve sequence-to-sequence models

# <a id="example-implementing-bahdanau-attention-for-sequence-to-sequence-tasks"></a>🧪 Example: Implementing Bahdanau attention for sequence-to-sequence tasks

---

# <a id="bahdanau-attention-additive-attention"></a>🎯 Bahdanau Attention (Additive Attention)

# <a id="key-components-query-key-and-value"></a>🔑 Key components: Query, Key, and Value

# <a id="how-bahdanau-attention-works-and-its-application-in-machine-translation"></a>📚 How Bahdanau attention works and its application in machine translation

# <a id="example-using-bahdanau-attention-with-an-rnn-encoder-decoder"></a>🧪 Example: Using Bahdanau attention with an RNN encoder-decoder

---

# <a id="transformers-and-self-attention"></a>🧬 Transformers and Self-Attention

# <a id="introduction-to-transformers-and-their-self-attention-mechanism"></a>📘 Introduction to transformers and their self-attention mechanism

# <a id="how-transformers-outperform-rnns"></a>⚡ How transformers outperform RNNs

# <a id="example-building-a-simple-transformer-model-for-text-classification"></a>🛠️ Example: Building a simple transformer model for text classification

---
