
---

### 🔹 **1. Masking in the Encoder**

In the **encoder**, masking is mostly used for **padding tokens**:

- **Padding Mask (or Attention Mask)**:
  - Purpose: Prevents the model from attending to padded positions (`[PAD]`) that were added to make sequences the same length in a batch.
  - Usage: Applied in both self-attention layers of the encoder and in encoder-decoder attention of the decoder.

---

### 🔹 **2. Masking in the Decoder**

In the **decoder**, we use **two types of masks**:

#### a. **Causal Mask (Look-Ahead Mask):**
- Purpose: Prevents the decoder from "cheating" by looking at future tokens during training.
- Implementation: Usually a triangular matrix where positions `[i][j]` are masked if `j > i`.
- Effect: Each token can only attend to itself and previous tokens (i.e., autoregressive behavior).

#### b. **Padding Mask (from encoder output):**
- Purpose: When the decoder attends to the encoder's outputs (via encoder-decoder attention), this mask blocks attention to encoder’s padded tokens.

---


---

### 🔹 1. **Encoder Padding Mask Example**

Let's say we have a batch with two sequences:
```text
Seq 1: [The, cat, sat, on, the, mat]
Seq 2: [Dogs, bark, loudly, <PAD>, <PAD>, <PAD>]
```

#### ➤ Padding Mask for these:
We mask out the `<PAD>` tokens:

```python
# Shape: [batch_size, 1, 1, seq_len]
# 1 = keep (attend), 0 = mask (no attend)

mask = [
    [[1, 1, 1, 1, 1, 1]],       # No padding
    [[1, 1, 1, 0, 0, 0]]        # Mask last 3 positions
]
```

This will be **broadcasted** during self-attention to ignore contributions from padding tokens.

---

### 🔹 2. **Decoder Causal (Look-Ahead) Mask Example**

Say the target sequence is:
```text
["I", "love", "you"]
```

We don’t want "love" to see "you" during training. So, the **look-ahead mask** is:

```python
import torch
mask = torch.triu(torch.ones(3, 3), diagonal=1)
# Now set masked positions to -inf
mask = mask.masked_fill(mask == 1, float('-inf')).masked_fill(mask == 0, 0.0)
```

#### ➤ Resulting mask (before softmax):
```
[[  0.  -inf  -inf]
 [  0.    0.  -inf]
 [  0.    0.    0.]]
```

When softmax is applied row-wise:
- `-inf` becomes **0 probability**
- Valid positions have softmax computed normally.

#### ➤ Visual (Heatmap-style):

| Pos \ Attn to | 0 (I) | 1 (love) | 2 (you) |
|---------------|--------|----------|----------|
| 0             | ✅     | ❌       | ❌       |
| 1             | ✅     | ✅       | ❌       |
| 2             | ✅     | ✅       | ✅       |

This enforces autoregressive prediction: each token can only attend to earlier tokens and itself.

---
