## Encoder Decoder Architecture
![image.png](attachment:image.png)

![image.png](attachment:image.png)

# Attention Mechanism

#  Problem with Encoder–Decoder (Seq2Seq) Architecture  
## And the Need for the Attention Mechanism

---

## ✅ Recap: What Does Encoder–Decoder (Seq2Seq) Do?

In Seq2Seq models:
- The **encoder** reads the input sequence and compresses it into a **context vector** (usually the final hidden state)
- The **decoder** uses this context vector to generate the output sequence



---

## Problems with Encoder–Decoder Architecture (Without Attention)

### 1. **Fixed-Length Bottleneck**
- All the input information is squashed into **a single final hidden state** (`h_T`)
- This **limits capacity**, especially with **long or complex inputs**
- Example: Translating a paragraph using only one vector = bad translation

---

### 2. **Information Loss**
- Early input tokens (e.g., the first few words) may be **forgotten**
- The decoder only sees `h_T`, not the full sequence of encoder states
- Leads to **missing or incorrect words** in output

---

### 3. **No Dynamic Focus**
- The decoder cannot choose **which parts of the input** to pay attention to
- It always relies on **the same fixed vector**, regardless of which word it's generating

---

## Solution: Add the **Attention Mechanism**

> Attention allows the decoder to **look at all hidden states** of the encoder, not just the last one.

### What Attention Does:
- For every output token, it **computes weights** over all encoder hidden states
- It produces a **weighted sum** of those states (called the "context vector")
- This **context vector changes at each time step**, allowing **dynamic focus**


Encoder Hidden States: [h1, h2, h3, h4]
↓
Attention Weights (α1, α2, α3, α4)
↓
Context Vector for each output token



---

## Benefits of Attention

| Problem in Basic Seq2Seq      | Attention Fix                                |
|-------------------------------|-----------------------------------------------|
| Fixed-length bottleneck       | Uses **all encoder hidden states**           |
| Long input → info loss        | Focuses dynamically on relevant parts        |
| Decoder lacks context         | Gets a **custom context** for each time step |
| Poor performance on long input| Much better translation and summarization    |

---

## Final Takeaway

> The basic encoder–decoder compresses all information into **one fixed vector**, which leads to **loss of important context**, especially for long or complex sequences.  
> The **attention mechanism** solves this by allowing the decoder to **look at different parts of the input** at every step — enabling more accurate, context-aware outputs.

