In [1]:
# ============================================
# Module 8: From Classical to Deep NLP
# LSTM Critical Thinking & Application Lab
# ============================================
# Author: Prof. Dr. Swati Chandna
# Course: M.Sc. Applied Data Science & AI
# --------------------------------------------
# Learning Goals:
# - Explain how attention distributes “focus” across words.
# - Connect the dot-product formula to linguistic intuition.
# - Implement and visualize scaled dot-product attention.
# ============================================

## Part 1 — Predict the Attention Weights

Before coding, let’s reason like the model!

### Task

1. Choose one or more of the following sentences:
   - **I love AI**
   - **AI loves me**
   - **Birds can fly**
   - **Students learn quickly**

2. For each sentence, fill in a small **attention weight table** (on code).  
   Each **row** corresponds to a *query word*, and each **column** to a *key word*.  
   The values in every row must sum to **1.0**.

   Example:

   | From → To | I | love | AI |
   |------------|---|------|----|
   | **I** | 0.25 | 0.35 | 0.40 |
   | **love** | 0.30 | 0.30 | 0.40 |
   | **AI** | 0.20 | 0.40 | 0.40 |



3. Discuss with your partner:
   - Which word does each word depend on most — and why?  
   - Would the pattern change if the sentence were *“AI loves me”*?  
   - Is attention symmetric (if *love → AI* = 0.7, does *AI → love* = 0.7)?

**Hint:** Think about *subject*, *verb*, and *object* roles.  
The verb often links both sides, so it typically shares its attention between the subject and the object.

---

##  Part 2 — Compute Scaled Dot-Product Attention

Now we’ll compute the same idea mathematically.

### Formula

$$
\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^T}{\sqrt{d_k}}\right)V
$$

### Your Tasks

1. **Create small embeddings** for each word in your chosen sentence (e.g., 3–5 dimensions).  

2. **Initialize random weight matrices** for queries, keys, and values:
   - \( W_Q \), \( W_K \), \( W_V \)

3. **Compute:**
   - \( Q = X W_Q \)
   - \( K = X W_K \)
   - \( V = X W_V \)

4. **Calculate attention scores:**
   $$
   \text{scores} = \frac{Q K^{T}}{\sqrt{d_k}}
   $$

5. **Apply softmax** to obtain attention weights:  
    $$
   \text{weights} = \text{softmax}(\text{scores})
   $$

6. **Compute final outputs:**  
   $$
   \text{output} = \text{weights} \times V
   $$

---

### Visualize Attention

Create two visualizations for your sentence:

- **Heatmap:** shows how much each word attends to others.  
- **Arrows:** show direction and strength of attention between words.

Example visual goals:
- Darker cells = stronger attention.  
- Thicker arrows = stronger connections.

---

### Reflection

- Which word focuses most on others?  
- Do verbs distribute attention differently than nouns?  
- How does sentence order affect attention patterns?  
- Would adding an adverb (*“I love AI deeply”*) shift the weights?

---

