# Understanding Query, Key, and Value: A Deep Dive

Let's understand these concepts through **multiple analogies** and **step-by-step examples**.

## The Big Picture

**Query, Key, and Value** are three different "views" or "perspectives" of the same words in a sentence. Think of them as three different questions we ask about each word:

- **Query (Q)**: "What am I looking for?"
- **Key (K)**: "What information do I contain that others might search for?"
- **Value (V)**: "What is my actual content that I'll share?"

Let's explore this through several analogies!

## üìö Analogy 1: The Library System

Imagine you're in a library searching for information.

### The Scenario
You walk into a library and ask: **"I need information about machine learning"**

Here's how Q, K, V work:

| Concept | Library Analogy | What It Does |
|---------|----------------|-------------|
| **Query** | Your search request: "machine learning" | Represents what you're looking for |
| **Key** | Book titles/labels on shelves: "AI Textbook", "Neural Networks Guide", "Cooking Recipes" | Help you find relevant books |
| **Value** | Actual book contents | The information you actually take away |

### The Process

1. **You have a Query**: "machine learning"
2. **You check Keys** (book titles): 
   - "AI Textbook" ‚úÖ (relevant! high match)
   - "Neural Networks Guide" ‚úÖ (relevant! high match)
   - "Cooking Recipes" ‚ùå (not relevant, low match)
3. **You read the Values** (contents) from books with high matches
4. **You combine** the information proportionally:
   - 50% from "AI Textbook"
   - 50% from "Neural Networks Guide"
   - 0% from "Cooking Recipes"

### Key Insight
**The title (Key) helps you find the book, but the content (Value) is what you actually learn from!**

## üîç Analogy 2: Google Search

This is even more intuitive!

### When you search on Google:

```
You type: "best pizza in New York"  ‚Üê This is your QUERY
```

### What happens:

1. **Query (Q)**: Your search term: "best pizza in New York"

2. **Keys (K)**: Metadata/tags of each webpage:
   - Page 1: ["pizza", "New York", "restaurant", "food"] ‚úÖ
   - Page 2: ["pizza", "Chicago", "deep dish"] ‚ö†Ô∏è
   - Page 3: ["cars", "vehicles", "New York"] ‚ùå

3. **Matching**: Google compares your Query against Keys:
   - Page 1: High relevance (0.9) - has "pizza" AND "New York"
   - Page 2: Medium relevance (0.3) - has "pizza" but wrong city
   - Page 3: Low relevance (0.1) - only has "New York"

4. **Values (V)**: Actual content of webpages
   - Page 1 Value: "Joe's Pizza on 42nd Street has amazing slices..."
   - Page 2 Value: "Chicago deep dish is characterized by..."
   - Page 3 Value: "The best cars to drive in New York are..."

5. **Final Result**: Google shows you a **weighted combination**:
   - 90% of what you see comes from Page 1 (high relevance)
   - 9% from Page 2 (medium relevance)
   - 1% from Page 3 (low relevance)

### The Magic
You don't search the actual content (Values) directly - that would be too slow! Instead:
- Keys are like **quick summaries** for fast matching
- Values are the **full content** you actually want
- Query is **what you're looking for**

## üìù Analogy 3: Understanding a Sentence (The Real Use Case)

Now let's see how this works for actual language.

### Sentence: "The cat sat on the mat"

Let's focus on the word **"sat"** and see what it should pay attention to.

### For the word "sat":

**Query (from "sat")**: "I'm a verb. Who is performing me? Where am I happening?"
- This is like "sat" asking: "What should I pay attention to?"

**Keys (from all words)**:
- "The": "I'm an article, not very important" ‚ö™
- "cat": "I'm a noun, an ANIMAL, could be a subject of action!" üü¢
- "sat": "That's me" ‚ö™
- "on": "I'm a preposition, showing relationships" üü°
- "the": "I'm an article" ‚ö™
- "mat": "I'm a noun, a LOCATION where things happen!" üü¢

**Matching (Query of "sat" against all Keys)**:
- "sat" √ó "The": 0.05 (low match)
- "sat" √ó "cat": 0.50 (high match! verbs care about their subjects)
- "sat" √ó "sat": 0.10 (words do pay some attention to themselves)
- "sat" √ó "on": 0.15 (medium match)
- "sat" √ó "the": 0.05 (low match)
- "sat" √ó "mat": 0.45 (high match! verbs care about locations)

**Values (the actual meaning each word contributes)**:
- "The": [simple article information]
- "cat": [animal, furry, pet, subject of action]
- "sat": [action, past tense, resting position]
- "on": [spatial relationship, above surface]
- "the": [simple article information]
- "mat": [object, flat surface, location]

**Final representation of "sat" after attention**:
```
New "sat" = 0.50 √ó [cat info] + 0.45 √ó [mat info] + 0.10 √ó [sat info] + ...
           = "sitting action performed by a cat on a mat surface"
```

The word "sat" now has a **richer representation** that includes context about WHO sat and WHERE!

In [None]:
# Let's visualize this with actual code!

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style
sns.set_style("whitegrid")
np.random.seed(42)

print("Let's build intuition with a concrete example!")

## üéØ Concrete Example: Step by Step

Let's work through a **tiny example** with actual numbers.

### Sentence: "cat sat"

We'll use 3-dimensional vectors to keep it simple.

In [None]:
# Step 1: Start with word embeddings (these come from a lookup table)
# In real transformers, these are learned 512 or 768 dimensional vectors
# We use 3D for visualization

embeddings = {
    'cat': np.array([1.0, 0.5, 0.2]),  # represents the word "cat"
    'sat': np.array([0.3, 1.0, 0.8])   # represents the word "sat"
}

print("Original Word Embeddings:")
print("cat:", embeddings['cat'])
print("sat:", embeddings['sat'])
print("\nThese are just vector representations of words.")
print("Think of them as coordinates in 'meaning space'")

In [None]:
# Step 2: Create transformation matrices
# These are LEARNED during training
# They transform embeddings into Query, Key, and Value spaces

# Small weight matrices (3x3) for our 3D embeddings
W_query = np.array([
    [1.0, 0.0, 0.0],
    [0.0, 1.0, 0.0],
    [0.0, 0.0, 1.0]
]) * 0.5

W_key = np.array([
    [0.8, 0.2, 0.0],
    [0.2, 0.8, 0.0],
    [0.0, 0.0, 1.0]
]) * 0.5

W_value = np.array([
    [1.0, 0.0, 0.0],
    [0.0, 1.0, 0.0],
    [0.0, 0.0, 1.0]
]) * 0.8

print("Weight Matrices created!")
print("These transform embeddings into Q, K, V representations")

In [None]:
# Step 3: Compute Query, Key, Value for each word

Q = {}
K = {}
V = {}

for word in ['cat', 'sat']:
    embedding = embeddings[word]
    
    Q[word] = W_query @ embedding  # Matrix multiplication
    K[word] = W_key @ embedding
    V[word] = W_value @ embedding

print("="*60)
print("STEP 3: Creating Q, K, V for each word")
print("="*60)

for word in ['cat', 'sat']:
    print(f"\nWord: '{word}'")
    print(f"  Original embedding: {embeddings[word]}")
    print(f"  Query (Q):  {Q[word]}  ‚Üê 'What am I looking for?'")
    print(f"  Key (K):    {K[word]}  ‚Üê 'What do I offer?'")
    print(f"  Value (V):  {V[word]}  ‚Üê 'What is my content?'")

## üîë The Critical Insight

### Why do we need THREE different representations?

Each word plays **two roles** simultaneously:

1. **As a searcher (Query)**: "What information do I need from other words?"
2. **As a provider (Key & Value)**: "What information can I provide to other words?"

#### Example with "sat":

**When "sat" is the QUERY (searching):**
- "I'm a verb. I need to know who performed me (subject) and where (location)"
- So "sat"'s Query looks for subjects and locations

**When "sat" is the KEY (being searched):**
- "I'm an action that happened. Other words might want to know about me."
- So "sat"'s Key advertises: "I'm an action, a verb, describes what happened"

**When "sat" is the VALUE (providing information):**
- "Here's the actual meaning I contribute: past-tense action, sitting position, etc."
- The Value contains the rich semantic information

### The Same Word, Three Perspectives!

Think of it like a person at a networking event:
- **Query**: "I'm looking for software engineers" (what you seek)
- **Key**: "Hi, I'm a software engineer!" (how you advertise yourself)
- **Value**: [Your actual skills, experience, knowledge] (what you offer)

You're simultaneously **searching for others** AND **being found by others**!

In [None]:
# Step 4: Calculate attention scores
# This determines how much each word should attend to every other word

print("="*60)
print("STEP 4: Computing Attention (Query √ó Key matching)")
print("="*60)

# Let's see how much "sat" should attend to each word
print("\nFocus: How much should 'sat' pay attention to each word?")
print("\nWe compute: Query('sat') ‚Ä¢ Key(each word)")
print("The dot product measures similarity/relevance\n")

query_sat = Q['sat']

# Attention scores: how much 'sat' attends to each word
score_sat_to_cat = np.dot(query_sat, K['cat'])
score_sat_to_sat = np.dot(query_sat, K['sat'])

print(f"Query('sat') ‚Ä¢ Key('cat') = {score_sat_to_cat:.3f}")
print(f"Query('sat') ‚Ä¢ Key('sat') = {score_sat_to_sat:.3f}")
print("\nHigher score = more relevant = should pay more attention")

In [None]:
# Step 5: Scale the scores
d_k = 3  # dimension of our key vectors
scaling_factor = np.sqrt(d_k)

scaled_score_sat_to_cat = score_sat_to_cat / scaling_factor
scaled_score_sat_to_sat = score_sat_to_sat / scaling_factor

print("="*60)
print("STEP 5: Scale by ‚àöd_k")
print("="*60)
print(f"\nScaling factor: ‚àö{d_k} = {scaling_factor:.3f}")
print(f"Scaled score ('sat' ‚Üí 'cat'): {scaled_score_sat_to_cat:.3f}")
print(f"Scaled score ('sat' ‚Üí 'sat'): {scaled_score_sat_to_sat:.3f}")
print("\nWhy scale? Prevents very large values that make training difficult")

In [None]:
# Step 6: Apply softmax to get attention weights
# This converts scores into probabilities that sum to 1

def softmax(scores):
    exp_scores = np.exp(scores - np.max(scores))  # numerical stability
    return exp_scores / exp_scores.sum()

attention_scores = np.array([scaled_score_sat_to_cat, scaled_score_sat_to_sat])
attention_weights = softmax(attention_scores)

print("="*60)
print("STEP 6: Apply Softmax (convert to probabilities)")
print("="*60)
print(f"\nAttention weight ('sat' ‚Üí 'cat'): {attention_weights[0]:.3f}")
print(f"Attention weight ('sat' ‚Üí 'sat'): {attention_weights[1]:.3f}")
print(f"Sum: {attention_weights.sum():.3f} (must equal 1.0)")

print("\nüí° Interpretation:")
print(f"   'sat' should focus {attention_weights[0]*100:.1f}% on 'cat'")
print(f"   'sat' should focus {attention_weights[1]*100:.1f}% on itself")

In [None]:
# Step 7: Compute weighted sum of Values
# This is the final output!

print("="*60)
print("STEP 7: Compute Weighted Sum of Values")
print("="*60)

# New representation of 'sat' after attention
new_sat = attention_weights[0] * V['cat'] + attention_weights[1] * V['sat']

print("\nOriginal 'sat' embedding:", embeddings['sat'])
print("Original 'sat' value:    ", V['sat'])
print("\nNEW 'sat' representation:", new_sat)

print("\nüéØ This new representation combines:")
print(f"   {attention_weights[0]*100:.1f}% information from 'cat': {V['cat']}")
print(f"   {attention_weights[1]*100:.1f}% information from 'sat': {V['sat']}")
print("\n‚ú® Result: 'sat' now has context about 'cat'!")

## üé® Visual Summary

Let's create a visual representation of the entire process!

In [None]:
# Create a comprehensive visualization

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('The Complete Query-Key-Value Process', fontsize=16, fontweight='bold')

# Plot 1: Original Embeddings
ax1 = axes[0, 0]
words = ['cat', 'sat']
colors = ['#FF6B6B', '#4ECDC4']

for i, word in enumerate(words):
    ax1.bar([0, 1, 2], embeddings[word], alpha=0.7, label=word, color=colors[i], width=0.35 * (i - 0.5))

ax1.set_xlabel('Dimension')
ax1.set_ylabel('Value')
ax1.set_title('1. Original Word Embeddings')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Q, K, V Representations
ax2 = axes[0, 1]
word = 'sat'
x = np.arange(3)
width = 0.25

ax2.bar(x - width, Q[word], width, label='Query (Q)', alpha=0.8, color='#FF6B6B')
ax2.bar(x, K[word], width, label='Key (K)', alpha=0.8, color='#4ECDC4')
ax2.bar(x + width, V[word], width, label='Value (V)', alpha=0.8, color='#95E1D3')

ax2.set_xlabel('Dimension')
ax2.set_ylabel('Value')
ax2.set_title('2. Q, K, V for "sat"')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Plot 3: Attention Weights
ax3 = axes[1, 0]
attention_matrix = np.array([attention_weights, [0, 0]])  # Just showing 'sat' row
im = ax3.imshow(attention_weights.reshape(1, -1), cmap='YlOrRd', aspect='auto', vmin=0, vmax=1)
ax3.set_xticks([0, 1])
ax3.set_xticklabels(['cat', 'sat'])
ax3.set_yticks([0])
ax3.set_yticklabels(['sat'])
ax3.set_xlabel('Attending TO (Key)')
ax3.set_ylabel('Attending FROM (Query)')
ax3.set_title('3. Attention Weights')

# Add text annotations
for i in range(2):
    ax3.text(i, 0, f'{attention_weights[i]:.2f}', ha='center', va='center', 
            color='white', fontweight='bold', fontsize=12)

plt.colorbar(im, ax=ax3, label='Weight')

# Plot 4: Before and After
ax4 = axes[1, 1]
x_pos = np.arange(3)

ax4.bar(x_pos - 0.2, V['sat'], 0.4, label='Before Attention', alpha=0.7, color='#4ECDC4')
ax4.bar(x_pos + 0.2, new_sat, 0.4, label='After Attention', alpha=0.7, color='#FF6B6B')

ax4.set_xlabel('Dimension')
ax4.set_ylabel('Value')
ax4.set_title('4. "sat" Before vs After Attention')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("KEY TAKEAWAY")
print("="*60)
print("The 'sat' vector has been ENRICHED with information from 'cat'")
print("This is how words understand their context!")

## üß© The Final Puzzle: Why Three Separate Transformations?

You might ask: **"Why can't we just use the original embeddings? Why transform them into Q, K, V?"**

Great question! Here's why:

### Without Q, K, V (using just embeddings):
```python
# If we just did: embedding('sat') ‚Ä¢ embedding('cat')
# This would measure general similarity
```

**Problem**: Words that are similar in meaning might not need to attend to each other!

Example: "happy" and "joyful" are similar, but when understanding a sentence, "happy" might need to attend to the **subject** (who is happy?) rather than other emotion words.

### With Q, K, V transformations:

```python
# We do: Query('sat') ‚Ä¢ Key('cat')
# This measures TASK-SPECIFIC relevance
```

**Solution**: The transformations learn to encode **relationships** rather than just similarity!

- **Query transformation** learns: "What relationships should I look for?"
- **Key transformation** learns: "What relationships do I participate in?"
- **Value transformation** learns: "What information should I contribute?"

### Concrete Example:

Sentence: "The **cat** sat on the **mat**"

- **Query('sat')** learns to look for:
  - Noun patterns (for subjects)
  - Location patterns (for where)
  
- **Key('cat')** learns to advertise:
  - "I'm a noun"
  - "I can be a subject"
  
- **Value('cat')** provides:
  - Rich semantic meaning: [animal, furry, pet, small, ...]

## üéì Let's Test Your Understanding!

### Question 1: Database Analogy
In a database, you're looking for "employees with Python skills".

- What is the Query?
- What are the Keys?
- What are the Values?

<details>
<summary>Click for answer</summary>

- **Query**: "employees with Python skills" (what you're searching for)
- **Keys**: Tags/metadata on each employee record ("Python", "Java", "Manager", etc.)
- **Values**: Full employee profiles (name, experience, projects, etc.)

</details>

### Question 2: Sentence Understanding
Sentence: "The dog chased the cat"

What should the word "chased" pay attention to?

<details>
<summary>Click for answer</summary>

"chased" should pay high attention to:
- **"dog"** (the subject performing the action)
- **"cat"** (the object receiving the action)

Lower attention to:
- **"The"** (not semantically important)

</details>

### Question 3: The Key Question
If two words have identical embeddings, will they have identical Q, K, and V?

<details>
<summary>Click for answer</summary>

**YES!** Because Q, K, V are computed by multiplying the embedding by weight matrices:
- Q = Embedding √ó W_query
- K = Embedding √ó W_key
- V = Embedding √ó W_value

Same input (embedding) ‚Üí Same output (Q, K, V)

However, in practice, even identical words in different positions will have different embeddings after adding positional encoding!

</details>

## üéØ Summary: The Big Picture

### What Happens in Attention (Simple Version):

1. **Start**: You have word embeddings
2. **Transform**: Create three views (Q, K, V) of each word
3. **Compare**: Match Queries against Keys ("What's relevant?")
4. **Combine**: Mix Values based on relevance weights
5. **Result**: Each word now understands its context!

### The Genius:

- **No fixed rules**: The model **learns** what to pay attention to
- **Flexible**: Different attention heads can learn different types of relationships
- **Parallel**: All words processed simultaneously (unlike RNNs)
- **Contextual**: Each word's representation depends on surrounding words

### Remember:

üîç **Query**: "What should I look for?"
üîë **Key**: "What do I represent for others to find?"
üíé **Value**: "What do I actually contribute?"

### The Magic Formula:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Breaking it down:
- $QK^T$: "How relevant is each word to each other word?"
- $\frac{1}{\sqrt{d_k}}$: "Scale it down for numerical stability"
- $\text{softmax}$: "Convert to probabilities"
- $\times V$: "Mix the values based on relevance"

And that's Query, Key, and Value explained! üéâ

## üî¨ Bonus: Experiment Yourself!

In [None]:
# Try changing these values and see what happens!

# Create your own embeddings
my_embeddings = {
    'word1': np.array([1.0, 0.0, 0.5]),
    'word2': np.array([0.5, 1.0, 0.2])
}

# Create simple weight matrices
W_q = np.eye(3) * 0.5
W_k = np.eye(3) * 0.5
W_v = np.eye(3) * 0.8

# Compute Q, K, V
Q1 = W_q @ my_embeddings['word1']
K1 = W_k @ my_embeddings['word1']
K2 = W_k @ my_embeddings['word2']
V1 = W_v @ my_embeddings['word1']
V2 = W_v @ my_embeddings['word2']

# Compute attention
score1 = np.dot(Q1, K1)
score2 = np.dot(Q1, K2)
weights = softmax(np.array([score1, score2]))

# Final output
output = weights[0] * V1 + weights[1] * V2

print("Your custom attention result:")
print(f"Attention weights: {weights}")
print(f"Output: {output}")
print("\nTry changing the embeddings and see how attention changes!")

## üìö Further Reading

Now that you understand Q, K, V, you're ready to:

1. Learn about **multi-head attention** (multiple Q, K, V transformations in parallel)
2. Understand **self-attention vs cross-attention**
3. Study complete **transformer architecture**
4. Explore **real implementations** (BERT, GPT, etc.)

You've got this! üöÄ