# 🎯 Understanding Q, K, V Vectors in Attention Mechanisms

## Introduction
In this notebook, we will learn about the fundamental components of attention mechanisms in neural networks: Query (Q), Key (K), and Value (V) vectors. These are crucial for understanding how models like Transformers focus on different parts of data.

### The QKV Trinity
**Analogy:** Like a sophisticated library system!
- 🔍 **Query (Q):** "What information am I looking for?"
- 🗝️ **Key (K):** "What information do I have available?"
- 📚 **Value (V):** "The actual information content"

### The QKV Process
Let's break down the steps involved in using Q, K, V vectors:
1. **Create Vectors:** Transform each word into Q, K, V representations.
2. **Calculate Similarity:** Compare Query with all Keys using dot product.
3. **Normalize Scores:** Apply softmax to get attention weights.
4. **Weighted Sum:** Combine Values using attention weights.


### Mathematical Foundation of Attention
**Attention Formula:**
> Attention(Q,K,V) = softmax(QK<sup>T</sup>/√d<sub>k</sub>)V
- QK<sup>T</sup>: Query-Key similarity scores
- √d<sub>k</sub>: Scaling factor (prevents vanishing gradients)
- softmax: Normalizes to a probability distribution
- V: Weighted combination of values

### Real-World Example: Search Engine
- 🔍 **Your search:** "best pizza restaurants"
- 🗝️ **Keys:** Restaurant descriptions in database
- 📚 **Values:** Full restaurant information
- 🎯 **Result:** Restaurants ranked by relevance!

### Demo: Implementing QKV Attention
Let's see a simple implementation of the attention mechanism using PyTorch.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math


In [None]:
class SimpleAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        
        # Linear transformations for Q, K, V
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
    
    def forward(self, x):
        # x shape: [seq_len, embed_dim]
        Q = self.query(x)  # Queries
        K = self.key(x)    # Keys  
        V = self.value(x)  # Values
        
        # Calculate attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1))
        scores = scores / math.sqrt(self.embed_dim)
        
        # Apply softmax
        attention_weights = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        output = torch.matmul(attention_weights, V)
        
        return output, attention_weights


# Example Usage
Set parameters and create a sample input tensor.

In [None]:
embed_dim = 64
seq_len = 5
x = torch.randn(seq_len, embed_dim)

attention_layer = SimpleAttention(embed_dim)
output, weights = attention_layer(x)

print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")


### Concept Simplified — Think of QKV like dating apps:
- 💝 **Query:** Your dating preferences (e.g., "I like funny people")
- 🏷️ **Key:** Other profiles' tags (e.g., "Funny, Smart, Kind")
- 👤 **Value:** Full profiles with photos and details
- 💕 **Match:** The algorithm shows profiles most similar to your Query!

### Different Perspective: Whiteboard view
Imagine Q, K, V as components of a restaurant recommendation system:
- Query: What you want (e.g., type of cuisine)
- Key: Restaurant features (e.g., "Italian", "Vegetarian")
- Value: Full restaurant details (menu, photos, reviews)
QKV vectors are the secret sauce behind effective attention!

### Practical Question
If you were building a code completion tool, what would your Query vector represent when trying to complete "def calculate_"?
Think about the information the model needs to generate the next part of the code!