## Summary of the previous chapter 

In [81]:
import urllib.request
import torch
import tiktoken  
from torch.utils.data import Dataset, DataLoader

# Download the text file
url = ("https://raw.githubusercontent.com/rasbt/"
       "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
       "the-verdict.txt")
file_path = "the-verdict.txt"
urllib.request.urlretrieve(url, file_path)

# Load and tokenize the text
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

tokenizer = tiktoken.get_encoding("cl100k_base")  
enc_text = tokenizer.encode(raw_text)

# Define a custom dataset class
class GPTDatasetV1(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        self.input_ids = []
        self.target_ids = []
        token_ids = tokenizer.encode(txt)
        for i in range(0, len(token_ids) - max_length, stride):
            input_chunk = token_ids[i:i + max_length]
            target_chunk = token_ids[i + 1: i + max_length + 1]
            self.input_ids.append(torch.tensor(input_chunk))
            self.target_ids.append(torch.tensor(target_chunk))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.target_ids[idx]

# Function to create a DataLoader
def create_dataloader_v1(txt, batch_size, max_length=256, stride=128, shuffle=True, drop_last=True, num_workers=0):
    tokenizer = tiktoken.get_encoding("gpt2")
    dataset = GPTDatasetV1(txt, tokenizer, max_length, stride)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)
    return dataloader 

# Create DataLoader
dataloader = create_dataloader_v1(raw_text, batch_size=1, max_length=4, stride=2, shuffle=False)
data_iter = iter(dataloader)
first_batch = next(data_iter)
second_batch = next(data_iter)
print("First batch:", first_batch)
print("Second batch:", second_batch)

# --- Adding Token and Positional Embeddings ---

# Define token embedding layer
vocab_size = 50257  # Typical size for GPT models
embedding_dim = 256  # Example embedding size (GPT-3 uses 12,288)
token_embedding_layer = torch.nn.Embedding(vocab_size, embedding_dim)

# Convert token IDs into token embeddings
token_embeddings = token_embedding_layer(first_batch[0])  # First batch of inputs
print("Token Embeddings Shape:", token_embeddings.shape)

# Define positional embedding layer
max_length = 4  # Same as the max sequence length
pos_embedding_layer = torch.nn.Embedding(max_length, embedding_dim)

# Generate position embeddings
positions = torch.arange(max_length).unsqueeze(0)  # Create position indices
pos_embeddings = pos_embedding_layer(positions)
print("Positional Embeddings Shape:", pos_embeddings.shape)

# Combine token and positional embeddings
input_embeddings = token_embeddings + pos_embeddings
print("Final Input Embeddings Shape:", input_embeddings.shape)


First batch: [tensor([[  40,  367, 2885, 1464]]), tensor([[ 367, 2885, 1464, 1807]])]
Second batch: [tensor([[2885, 1464, 1807, 3619]]), tensor([[1464, 1807, 3619,  402]])]
Token Embeddings Shape: torch.Size([1, 4, 256])
Positional Embeddings Shape: torch.Size([1, 4, 256])
Final Input Embeddings Shape: torch.Size([1, 4, 256])


# Ch 3 - Coding Attention mechanisms

This passage explains the self-attention mechanism in large language models (LLMs), a crucial part of transformer-based architectures like GPT. Here’s a summary of the key concepts and steps covered:  

### **1. The Problem with Long Sequences**  
- Traditional models like **Recurrent Neural Networks (RNNs)** process text sequentially and struggle with long-range dependencies.  
- The **encoder-decoder RNN architecture** attempts to capture context but compresses all information into a single hidden state, leading to loss of information over long sequences.  
- **Attention mechanisms** were introduced to solve this by allowing the model to focus on different parts of the input while generating output.  

### **2. Introduction to Attention Mechanisms**  
- **Bahdanau attention** (2014) improved RNNs by allowing the decoder to focus on relevant words at each step rather than relying only on a final hidden state.  
- In 2017, researchers found that **RNNs were unnecessary** for language models and developed the **Transformer architecture**, which relies entirely on **self-attention** mechanisms.  

### **3. Understanding Self-Attention**  
- **Self-attention** enables each word in a sentence to attend to (focus on) all other words, capturing relationships between words more effectively than RNNs.  
- **Context vectors** are generated, which enrich each word’s representation by incorporating information from other words.  

### **4. Implementing a Simple Self-Attention Mechanism**  
- The **goal** is to compute a **context vector** for each input word, weighted by its relevance to the other words.  
- The **dot product** is used to calculate similarity scores between words.  
- **Normalization** ensures that the attention weights sum to 1, making them interpretable.  
- The **softmax function** is often used for stabilization.  

### **5. Example Implementation in PyTorch**  
- **Step 1: Define input tokens** (already embedded as 3D vectors).  
- **Step 2: Compute attention scores** using the dot product between each word and a query word.  
- **Step 3: Normalize scores to obtain attention weights**, ensuring they sum to 1.  
- **Step 4: Compute the weighted sum to get the final context vector.**  

### **Key Takeaways**  
- **Self-attention** allows LLMs to capture relationships between words efficiently.  
- It **overcomes the limitations** of RNNs by allowing each word to reference all others, not just previous ones.  
- This chapter **focused on implementing self-attention**, with future sections covering trainable weights and multi-head attention.  

Would you like me to explain any specific part in more detail? 😊

In [82]:
import torch

# Define the input sentence as token embeddings (example 3D vectors for simplicity)
inputs = torch.tensor([
    [0.43, 0.15, 0.89],  # Token 1: "Your"
    [0.55, 0.87, 0.66],  # Token 2: "journey"
    [0.57, 0.85, 0.64],  # Token 3: "starts"
    [0.22, 0.58, 0.33],  # Token 4: "with"
    [0.77, 0.25, 0.10],  # Token 5: "one"
    [0.05, 0.80, 0.55]   # Token 6: "step"
])

# Select a query token (e.g., the second token "journey")
query = inputs[1]

# Step 1: Compute attention scores (dot product of the query with all tokens)
attn_scores = torch.empty(inputs.shape[0])
for i, token in enumerate(inputs):
    attn_scores[i] = torch.dot(query, token)

# Print the attention scores
print("Attention Scores:", attn_scores)
# it is taking dot product of "journey" with every other word to see how close ti is to other words ( in the loop) and 
#printing it

# Step 2: Normalize the attention scores using softmax
attn_weights = torch.nn.functional.softmax(attn_scores, dim=0)
print("Attention Weights after softmax:", attn_weights)
#then it is sofrmaxing the entire thing to calculate the probaabilities

# Step 3: Compute the context vector as a weighted sum of input embeddings
context_vector = torch.zeros_like(query)
for i, token in enumerate(inputs):
    context_vector += attn_weights[i] * token

print("Context Vector:", context_vector)
# The context vector we computed represents a new, enriched version of the word "journey" 
# (our query word), but now infused with information from the entire sentence.
#Instead of treating "journey"as an isolaiid word, the context vector captures how
#  much "journey" relates to other words in the sentence (e.g., "Your", "starts", "one", "step", etc.).


Attention Scores: tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
Attention Weights after softmax: tensor([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
Context Vector: tensor([0.4419, 0.6515, 0.5683])


In [83]:
print(inputs[1])

tensor([0.5500, 0.8700, 0.6600])


Now, like we are calculated, the attention score of.Journey for all otherwords, we will also calculate the attention score of all other words with all other words

In [84]:
attn_scores = torch.empty(6, 6)
for i, x_i in enumerate(inputs):
        for j, x_j in enumerate(inputs):
            attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [85]:
attn_scores = inputs @ inputs.T
print(attn_scores)

tensor([[0.9995, 0.9544, 0.9422, 0.4753, 0.4576, 0.6310],
        [0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865],
        [0.9422, 1.4754, 1.4570, 0.8296, 0.7154, 1.0605],
        [0.4753, 0.8434, 0.8296, 0.4937, 0.3474, 0.6565],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654, 0.2935],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [86]:
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)

tensor([[0.2098, 0.2006, 0.1981, 0.1242, 0.1220, 0.1452],
        [0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581],
        [0.1390, 0.2369, 0.2326, 0.1242, 0.1108, 0.1565],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.1263, 0.1720],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.1295],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


In [87]:
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)
print("All row sums:", attn_weights.sum(dim=-1))

Row 2 sum: 1.0
All row sums: tensor([1.0000, 1.0000, 1.0000, 1.0000, 1.0000, 1.0000])


In [88]:
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])


In [89]:
#print("Previous 2nd context vector:", context_vec_2)

In [90]:
inputs.shape

torch.Size([6, 3])

### **Understanding Self-Attention Step by Step (For an Absolute Beginner)**  

Let's go through each of these codes carefully, assuming you're new to this. I'll explain them in a way that makes sense even if you're not familiar with deep learning.

---

## **CODE 1** (Basic Self-Attention Mechanism)  

This code implements a simplified self-attention mechanism for one token (the word "journey").  

### **Step 1: Defining Inputs (Words as Vectors)**  
```python
inputs = torch.tensor([
    [0.43, 0.15, 0.89],  # "Your"
    [0.55, 0.87, 0.66],  # "journey"
    [0.57, 0.85, 0.64],  # "starts"
    [0.22, 0.58, 0.33],  # "with"
    [0.77, 0.25, 0.10],  # "one"
    [0.05, 0.80, 0.55]   # "step"
])
```
- Each word is represented as a **3D vector**.  
- These vectors are just **numbers representing the meaning of words** (they are called "embeddings").  

---

### **Step 2: Selecting a Query Word**
```python
query = inputs[1]
```
- We **select the word "journey"** as the "query".  
- This means we are trying to find **how "journey" relates to other words** in the sentence.  

---

### **Step 3: Compute Attention Scores (Dot Product)**
```python
attn_scores = torch.empty(inputs.shape[0])  # Create an empty tensor to store scores
for i, token in enumerate(inputs):
    attn_scores[i] = torch.dot(query, token)
```
#### What is happening here?  
- We are taking the **dot product** of "journey" with every other word in the sentence.  
- This measures **how similar each word is to "journey"**.  
- The dot product gives a **higher number if the words are similar** and a **lower number if they are different**.  

---

### **Step 4: Convert Scores to Probabilities (Softmax)**
```python
attn_weights = torch.nn.functional.softmax(attn_scores, dim=0)
```
#### Why do we need this?  
- The attention scores are just numbers. We need to **convert them into probabilities** so they sum to **1**.  
- The **softmax function** makes sure the biggest score gets the highest probability.  

---

### **Step 5: Compute the Context Vector (Weighted Sum)**
```python
context_vector = torch.zeros_like(query)  # Initialize the context vector
for i, token in enumerate(inputs):
    context_vector += attn_weights[i] * token
```
#### What is happening here?  
- Each word **contributes to the final output** based on how similar it is to "journey".  
- Words that are **more relevant get a bigger share** in the final output.  
- The **context vector** is a new version of "journey", **but now it includes information from all words in the sentence**.  

📌 **Think of this like a group discussion**:  
- If many people talk about the same topic, that topic becomes important.  
- The final opinion (context vector) depends on how much each person (word) contributed.  

---

## **CODE 2** (Computing Attention Scores for All Tokens)  

```python
attn_scores = torch.empty(6, 6)  # Create a 6x6 empty matrix
for i, x_i in enumerate(inputs):
        for j, x_j in enumerate(inputs):
            attn_scores[i, j] = torch.dot(x_i, x_j)
print(attn_scores)
```
### **What is happening here?**  
- Instead of computing attention scores **only for "journey"**, now we compute attention scores for **every word with every other word**.  
- The result is a **6x6 matrix**, where each row represents a word and each column represents how much attention it gives to other words.  

---

## **CODE 3** (Using Matrix Multiplication for Faster Computation)  

```python
attn_scores = inputs @ inputs.T
print(attn_scores)
```
### **Why do this?**  
- Instead of using **for loops**, we use **matrix multiplication** (`@`) to compute all dot products **at once**.  
- `inputs.T` means **transpose** (swap rows and columns).  
- The result is the **same as CODE 2, but computed much faster**.  

---

## **CODE 4** (Applying Softmax to Get Attention Weights)  

```python
attn_weights = torch.softmax(attn_scores, dim=-1)
print(attn_weights)
```
### **What is happening?**  
- We take the **6x6 attention scores** and apply **softmax** row-wise.  
- This converts raw scores into **probabilities** (each row sums to **1**).  
- This tells us **how much each word attends to every other word**.  

---

## **CODE 5** (Verifying Softmax Normalization)  

```python
row_2_sum = sum([0.1385, 0.2379, 0.2333, 0.1240, 0.1082, 0.1581])
print("Row 2 sum:", row_2_sum)
print("All row sums:", attn_weights.sum(dim=-1))
```
### **What is happening?**  
- We **check if softmax is working correctly**.  
- **Each row should sum to 1** (because softmax converts numbers into probabilities).  
- `attn_weights.sum(dim=-1)` prints the sum of each row to confirm that.  

---

## **CODE 6** (Computing All Context Vectors)  

```python
all_context_vecs = attn_weights @ inputs
print(all_context_vecs)
```
### **What is happening?**  
- Before, we computed **one context vector (for "journey")**.  
- Now, we compute **context vectors for all words at once** using **matrix multiplication**.  
- This creates **a new sentence representation** where each word **is updated based on its relationship with other words**.  

---

## **CODE 7** (Comparing with Previous Computation)  

```python
print("Previous 2nd context vector:", context_vec_2)
```
### **Why do this?**  
- This checks if the context vector we computed earlier for **"journey"** (from CODE 1) is **the same as the second row in CODE 6**.  
- If both are the same, it means our new method (matrix-based computation) is correct!  

---

## **Final Summary**  
### **Understanding the Progression**  
| Code | What It Does |
|------|-------------|
| **CODE 1** | Computes self-attention for one word ("journey") using loops |
| **CODE 2** | Computes attention scores for all words using loops |
| **CODE 3** | Optimizes CODE 2 using matrix multiplication |
| **CODE 4** | Converts attention scores to probabilities (softmax) |
| **CODE 5** | Verifies that softmax probabilities sum to 1 |
| **CODE 6** | Computes context vectors for all words using matrix multiplication |
| **CODE 7** | Checks if the optimized method matches the manual method |

---

## **Key Takeaways**  
- **Attention scores** tell us **how much each word is related to every other word**.  
- **Softmax converts scores into probabilities** so they sum to **1**.  
- **Context vectors combine information from all words**, making each word **aware of its surroundings**.  
- **Using matrices speeds up computation**, making the model efficient.  

---

## **Was Your Initial Understanding Correct?**  
Your understanding was mostly right, but now you can see the **matrix-based optimization** steps. Instead of doing everything in a loop, we used **matrix multiplication** to compute **all attention scores, weights, and context vectors at once**.  

Now, do you feel more confident in these codes? 😊

In [91]:
x_2 = inputs[1]
d_in = inputs.shape[1]
d_out = 2
torch.manual_seed(123)
w_query = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
w_value = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)

query_2 = x_2 @ w_query
value_2 = inputs @ w_value
key_2 = x_2 @ W_key

print("Query 2:", query_2)


Query 2: tensor([-1.1729, -0.0048])


In [92]:
print(d_in)

3


“weight” is short for “weight parameters,” the val-
ues of a neural network that are optimized during training. This is not to be confused
with the attention weights

Weight parameters are the fundamental, learned coefficients that define
the network’s connections, while attention weights are dynamic, context-specific values.

In [93]:
keys = inputs @ W_key
values = inputs @ w_value
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)

keys.shape: torch.Size([6, 2])
values.shape: torch.Size([6, 2])


As we can tell from the outputs, we successfully projected the six input tokens from a
three-dimensional onto a two-dimensional embedding space

In [94]:
keys_2 = keys[1]
attn_score_22 = query_2.dot(keys_2)
print(attn_score_22)

tensor(0.1376)


In [95]:
attn_scores_2 = query_2 @ keys.T
print(attn_scores_2)

tensor([ 0.2172,  0.1376,  0.1730, -0.0491,  0.7616, -0.3809])


In [96]:
d_k = keys.shape[-1]
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)
print(attn_weights_2)

tensor([0.1704, 0.1611, 0.1652, 0.1412, 0.2505, 0.1117])


In [97]:
context_vec_2 = attn_weights_2 @ values
print(context_vec_2)

tensor([0.2854, 0.4081])


Let's go step by step and break down **each code line by line** in a way that's super easy to understand. We'll use a **real-world analogy** wherever possible.  

---

# **CODE 1: Creating Query, Key, and Value Vectors**  
```python
x_2 = inputs[1]  # Select the 2nd word's embedding (word: "journey")

d_in = inputs.shape[1]  # Get the number of features (3 in this case)
d_out = 2  # We want to transform our embeddings into a smaller space (2D)

torch.manual_seed(123)  # Set seed for reproducibility

# Define weight matrices for Query, Key, and Value transformations
w_query = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)
w_value = torch.nn.Parameter(torch.randn(d_in, d_out), requires_grad=False)

# Compute transformed vectors
query_2 = x_2 @ w_query  # Transform "journey" into a query vector
value_2 = inputs @ w_value  # Transform all words into value vectors
key_2 = x_2 @ W_key  # Transform "journey" into a key vector

print("Query 2:", query_2)  # Print the query vector
```

### **What is happening here?**  
- Each word is represented as a **3D vector (embedding)**.  
- We want to transform them into a **smaller 2D space** using **weight matrices**.  
- The weight matrices (`w_query`, `W_key`, `w_value`) help compute **Query, Key, and Value vectors**.  
- Specifically:
  - `query_2` → The transformed version of "journey" as a **query**.  
  - `value_2` → The transformed versions of **all words** as **values**.  
  - `key_2` → The transformed version of "journey" as a **key**.  

📌 **Analogy**  
Imagine a **teacher grading students** on different subjects:  
- **Query** = What you are looking for (e.g., "Who is good at math?")  
- **Keys** = Students' skills (Math, Science, English)  
- **Values** = Students' detailed reports (full information)  

This step **converts raw student data into a form we can use for searching**.  

---

# **CODE 2: Compute Keys and Values for All Words**  
```python
keys = inputs @ W_key  # Transform all words into keys
values = inputs @ w_value  # Transform all words into values
```

### **What is happening here?**  
- Instead of transforming **only "journey"**, we now transform **all words** into:
  - **Keys** (for comparison)
  - **Values** (for final representation)  

📌 **Analogy**  
Now, instead of looking at only one student ("journey"), we **assess all students** (all words) and store their:  
- **Skills** (Keys)  
- **Detailed performance reports** (Values)  

Now, each word has its **own key and value representation**.  

---

# **CODE 3: Compute Similarity Between "Journey" and Its Own Key**  
```python
keys_2 = keys[1]  # Extract the key vector for "journey"
attn_score_22 = query_2.dot(keys_2)  # Compute dot product (similarity)
print(attn_score_22)
print("keys.shape:", keys.shape)
print("values.shape:", values.shape)
```

### **What is happening here?**  
- We take the **key** corresponding to "journey" (`keys_2`).  
- We **compare "journey's" query with its own key** using the **dot product**.  
- This gives us **a similarity score** (higher = more similar).  

📌 **Analogy**  
If a student is **good at math**, we check **how much their skill profile matches their math abilities**.  
This tells us **how well they match their own description**.  

---

# **CODE 4: Compute Attention Scores for All Words**  
```python
attn_scores_2 = query_2 @ keys.T  # Compute attention scores for all words
print(attn_scores_2)
```

### **What is happening here?**  
- Instead of **just comparing "journey" with itself**, we now compare "journey" **with every word in the sentence**.  
- This gives us **a list of similarity scores** for "journey" **vs every other word**.  

📌 **Analogy**  
Instead of checking **only one student's skill match**, we now compare this student **with all other students** to see who is similar to them.  

📌 **Output**  
A **list of numbers**, each representing **how much "journey" relates to another word**.  

---

# **CODE 5: Apply Softmax for Normalized Attention Weights**  
```python
d_k = keys.shape[-1]  # Get the number of features in keys
attn_weights_2 = torch.softmax(attn_scores_2 / d_k**0.5, dim=-1)  # Normalize scores
print(attn_weights_2)
```

### **What is happening here?**  
- The **raw attention scores** are just numbers.  
- We **normalize them** using **softmax** to get probabilities.  
- The division `d_k**0.5` helps stabilize values.  

📌 **Analogy**  
If we give students **raw scores**, they might not make sense.  
- We **convert them into percentages** (so they sum to 100%).  
- Now, we **know how much "journey" should focus on each word**.  

📌 **Output**  
A **list of probabilities**, showing **how much attention "journey" gives to each word**.  

---

# **CODE 6: Compute the Final Context Vector for "Journey"**  
```python
context_vec_2 = attn_weights_2 @ values  # Multiply attention weights with values
print(context_vec_2)
```

### **What is happening here?**  
- We **multiply the attention weights with the value vectors**.  
- This creates a **new representation of "journey" that includes information from all words**.  

📌 **Analogy**  
Imagine writing an **essay**.  
- Initially, you write each sentence **independently**.  
- After reviewing the whole essay, you **refine each sentence** by considering the full context.  
- Now, **each sentence makes more sense in the overall story**.  

📌 **Output**  
A **new vector for "journey"** that has **learned from all other words** in the sentence.  

---

# **Final Summary**
| Code | What It Does | Simple Explanation |
|------|-------------|--------------------|
| **Code 1** | Create query, key, and value vectors | Convert words into "searchable" forms |
| **Code 2** | Compute keys and values for all words | Assign skills (keys) and reports (values) to all words |
| **Code 3** | Compute similarity of "journey" with itself | Check if a student matches their own skills |
| **Code 4** | Compute similarity with all words | Compare "journey" with every word in the sentence |
| **Code 5** | Convert scores into probabilities (softmax) | Normalize scores so they sum to 1 |
| **Code 6** | Compute final enriched word representation | Update "journey" with information from all words |

---

# **Final Takeaways**
- **Query** = What we are searching for.  
- **Key** = Information about each word.  
- **Value** = The actual representation of each word.  
- **Attention scores** show how words relate to each other.  
- **Softmax converts scores into probabilities**.  
- **Context vector updates words with information from the entire sentence**.  

💡 **Now you understand self-attention!** This is how models like **ChatGPT, BERT, and Transformers** learn meaning from text. 🚀  

Did this explanation help? Let me know if anything needs further clarification! 😊

In [98]:
import torch.nn as nn
class SelfAttention_v1(nn.Module):
    def __init__(self, d_in, d_out):
        super().__init__()
        self.W_query = nn.Parameter(torch.rand(d_in,d_out))
        self.W_key = nn.Parameter(torch.rand(d_in,d_out))
        self.W_value = nn.Parameter(torch.rand(d_in,d_out))
    def forward(self, x ):
            keys = x @ self.W_key
            values = x @ self.W_value
            queries = x @ self.W_query
            attn_scores = queries @ keys.T
            attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
            context_vec= attn_weights @ values
            return context_vec




In [99]:
torch.manual_seed(123)
sa_v1 = SelfAttention_v1(d_in, d_out)
print(sa_v1(inputs))

tensor([[0.2996, 0.8053],
        [0.3061, 0.8210],
        [0.3058, 0.8203],
        [0.2948, 0.7939],
        [0.2927, 0.7891],
        [0.2990, 0.8040]], grad_fn=<MmBackward0>)


The thing is that this attention module is still missing twk kwy components 
 1) The masking mechanism is misssing 
 2) The multiheaded module is not implemented 
 

1. Masking Mechanism

In [100]:
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)

tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])


Now, we can multiply this mask with the attention weights to zero-out the values above
the diagonal

In [101]:
masked_simple = attn_weights*mask_simple
print(masked_simple)

tensor([[0.2098, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1385, 0.2379, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1390, 0.2369, 0.2326, 0.0000, 0.0000, 0.0000],
        [0.1435, 0.2074, 0.2046, 0.1462, 0.0000, 0.0000],
        [0.1526, 0.1958, 0.1975, 0.1367, 0.1879, 0.0000],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


As we can see, the elements above the diagonal are successfully zeroed out:


In [102]:
row_sums = masked_simple.sum(dim = -1 , keepdim= True)
masked_simple_norm = masked_simple/ row_sums
print(masked_simple_norm)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3680, 0.6320, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2284, 0.3893, 0.3822, 0.0000, 0.0000, 0.0000],
        [0.2046, 0.2956, 0.2915, 0.2084, 0.0000, 0.0000],
        [0.1753, 0.2250, 0.2269, 0.1570, 0.2158, 0.0000],
        [0.1385, 0.2184, 0.2128, 0.1420, 0.0988, 0.1896]])


The third step is to renormalize the attention weights to sum up to 1 again in each
row. 

In [103]:
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)
print(masked)

tensor([[0.9995,   -inf,   -inf,   -inf,   -inf,   -inf],
        [0.9544, 1.4950,   -inf,   -inf,   -inf,   -inf],
        [0.9422, 1.4754, 1.4570,   -inf,   -inf,   -inf],
        [0.4753, 0.8434, 0.8296, 0.4937,   -inf,   -inf],
        [0.4576, 0.7070, 0.7154, 0.3474, 0.6654,   -inf],
        [0.6310, 1.0865, 1.0605, 0.6565, 0.2935, 0.9450]])


In [104]:
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)
print(attn_weights)

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.4056, 0.5944, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2566, 0.3741, 0.3693, 0.0000, 0.0000, 0.0000],
        [0.2176, 0.2823, 0.2796, 0.2205, 0.0000, 0.0000],
        [0.1826, 0.2178, 0.2191, 0.1689, 0.2115, 0.0000],
        [0.1473, 0.2033, 0.1996, 0.1500, 0.1160, 0.1839]])


### 📌 **How is the Causal Attention Mask Applied? (Step-by-Step Explanation)**  

Causal attention ensures that a token (word) in a sequence **can only attend to itself and the tokens before it** but **not future tokens** (to prevent information leakage). This is **especially important in autoregressive models** like GPT.  

Let's break down the **entire process step by step**:

---

## **1️⃣ Compute Attention Scores and Weights**  

We first compute the **attention scores** and apply the **softmax function** to obtain attention weights:

```python
queries = sa_v2.W_query(inputs)  # Compute query vectors
keys = sa_v2.W_key(inputs)  # Compute key vectors
attn_scores = queries @ keys.T  # Compute attention scores (dot product of queries and keys)
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)  # Normalize with softmax
```

### 🔹 **What’s Happening Here?**
- `attn_scores` is a **matrix** that tells us **how much each word attends to every other word**.  
- `torch.softmax(..., dim=-1)` ensures that **each row sums to 1**, meaning that each word distributes its attention across other words.

### 🔹 **Example Output (`attn_weights`)**
```
tensor([[0.1921, 0.1646, 0.1652, 0.1550, 0.1721, 0.1510],
        [0.2041, 0.1659, 0.1662, 0.1496, 0.1665, 0.1477],
        [0.2036, 0.1659, 0.1662, 0.1498, 0.1664, 0.1480],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.1661, 0.1564],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.1585],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]])
```
🔹 **Problem:** Words are attending to **future words**, which **we don’t want** in causal attention.  

---

## **2️⃣ Create a Lower Triangular Mask**
We now create a **mask** to ensure each token **only attends to itself and past tokens**.

```python
context_length = attn_scores.shape[0]
mask_simple = torch.tril(torch.ones(context_length, context_length))
print(mask_simple)
```

### 🔹 **What’s Happening Here?**
- `torch.tril()` creates a **lower triangular matrix** where:  
  - **1s** are for allowed positions (tokens can attend to these).  
  - **0s** are for disallowed positions (tokens **cannot** attend to future tokens).

### 🔹 **Example Output (`mask_simple`)**
```
tensor([[1., 0., 0., 0., 0., 0.],
        [1., 1., 0., 0., 0., 0.],
        [1., 1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0.],
        [1., 1., 1., 1., 1., 0.],
        [1., 1., 1., 1., 1., 1.]])
```
🔹 **Effect:**  
- The **first word** can only attend to itself.  
- The **second word** can attend to itself and the first word.  
- The **third word** can attend to the first two, and so on.  
- **Future words are masked (0s).**  

---

## **3️⃣ Apply the Mask to Attention Weights**
Now we **apply the mask** by multiplying it element-wise (`*` operator):

```python
masked_simple = attn_weights * mask_simple
print(masked_simple)
```

### 🔹 **Effect:**  
- Any value **above the diagonal becomes 0** (future words are ignored).  

### 🔹 **Example Output (`masked_simple`)**
```
tensor([[0.1921, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2041, 0.1659, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2036, 0.1659, 0.1662, 0.0000, 0.0000, 0.0000],
        [0.1869, 0.1667, 0.1668, 0.1571, 0.0000, 0.0000],
        [0.1830, 0.1669, 0.1670, 0.1588, 0.1658, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]])
```
🔹 **Problem:** After masking, **the row sums are no longer 1**.  

---

## **4️⃣ Renormalize Attention Weights**
Since masking removed some values, we **renormalize** to ensure each row still sums to 1:

```python
row_sums = masked_simple.sum(dim=-1, keepdim=True)
masked_simple_norm = masked_simple / row_sums
print(masked_simple_norm)
```

### 🔹 **Effect:**  
- Each row’s sum **is forced back to 1**, making it a **valid probability distribution**.  

### 🔹 **Example Output (`masked_simple_norm`)**
```
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.5517, 0.4483, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3800, 0.3097, 0.3103, 0.0000, 0.0000, 0.0000],
        [0.2758, 0.2460, 0.2462, 0.2319, 0.0000, 0.0000],
        [0.2175, 0.1983, 0.1984, 0.1888, 0.1971, 0.0000],
        [0.1935, 0.1663, 0.1666, 0.1542, 0.1666, 0.1529]])
```
🔹 **Now the masked attention weights sum to 1** in each row! ✅  

---

## **5️⃣ A More Efficient Trick: Using -∞ Instead of Multiplication**
Instead of using `0s` in the mask and manually renormalizing, we can **directly use -∞ before applying softmax**:

```python
mask = torch.triu(torch.ones(context_length, context_length), diagonal=1)  # Upper triangular mask
masked = attn_scores.masked_fill(mask.bool(), -torch.inf)  # Replace upper triangle with -∞
print(masked)
```

🔹 **Effect:**  
- `masked_fill()` replaces **future positions with -∞**.  
- When we apply `softmax()`, those positions **become 0 automatically**.

### 🔹 **Example Output (`masked`)**
```
tensor([[0.2899, -inf, -inf, -inf, -inf, -inf],
        [0.4656, 0.1723, -inf, -inf, -inf, -inf],
        [0.4594, 0.1703, 0.1731, -inf, -inf, -inf],
        [0.2642, 0.1024, 0.1036, 0.0186, -inf, -inf],
        [0.2183, 0.0874, 0.0882, 0.0177, 0.0786, -inf],
        [0.3408, 0.1270, 0.1290, 0.0198, 0.1290, 0.0078]])
```

Now applying `softmax()`:

```python
attn_weights = torch.softmax(masked / keys.shape[-1]**0.5, dim=1)
```

🔹 **This is faster & avoids manual renormalization!** 🚀  

---

### ✅ **Final Thoughts**
- **Causal masking prevents future words from influencing past words.**
- **Multiplying with 0s works, but using -∞ is more efficient.**
- **Renormalization ensures valid attention probabilities.**
- **Softmax automatically handles the masked values.** 🎯

### Masking additional attention weights with dropout

Dropout in deep learning is a technique where randomly selected hidden layer units
are ignored during training, effectively “dropping” them out. This method helps pre-
vent overfitting by ensuring that a model does not become overly reliant on any spe-
cific set of hidden layer units. It’s important to emphasize that dropout is only used
during training and is disabled afterward.


In [105]:
torch.manual_seed(123)
dropout = torch.nn.Dropout(0.5)
example = torch.ones(6,6)
print(dropout(example))

tensor([[2., 2., 2., 2., 2., 2.],
        [0., 2., 0., 0., 0., 0.],
        [0., 0., 2., 0., 2., 0.],
        [2., 2., 0., 0., 0., 2.],
        [2., 0., 0., 0., 0., 2.],
        [0., 2., 0., 0., 0., 0.]])


When applying dropout to an attention weight matrix with a rate of 50%, half of the
elements in the matrix are randomly set to zero. To compensate for the reduction in
active elements, the values of the remaining elements in the matrix are scaled up by a
factor of 1/0.5 = 2. This scaling is crucial to maintain the overall balance of the attention weights, ensuring that the average influence of the attention mechanism remains
consistent during both the training and inference phases

In [106]:
torch.manual_seed(123)
print(dropout(attn_weights))

tensor([[2.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 1.1888, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.7386, 0.0000, 0.0000, 0.0000],
        [0.4352, 0.5646, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.3652, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.4065, 0.0000, 0.0000, 0.0000, 0.0000]])


But before we begin, let’s ensure that the code can handle batches consisting of
more than one input so that the CausalAttention class supports the batch outputs
produced by the data loader we implemented in chapter 2.
For simplicity, to simulate such batch inputs, we duplicate the input text example:

In [107]:
batch = torch.stack((inputs, inputs), dim=0)
print(batch.shape)

torch.Size([2, 6, 3])


In [108]:
class CausalAttention(nn.Module):  # ✅ Correct spelling
    def __init__(self, d_in , d_out, context_length, dropout, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask', torch.triu(torch.ones(context_length, context_length), diagonal=1))

    def forward(self, x):
        b, num_tokens, d_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        attn_scores = queries @ keys.transpose(1, 2)
        attn_scores.masked_fill_(self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)
        context_vec = attn_weights @ values
        return context_vec


In [112]:
print(inputs.shape)

torch.Size([6, 3])


In [113]:
torch.manual_seed(123)
context_length = batch.shape[1]
ca = CausalAttention(d_in, d_out, context_length, 0.0)
context_vecs= ca(batch)
print("context_vecs.shape:", context_vecs.shape)

context_vecs.shape: torch.Size([2, 6, 2])


In [114]:
print(batch)

tensor([[[0.4300, 0.1500, 0.8900],
         [0.5500, 0.8700, 0.6600],
         [0.5700, 0.8500, 0.6400],
         [0.2200, 0.5800, 0.3300],
         [0.7700, 0.2500, 0.1000],
         [0.0500, 0.8000, 0.5500]],

        [[0.4300, 0.1500, 0.8900],
         [0.5500, 0.8700, 0.6600],
         [0.5700, 0.8500, 0.6400],
         [0.2200, 0.5800, 0.3300],
         [0.7700, 0.2500, 0.1000],
         [0.0500, 0.8000, 0.5500]]])


In [111]:
print(context_length)

6


Implementing Muti-Headed Atention

In [121]:
class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in,d_out, context_length,dropout,num_heads, qkv_bias=False):
        super().__init__()
        self.heads=nn.ModuleList([CausalAttention(d_in, d_out, context_length, dropout, qkv_bias) for _ in range(num_heads)])
    def forward(self,x):
        return torch.cat([head(x) for head in self.heads], dim=-1)
        

In [125]:
torch.manual_seed(123)
context_length = batch.shape[1] # This is the number of tokens
d_in, d_out = 3,2
mha = MultiHeadAttentionWrapper(
d_in, d_out, context_length, 0.0, num_heads=2
)
context_vecs = mha(batch)
print(context_vecs)
print("context_vecs.shape:", context_vecs.shape)

tensor([[[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]],

        [[-0.4519,  0.2216,  0.4772,  0.1063],
         [-0.5874,  0.0058,  0.5891,  0.3257],
         [-0.6300, -0.0632,  0.6202,  0.3860],
         [-0.5675, -0.0843,  0.5478,  0.3589],
         [-0.5526, -0.0981,  0.5321,  0.3428],
         [-0.5299, -0.1081,  0.5077,  0.3493]]], grad_fn=<CatBackward0>)
context_vecs.shape: torch.Size([2, 6, 4])


In [None]:
class MulatHeadAttention(nn.Module):
    def __init__(self, d_in, d_out,context_length , dropout, num_heads, qkv_bias = False):
        super().__init__()
        assert (d_out % num_heads == 0), \
            "d_out must be divisible by num_heads"
        self.d_out = d_out
        self.num_heads = num_heads
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.out_proj = nn.Linear(d_out, d_out)
        self.dropout = nn.Dropout(dropout)
        self.register_buffer('mask' , torch.triu(torch.ones(context_length, context_length), diagonal=1))
    def forward(self , x):
        b, num_tokens,d_in = x.shape
        keys = self.W_key(x)
        queries = self.W_query(x)
        values = self.W_value(x)
        keys = keys.view(b, num_tokens,self.num_heads,self.head_dim)
        values = values.view(b, num_tokens,self.num_heads,self.head_dim)
        queries = queries.view(b, num_tokens,self.num_heads,self.head_dim)
        keys = keys.transpose(1,2)
        queries = queries.transpose(1,2)
        values = values.transpose(1,2)
        attn_scores = queries @ keys.transpose(-2,-1)
        mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
        attn_scores.masked_fill_(mask_bool, -torch.inf)
        attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
        attn_weights = self.dropout(attn_weights)
        context_vec= (attn_weights@values).transpose(1,2)
        contex_vec  = context_vec.contigious().view(b, num_tokens,self.d_out)
        contex_vec= contex_vec.out_proj(contex_vec)
        return contex_vec
    

### **Understanding Reshaping (`view()`) in Multi-Head Attention**  
Let's break these lines down **step by step** and explain why we reshape the tensors.

---

## **1️⃣ Reshape `keys`, `values`, and `queries`**
```python
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
values = values.view(b, num_tokens, self.num_heads, self.head_dim)
queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)
```

### ✅ **What is happening?**
- `keys`, `values`, and `queries` are originally **(batch_size, num_tokens, d_out)**.
- `view()` **reshapes them** into **(batch_size, num_tokens, num_heads, head_dim)**.

### 🛠 **Breaking It Down**
| Variable | Original Shape | After `view()` | Why? |
|----------|---------------|----------------|------|
| `keys` / `values` / `queries` | `(batch_size, num_tokens, d_out)` | `(batch_size, num_tokens, num_heads, head_dim)` | Splits `d_out` into `num_heads × head_dim` |

### ❓ **Why do we do this?**
- In **multi-head attention**, instead of using a single attention mechanism, we **split** `d_out` into multiple **heads**.
- Each head learns **different attention patterns** by focusing on different aspects of the input.

---

## **2️⃣ Why `d_out = num_heads * head_dim`?**
When defining the multi-head attention layer, we ensure:
```python
assert (d_out % num_heads == 0), "d_out must be divisible by num_heads"
```
This ensures that:
```
head_dim = d_out / num_heads
```
Each attention head gets a **smaller slice of the total embedding dimension**.

---

## **🛠 Example**
Suppose:
- `batch_size = 2`
- `num_tokens = 4`
- `d_out = 8`
- `num_heads = 2`
- Then, `head_dim = d_out / num_heads = 8 / 2 = 4`

Before reshaping (`keys.shape`):  
```
(batch_size, num_tokens, d_out) = (2, 4, 8)
```
After reshaping (`keys.shape`):  
```
(batch_size, num_tokens, num_heads, head_dim) = (2, 4, 2, 4)
```
Now, each of the **2 heads** gets a 4-dimensional representation per token.

---

## **🚀 Summary of Each Line**
| Line | What Happens? | Why? |
|------|-------------|------|
| `keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)` | Reshapes keys into multiple heads | Allows **each head** to learn different attention patterns |
| `values = values.view(b, num_tokens, self.num_heads, self.head_dim)` | Reshapes values for multi-head attention | Ensures each head gets its own representation |
| `queries = queries.view(b, num_tokens, self.num_heads, self.head_dim)` | Reshapes queries for multi-head attention | Allows each head to compute attention separately |

This step **prepares the data** for **multi-head attention**, ensuring each head gets its own independent representation.

Let me know if anything needs more explanation! 🚀
1. **The total number of elements must stay the same** before and after reshaping.  
   - Example: If you have a tensor of shape `(6,)` with 6 elements, you can reshape it into `(2, 3)`, but you can't reshape it to `(2, 4)` because that would require 8 elements.

2. `view()` does not change the **values inside the tensor**, it only **changes the structure** (how the values are grouped).

### **Why is it used in your code?**
In your multi-head attention code, we use `view()` to **reshape** the `keys`, `queries`, and `values` so that each can be split into multiple **attention heads**:
```python
keys = keys.view(b, num_tokens, self.num_heads, self.head_dim)
```
- **Before:** The `keys` tensor is shaped like `(batch_size, num_tokens, d_out)`.
- **After:** It is reshaped to `(batch_size, num_tokens, num_heads, head_dim)`.

This allows each head to independently handle a smaller part of the data. It's like **splitting the work among multiple workers** (attention heads), so each one can focus on a different aspect.

---

---

### **1️⃣ Transposing Keys, Queries, and Values**
```python
keys = keys.transpose(1,2)
queries = queries.transpose(1,2)
values = values.transpose(1,2)
```
#### ✅ **What Happens Here?**
- The shape of `keys`, `queries`, and `values` before transposing is:
  ```
  (batch_size, num_tokens, num_heads, head_dim)
  ```
- After `.transpose(1,2)`, the shape becomes:
  ```
  (batch_size, num_heads, num_tokens, head_dim)
  ```

#### ❓ **Why Do We Do This?**
- We need to compute **attention scores** separately for each head.
- Moving `num_heads` to the second dimension allows parallel computation over multiple heads.

---

### **2️⃣ Compute Attention Scores**
```python
attn_scores = queries @ keys.transpose(-2,-1)
```
#### ✅ **What Happens Here?**
- `keys.transpose(-2,-1)` swaps the last two dimensions, changing its shape:
  ```
  (batch_size, num_heads, head_dim, num_tokens)
  ```
- Now, when we do:
  ```
  attn_scores = queries @ keys.transpose(-2,-1)
  ```
  - This performs **matrix multiplication** between:
    ```
    (batch_size, num_heads, num_tokens, head_dim)  
    @ (batch_size, num_heads, head_dim, num_tokens)
    ```
  - The result is:
    ```
    (batch_size, num_heads, num_tokens, num_tokens)
    ```
  - This matrix contains **attention scores** indicating how much each token should attend to every other token.

#### ❓ **Why Do We Do This?**
- We measure how similar each query (`Q`) is to each key (`K`).
- This similarity score determines how much focus each token should have on others.

---


```python
keys = keys.transpose(1, 2)
queries = queries.transpose(1, 2)
values = values.transpose(1, 2)

attn_scores = queries @ keys.transpose(-2, -1)
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
attn_scores.masked_fill_(mask_bool, -torch.inf)

attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
attn_weights = self.dropout(attn_weights)

context_vec = (attn_weights @ values).transpose(1, 2)
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec)  # ✅ Correct function call

return context_vec
```

---
---

### **🔹 Compute Attention Scores**
```python
attn_scores = queries @ keys.transpose(-2, -1)
```
- Computes attention scores:  
  - `queries @ keys^T` → `(b, num_heads, num_tokens, num_tokens)`.  
  - Measures how well each token attends to every other token.  

---

### **🔹 Apply Causal Mask**
```python
mask_bool = self.mask.bool()[:num_tokens, :num_tokens]
attn_scores.masked_fill_(mask_bool, -torch.inf)
```
- **Mask future tokens** by setting unwanted values to `-inf` (ensures no information flows from future tokens).  

**🚨 Bug:**
```python
attn_scores.masked_fill_masked_fill_(mask_bool, -torch.inf)  # ❌ Typo
```
- `masked_fill_masked_fill_` is **incorrect**. The correct line is:
  ```python
  attn_scores.masked_fill_(mask_bool, -torch.inf)
  ```  

---

### **🔹 Compute Attention Weights**
```python
attn_weights = torch.softmax(attn_scores / keys.shape[-1]**0.5, dim=-1)
```
- Uses **softmax** to normalize scores into probabilities.  
- Dividing by `sqrt(head_dim)` stabilizes gradients.  

---

### **🔹 Apply Dropout to Attention Weights**
```python
attn_weights = self.dropout(attn_weights)
```
- Randomly drops some attention weights (helps generalization).  

---
Sure! Let's go **line by line**, explaining what is happening and why we do it in a simple way.  

---

### **1️⃣ Compute Context Vector from Attention Weights**
```python
context_vec = (attn_weights @ values).transpose(1, 2)
```
#### ✅ **What is happening?**
- `attn_weights` (the attention probabilities) is multiplied with `values` (the input representations).
- This gives a **weighted sum of values** for each token, based on how much attention it should give to other tokens.

#### ❓ **Why do we do this?**
- This step **creates new representations** for each token by combining information from other tokens.
- The model decides which tokens are important and **blends their information together**.

#### 🛠 **Transpose Explanation**
```python
.transpose(1, 2)
```
- This swaps the second and third dimensions.
- Changes shape from:
  ```
  (batch_size, num_heads, num_tokens, head_dim)
  ```
  - to  
  ```
  (batch_size, num_tokens, num_heads, head_dim)
  ```
- **Why?** → Makes it easier to reshape in the next step.

---

### **2️⃣ Reshape to Merge Attention Heads**
```python
context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
```
#### ✅ **What is happening?**
- `.contiguous()` ensures memory is arranged properly before reshaping.
- `.view(b, num_tokens, self.d_out)` reshapes the tensor from:
  ```
  (batch_size, num_tokens, num_heads, head_dim)
  ```
  - to  
  ```
  (batch_size, num_tokens, d_out)
  ```
  where `d_out = num_heads * head_dim`.

#### ❓ **Why do we do this?**
- Each attention **head** computed a separate representation.
- Now, we **combine all heads into a single representation** per token.

---

### **3️⃣ Apply Final Linear Transformation**
```python
context_vec = self.out_proj(context_vec)
```
#### ✅ **What is happening?**
- `self.out_proj` is a **fully connected linear layer** (`nn.Linear`) that applies a transformation to `context_vec`.

#### ❓ **Why do we do this?**
- The **attention output still needs to match the expected input size** for the next layer in the transformer.
- This **refines the information** before passing it to the next layers.

---

### **🚀 Summary of Each Line**
| Line | What Happens? | Why? |
|------|-------------|------|
| `context_vec = (attn_weights @ values).transpose(1, 2)` | Computes weighted sum of values based on attention | Lets each token "learn" from others |
| `context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)` | Merges multiple attention heads into one representation | Combines information from all heads |
| `context_vec = self.out_proj(context_vec)` | Applies a final transformation | Ensures the output is in the right format for the next layer |

Let me know if anything needs more explanation! 🚀