⚙️ Let’s spin up the **neuronal machinery** of the transformer —  
We’re no longer just using GPT.  
We’re going *inside its brain.*  

---

# 🧪 `08_lab_transformer_forward_pass_step_by_step.ipynb`  
### 📁 `05_llm_engineering/01_llm_fundamentals`  
> Manually implement and visualize a **single Transformer block**  
→ Includes **multi-head attention, residuals, layer norm, and MLPs**  
→ This is the **deepest you’ll ever debug a Transformer without crying** 😅

---

## 🎯 Learning Goals

- Understand **Transformer block architecture**  
- Implement attention, MLP, skip connections manually  
- Visualize **attention matrices** and **token dependencies**  
- Compare to PyTorch/HF equivalents

---

## 💻 Runtime Specs

| Feature       | Spec              |
|---------------|-------------------|
| Framework     | PyTorch ✅  
| Sequence      | Short input strings ✅  
| Attention     | Manual + visual ✅  
| Platform      | Colab-friendly ✅  

---

## 🧠 Section 1: Input Tokens

```python
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

tokens = ["hello", "i", "am", "moooaahhh"]
vocab = {tok: i for i, tok in enumerate(tokens)}
x = torch.tensor([vocab[tok] for tok in tokens]).unsqueeze(0)  # shape (1, 4)

embed_dim = 8
```

---

## 🔤 Section 2: Embedding Layer

```python
embedding = nn.Embedding(len(vocab), embed_dim)
x_embed = embedding(x)  # shape (1, 4, 8)
```

---

## 🧠 Section 3: Self-Attention Layer (1 head)

```python
class SimpleSelfAttention(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.q = nn.Linear(embed_dim, embed_dim)
        self.k = nn.Linear(embed_dim, embed_dim)
        self.v = nn.Linear(embed_dim, embed_dim)
        self.scale = embed_dim ** 0.5

    def forward(self, x):
        Q, K, V = self.q(x), self.k(x), self.v(x)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / self.scale
        weights = torch.softmax(scores, dim=-1)
        out = torch.matmul(weights, V)
        return out, weights
```

```python
attn = SimpleSelfAttention(embed_dim)
attn_out, weights = attn(x_embed)
```

---

## 🎨 Section 4: Visualize Attention Matrix

```python
import seaborn as sns

plt.figure(figsize=(6, 4))
sns.heatmap(weights[0].detach(), annot=True, xticklabels=tokens, yticklabels=tokens, cmap="Blues")
plt.title("Self-Attention Weights")
plt.xlabel("Key")
plt.ylabel("Query")
plt.show()
```

---

## 🧪 Section 5: Add LayerNorm + Residual + FeedForward

```python
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        self.attn = SimpleSelfAttention(embed_dim)
        self.norm1 = nn.LayerNorm(embed_dim)
        self.ff = nn.Sequential(
            nn.Linear(embed_dim, 4 * embed_dim),
            nn.ReLU(),
            nn.Linear(4 * embed_dim, embed_dim)
        )
        self.norm2 = nn.LayerNorm(embed_dim)

    def forward(self, x):
        attn_out, weights = self.attn(x)
        x = self.norm1(x + attn_out)  # Residual + Norm
        ff_out = self.ff(x)
        x = self.norm2(x + ff_out)
        return x, weights
```

---

## 🧪 Section 6: Final Forward Pass

```python
block = TransformerBlock(embed_dim)
out, final_weights = block(x_embed)
print("Final shape:", out.shape)
```

---

## ✅ Wrap-Up Summary

| Feature                      | ✅ |
|------------------------------|----|
| Token embeddings created     | ✅ |
| Manual attention implemented | ✅ |
| Visual attention matrix      | ✅ |
| Residual + FF + LayerNorm    | ✅ |

---

## 🧠 What You Learned

- Transformers rely on **dot products of Q & K** to decide “who attends to whom”  
- LayerNorm and skip connections **stabilize training**  
- This lab = a **mental microscope** inside every GPT, BERT, or LLaMA  
- You now **own the block**, not just use it

---

Next lab is **logit-level decoding**:  
> `09_lab_prompt_patterns_and_token_logprobs.ipynb`  
We’ll send prompts into a real LLM, extract **token-wise logits**, and experiment with **top-k, top-p, temperature**.

Moooaaahh mode activated?