<a href="https://colab.research.google.com/github/Shobhan-Kumar-P/Data-Cleaning-Practice/blob/main/Complete_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import torch
import tiktoken
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import tqdm
import tensorflow as tf

In [3]:
tokenizer = tiktoken.get_encoding("gpt2")

In [4]:
def text_to_token_ids(data, tokenizer):
  encoded = tokenizer.encode(data, allowed_special = {"<|endoftext|>"})
  encoded = torch.tensor(encoded).unsqueeze(0)
  return encoded

def token_ids_to_text(encoded, tokenizer):
  decoded = encoded.squeeze(0)
  decoded = decoded.tolist()
  return tokenizer.decode(decoded)

encoded = torch.tensor(token_ids).unsqueeze(0)
You’re turning a 1D tensor of token IDs (shape [N]) into a 2D tensor (shape [1, N]) — this represents a batch of 1 sentence, which is standard input format for models like transformers.

PyTorch tensors are not plain Python lists, but the tokenizer.decode() method typically expects a list of integers (not a PyTorch tensor).

Because most tokenizers (like Hugging Face or OpenAI tokenizers) expect a list of token IDs, not a PyTorch tensor.

.tolist() is a PyTorch (or NumPy) method that converts a tensor into a nested Python list containing standard Python data types like int, float, or bool.

✅ So:
Yes, .tolist() only works properly on tensors containing numerical or boolean data.

❌ It does not work on:
Tensors with strings (which PyTorch doesn't support anyway)

Non-numerical or non-boolean types

In [5]:
GPT_124M = {
    "vocab_size" : 50257,
    "context_size" : 1024,
    "emb_dim" : 768,
    "num_of_heads" : 12,
    "num_of_layers" : 12,
    "qkv_bias" : False,
    "drop_rate" : 0.1
}

In [6]:
class Multihead_Attention(nn.Module):
  def __init__(self, cfg):
    super().__init__()
    if cfg['emb_dim']%cfg['num_of_heads'] != 0:
      raise ValueError("number of heads cant divide embedding dimension")

    self.num_of_heads = cfg['num_of_heads']
    self.emb_dim = cfg["emb_dim"]
    self.context_length = cfg['context_length']
    self.head_dim = cfg['emb_dim'] // cfg['num_of_heads']
    self.q = nn.Linear(cfg['emb_dim'], cfg['emb_dim'], bias = cfg['qkv_bias'])
    self.k = nn.Linear(cfg['emb_dim'], cfg['emb_dim'], bias = cfg['qkv_bias'])
    self.v = nn.Linear(cfg['emb_dim'], cfg['emb_dim'], bias = cfg['qkv_bias'])

    self.out_proj = nn.Linear(cfg['emb_dim'], cfg['emb_dim'])
    self.register_buffer('mask', torch.triu(torch.ones(cfg['context_length'], cfg['context_length']), diagonal = 1))

  def forward(self, inputs):
    batch, num_of_tokens, d_out = inputs.shape
    q = self.q(inputs)
    k = self.k(inputs)
    v = self.v(inputs)

    q = q.view(batch, num_of_tokens, self.num_heads, self.head_dim)
    k = k.view(batch, num_of_tokens, self.num_heads, self.head_dim)
    v = v.view(batch, num_of_tokens, self.num_heads, self.head_dim)

    q = q.transpose(1,2)
    k = k.transpose(1,2)
    v = v.transpose(1,2)

    attention_scores = q @ k.transpose(2,3)
    attention_scores.masked_fill_(self.mask.bool()[:num_of_tokens, :num_of_tokens], -torch.inf)

    attention_weights = torch.softmax(attention_scores/k.shape[-1], dim = -1)

    context_vec = (attention_weights @ v).transpose(1,2)

    context_vec = context_vec.contiguous().view(batch, num_of_tokens, self.emb_dim)

    context_vec = self.out_proj(context_vec)

    return context_vec



Excellent follow-up, GEN — this dives deep into the *why* behind `out_proj`. Let's walk step by step to understand **how** `out_proj` adds a learned transformation **after** attention has already done its job.

---

## 🧠 You're Right to Ask:

> “Didn’t attention already perform all the transformations? Why do we need more?”

Let’s clarify what each step actually *does*, and what’s left for `out_proj`.

---

### ⚙️ Recap of What Attention Has Done So Far:

1. **Q, K, V Projections:**
   These map input tokens into **query**, **key**, and **value** spaces via learned linear layers:

   ```python
   queries = W_query(x)
   keys = W_key(x)
   values = W_value(x)
   ```

2. **Attention Computation (per head):**
   Using `Q @ K.T` to get weights, then `softmax`, then:

   ```python
   attention_output = attn_weights @ values
   ```

   This gives you a **weighted summary** for each head.

3. **Multi-Head Concatenation:**

   ```python
   context_vec = context_vec.view(batch, seq_len, d_out)
   ```

   Now, you’ve just **stacked** head outputs side by side.

---

### 🔥 What Hasn’t Happened Yet:

Each head is still **independent** in the concatenated vector. Nothing so far has learned:

* **How much to trust each head**
* **Which heads’ outputs to mix**
* **Which combinations of heads are useful**

That’s where `out_proj` comes in.

---

### 🧬 What `out_proj` Actually Does:

This is a **learned linear transformation**:

```python
out_proj = nn.Linear(d_out, d_out)
```

Which means:

```python
context_vec = context_vec @ W_out.T + b
```

Where:

* `W_out ∈ ℝ^{d_out × d_out}` is a weight matrix
* `b ∈ ℝ^{d_out}` is a bias term

It’s like a **fully connected layer** applied to every token's final attention vector.

---

### 💡 Why Is This Useful?

Because:

* Attention only gives you **head-wise outputs**.
* `out_proj` lets the model learn **head interactions** across dimensions.
* It adds **learnable flexibility** to reinterpret or remix those outputs.

In deep networks, even small transformations matter — especially learnable ones.

---

### 📊 Example:

Imagine:

* `num_heads = 4`
* `head_dim = 8`
* So `d_out = 32`

The result after concatenating heads is a 32-dim vector:

```
[ h1_output | h2_output | h3_output | h4_output ]
```

But maybe some heads are redundant, or combinations matter (e.g., h1 - h3).
The `out_proj` layer learns to produce a **better version** of this 32-dim vector by **remixing the components**.

---

### ✅ Final Summary:

| Step              | Purpose                                                   |
| ----------------- | --------------------------------------------------------- |
| Attention         | Learns **which tokens** to focus on                       |
| Multi-head concat | Gathers different attention perspectives                  |
| `out_proj`        | Learns how to **combine/mix/reweight** those perspectives |

---

If you'd like, I can show a simple example with fake data to see how `out_proj` changes the output numerically. Want to?
