💋 **Mooooaahhh locked, loaded, and encoded.**  
Let’s dive into the **first weapon** in the LLM arsenal: the **tokenizer** — the very *language* of language models.

---

# 🧪 `07_lab_tokenizer_visualizer_and_custom_vocab.ipynb`  
### 📁 `05_llm_engineering/01_llm_fundamentals`  
> Build a **custom tokenizer** using BPE & SentencePiece  
→ Visualize how text is **chunked into subword units**  
→ Understand why tokenization is **half the model's brain**.

---

## 🎯 Learning Goals

- Understand **why tokenization exists**  
- Train a **BPE tokenizer** on custom corpus  
- Visualize token splits for common vs rare words  
- Use **HuggingFace + SentencePiece**  
- Export + load tokenizer config just like GPTs do

---

## 💻 Runtime Spec

| Tool         | Spec                |
|--------------|---------------------|
| Tokenizer    | 🤗 `tokenizers`, `sentencepiece` ✅  
| Corpus       | Tiny text8 / poetry ✅  
| Output       | Custom `.json` vocab ✅  
| Visuals      | Split trees, vocab growth ✅  
| Platform     | Colab / local ✅  

---

## 🔧 Section 1: Install Dependencies

```bash
!pip install tokenizers sentencepiece
```

---

## 📚 Section 2: Create Sample Corpus

```python
corpus = [
    "machine learning is amazing",
    "deep learning is part of machine learning",
    "llms are the future of language modeling",
    "moooaahhh from the professor"
]

with open("tiny_corpus.txt", "w") as f:
    for line in corpus:
        f.write(line + "\n")
```

---

## 🧠 Section 3: Train BPE Tokenizer with 🤗

```python
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.BpeTrainer(vocab_size=50, show_progress=True)
tokenizer.train(["tiny_corpus.txt"], trainer)
tokenizer.save("mooaahh_tokenizer.json")
```

---

## 🔎 Section 4: Visualize Token Splits

```python
tokenizer = Tokenizer.from_file("mooaahh_tokenizer.json")

def visualize(text):
    encoding = tokenizer.encode(text)
    print(f"Input: {text}")
    print("Tokens:")
    for tok, id_ in zip(encoding.tokens, encoding.ids):
        print(f"{tok:>10} → ID {id_}")

visualize("moooaahhh from the professor")
visualize("deep language learning")
```

---

## 🧪 Section 5: SentencePiece Alternative

```python
import sentencepiece as spm

with open("corpus.txt", "w") as f:
    for line in corpus:
        f.write(line + "\n")

spm.SentencePieceTrainer.Train(
    input='corpus.txt', model_prefix='sp', vocab_size=50, model_type='bpe'
)

sp = spm.SentencePieceProcessor()
sp.load("sp.model")

print(sp.encode("moooaahhh from the professor", out_type=str))
```

---

## ✅ Wrap-Up Summary

| Task                               | ✅ |
|------------------------------------|----|
| BPE tokenizer trained on corpus    | ✅ |
| SentencePiece variant implemented  | ✅ |
| Custom tokens + IDs visualized     | ✅ |
| Exportable + loadable tokenizer    | ✅ |

---

## 🧠 What You Learned

- Tokenizers split text into **subwords**, not words  
- Rare words = **longer chains of subwords**  
- You can fully **train & inspect your own tokenizer**  
- Tokenization is **pre-model** — but crucial for accuracy, speed, generalization

---

You ready to go from text splits to **Transformer circuits**?  
Next lab:

> 🔁 `08_lab_transformer_forward_pass_step_by_step.ipynb`  
We’re gonna build a **mini Transformer block**, visualize attention, residuals, layer norms…  
and finally understand *why it works*.

Shall we spin up those attention heads, Professor?