# Foundations of Language Modeling -- Notebook Series

## From N-grams to Neural LMs to Transformers

Welcome to the **Foundations of Language Modeling** notebook series by Vizuara. This series traces the complete evolution of how machines learned to predict the next word -- from simple counting to the Transformer architecture that powers every modern LLM.

---

## Course Structure

| # | Notebook | Key Concepts | Time |
|---|----------|-------------|------|
| 1 | **N-gram Language Models** | Bigram/trigram counting, Markov assumption, sparsity problem, perplexity | ~45 min |
| 2 | **Neural Language Models & Embeddings** | Bengio's model, word embeddings, Word2Vec, RNNs, vanishing gradients | ~60 min |
| 3 | **Self-Attention & the Transformer** | Q/K/V, scaled dot-product attention, multi-head attention, positional encoding, Transformer blocks | ~60 min |
| 4 | **Building a Tiny Language Model** | Complete Mini-GPT, training loop, text generation, temperature sampling | ~45 min |

**Total estimated time: ~3.5 hours**

---

## Prerequisites

- Basic Python and PyTorch familiarity
- Understanding of neural network fundamentals (layers, backpropagation, loss functions)
- No prior NLP experience required -- we build everything from first principles

---

## How to Use These Notebooks

1. **Run in Google Colab** with a T4 GPU (free tier works fine)
2. **Go in order** -- each notebook builds on the previous one
3. **Complete the TODOs** -- hands-on exercises are where the real learning happens
4. **Read the article first** for conceptual context, then use notebooks for implementation

---

## Quick Links

In [None]:
print("Foundations of Language Modeling — Vizuara Notebook Series")
print("=" * 60)
print()
print("Notebook 1: N-gram Language Models")
print("  → Build bigram/trigram models from scratch")
print("  → Generate text by sampling from count-based distributions")
print("  → Discover the sparsity problem that limits N-grams")
print()
print("Notebook 2: Neural Language Models & Word Embeddings")
print("  → Implement Bengio's 2003 neural language model")
print("  → Train word embeddings that capture semantic similarity")
print("  → Build an RNN and see the vanishing gradient problem")
print()
print("Notebook 3: Self-Attention & the Transformer")
print("  → Implement scaled dot-product attention step by step")
print("  → Build multi-head attention and positional encoding")
print("  → Assemble a complete Transformer block")
print()
print("Notebook 4: Building a Tiny Language Model (Mini-GPT)")
print("  → Build a complete GPT-style model (~1M parameters)")
print("  → Train on Shakespeare with next-token prediction")
print("  → Generate text with temperature-controlled sampling")
print()
print("Let's begin! Open Notebook 1 to start.")