Here is **Chapter 15: Large Language Models (LLMs) & Generative AI** — the frontier of artificial intelligence.

---

# **CHAPTER 15: LARGE LANGUAGE MODELS (LLMs) & GENERATIVE AI**

*The Era of Foundation Models*

## **Chapter Overview**

Large Language Models (LLMs) have democratized artificial general intelligence capabilities, demonstrating emergent abilities in reasoning, code generation, and instruction following at unprecedented scale. This chapter covers the engineering and scientific principles behind models like GPT-4, LLaMA, and Claude: from scaling laws and distributed training infrastructure to alignment via RLHF and production deployment strategies including RAG and efficient serving.

**Estimated Time:** 70-80 hours (5-6 weeks)  
**Prerequisites:** Chapters 11-14 (Deep Learning Frameworks, Transformers, Attention mechanisms)

---

## **15.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Apply scaling laws to estimate optimal model size and training compute (Chinchilla optimal)
2. Implement efficient large-scale training using DeepSpeed ZeRO, FSDP, and FlashAttention
3. Understand and apply alignment techniques: RLHF, DPO, and Constitutional AI
4. Build RAG (Retrieval-Augmented Generation) systems with vector databases and embedding models
5. Engineer complex prompt patterns: Chain-of-Thought, ReAct, and Tree-of-Thoughts
6. Deploy and serve LLMs efficiently using quantization (GPTQ, AWQ), vLLM, and speculative decoding
7. Fine-tune open-source LLMs (LLaMA-2, Mistral) using LoRA and QLoRA for domain adaptation

---

## **15.1 Scaling Laws and Compute-Optimal Training**

#### **15.1.1 The Scaling Laws**

Kaplan et al. (OpenAI, 2020) established that loss scales as a power law with model size ($N$), dataset size ($D$), and compute ($C$):

$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}$$

Where $\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$.

**Implication:** To halve the loss, you need ~10x more parameters or data.

#### **15.1.2 Chinchilla Optimal (Hoffmann et al., 2022)**

Kaplan suggested training large models on relatively little data. Chinchilla showed this was suboptimal.

**Optimal allocation:** For compute budget $C$ (FLOPs), model size $N$ and tokens $D$ should scale equally:

$$N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}$$

**Rule of Thumb:** ~20 tokens per parameter (e.g., 70B model needs ~1.4T tokens, not 300B as in GPT-3).

**Practical Impact:**
- GPT-3 (175B, 300B tokens): Undertrained by Chinchilla standards
- LLaMA-2 (70B, 2T tokens): Closer to compute-optimal
- Smaller models trained longer often outperform larger models trained briefly

---

## **15.2 Modern LLM Architectures**

#### **15.2.1 Decoder-Only Dominance**

Modern LLMs (GPT-4, LLaMA, Claude, PaLM) use decoder-only Transformer variants:

**Key Modifications from Original Transformer:**
1. **Pre-LayerNorm:** LayerNorm before attention/FFN (stabilizes training at scale)
2. **Rotary Position Embeddings (RoPE):** $f(q, m) = qe^{im\theta}$ — relative position encoding that generalizes to longer sequences than seen during training
3. **SwiGLU Activation:** $\text{SwiGLU}(x) = \text{Swish}_1(xW) \otimes xV$ — improves performance over ReLU/GeLU
4. **Grouped Query Attention (GQA):** Share key/value heads across query heads (reduces memory bandwidth during inference)

```python
# Simplified RoPE implementation
def apply_rotary_pos_emb(q, k, cos, sin):
    # q, k: (batch, heads, seq, dim)
    # Rotate half dimensions
    q_rot = torch.stack([-q[..., 1::2], q[..., ::2]], dim=-1).flatten(-2)
    k_rot = torch.stack([-k[..., 1::2], k[..., ::2]], dim=-1).flatten(-2)
    
    q = q * cos + q_rot * sin
    k = k * cos + k_rot * sin
    return q, k
```

#### **15.2.2 Mixture of Experts (MoE)**

Models like GPT-4 and Mixtral use sparse layers: instead of one dense FFN, use $N$ experts, route to top-$k$ per token.

**Benefits:**
- Scale parameters without scaling compute (only activate subset)
- Specialization: Experts specialize in different domains (code, math, science)

```python
# Simplified MoE layer
class MoELayer(nn.Module):
    def __init__(self, d_model, num_experts=8, top_k=2):
        super().__init__()
        self.experts = nn.ModuleList([FFN(d_model) for _ in range(num_experts)])
        self.gate = nn.Linear(d_model, num_experts)
        self.top_k = top_k
        
    def forward(self, x):
        # x: (batch*seq, d_model)
        gates = torch.softmax(self.gate(x), dim=-1)  # (batch*seq, num_experts)
        
        # Select top-k experts
        top_gates, top_indices = torch.topk(gates, self.top_k, dim=-1)
        top_gates = top_gates / top_gates.sum(dim=-1, keepdim=True)  # Normalize
        
        output = torch.zeros_like(x)
        for i in range(self.top_k):
            expert_idx = top_indices[:, i]
            expert_gate = top_gates[:, i:i+1]
            
            # Route to expert (simplified - actual implementation uses efficient grouping)
            for j in range(len(self.experts)):
                mask = (expert_idx == j)
                if mask.any():
                    output[mask] += self.experts[j](x[mask]) * expert_gate[mask]
        
        return output
```

---

## **15.3 Training at Scale**

Training 7B+ parameter models requires specialized distributed strategies.

#### **15.3.1 ZeRO (Zero Redundancy Optimizer) — DeepSpeed**

Partitions optimizer states, gradients, and parameters across data parallel processes.

**Stages:**
- **ZeRO-1:** Partition optimizer states (4x memory reduction)
- **ZeRO-2:** + Partition gradients (8x reduction)
- **ZeRO-3:** + Partition parameters (linear reduction with degree)

```python
# DeepSpeed config
ds_config = {
    "fp16": {"enabled": True},
    "zero_optimization": {
        "stage": 2,  # ZeRO-2
        "allgather_partitions": True,
        "reduce_scatter": True,
    },
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
}

model_engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    model_parameters=model.parameters(),
    config=ds_config
)
```

#### **15.3.2 FlashAttention**

IO-aware exact attention algorithm that reduces HBM (high bandwidth memory) reads/writes from $O(N^2)$ to $O(N)$.

**Key Idea:** Tiling to fit in SRAM (fast on-chip memory), recomputing softmax statistics online.

**Speedup:** 2-4x faster, 10-20x memory efficient for long sequences (2k+).

```python
# Using FlashAttention 2
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
```

#### **15.3.3 Gradient Checkpointing (Activation Checkpointing)**

Trade compute for memory: Don't store activations during forward, recompute during backward.

**Memory:** Linear in layers instead of quadratic (enables training 10x larger models).

```python
model.gradient_checkpointing_enable()
# Or in PyTorch directly:
from torch.utils.checkpoint import checkpoint
```

---

## **15.4 Alignment: RLHF and Alternatives**

Raw pretrained models predict internet text; alignment makes them helpful, harmless, and honest.

#### **15.4.1 Supervised Fine-Tuning (SFT)**

Train on high-quality instruction-response pairs.

**Data Format:**
```
[INST] <<SYS>>
You are a helpful assistant.
<</SYS>>

What is the capital of France? [/INST]
The capital of France is Paris.</s>
```

#### **15.4.2 RLHF (Reinforcement Learning from Human Feedback)**

**Three-Step Process:**

1. **Collect Preferences:** Humans rank multiple model outputs (A > B)
2. **Train Reward Model (RM):** $r_\theta(x, y)$ predicts human preference
3. **Optimize Policy with PPO:**
   
   $$\max_{\pi} \mathbb{E}_{x \sim D, y \sim \pi}[r_\theta(x, y)] - \beta \mathbb{D}_{KL}[\pi || \pi_{ref}]$$

   KL penalty prevents model from drifting too far from pretrained distribution.

**PPO (Proximal Policy Optimization):** Clipped surrogate objective for stable training.

```python
# Simplified PPO update (conceptual)
ratio = torch.exp(new_logprob - old_logprob)
surrogate1 = ratio * advantages
surrogate2 = torch.clamp(ratio, 1-eps, 1+eps) * advantages
policy_loss = -torch.min(surrogate1, surrogate2).mean()
```

#### **15.4.3 DPO (Direct Preference Optimization)**

Bypass reward model and PPO. Optimize directly on preference data with classification loss.

$$\mathcal{L}_{DPO} = -\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)$$

**Advantages:** Simpler, more stable, no reward model training, often better performance.

#### **15.4.4 Constitutional AI (Anthropic)**

Self-improvement via AI feedback:
1. Model generates responses
2. AI critiques responses based on "constitution" (principles)
3. Model revises based on critique
4. Train on revised responses (RLAIF - RL from AI Feedback)

---

## **15.5 Prompt Engineering and Advanced Inference**

#### **15.5.1 Chain-of-Thought (CoT)**

Prompt the model to show reasoning steps: "Let's think step by step."

**Automatic CoT:** Generate multiple reasoning paths, vote on final answer (Self-Consistency).

#### **15.5.2 ReAct (Reasoning + Acting)**

Interleave reasoning traces with actions (API calls, tool use).

```
Thought: I need to find the current weather in Paris.
Action: search_weather[Paris]
Observation: 15°C, sunny
Thought: Now I can answer the user.
Final Answer: It's 15°C and sunny in Paris.
```

#### **15.5.3 Tree of Thoughts (ToT)**

Maintain multiple reasoning paths, evaluate each, prune unpromising branches (like beam search but for reasoning).

---

## **15.6 Retrieval-Augmented Generation (RAG)**

Ground LLMs in external knowledge to reduce hallucinations and provide source attribution.

#### **15.6.1 Architecture**

1. **Indexing:** Chunk documents, embed with model (OpenAI text-embedding-ada-002 or BGE, E5), store in vector DB
2. **Retrieval:** Embed query, find top-k similar chunks (cosine similarity, MIP)
3. **Generation:** Concatenate retrieved context with query, generate answer

```python
# Simplified RAG pipeline
from sentence_transformers import SentenceTransformer
import faiss

# 1. Indexing
encoder = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = load_docs()
embeddings = encoder.encode(documents)

index = faiss.IndexFlatIP(embeddings.shape[1])  # Inner product = cosine for normalized vectors
index.add(embeddings)

# 2. Retrieval
query = "What are the leave policies?"
query_emb = encoder.encode([query])
D, I = index.search(query_emb, k=3)  # Top 3 chunks
retrieved_docs = [documents[i] for i in I[0]]

# 3. Generation
context = "\n".join(retrieved_docs)
prompt = f"Context: {context}\n\nQuestion: {query}\nAnswer:"
response = llm.generate(prompt)
```

#### **15.6.2 Advanced RAG**

- **Hybrid Search:** Combine vector similarity with keyword (BM25) using Reciprocal Rank Fusion
- **Reranking:** Use cross-encoder to rerank retrieved chunks (more accurate than bi-encoder)
- **Query Expansion:** Generate hypothetical answer embedding (HyDE) to improve retrieval
- **Iterative RAG:** Generate, check if answer complete, retrieve more if needed

---

## **15.7 LLM Application Development**

#### **15.7.1 LangChain / LlamaIndex Frameworks**

**LangChain Components:**
- **Chains:** Sequences of calls (LLM, tool, LLM)
- **Agents:** Dynamic chain that decides which tools to use
- **Memory:** Conversation buffer, vector store memory
- **Retrievers:** Interface to vector DBs

```python
from langchain import OpenAI, LLMChain, PromptTemplate
from langchain.memory import ConversationBufferMemory

template = """You are a helpful assistant.

History: {history}
Human: {input}
Assistant:"""

prompt = PromptTemplate(
    input_variables=["history", "input"],
    template=template
)

memory = ConversationBufferMemory()
llm = OpenAI(temperature=0)
chain = LLMChain(llm=llm, prompt=prompt, memory=memory)

chain.predict(input="Hi there!")
```

#### **15.7.2 Efficient Serving**

**Quantization:**
- **GPTQ:** 4-bit quantization, suitable for inference (slight quality loss)
- **AWQ:** Activation-aware quantization (better than GPTQ for same bits)
- **GGUF/llama.cpp:** CPU inference, optimized for consumer hardware

**vLLM:**
PagedAttention algorithm for throughput serving (continuous batching).

```bash
# Serve with vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --tensor-parallel-size 1 \
    --quantization awq
```

---

## **15.8 Workbook Labs**

### **Lab 1: QLoRA Fine-tuning**
Fine-tune Llama-2-7B on custom dataset using QLoRA (4-bit quantization + LoRA):
1. Load model in 4-bit (bitsandbytes)
2. Add LoRA adapters (rank 64)
3. Train on instruction dataset (Alpaca format)
4. Merge adapters and evaluate vs base model

**Deliverable:** Fine-tuned model capable of domain-specific instruction following (e.g., medical QA).

### **Lab 2: RAG Pipeline**
Build RAG for technical documentation:
1. Chunk markdown docs with overlap
2. Embed using BGE-large
3. FAISS index with HNSW (fast approximate search)
4. Evaluate retrieval accuracy (hit rate @k)
5. Compare generation with vs without retrieval (hallucination reduction)

**Deliverable:** Working RAG system with evaluation metrics.

### **Lab 3: RLHF Simulation**
Implement DPO from scratch (simplified):
1. Create synthetic preference dataset (pairs of good/bad responses)
2. Implement DPO loss
3. Fine-tune small GPT-2 (125M) using DPO
4. Show win rate improvement vs SFT baseline

**Deliverable:** DPO training script and preference learning curves.

### **Lab 4: Multi-Agent System**
Build ReAct agent with tool use:
1. Tools: Calculator, Wikipedia search, Weather API
2. Agent loop: Thought → Action → Observation
3. Handle tool errors gracefully
4. Evaluate on multi-hop questions requiring tool chaining

**Deliverable:** Agent that answers "What is the temperature in the capital of France?" by calling tools in sequence.

---

## **15.9 Common Pitfalls**

1. **Context Length Exceedance:** Sending 10k tokens to 4k context model causes silent truncation or errors. Always check tokenizer length.

2. **Prompt Injection:** User input like "Ignore previous instructions and..." can hijack system prompts. Use input sanitization and delimiters.

3. **Temperature=0 Non-Determinism:** Even with temperature 0, GPU operations can have slight non-determinism. For reproducibility, set seeds and use CPU if critical.

4. **Hallucinations in RAG:** Retrieved context helps but doesn't eliminate hallucinations. Always verify critical facts.

5. **Fine-tuning Catastrophic Forgetting:** Training on narrow domain makes model forget general knowledge. Use lower learning rates and mix with general instruction data.

---

## **15.10 Interview Questions**

**Q1:** Explain the difference between ZeRO-2 and ZeRO-3 in DeepSpeed.
*A: ZeRO-2 partitions optimizer states and gradients across data parallel processes, but keeps full parameters on each GPU. ZeRO-3 also partitions the model parameters, so each GPU only holds a slice of parameters. When needed, parameters are gathered via all-gather communication. ZeRO-3 enables training much larger models (trillion parameters across GPUs) but with more communication overhead.*

**Q2:** What is the "Chinchilla optimal" training regime and why did it change LLM training practices?
*A: Kaplan scaling laws suggested model size should grow faster than data (train large models on ~300B tokens). Chinchilla showed compute is optimally allocated when model size and training tokens scale equally (~20 tokens per parameter). This means smaller models trained on more data (e.g., 70B on 2T tokens) outperform larger undertrained models (175B on 300B tokens) for same compute, shifting focus to data quality and longer training.*

**Q3:** Compare RLHF with DPO. When would you choose one over the other?
*A: RLHF trains a separate reward model then uses PPO to optimize policy against it. Complex, unstable (KL divergence tuning), but flexible for complex reward functions. DPO optimizes directly on preference data using classification loss (implicit reward). Simpler, more stable, no reward model needed, often performs as well or better. Choose DPO for simplicity and stability; RLHF if you need explicit reward model for inspection or complex multi-objective rewards.*

**Q4:** How does FlashAttention reduce memory usage from $O(N^2)$ to $O(N)$?
*A: Standard attention materializes the full $N \times N$ attention matrix in HBM (high bandwidth memory). FlashAttention uses tiling to compute attention in blocks that fit in fast SRAM (on-chip memory). It computes softmax incrementally (online softmax) without storing the full matrix, and only writes the final output to HBM. This reduces HBM accesses from quadratic to linear in sequence length.*

**Q5:** What is the "reversal curse" in LLMs and how does it relate to training?
*A: Models trained on "A is B" often fail to answer "B is A" (e.g., knowing "Olaf Scholz is Chancellor of Germany" but failing "Who is Chancellor of Germany? → Olaf Scholz"). Arises because auto-regressive models see ordered sequences during training; the reverse order is rare. Solution: bidirectional training or ensuring question-answer pairs appear in both directions in training data.*

---

## **15.11 Further Reading**

**Papers:**
- "Scaling Laws for Neural Language Models" (Kaplan et al., 2020)
- "Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022 - Chinchilla)
- "Llama 2: Open Foundation and Fine-Tuned Chat Models" (Touvron et al., 2023)
- "Direct Preference Optimization" (Rafailov et al., 2023)
- "FlashAttention: Fast and Memory-Efficient Exact Attention" (Dao et al., 2022)

**Resources:**
- DeepSpeed Documentation: https://www.deepspeed.ai/
- Hugging Face PEFT: Parameter-Efficient Fine-Tuning
- "Building LLM Applications" by Chip Huyen

---

## **15.12 Checkpoint Project: Production LLM Chatbot**

Build a domain-specific AI assistant (e.g., legal, medical, or technical support) deployable at scale.

**Requirements:**

1. **Base Model:** LLaMA-2-13B or Mistral-7B (open source)

2. **Fine-tuning:**
   - Prepare 10k+ instruction-response pairs in domain
   - Use QLoRA (4-bit) to fit on single A100 40GB
   - DPO training on preference pairs (human-ranked responses)

3. **RAG Integration:**
   - Ingest 1000+ domain documents (PDFs, markdown)
   - Chunk with semantic boundaries (not just fixed size)
   - Hybrid retrieval: Dense (embeddings) + Sparse (BM25)
   - Rerank with cross-encoder

4. **Safety & Alignment:**
   - System prompt with safety guidelines
   - Content moderation filter (toxicity classifier)
   - Refusal training for out-of-scope queries

5. **Deployment:**
   - vLLM serving with continuous batching
   - Quantization to AWQ 4-bit (reduce VRAM, increase throughput)
   - REST API with rate limiting and request validation
   - Streaming responses (Server-Sent Events)

6. **Evaluation:**
   - Benchmark against GPT-3.5 on domain-specific questions
   - Human evaluation (helpfulness, accuracy, safety) on 100 conversations
   - Latency: p95 < 500ms for 500 token generation

**Deliverables:**
- `llm_chatbot/` repository with training, RAG, and serving code
- Docker Compose setup: vLLM server + Vector DB (Chroma/Weaviate) + API gateway
- Evaluation report: "Model achieves 85% accuracy vs GPT-3.5's 90%, but 5x cheaper to run"
- Demo video showing multi-turn conversation with RAG grounding

**Success Criteria:**
- Handles 10 concurrent users with <2s latency
- Cites sources for factual claims (RAG attribution)
- Refuses harmful requests appropriately
- Maintains context across 5+ turn conversations

---

**End of Chapter 15**

*You now master the engineering of Large Language Models. Chapter 16 will cover Computer Vision Advanced — Vision Transformers, Diffusion Models, and Multimodal AI.*

---

<div style='width:100%; display:flex; justify-content:space-between; align-items:center; margin: 1em 0;'>
  <a href='../3. Deep_learning_and_neural_networks.ipynb/14. transformers_and_modern_nlp.ipynb' style='font-weight:bold; font-size:1.05em;'>&larr; Previous</a>
  <a href='../TOC.md' style='font-weight:bold; font-size:1.05em; text-align:center;'>Table of Contents</a>
  <a href='16. computer_vision_advanced.ipynb' style='font-weight:bold; font-size:1.05em;'>Next &rarr;</a>
</div>
