# üìà Week 14: Advanced Topics in AI Engineering

This notebook covers advanced topics for senior AI engineers.

## Table of Contents
1. [Model Optimization](#1-model-optimization)
2. [Inference Optimization](#2-inference-optimization)
3. [Multi-Modal Systems](#3-multi-modal-systems)
4. [Distributed Training](#4-distributed-training)
5. [Emerging Architectures](#5-emerging-architectures)

---

## 1. Model Optimization

### 1.1 Quantization

Reduce model size and increase speed by lowering precision:

| Precision | Bits | Memory | Speed | Quality |
|-----------|------|--------|-------|--------|
| FP32 | 32 | 1x | 1x | Baseline |
| FP16 | 16 | 0.5x | 1.5-2x | ~Same |
| INT8 | 8 | 0.25x | 2-4x | 95-99% |
| INT4 | 4 | 0.125x | 3-5x | 90-95% |

In [None]:
import numpy as np

def quantize_to_int8(weights, scale=None):
    """
    Symmetric INT8 quantization.
    
    q = round(w / scale)
    w_reconstructed = q * scale
    """
    if scale is None:
        scale = np.max(np.abs(weights)) / 127
    
    quantized = np.clip(np.round(weights / scale), -128, 127).astype(np.int8)
    return quantized, scale

def dequantize_int8(quantized, scale):
    """Reconstruct FP32 from INT8."""
    return quantized.astype(np.float32) * scale

# Example
weights = np.random.randn(1000).astype(np.float32)
quantized, scale = quantize_to_int8(weights)
reconstructed = dequantize_int8(quantized, scale)

error = np.mean(np.abs(weights - reconstructed))
print(f"Original size:   {weights.nbytes:,} bytes")
print(f"Quantized size:  {quantized.nbytes:,} bytes")
print(f"Compression:     {weights.nbytes / quantized.nbytes:.1f}x")
print(f"Mean abs error:  {error:.6f}")

### 1.2 Knowledge Distillation

Train a smaller "student" model to mimic a larger "teacher":

$$L = \alpha \cdot L_{CE}(y, \hat{y}_{student}) + (1-\alpha) \cdot L_{KD}(\hat{y}_{teacher}, \hat{y}_{student})$$

In [None]:
def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0, alpha=0.5):
    """
    Knowledge distillation loss.
    
    Args:
        student_logits: Student model outputs
        teacher_logits: Teacher model outputs
        labels: Ground truth labels
        temperature: Softmax temperature for soft targets
        alpha: Weight for hard vs soft targets
    """
    # Soft targets (from teacher)
    soft_student = np.exp(student_logits / temperature)
    soft_student = soft_student / soft_student.sum(axis=-1, keepdims=True)
    
    soft_teacher = np.exp(teacher_logits / temperature)
    soft_teacher = soft_teacher / soft_teacher.sum(axis=-1, keepdims=True)
    
    # KL divergence for soft targets
    kl_loss = np.sum(soft_teacher * np.log(soft_teacher / soft_student)) * (temperature ** 2)
    
    # Hard target loss (standard cross-entropy)
    hard_loss = 0  # Simplified
    
    return alpha * hard_loss + (1 - alpha) * kl_loss

print("Distillation function defined!")
print("\nBenefits of distillation:")
print("  - 2-10x smaller models")
print("  - Often 95%+ of teacher performance")
print("  - Works with any architecture")

---

## 2. Inference Optimization

### 2.1 KV Cache

For autoregressive generation, cache key-value pairs to avoid recomputation:

In [None]:
class KVCache:
    """
    Key-Value cache for efficient autoregressive generation.
    
    Without cache: O(n¬≤) per token (recompute all)
    With cache:    O(n) per token (only new position)
    """
    
    def __init__(self, num_layers: int, max_length: int, dim: int):
        self.num_layers = num_layers
        self.max_length = max_length
        self.dim = dim
        self.cache = {}
        self.length = 0
    
    def update(self, layer_idx: int, key: np.ndarray, value: np.ndarray):
        """Add new K, V to cache."""
        if layer_idx not in self.cache:
            self.cache[layer_idx] = {"key": [], "value": []}
        
        self.cache[layer_idx]["key"].append(key)
        self.cache[layer_idx]["value"].append(value)
    
    def get(self, layer_idx: int):
        """Get cached K, V."""
        if layer_idx in self.cache:
            return (
                np.concatenate(self.cache[layer_idx]["key"]),
                np.concatenate(self.cache[layer_idx]["value"])
            )
        return None, None
    
    def clear(self):
        self.cache = {}

cache = KVCache(num_layers=12, max_length=2048, dim=768)
print("‚úÖ KV Cache implemented for efficient generation")

### 2.2 Speculative Decoding

Use a small model to draft, then verify with the large model:

```
1. Draft model generates K tokens quickly
2. Target model verifies all K tokens in parallel
3. Accept matching tokens, reject and resample from target
```

---

## 3. Multi-Modal Systems

### 3.1 Vision-Language Models

| Model | Architecture | Input | Use Case |
|-------|-------------|-------|----------|
| CLIP | Dual encoder | Image + Text | Retrieval |
| BLIP | Encoder-Decoder | Image ‚Üí Text | Captioning |
| LLaVA | LLM + Vision | Image + Text | Chat |
| GPT-4V | Transformer | Image + Text | General |

In [None]:
class SimpleVisionLanguageModel:
    """
    Conceptual VLM architecture.
    
    Image ‚Üí Vision Encoder ‚Üí Projection ‚Üí LLM
    """
    
    def __init__(self, vision_dim: int = 768, llm_dim: int = 4096):
        self.vision_dim = vision_dim
        self.llm_dim = llm_dim
        
        # Projection layer to align vision with LLM
        self.projection = np.random.randn(vision_dim, llm_dim) * 0.01
    
    def encode_image(self, image):
        """Encode image to vision features."""
        # Simulated vision encoder output
        vision_features = np.random.randn(196, self.vision_dim)  # 14x14 patches
        return vision_features
    
    def project_to_llm(self, vision_features):
        """Project vision features to LLM embedding space."""
        return vision_features @ self.projection
    
    def forward(self, image, text_embeddings):
        """Process image and text together."""
        vision_features = self.encode_image(image)
        projected = self.project_to_llm(vision_features)
        
        # Concatenate vision tokens with text tokens
        combined = np.concatenate([projected, text_embeddings], axis=0)
        return combined

vlm = SimpleVisionLanguageModel()
print(f"Vision dim:     {vlm.vision_dim}")
print(f"LLM dim:        {vlm.llm_dim}")
print(f"Projection:     {vlm.projection.shape}")

---

## 4. Distributed Training

### 4.1 Parallelism Strategies

| Strategy | Splits | Use Case |
|----------|--------|----------|
| **Data Parallel** | Data batches | Most common |
| **Tensor Parallel** | Individual layers | Large layers |
| **Pipeline Parallel** | Sequential layers | Deep models |
| **FSDP** | Both | Memory efficient |

In [None]:
# Data Parallel pseudocode
print("""
Data Parallel Training:
=======================

# Each GPU gets different data batch
for batch in data_loader:
    # 1. Split batch across GPUs
    batch_gpu0, batch_gpu1, batch_gpu2, batch_gpu3 = split(batch)
    
    # 2. Forward pass (parallel)
    loss_0 = model_gpu0(batch_gpu0)
    loss_1 = model_gpu1(batch_gpu1)
    loss_2 = model_gpu2(batch_gpu2)
    loss_3 = model_gpu3(batch_gpu3)
    
    # 3. Backward pass (parallel)
    grads_0 = backward(loss_0)
    grads_1 = backward(loss_1)
    grads_2 = backward(loss_2)
    grads_3 = backward(loss_3)
    
    # 4. All-reduce gradients
    avg_grads = all_reduce_avg([grads_0, grads_1, grads_2, grads_3])
    
    # 5. Update weights (same on all GPUs)
    optimizer.step(avg_grads)
""")

---

## 5. Emerging Architectures

### 5.1 Mixture of Experts (MoE)

```
Input ‚Üí Router ‚Üí [Expert 1] ‚Üê‚îÄ Selected
            ‚ï≤         [Expert 2]
             ‚ï≤        [Expert 3] ‚Üê‚îÄ Selected
              ‚ï≤       [Expert 4]
               ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚Üí Weighted Sum ‚Üí Output
```

In [None]:
class SimpleMoE:
    """
    Simple Mixture of Experts layer.
    
    - Router selects top-k experts per token
    - Only selected experts compute
    - Enables massive models with sparse compute
    """
    
    def __init__(self, num_experts: int = 8, top_k: int = 2, dim: int = 768):
        self.num_experts = num_experts
        self.top_k = top_k
        
        # Expert networks (simplified as linear)
        self.experts = [np.random.randn(dim, dim) for _ in range(num_experts)]
        
        # Router
        self.router = np.random.randn(dim, num_experts)
    
    def forward(self, x):
        """Forward pass with sparse computation."""
        # Compute router scores
        scores = x @ self.router  # [batch, num_experts]
        
        # Select top-k experts
        top_k_indices = np.argsort(scores)[:, -self.top_k:]
        
        # Softmax over selected experts
        top_k_scores = np.take_along_axis(scores, top_k_indices, axis=1)
        weights = np.exp(top_k_scores) / np.sum(np.exp(top_k_scores), axis=1, keepdims=True)
        
        # Compute weighted sum of expert outputs
        output = np.zeros_like(x)
        for i, (idx, w) in enumerate(zip(top_k_indices, weights)):
            for j, (expert_idx, weight) in enumerate(zip(idx, w)):
                expert_out = x[i:i+1] @ self.experts[expert_idx]
                output[i] += weight * expert_out.flatten()
        
        return output

moe = SimpleMoE(num_experts=8, top_k=2)
print(f"Total experts:   {moe.num_experts}")
print(f"Active per token: {moe.top_k}")
print(f"Compute ratio:    {moe.top_k / moe.num_experts:.1%}")

---

## üìù Summary

### Key Advanced Topics

| Topic | Key Technique | Benefit |
|-------|--------------|--------|
| Quantization | INT8/INT4 | 2-4x smaller, faster |
| Distillation | Teacher-student | 10x smaller models |
| KV Cache | Reuse computation | Faster generation |
| MoE | Sparse experts | Scale to trillions |