Here is **Chapter 16: Computer Vision Advanced** — the convergence of vision and language, and the generative revolution.

---

# **CHAPTER 16: COMPUTER VISION ADVANCED**

*Beyond Convolution*

## **Chapter Overview**

While CNNs dominated computer vision for a decade, the Transformer architecture has now conquered vision through Vision Transformers (ViT), enabling unprecedented scale and multimodal understanding. Simultaneously, generative models have evolved from GANs to Diffusion Models, enabling photorealistic image synthesis. This chapter bridges CNNs and Transformers, covers self-supervised learning at scale, and explores the multimodal frontier where vision meets language.

**Estimated Time:** 60-70 hours (4-5 weeks)  
**Prerequisites:** Chapters 12 (CNNs), 14 (Transformers), 15 (LLMs)

---

## **16.0 Learning Objectives**

By the end of this chapter, you will be able to:
1. Implement Vision Transformers (ViT) from scratch and understand patch embedding strategies
2. Train self-supervised vision models using contrastive learning (SimCLR) and masked autoencoding (MAE)
3. Build and train diffusion models for image generation, understanding the forward/reverse processes
4. Implement and fine-tune multimodal vision-language models (CLIP, LLaVA) for zero-shot classification and retrieval
5. Apply computer vision models to video understanding (temporal modeling, 3D convolutions)
6. Optimize vision models for efficient deployment (knowledge distillation, patching strategies)

---

## **16.1 Vision Transformers (ViT)**

#### **16.1.1 From CNNs to Transformers**

CNNs excel at local feature extraction but struggle with global relationships without deep stacks. ViT applies Transformers directly to image patches.

**Architecture:**
1. **Patch Embedding:** Split image $x \in \mathbb{R}^{H \times W \times C}$ into patches $x_p \in \mathbb{R}^{N \times (P^2 \cdot C)}$
   - $P$ = patch resolution (typically 16)
   - $N = HW/P^2$ = number of patches (196 for 224×224 image with P=16)

2. **Linear Projection:** Map each patch to embedding dimension $D$ using trainable matrix $\mathbf{E} \in \mathbb{R}^{(P^2 \cdot C) \times D}$

3. **Position Embeddings:** Add learned 1D position embeddings (or 2D sin-cos)

4. **Transformer Encoder:** Standard Transformer blocks (Multi-head Self-Attention + MLP)

5. **Classification Head:** Prepend learnable [CLS] token, use its final state for classification

```python
import torch
import torch.nn as nn
import math

class PatchEmbedding(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, embed_dim=768):
        super().__init__()
        self.img_size = img_size
        self.patch_size = patch_size
        self.n_patches = (img_size // patch_size) ** 2
        
        # Conv2d with stride = kernel_size is equivalent to patch extraction
        self.proj = nn.Conv2d(
            in_channels, embed_dim, 
            kernel_size=patch_size, stride=patch_size
        )
        
    def forward(self, x):
        # x: (B, C, H, W)
        x = self.proj(x)  # (B, embed_dim, H/P, W/P)
        x = x.flatten(2)  # (B, embed_dim, n_patches)
        x = x.transpose(1, 2)  # (B, n_patches, embed_dim)
        return x

class VisionTransformer(nn.Module):
    def __init__(self, img_size=224, patch_size=16, in_channels=3, 
                 num_classes=1000, embed_dim=768, depth=12, 
                 num_heads=12, mlp_ratio=4.0):
        super().__init__()
        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, embed_dim)
        num_patches = self.patch_embed.n_patches
        
        # Class token and position embeddings
        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
        self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
        
        # Transformer Encoder
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim, nhead=num_heads, 
            dim_feedforward=int(embed_dim * mlp_ratio),
            dropout=0.1, activation='gelu', batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=depth)
        
        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)
        
        # Initialize
        nn.init.normal_(self.cls_token, std=0.02)
        nn.init.normal_(self.pos_embed, std=0.02)
        
    def forward(self, x):
        B = x.shape[0]
        x = self.patch_embed(x)  # (B, n_patches, embed_dim)
        
        # Add cls token
        cls_tokens = self.cls_token.expand(B, -1, -1)  # (B, 1, embed_dim)
        x = torch.cat([cls_tokens, x], dim=1)  # (B, n_patches+1, embed_dim)
        
        # Add position embeddings
        x = x + self.pos_embed
        
        # Transformer
        x = self.transformer(x)
        x = self.norm(x)
        
        # Classifier on CLS token
        cls_output = x[:, 0]
        return self.head(cls_output)
```

#### **16.1.2 Inductive Bias Trade-off**

**CNNs:** Locality and translation equivariance baked into architecture (prior knowledge about images).

**ViT:** No image-specific inductive bias except patch extraction. Learns spatial relationships from scratch.

**Implication:** ViT requires more data (ImageNet-21k or JFT-300M) to outperform CNNs, but scales better to huge datasets.

**Hybrid Approaches:**
- **CoAtNet:** Combines convolutions and attention
- **Swin Transformer:** Hierarchical ViT with shifted windows (local attention)

---

## **16.2 Self-Supervised Learning for Vision**

Learning representations without labels by predicting parts of the input from other parts.

#### **16.2.1 Contrastive Learning (SimCLR, MoCo)**

**Core Idea:** Pull together augmented views of same image, push apart different images.

**SimCLR Loss (NT-Xent):**

$$\mathcal{L}_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}$$

Where $\text{sim}(u,v) = \frac{u^T v}{\|u\| \|v\|}$ (cosine similarity)

**Key Components:**
1. **Data Augmentation:** Random crop, color jitter, Gaussian blur (stronger than supervised)
2. **Projection Head:** Small MLP maps representations to contrastive loss space (removed after pretraining)
3. **Large Batch Size:** Needs many negatives (4096+)

```python
# Simplified SimCLR loss
def nt_xent_loss(z_i, z_j, temperature=0.5):
    """
    z_i, z_j: (batch, dim) representations of two views
    """
    batch_size = z_i.shape[0]
    z = torch.cat([z_i, z_j], dim=0)  # (2*batch, dim)
    z = F.normalize(z, dim=1)
    
    # Cosine similarity matrix
    similarity = torch.mm(z, z.T) / temperature  # (2*batch, 2*batch)
    
    # Mask out self-similarity
    mask = torch.eye(2*batch_size, device=z.device).bool()
    similarity = similarity.masked_fill(mask, -9e15)
    
    # Positive pairs: (i, i+batch) and (i+batch, i)
    positives = torch.cat([
        similarity[range(batch_size), range(batch_size, 2*batch_size)].unsqueeze(1),
        similarity[range(batch_size, 2*batch_size), range(batch_size)].unsqueeze(1)
    ], dim=0)  # (2*batch, 1)
    
    # Denominator: all negatives
    negatives = similarity  # Already masked self
    
    # Loss
    logits = torch.cat([positives, negatives], dim=1)
    labels = torch.zeros(2*batch_size, device=z.device, dtype=torch.long)
    
    return F.cross_entropy(logits, labels)
```

**MoCo (Momentum Contrast):** Uses queue of past representations as negatives, allowing smaller batches.

#### **16.2.2 Masked Autoencoders (MAE)**

BERT-style pretraining for images: Mask random patches, reconstruct pixel values.

**Asymmetric Design:**
- **Encoder:** ViT processing only visible patches (25% of image) → efficient
- **Decoder:** Lightweight Transformer reconstructing all patches
- **Reconstruction Target:** Normalized pixel values (per-patch)

**High Masking Ratio:** 75% (unlike BERT's 15%), because images have high redundancy.

```python
class MAE(nn.Module):
    def __init__(self, img_size=224, patch_size=16, mask_ratio=0.75):
        super().__init__()
        self.patch_embed = PatchEmbedding(img_size, patch_size, 3, 768)
        self.mask_ratio = mask_ratio
        
        # Encoder (standard ViT)
        encoder_layer = nn.TransformerEncoderLayer(768, 12, 3072, batch_first=True)
        self.encoder = nn.TransformerEncoder(encoder_layer, 12)
        
        # Decoder (lightweight)
        self.mask_token = nn.Parameter(torch.zeros(1, 1, 768))
        decoder_layer = nn.TransformerEncoderLayer(768, 16, 2048, batch_first=True)
        self.decoder = nn.TransformerEncoder(decoder_layer, 8)
        
        # Reconstruction head
        self.head = nn.Linear(768, patch_size**2 * 3)
        
    def random_masking(self, x):
        N, L, D = x.shape  # batch, length, dim
        len_keep = int(L * (1 - self.mask_ratio))
        
        noise = torch.rand(N, L, device=x.device)
        ids_shuffle = torch.argsort(noise, dim=1)
        ids_restore = torch.argsort(ids_shuffle, dim=1)
        
        # Keep subset
        ids_keep = ids_shuffle[:, :len_keep]
        x_masked = torch.gather(x, dim=1, index=ids_keep.unsqueeze(-1).expand(-1, -1, D))
        
        # Generate mask (0 is keep, 1 is remove)
        mask = torch.ones([N, L], device=x.device)
        mask[:, :len_keep] = 0
        mask = torch.gather(mask, dim=1, index=ids_restore)
        
        return x_masked, mask, ids_restore
        
    def forward(self, imgs):
        # Embed patches
        x = self.patch_embed(imgs)
        
        # Masking
        x, mask, ids_restore = self.random_masking(x)
        
        # Add pos embed to visible patches...
        
        # Encode
        x = self.encoder(x)
        
        # Decode (add mask tokens)
        mask_tokens = self.mask_token.expand(x.shape[0], ids_restore.shape[1] - x.shape[1], -1)
        x_full = torch.cat([x, mask_tokens], dim=1)
        x_full = torch.gather(x_full, dim=1, index=ids_restore.unsqueeze(-1).expand(-1, -1, x.shape[2]))
        
        # Decode and reconstruct
        x = self.decoder(x_full)
        pred = self.head(x)
        
        return pred, mask
```

---

## **16.3 Generative Models**

#### **16.3.1 Generative Adversarial Networks (GANs)**

Two-player game: Generator $G$ creates fake images, Discriminator $D$ distinguishes real from fake.

$$\min_G \max_D V(D, G) = \mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]$$

**Modern GANs:**
- **StyleGAN:** Style-based generator with progressive growing, mapping latent to intermediate latent space (disentanglement)
- **CycleGAN:** Unpaired image-to-image translation using cycle consistency loss

**GAN Limitations:** Mode collapse, training instability, hard to evaluate.

#### **16.3.2 Variational Autoencoders (VAEs)**

Learn latent distribution $p(z|x)$ via encoder, generate via decoder.

**ELBO (Evidence Lower Bound):**
$$\mathcal{L} = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))$$

Reconstruction loss + KL divergence (regularization).

**VQ-VAE (Vector Quantized):** Discrete latents using codebook.
- Encoder outputs continuous → nearest neighbor lookup in codebook
- Straight-through estimator for backpropagation
- Used in DALL-E, SoundStream

#### **16.3.3 Diffusion Models**

Current state-of-the-art for image generation (Stable Diffusion, DALL-E 2, Imagen).

**Forward Process (Diffusion):**
Gradually add Gaussian noise over $T$ timesteps:
$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t \mathbf{I})$$

After reparameterization:
$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I})$$

**Reverse Process (Denoising):**
Learn $p_\theta(x_{t-1}|x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$

**Training Objective (Simple):**
Predict the noise $\epsilon$ added to $x_0$:
$$\mathcal{L} = \mathbb{E}_{x_0, t, \epsilon} \|\epsilon - \epsilon_\theta(x_t, t)\|^2$$

```python
# Simplified Diffusion Model (DDPM)
class Diffusion(nn.Module):
    def __init__(self, model, timesteps=1000):
        super().__init__()
        self.model = model  # U-Net
        self.timesteps = timesteps
        
        # Pre-compute noise schedule (cosine or linear)
        betas = torch.linspace(0.0001, 0.02, timesteps)
        alphas = 1.0 - betas
        alphas_cumprod = torch.cumprod(alphas, dim=0)
        
        self.register_buffer('betas', betas)
        self.register_buffer('alphas_cumprod', alphas_cumprod)
        self.register_buffer('sqrt_alphas_cumprod', torch.sqrt(alphas_cumprod))
        self.register_buffer('sqrt_one_minus_alphas_cumprod', torch.sqrt(1.0 - alphas_cumprod))
        
    def forward(self, x_0, t):
        # Add noise
        noise = torch.randn_like(x_0)
        x_t = self.sqrt_alphas_cumprod[t] * x_0 + self.sqrt_one_minus_alphas_cumprod[t] * noise
        
        # Predict noise
        predicted_noise = self.model(x_t, t)
        
        return F.mse_loss(predicted_noise, noise)
    
    @torch.no_grad()
    def sample(self, batch_size, channels, height, width):
        # Start from pure noise
        x = torch.randn(batch_size, channels, height, width, device=self.betas.device)
        
        for t in reversed(range(self.timesteps)):
            t_batch = torch.full((batch_size,), t, device=x.device, dtype=torch.long)
            predicted_noise = self.model(x, t_batch)
            
            alpha_t = self.alphas_cumprod[t]
            alpha_t_prev = self.alphas_cumprod[t-1] if t > 0 else torch.tensor(1.0)
            beta_t = self.betas[t]
            
            # Denoise step (simplified)
            x = (x - beta_t / torch.sqrt(1 - alpha_t) * predicted_noise) / torch.sqrt(1 - beta_t)
            
            if t > 0:
                noise = torch.randn_like(x)
                x = x + torch.sqrt(beta_t) * noise
        
        return x
```

**Latent Diffusion Models (Stable Diffusion):** Apply diffusion in VAE latent space (lower dimensional), conditioned on text embeddings (CLIP).

---

## **16.4 Multimodal Vision-Language Models**

#### **16.4.1 CLIP (Contrastive Language-Image Pre-training)**

Jointly train image encoder and text encoder to maximize cosine similarity of (image, text) pairs, minimize for non-matching pairs.

**Zero-shot Classification:**
```python
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Prepare text descriptions
texts = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # (n_images, n_texts)
probs = logits_per_image.softmax(dim=1)
```

#### **16.4.2 LLaVA (Large Language and Vision Assistant)**

Connect CLIP vision encoder to LLM (Vicuna/Llama) via projection layer. Fine-tune on instruction-following data.

**Architecture:**
1. Image → ViT → [CLS] token features
2. Projection layer → LLM embedding space
3. Concatenate with text tokens → LLM generates response

---

## **16.5 Video Understanding**

#### **16.5.1 3D Convolutions**

Extend 2D conv to temporal dimension: $(C_{out}, C_{in}, k_h, k_w, k_t)$

**I3D (Inflated 3D CNN):** Inflate 2D ImageNet pre-trained filters to 3D by repeating across time.

#### **16.5.2 Video Transformers**

**TimeSformer:** Divided space-time attention (attend spatially then temporally to reduce complexity $O((HW)^2 T)$ to $O((HW)^2) + O(T^2)$).

**VideoMAE:** Masked autoencoding for video (tube masking).

---

## **16.6 Workbook Labs**

### **Lab 1: Vision Transformer from Scratch**
Implement ViT-Tiny on CIFAR-10:
1. Patch embedding (4x4 patches for 32x32 images)
2. Custom Transformer encoder (no PyTorch nn.Transformer)
3. Train from scratch (no pretraining)
4. Compare to ResNet-18 accuracy

**Deliverable:** ViT achieving >80% on CIFAR-10.

### **Lab 2: Contrastive Pretraining**
SimCLR pretraining on unlabeled subset of ImageNet:
1. Strong augmentation pipeline
2. ResNet-50 encoder + projection head
3. Train with NT-Xent loss
4. Evaluate: Linear probing (freeze encoder, train linear classifier) vs supervised baseline

**Deliverable:** Pretrained checkpoint with competitive linear probe accuracy.

### **Lab 3: Fine-tuning Stable Diffusion**
Use diffusers library to fine-tune Stable Diffusion on custom concept (e.g., "a photo of [V] dog"):
1. DreamBooth or LoRA fine-tuning
2. Inference pipeline generating new images of the concept
3. Evaluation: CLIP similarity to text prompt, FID score

**Deliverable:** Personalized diffusion model generating consistent concept images.

### **Lab 4: Multimodal RAG**
Build CLIP-based image search:
1. Index 10k images with CLIP embeddings (FAISS)
2. Text-to-image search (natural language queries)
3. Image-to-image search (similarity search)
4. Hybrid: Combine with text metadata

**Deliverable:** Image search engine with web interface.

---

## **16.7 Common Pitfalls**

1. **ViT on Small Data:** Training ViT from scratch on ImageNet-1k (1.2M images) performs worse than ResNet. Use transfer learning or augmentation (RandAug, Mixup, CutMix).

2. **Diffusion Sampling Steps:** Using too few steps (<20) with standard DDPM gives poor quality. Use DDIM or DPM-Solver for fast sampling (10-20 steps).

3. **GAN Mode Collapse:** Generator produces limited variety. Monitor diversity, use techniques like MiniBatch Discrimination or switch to Diffusion.

4. **Contrastive Learning Temperature:** Temperature $\tau$ is crucial. Too low (0.01) causes collapse; too high (1.0) reduces signal.

---

## **16.8 Interview Questions**

**Q1:** Why do Vision Transformers need more data than CNNs to perform well?
*A: CNNs have strong inductive biases built in: locality (local receptive fields), translation equivariance (weight sharing), and hierarchical structure. ViT has no image-specific prior except patch extraction—it must learn spatial relationships from scratch. With limited data, CNNs generalize better due to these biases; with massive data (ImageNet-21k, JFT), ViT's flexibility and global attention allow better scaling.*

**Q2:** Explain the difference between SimCLR (contrastive) and MAE (masked autoencoding) for self-supervised learning.
*A: SimCLR uses contrastive learning: pulls together augmented views of same image, pushes apart different images. Requires large batches or memory banks for negatives. MAE uses reconstruction: masks large portions of input (75%), reconstructs missing pixels via encoder-decoder. No negatives needed, asymmetric design (encoder only on visible patches) makes it efficient. SimCLR learns semantic representations; MAE learns both semantic and low-level details.*

**Q3:** How does Latent Diffusion (Stable Diffusion) differ from pixel-space diffusion?
*A: Pixel-space diffusion (original DDPM) operates on full resolution RGB images (expensive). Latent Diffusion first compresses images to latent space using a pre-trained VAE (e.g., 64x64x4 instead of 256x256x3, 48x reduction), applies diffusion process in this compressed space, then decodes with VAE. Additionally, it's conditioned on text embeddings (CLIP) via cross-attention in the U-Net, enabling text-to-image generation. Much faster and enables high-resolution generation.*

**Q4:** What is the purpose of the projection head in SimCLR, and why is it discarded for downstream tasks?
*A: The projection head (small MLP) maps representations to the space where contrastive loss is applied. It allows the backbone encoder to learn representations that are not necessarily normalized or linearly separable in the contrastive task. Empirically, representations before the projection head (h) perform better on downstream tasks than after (z), suggesting the projection head removes information useful for downstream but harmful for contrastive loss (e.g., augmentation-invariant features).*

**Q5:** How does CLIP enable zero-shot classification, and what are its limitations?
*A: CLIP jointly trains image and text encoders to align image-text pairs in embedding space. For zero-shot classification, you embed class names as text prompts ("a photo of a [class]"), embed the image, and take the highest cosine similarity. Limitations: 1) Sensitive to prompt engineering ("a photo of" vs "a picture of"), 2) Poor on fine-grained distinctions (breeds of dogs), 3) Biases from internet training data, 4) Struggles with abstract concepts or counting.*

---

## **16.9 Further Reading**

**Papers:**
- "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2020) - ViT
- "Masked Autoencoders Are Scalable Vision Learners" (He et al., 2021) - MAE
- "Denoising Diffusion Probabilistic Models" (Ho et al., 2020) - DDPM
- "High-Resolution Image Synthesis with Latent Diffusion Models" (Rombach et al., 2021) - Stable Diffusion
- "Learning Transferable Visual Models From Natural Language Supervision" (Radford et al., 2021) - CLIP

---

## **16.10 Checkpoint Project: Multimodal Content Generation Platform**

Build a system that generates images from text descriptions with editing capabilities.

**Requirements:**

1. **Base Model:** Stable Diffusion 1.5 or XL (open source)

2. **Features:**
   - Text-to-image generation with prompt optimization (expand simple prompts to detailed)
   - Image editing: Inpainting (fill masked regions) and Outpainting (extend borders)
   - Style transfer: Combine content from one image, style from another (using CLIP embeddings)
   - Upscaling: 4x super-resolution using Real-ESRGAN

3. **Personalization:**
   - DreamBooth fine-tuning on user-provided 5-10 images of concept
   - LoRA training for efficient style adaptation

4. **Safety:**
   - NSFW content detection (CLIP-based classifier)
   - Watermarking generated images
   - Metadata injection (C2PA standard)

5. **Deployment:**
   - FastAPI backend with async generation queue (Celery/Redis)
   - WebSocket streaming for generation progress
   - Model weights cached in VRAM, LoRA adapters hot-swappable
   - Quantization (INT8) for memory efficiency

**Deliverables:**
- `genai_platform/` with generation, editing, and training modules
- Frontend (React/Gradio) showing prompt input, gallery, editing canvas
- Performance: 512x512 image generation < 5 seconds on A100
- Report: "Platform supports 10 concurrent users with LoRA hot-swapping"

**Success Criteria:**
- Text alignment (CLIP score > 0.3 between prompt and generated image)
- User can train personalized concept in < 10 minutes
- Editing maintains consistency with original image (structural similarity > 0.8)

---

**End of Chapter 16**

*You now master advanced computer vision and multimodal AI. Chapter 17 will cover Reinforcement Learning (RL) — the foundation of RLHF and autonomous decision-making systems.*

---