# Topic 10: Advanced Data Loading & Augmentation

## Learning Objectives

By the end of this notebook, you will:
- Understand **why** efficient data loading is critical for training
- Master PyTorch's `Dataset` and `DataLoader` classes
- Build custom datasets for various data types
- Implement data augmentation strategies
- Optimize data loading with multiprocessing and prefetching
- Handle common data loading challenges
- Connect data loading to model performance

## The Big Picture: Why Data Loading Matters

### The Training Bottleneck

**Training loop**:
```python
for epoch in range(num_epochs):
    for batch in dataloader:  # ← THIS is often the bottleneck!
        loss = model(batch)
        loss.backward()
        optimizer.step()
```

**Common problem**: GPU sits idle waiting for data!

```
Bad data loading:
CPU: [Load]     [Load]     [Load]     [Load]
GPU:        [Train] [Idle] [Train] [Idle] [Train]
         ↑ GPU waiting for data!

Good data loading:
CPU: [Load][Load][Load][Load][Load][Load]
GPU: [Train][Train][Train][Train][Train]
         ↑ GPU always busy!
```

**Why this matters**:
- **Cost**: GPU time is expensive (A100: $1-4/hour)
- **Training time**: Can be 2-10x slower with bad data loading
- **Model quality**: More iterations = better models

### Real-World Impact

**Example**: Training ImageNet (1.3M images)
- **Naive loading**: Read from disk every epoch = ~3 hours per epoch
- **Optimized loading**: Prefetch + cache + augmentation = ~30 min per epoch
- **Savings**: 6x faster = 6x more experiments!

**Why it cannot be skipped**: Efficient data loading is the difference between:
- Research that takes weeks vs days
- Models that train to completion vs OOM errors
- Affordable experimentation vs prohibitive costs

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
import torchvision
import torchvision.transforms as transforms
from torchvision.transforms import v2  # New API (PyTorch 2.0+)

import numpy as np
import matplotlib.pyplot as plt
from PIL import Image
import time
from pathlib import Path

torch.manual_seed(42)
np.random.seed(42)

print(f"PyTorch version: {torch.__version__}")

## PyTorch Data Loading: The Foundation

### Core Components

**1. `Dataset`**: Represents your data
- Abstract class with `__len__` and `__getitem__`
- Defines how to access individual samples

**2. `DataLoader`**: Loads data in batches
- Batching, shuffling, parallel loading
- Handles multiprocessing and prefetching

**3. `Sampler`**: Controls sampling strategy
- Sequential, random, weighted, distributed

**4. `Transforms`**: Data preprocessing and augmentation
- Resize, normalize, augment

### The Data Flow

```
Dataset → Sampler → DataLoader → Model
   ↓         ↓          ↓
  Access   Order    Batching
  data     samples  + parallel
```

### Why This Design?

**Separation of concerns**:
- **Dataset**: What data to load (logic)
- **Sampler**: Which samples to load (order)
- **DataLoader**: How to load (parallelism, batching)

**Benefits**:
- Modular and reusable
- Easy to customize each part
- Efficient parallelization

## Building Custom Datasets

### The Minimal Dataset

Every dataset needs:
1. `__init__`: Initialize (load file paths, metadata, etc.)
2. `__len__`: Return number of samples
3. `__getitem__`: Return one sample given an index

**Why this interface?**
- `__len__`: DataLoader needs to know how many samples
- `__getitem__`: Allows indexing like `dataset[0]`
- Simple contract enables powerful functionality

In [None]:
class SimpleDataset(Dataset):
    """
    Minimal custom dataset example.
    
    Demonstrates the required interface.
    """
    def __init__(self, data, labels, transform=None):
        """
        Args:
            data: Input data (can be paths, arrays, etc.)
            labels: Corresponding labels
            transform: Optional transform function
        """
        self.data = data
        self.labels = labels
        self.transform = transform
    
    def __len__(self):
        """Return the total number of samples."""
        return len(self.data)
    
    def __getitem__(self, idx):
        """
        Load and return a single sample.
        
        Args:
            idx: Index of the sample to retrieve
        
        Returns:
            (sample, label) tuple
        """
        sample = self.data[idx]
        label = self.labels[idx]
        
        # Apply transform if provided
        if self.transform:
            sample = self.transform(sample)
        
        return sample, label

# Test the dataset
data = torch.randn(100, 3, 32, 32)  # 100 samples, 3 channels, 32x32
labels = torch.randint(0, 10, (100,))  # 10 classes

dataset = SimpleDataset(data, labels)

print(f"Dataset length: {len(dataset)}")
print(f"First sample shape: {dataset[0][0].shape}")
print(f"First label: {dataset[0][1]}")

# Create DataLoader
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)

# Iterate
for batch_idx, (samples, labels) in enumerate(dataloader):
    print(f"Batch {batch_idx}: samples {samples.shape}, labels {labels.shape}")
    if batch_idx == 2:  # Just show first 3 batches
        break

print("\nDataset + DataLoader: ✓")

## Real-World Dataset: Image Classification from Disk

### Common Pattern: Load Images from Directory

**Directory structure**:
```
dataset/
  class_0/
    img1.jpg
    img2.jpg
  class_1/
    img3.jpg
    img4.jpg
```

**Why this pattern?**
- **Memory efficient**: Don't load all images at once
- **Flexible**: Easy to add/remove samples
- **Standard**: Most datasets organized this way

In [None]:
class ImageFolderDataset(Dataset):
    """
    Load images from a directory structure.
    
    Expected structure:
      root/
        class_0/
          img1.jpg
        class_1/
          img2.jpg
    """
    def __init__(self, root_dir, transform=None):
        """
        Args:
            root_dir: Root directory path
            transform: Optional transform to apply
        """
        self.root_dir = Path(root_dir)
        self.transform = transform
        
        # Find all images and create class mapping
        self.samples = []  # List of (image_path, class_idx)
        self.classes = []  # List of class names
        self.class_to_idx = {}  # Map class name to index
        
        # Scan directory
        self._scan_directory()
    
    def _scan_directory(self):
        """Scan directory and build sample list."""
        # Get all class directories
        class_dirs = sorted([d for d in self.root_dir.iterdir() if d.is_dir()])
        
        for class_idx, class_dir in enumerate(class_dirs):
            class_name = class_dir.name
            self.classes.append(class_name)
            self.class_to_idx[class_name] = class_idx
            
            # Find all images in this class
            for img_path in class_dir.glob('*.jpg'):  # Or *.png, *.jpeg, etc.
                self.samples.append((img_path, class_idx))
    
    def __len__(self):
        return len(self.samples)
    
    def __getitem__(self, idx):
        """Load and return a sample."""
        img_path, label = self.samples[idx]
        
        # Load image
        # Why here? Lazy loading - only load when needed
        image = Image.open(img_path).convert('RGB')
        
        # Apply transforms
        if self.transform:
            image = self.transform(image)
        
        return image, label

print("ImageFolderDataset: ✓")
print("\nNote: This is similar to torchvision.datasets.ImageFolder")
print("Building your own helps understand the internals!")

## Data Augmentation: Creating More Data

### Why Augmentation?

**Problem**: Limited training data leads to overfitting

**Solution**: Generate variations of existing data

**Benefits**:
1. **More training data**: "Free" samples from existing ones
2. **Better generalization**: Model sees more variations
3. **Robustness**: Model learns invariances (rotation, brightness, etc.)
4. **Regularization**: Prevents overfitting

**Why it works**:
- Real-world data has natural variations
- Augmentation simulates these variations
- Model learns to be invariant to irrelevant changes

### Common Image Augmentations

**Geometric**:
- Random crop, resize, flip, rotate
- Why? Objects appear at different positions/scales/orientations

**Color**:
- Brightness, contrast, saturation, hue
- Why? Lighting conditions vary

**Modern (advanced)**:
- Cutout, Mixup, CutMix, AutoAugment
- Why? Further improve generalization

### When to Apply Augmentation?

**Training**: Apply random augmentations
- Each epoch sees different variations
- Creates effectively infinite dataset

**Validation/Test**: NO augmentation (or minimal)
- Need consistent evaluation
- Only apply necessary preprocessing (resize, normalize)

In [None]:
# Define augmentation pipelines

# Training: Heavy augmentation
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224, scale=(0.8, 1.0)),  # Random crop and resize
    transforms.RandomHorizontalFlip(p=0.5),  # 50% chance of flipping
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    transforms.RandomRotation(degrees=15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],  # ImageNet stats
                        std=[0.229, 0.224, 0.225])
])

# Validation: Minimal preprocessing only
val_transform = transforms.Compose([
    transforms.Resize(256),  # Resize to slightly larger
    transforms.CenterCrop(224),  # Center crop to target size
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                        std=[0.229, 0.224, 0.225])
])

print("Data Augmentation Pipelines:")
print("\nTraining pipeline:")
for i, t in enumerate(train_transform.transforms):
    print(f"  {i+1}. {t.__class__.__name__}")

print("\nValidation pipeline:")
for i, t in enumerate(val_transform.transforms):
    print(f"  {i+1}. {t.__class__.__name__}")

print("\nKey differences:")
print("✓ Training: Random crops, flips, color jitter, rotation")
print("✓ Validation: Deterministic resize and center crop only")
print("✓ Both: Normalize with same statistics")

### Visualizing Augmentations

Let's see what augmentations actually do to images.

In [None]:
# Load CIFAR-10 for demonstration
cifar_dataset = torchvision.datasets.CIFAR10(root='./data', train=True, 
                                               download=True, transform=None)

# Get one image
original_img, label = cifar_dataset[0]

# Define augmentation (without normalization for visualization)
augmentation = transforms.Compose([
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3),
    transforms.RandomRotation(degrees=20),
])

# Apply augmentation multiple times
fig, axes = plt.subplots(2, 4, figsize=(14, 7))
axes = axes.flatten()

# Show original
axes[0].imshow(original_img)
axes[0].set_title('Original', fontsize=12, fontweight='bold')
axes[0].axis('off')

# Show augmented versions
for i in range(1, 8):
    augmented = augmentation(original_img)
    axes[i].imshow(augmented)
    axes[i].set_title(f'Augmented {i}', fontsize=12)
    axes[i].axis('off')

plt.suptitle('Data Augmentation: Same Image, Different Views', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Each augmented version is slightly different!")
print("During training, the model sees new variations every epoch.")
print("This forces the model to learn robust features.")

## Optimizing DataLoader Performance

### Key Parameters

**1. `num_workers`**: Number of subprocesses for data loading
- **Why?** Parallel loading while GPU trains
- **Optimal**: Usually 4-8 (depends on CPU cores)
- **Too many**: Overhead from process creation

**2. `batch_size`**: Samples per batch
- **Why?** Trade-off: memory vs convergence
- **Larger**: Faster (better GPU utilization), but more memory
- **Smaller**: More updates, but slower

**3. `shuffle`**: Randomize order
- **Training**: True (prevents overfitting to order)
- **Validation/Test**: False (reproducible evaluation)

**4. `pin_memory`**: Pin memory for GPU transfer
- **Why?** Faster CPU→GPU transfer
- **When**: Always True if using GPU

**5. `prefetch_factor`**: Batches to prefetch per worker
- **Why?** Keep GPU fed
- **Default**: 2 (usually good)

**6. `persistent_workers`**: Keep workers alive between epochs
- **Why?** Avoid worker restart overhead
- **When**: True for faster training (PyTorch 1.7+)

In [None]:
# Load CIFAR-10
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)

# Compare different configurations
configs = [
    {'name': 'Slow (no parallelism)', 'num_workers': 0, 'pin_memory': False},
    {'name': 'Fast (optimized)', 'num_workers': 4, 'pin_memory': True, 'persistent_workers': True},
]

print("Comparing DataLoader Configurations:\n")
print("="*60)

for config in configs:
    name = config.pop('name')
    
    # Create dataloader
    dataloader = DataLoader(dataset, batch_size=128, shuffle=True, **config)
    
    # Time one epoch
    start_time = time.time()
    for batch_idx, (data, target) in enumerate(dataloader):
        # Simulate some processing
        _ = data.mean()
        if batch_idx >= 50:  # Just test first 50 batches
            break
    
    elapsed = time.time() - start_time
    
    print(f"\n{name}")
    print("-" * 60)
    print(f"  Time for 50 batches: {elapsed:.2f}s")
    print(f"  Configuration: {config}")

print("\n" + "="*60)
print("\nKey takeaway: Parallel workers + pin_memory = much faster!")

### Best Practices Summary

**For training on GPU**:
```python
DataLoader(
    dataset,
    batch_size=64,        # As large as GPU memory allows
    shuffle=True,         # Randomize order
    num_workers=4,        # 4-8 workers (depends on CPU)
    pin_memory=True,      # Faster GPU transfer
    persistent_workers=True,  # Keep workers alive
    prefetch_factor=2,    # Prefetch 2 batches per worker
)
```

**For validation/test**:
```python
DataLoader(
    dataset,
    batch_size=128,       # Can be larger (no backprop)
    shuffle=False,        # Deterministic evaluation
    num_workers=4,
    pin_memory=True,
)
```

## Advanced: Weighted Sampling for Imbalanced Data

### The Problem

**Imbalanced dataset**:
```
Class 0: 900 samples
Class 1: 100 samples
```

**Issue**: Model sees class 0 much more → biased predictions

### Solutions

**1. Weighted Loss**:
```python
weights = [0.1, 0.9]  # Higher weight for rare class
criterion = nn.CrossEntropyLoss(weight=torch.tensor(weights))
```

**2. Weighted Sampling**: Sample classes equally
```python
sampler = WeightedRandomSampler(weights, num_samples)
```

**Why sampling?**
- Model sees balanced batches
- Learns all classes equally
- No need to modify loss function

In [None]:
from torch.utils.data import WeightedRandomSampler

# Create imbalanced dataset
class_counts = [900, 100]  # Class 0 has 900, class 1 has 100
labels = [0] * 900 + [1] * 100
data = torch.randn(1000, 3, 32, 32)

imbalanced_dataset = SimpleDataset(data, torch.tensor(labels))

# Compute sample weights (inverse of class frequency)
class_sample_counts = torch.tensor(class_counts)
class_weights = 1.0 / class_sample_counts

# Assign weight to each sample based on its class
sample_weights = [class_weights[label] for label in labels]

# Create sampler
sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),
    replacement=True  # Sample with replacement
)

# Create dataloader with sampler
balanced_loader = DataLoader(imbalanced_dataset, batch_size=64, sampler=sampler)

# Check class distribution in batches
class_counts_in_batches = [0, 0]
num_batches = 10

for batch_idx, (_, labels_batch) in enumerate(balanced_loader):
    for label in labels_batch:
        class_counts_in_batches[label.item()] += 1
    
    if batch_idx >= num_batches - 1:
        break

print("Imbalanced Dataset:")
print(f"  Class 0: {class_counts[0]} samples")
print(f"  Class 1: {class_counts[1]} samples")
print(f"  Ratio: {class_counts[0]/class_counts[1]:.1f}:1\n")

print(f"After Weighted Sampling (first {num_batches} batches):")
print(f"  Class 0: {class_counts_in_batches[0]} samples")
print(f"  Class 1: {class_counts_in_batches[1]} samples")
print(f"  Ratio: {class_counts_in_batches[0]/class_counts_in_batches[1]:.1f}:1")

print("\n✓ Classes are now balanced in training batches!")

## Mini Exercises

### Exercise 1: Build a Text Dataset

Create a dataset for text classification from a CSV file.

In [None]:
# YOUR CODE HERE


# SOLUTION
def show_solution_1():
    import pandas as pd
    
    class TextDataset(Dataset):
        """
        Text classification dataset from CSV.
        
        CSV format: text, label
        """
        def __init__(self, csv_file, tokenizer=None, max_len=128):
            """
            Args:
                csv_file: Path to CSV file
                tokenizer: Function to tokenize text
                max_len: Maximum sequence length
            """
            self.df = pd.read_csv(csv_file)
            self.tokenizer = tokenizer
            self.max_len = max_len
        
        def __len__(self):
            return len(self.df)
        
        def __getitem__(self, idx):
            text = self.df.iloc[idx]['text']
            label = self.df.iloc[idx]['label']
            
            # Tokenize if tokenizer provided
            if self.tokenizer:
                tokens = self.tokenizer(text)
                # Pad/truncate to max_len
                if len(tokens) < self.max_len:
                    tokens = tokens + [0] * (self.max_len - len(tokens))
                else:
                    tokens = tokens[:self.max_len]
                return torch.tensor(tokens), label
            
            return text, label
    
    print("TextDataset implementation:")
    print("\nKey features:")
    print("✓ Reads from CSV file")
    print("✓ Optional tokenization")
    print("✓ Padding/truncation to fixed length")
    print("\nUsage:")
    print("  dataset = TextDataset('data.csv', tokenizer=my_tokenizer)")
    print("  loader = DataLoader(dataset, batch_size=32, shuffle=True)")

# Uncomment to see solution:
# show_solution_1()

### Exercise 2: Implement Custom Augmentation

Create a custom transformation that adds Gaussian noise to images.

In [None]:
# YOUR CODE HERE


# SOLUTION
def show_solution_2():
    class AddGaussianNoise(object):
        """
        Add Gaussian noise to tensor.
        
        Why? Simulates sensor noise, improves robustness.
        """
        def __init__(self, mean=0.0, std=0.1):
            self.mean = mean
            self.std = std
        
        def __call__(self, tensor):
            """
            Args:
                tensor: Input tensor (C, H, W)
            
            Returns:
                Tensor with added noise
            """
            noise = torch.randn(tensor.size()) * self.std + self.mean
            noisy_tensor = tensor + noise
            # Clamp to valid range [0, 1]
            return torch.clamp(noisy_tensor, 0.0, 1.0)
        
        def __repr__(self):
            return f"{self.__class__.__name__}(mean={self.mean}, std={self.std})"
    
    # Test the augmentation
    transform = transforms.Compose([
        transforms.ToTensor(),
        AddGaussianNoise(mean=0.0, std=0.05)
    ])
    
    # Load sample image
    dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                            download=True, transform=None)
    img, _ = dataset[0]
    
    # Apply transformation
    img_original = transforms.ToTensor()(img)
    img_noisy = transform(img)
    
    # Visualize
    fig, axes = plt.subplots(1, 2, figsize=(10, 5))
    axes[0].imshow(img_original.permute(1, 2, 0))
    axes[0].set_title('Original')
    axes[0].axis('off')
    
    axes[1].imshow(img_noisy.permute(1, 2, 0))
    axes[1].set_title('With Gaussian Noise')
    axes[1].axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print("Custom augmentation: AddGaussianNoise")
    print("\nWhy add noise?")
    print("- Simulates real-world sensor noise")
    print("- Improves model robustness")
    print("- Acts as regularization")

# Uncomment to see solution:
# show_solution_2()

### Exercise 3: Implement Data Caching

Create a dataset wrapper that caches loaded samples in memory for faster access.

In [None]:
# YOUR CODE HERE


# SOLUTION
def show_solution_3():
    class CachedDataset(Dataset):
        """
        Wrapper that caches dataset samples in memory.
        
        Trade-off: Speed vs Memory
        - Faster: No repeated disk I/O
        - Cost: Stores all samples in RAM
        
        Use when: Dataset fits in memory and I/O is bottleneck
        """
        def __init__(self, dataset, cache_size=None):
            """
            Args:
                dataset: Underlying dataset to cache
                cache_size: Max samples to cache (None = cache all)
            """
            self.dataset = dataset
            self.cache = {}
            self.cache_size = cache_size or len(dataset)
        
        def __len__(self):
            return len(self.dataset)
        
        def __getitem__(self, idx):
            # Check cache first
            if idx in self.cache:
                return self.cache[idx]
            
            # Load from dataset
            sample = self.dataset[idx]
            
            # Cache if space available
            if len(self.cache) < self.cache_size:
                self.cache[idx] = sample
            
            return sample
    
    # Test caching
    # Create base dataset
    transform = transforms.Compose([transforms.ToTensor()])
    base_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                                  download=True, transform=transform)
    
    # Create cached version
    cached_dataset = CachedDataset(base_dataset, cache_size=1000)
    
    # Measure time for first access (cache miss)
    start = time.time()
    for i in range(100):
        _ = cached_dataset[i]
    first_access = time.time() - start
    
    # Measure time for second access (cache hit)
    start = time.time()
    for i in range(100):
        _ = cached_dataset[i]
    second_access = time.time() - start
    
    print("CachedDataset Performance:")
    print(f"  First access (cache miss): {first_access:.3f}s")
    print(f"  Second access (cache hit): {second_access:.3f}s")
    print(f"  Speedup: {first_access/second_access:.1f}x\n")
    
    print("When to use caching:")
    print("✓ Dataset fits in RAM")
    print("✓ Disk I/O is bottleneck")
    print("✓ Multiple epochs over same data")
    print("\nWhen NOT to use:")
    print("✗ Large datasets (won't fit in RAM)")
    print("✗ Heavy data augmentation (cache becomes stale)")

# Uncomment to see solution:
# show_solution_3()

## Key Takeaways

### Core Concepts

**1. Why data loading matters**:
- Often the training bottleneck
- Poor data loading = GPU sits idle = wasted money
- Good data loading = 2-10x faster training

**2. PyTorch data loading pipeline**:
- **Dataset**: Defines data access (`__len__`, `__getitem__`)
- **DataLoader**: Batching, shuffling, parallel loading
- **Transforms**: Preprocessing and augmentation
- **Sampler**: Controls sampling strategy

**3. Data augmentation**:
- Creates variations of existing data
- Improves generalization and robustness
- **Training**: Heavy augmentation
- **Validation/Test**: Minimal preprocessing only

**4. Optimization techniques**:
- **`num_workers`**: Parallel loading (4-8 workers)
- **`pin_memory`**: Faster GPU transfer (always True)
- **`persistent_workers`**: Avoid restart overhead
- **Caching**: Trade memory for speed
- **Weighted sampling**: Handle imbalanced data

**5. Best practices**:
- Lazy loading (load in `__getitem__`, not `__init__`)
- Separate train/val transforms
- Use multiple workers for large datasets
- Monitor GPU utilization (should be near 100%)
- Profile to find bottlenecks

### Common Patterns

**Image classification**:
```python
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(...),
    transforms.ToTensor(),
    transforms.Normalize(...)
])
```

**Efficient DataLoader**:
```python
DataLoader(
    dataset,
    batch_size=64,
    shuffle=True,
    num_workers=4,
    pin_memory=True,
    persistent_workers=True
)
```

### What's Next?

You've mastered data loading! This completes the intermediate section.

**You now understand**:
- CNNs and their evolution
- Attention mechanisms (the foundation of transformers)
- Positional encodings (injecting sequence information)
- Complete transformer architecture
- Efficient data loading and augmentation

**Ready for advanced topics**:
- Modern transformer variants (Flash Attention, GQA)
- Training large models at scale
- Building production-ready systems

Congratulations on completing the intermediate section!

## Further Reading

### PyTorch Documentation
1. **Data Loading Tutorial**: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
2. **torch.utils.data API**: https://pytorch.org/docs/stable/data.html

### Data Augmentation
3. **AutoAugment** (Cubuk et al., 2019): Learned augmentation policies
4. **RandAugment** (Cubuk et al., 2020): Simplified AutoAugment
5. **Mixup** (Zhang et al., 2017): Mix samples for training
6. **CutMix** (Yun et al., 2019): Cut and paste regions

### Advanced Techniques
7. **FFCV** (Fast Forward Computer Vision): Ultra-fast data loading
8. **WebDataset**: Efficient loading from cloud storage
9. **NVIDIA DALI**: GPU-accelerated data loading

### Tools
- **albumentations**: Fast augmentation library
- **torchvision.transforms.v2**: New transforms API
- **kornia**: Differentiable augmentation