# Controller-Based Ghost Clipping: Memory-Efficient DP Training

This tutorial demonstrates how to use **Ghost Clipping** with the controller-based privacy engine. Ghost Clipping provides significant memory savings by computing per-sample gradient *norms* directly without materializing full per-sample gradients.

## Why Ghost Clipping + Controller?

Combining these two approaches gives you:
- **Memory Efficiency**: Ghost clipping avoids storing full per-sample gradients
- **No Model Wrapping**: Controller-based approach preserves model type
- **Transformer Compatibility**: Perfect for large models like BERT, GPT, etc.
- **Faster Training**: Reduced memory footprint allows larger batch sizes

## Setup

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from opacus.privacy_engine_gsc import PrivacyEngineGradSampleController
from opacus.utils.fast_gradient_clipping_utils import DPLossFastGradientClipping
import warnings
warnings.simplefilter("ignore")

## Create Dataset and Model

In [None]:
# Create a synthetic dataset
n_samples = 1000
n_features = 50
n_classes = 10

X = torch.randn(n_samples, n_features)
y = torch.randint(0, n_classes, (n_samples,))

dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

In [None]:
class DeepClassifier(nn.Module):
    """Deeper model to demonstrate memory savings"""
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, hidden_dim)
        self.fc4 = nn.Linear(hidden_dim, output_dim)
        self.relu = nn.ReLU()
        self.layer_norm = nn.LayerNorm(hidden_dim)
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.layer_norm(self.relu(self.fc2(x)))
        x = self.relu(self.fc3(x))
        x = self.fc4(x)
        return x

model = DeepClassifier(n_features, 256, n_classes)
print(f"Model has {sum(p.numel() for p in model.parameters())} parameters")

## Standard Controller-Based Training (without Ghost Clipping)

First, let's see the standard approach:

In [None]:
# Standard controller-based approach
model_standard = DeepClassifier(n_features, 256, n_classes)
optimizer_standard = optim.Adam(model_standard.parameters(), lr=0.001)

privacy_engine_standard = PrivacyEngineGradSampleController()

model_standard, optimizer_standard, dataloader_standard = privacy_engine_standard.make_private(
    module=model_standard,
    optimizer=optimizer_standard,
    data_loader=dataloader,
    noise_multiplier=1.0,
    max_grad_norm=1.0,
    grad_sample_mode="hooks",  # Standard mode
)

print("Standard controller-based approach configured")

## Ghost Clipping with Controller

Now let's use ghost clipping for memory efficiency:

In [None]:
# Ghost clipping approach
model_ghost = DeepClassifier(n_features, 256, n_classes)
optimizer_ghost = optim.Adam(model_ghost.parameters(), lr=0.001)

privacy_engine_ghost = PrivacyEngineGradSampleController()

# Use grad_sample_mode="ghost" for ghost clipping
controller, optimizer_ghost, dataloader_ghost = privacy_engine_ghost.make_private(
    module=model_ghost,
    optimizer=optimizer_ghost,
    data_loader=dataloader,
    noise_multiplier=1.0,
    max_grad_norm=1.0,
    grad_sample_mode="ghost",  # Ghost clipping mode!
    return_controller=True,  # Get controller for loss wrapper
)

print(f"Ghost clipping configured")
print(f"Model type preserved: {isinstance(model_ghost, DeepClassifier)}")

## Training with Ghost Clipping

Ghost clipping requires a special loss wrapper that performs two backward passes:

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_ghost = model_ghost.to(device)

# Create loss wrapper for ghost clipping
criterion = nn.CrossEntropyLoss(reduction="mean")
loss_fn = DPLossFastGradientClipping(
    module=controller,
    optimizer=optimizer_ghost,
    criterion=criterion,
    loss_reduction="mean",
)

EPOCHS = 3
DELTA = 1e-5

for epoch in range(EPOCHS):
    model_ghost.train()
    total_loss = 0
    
    for batch_idx, (data, target) in enumerate(dataloader_ghost):
        data, target = data.to(device), target.to(device)
        
        optimizer_ghost.zero_grad()
        output = model_ghost(data)
        
        # Use loss wrapper (performs two backward passes)
        loss = loss_fn(output, target)
        loss.backward()  # This handles ghost clipping magic
        
        optimizer_ghost.step()
        
        total_loss += loss.item()
    
    epsilon = privacy_engine_ghost.get_epsilon(DELTA)
    avg_loss = total_loss / len(dataloader_ghost)
    print(f"Epoch {epoch + 1}/{EPOCHS} | Loss: {avg_loss:.4f} | ε: {epsilon:.2f} (δ={DELTA})")

## How Ghost Clipping Works

Ghost clipping performs **two backward passes**:

1. **First Pass**: Computes per-sample gradient norms (without storing full gradients)
2. **Compute Clipping Coefficients**: `coeff = min(1, max_grad_norm / norm)`
3. **Second Pass**: Backprop with loss scaled by clipping coefficients

This avoids storing full per-sample gradients, saving memory!

## Memory Comparison

Let's visualize the memory difference:

In [None]:
import torch.cuda as cuda

if torch.cuda.is_available():
    # Reset memory stats
    cuda.reset_peak_memory_stats()
    
    # Run one batch with standard approach
    data, target = next(iter(dataloader_standard))
    data, target = data.to(device), target.to(device)
    output = model_standard(data)
    loss = nn.functional.cross_entropy(output, target)
    loss.backward()
    
    standard_memory = cuda.max_memory_allocated() / 1024**2  # MB
    print(f"Standard approach peak memory: {standard_memory:.2f} MB")
    
    # Reset for ghost clipping
    cuda.reset_peak_memory_stats()
    
    # Run one batch with ghost clipping
    data, target = next(iter(dataloader_ghost))
    data, target = data.to(device), target.to(device)
    output = model_ghost(data)
    loss = loss_fn(output, target)
    loss.backward()
    
    ghost_memory = cuda.max_memory_allocated() / 1024**2  # MB
    print(f"Ghost clipping peak memory: {ghost_memory:.2f} MB")
    print(f"Memory savings: {(1 - ghost_memory/standard_memory)*100:.1f}%")
else:
    print("GPU not available for memory comparison")

## When to Use Ghost Clipping?

### Use Ghost Clipping When:
- Training large models (transformers, ResNets, etc.)
- Memory is a bottleneck
- You want to use larger batch sizes
- Working with Linear and LayerNorm layers (optimized support)

### Use Standard Mode When:
- Model is small
- Memory is not a concern
- Model has many custom layers without ghost clipping support
- You need per-sample gradients for analysis

## Ghost Clipping with Target Epsilon

You can also use `make_private_with_epsilon`:

In [None]:
model_epsilon = DeepClassifier(n_features, 256, n_classes)
optimizer_epsilon = optim.Adam(model_epsilon.parameters(), lr=0.001)
dataloader_epsilon = DataLoader(dataset, batch_size=32, shuffle=True)

privacy_engine_epsilon = PrivacyEngineGradSampleController()

controller_epsilon, optimizer_epsilon, dataloader_epsilon = privacy_engine_epsilon.make_private_with_epsilon(
    module=model_epsilon,
    optimizer=optimizer_epsilon,
    data_loader=dataloader_epsilon,
    target_epsilon=3.0,
    target_delta=1e-5,
    epochs=3,
    max_grad_norm=1.0,
    grad_sample_mode="ghost",
    return_controller=True,
)

print(f"Target epsilon: 3.0")
print(f"Computed noise multiplier: {optimizer_epsilon.noise_multiplier:.3f}")

## Supported Layers

Ghost clipping is optimized for:
- `nn.Linear`
- `nn.LayerNorm`
- `nn.Embedding`
- And more...

For unsupported layers, it automatically falls back to Fast Gradient Clipping (computes full gradients then norms).

## Cleanup

Remember to cleanup the controller when done:

In [None]:
controller.cleanup()
print("Hooks cleaned up!")

## Summary

| Feature | Standard Controller | Ghost Clipping Controller |
|---------|--------------------|--------------------------|
| Memory Usage | Higher | **Lower** |
| Speed | Faster (1 backward) | Slower (2 backwards) |
| Model Wrapping | No | **No** |
| Type Preservation | Yes | **Yes** |
| Best For | Small models | **Large models** |
| Batch Size | Smaller | **Larger** |

The combination of controller-based approach + ghost clipping is ideal for training large transformers with differential privacy!

## Learn More

- [Ghost Clipping Paper](https://arxiv.org/abs/2205.09632)
- [Controller-Based Tutorial](controller_based_privacy_engine.ipynb)
- [Opacus Documentation](https://opacus.ai)