# Privacy Techniques for Generative Models

## 1. Differential Privacy with Opacus

Differential Privacy (DP) is a formal framework that ensures the inclusion or exclusion of a single training data point does not significantly affect the output of a model. In this example, we use the `Opacus` library with PyTorch to train a model using DP-SGD (Stochastic Gradient Descent with noise added to gradients), which helps protect individual data samples.

In [9]:
from opacus import PrivacyEngine
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Use CPU
device = torch.device("cpu")

# Define model
model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28 * 28, 10)
).to(device)

# Optimizer and loss
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()

# Data
train_loader = DataLoader(
    datasets.MNIST('.', train=True, download=True, transform=transforms.ToTensor()),
    batch_size=64,
    shuffle=True
)

# Attach Privacy Engine with required max_grad_norm
privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    target_epsilon=10,
    target_delta=1e-5,
    epochs=1,
    max_grad_norm=1.0
)

# Training loop with visible output
for batch_idx, (x, y) in enumerate(train_loader):
    x, y = x.to(device), y.to(device)
    optimizer.zero_grad()
    loss = loss_fn(model(x), y)
    loss.backward()
    optimizer.step()
    
    if batch_idx % 100 == 0:
        print(f"Batch {batch_idx}: Loss = {loss.item():.4f}")
        
    if batch_idx >= 300:  # Limit for quick demo
        break

Batch 0: Loss = 2.3245
Batch 100: Loss = 2.2106
Batch 200: Loss = 2.0593
Batch 300: Loss = 1.9786


## 2. Federated Learning with Flower

Federated Learning (FL) is a technique that allows multiple devices or institutions to collaboratively train a model without exchanging raw data. Each participant trains the model locally and shares only model updates with a central server. This approach protects data privacy by keeping personal data decentralized. The following code uses the `flower` library.

In [None]:
### Server Code (run separately)
import flwr as fl

def fit_config(server_round):
    return {"epochs": 1}

fl.server.start_server(config={"num_rounds": 3}, strategy=fl.server.strategy.FedAvg(fit_config=fit_config))


### Client Code


import flwr as fl
import torch, torch.nn as nn, torch.optim as optim
from torchvision import datasets, transforms

model = nn.Sequential(nn.Flatten(), nn.Linear(28*28, 10))

def get_data():
    train = datasets.MNIST(".", train=True, transform=transforms.ToTensor(), download=True)
    return torch.utils.data.DataLoader(train, batch_size=32, shuffle=True)

class FlowerClient(fl.client.NumPyClient):
    def get_parameters(self): return [p.detach().numpy() for p in model.parameters()]
    def fit(self, parameters, config):
        for p, new_p in zip(model.parameters(), parameters):
            p.data = torch.tensor(new_p)
        loader = get_data()
        optimizer = optim.SGD(model.parameters(), lr=0.01)
        for x, y in loader:
            optimizer.zero_grad()
            loss = nn.CrossEntropyLoss()(model(x), y)
            loss.backward()
            optimizer.step()
        return self.get_parameters(), len(loader.dataset), {}
    def evaluate(self, parameters, config): return 0.0, len(get_data().dataset), {}

fl.client.start_numpy_client(server_address="localhost:8080", client=FlowerClient())


---

## 3. PATE (Private Aggregation of Teacher Ensembles)

PATE is a privacy technique that trains multiple teacher models on disjoint subsets of private data and uses their noisy aggregated outputs to train a student model. By adding noise to the aggregated predictions, PATE ensures strong privacy guarantees while enabling the student model to learn from the private data indirectly.

In [10]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
teacher_data = np.array_split(X_train, 5)
teacher_labels = np.array_split(y_train, 5)

teachers = [RandomForestClassifier().fit(x, y) for x, y in zip(teacher_data, teacher_labels)]

def noisy_vote(x, epsilon=1.0):
    votes = np.array([t.predict([x])[0] for t in teachers])
    counts = np.bincount(votes, minlength=10)
    noisy_counts = counts + np.random.laplace(0, 1/epsilon, size=10)
    return np.argmax(noisy_counts)

student_data = X_test
student_labels = np.array([noisy_vote(x) for x in student_data])

student = RandomForestClassifier().fit(student_data, student_labels)


---

## 4. Synthetic Sample Filtering (Post-hoc Privacy Check)

This method is a post-processing step used after generating synthetic data. It compares each synthetic sample to real data to check for overfitting or memorization risks. If a synthetic sample is too similar to a real one (based on a thresholded similarity metric like cosine similarity), it can be filtered out to preserve privacy.

In [12]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def is_too_similar(real_data, synthetic_sample, threshold=0.95):
    sims = cosine_similarity([synthetic_sample], real_data)[0]
    return np.any(sims > threshold)

real = np.random.rand(100, 10)
synth = np.random.rand(10)

print("Too similar?" if is_too_similar(real, synth) else "Safe to keep")

Too similar?
