# Differentially Private Deep Learning with Opacus

Adapted from  https://github.com/pytorch/opacus/blob/main/tutorials/intro_to_advanced_features.ipynb

In this document, we will investigate the advanced features of Opacus and see how to implement custom functionality.

First of all, we recommend you get a GPU runtime for this Colab! You can do so by clicking on Runtime > Change Runtime Type above, and selecting GPU.

First things first: let's start by installing Opacus.

## Overview

There are three components essential to DP-SGD.

  
1. The norm of the gradient value for every sample is clipped to a certain value
  
2. Calibrated gaussian noise is added to the resulting batch gradient to hide the individual contributions.

3. Minibatches should be formed by uniform sampling, i.e. on each training step, each sample from the dataset is included with a certain probability `q`. Note, that this is different from standard approach of dataset being shuffled and split into batches: each sample has a non-zero probability of appearing multiple times in a given epoch, or not appearing at all.


This translates into the three distinctions from standard training:

1. We need to compute per sample gradients (so that we know what to clip). Currently, PyTorch autograd engine only stores gradients aggregated over a batch.
2. We need to incorporate Poisson sampling into the training process.
3. We need to implement gradient clipping and noise addition
4. Finally, we need to keep an account of the privacy parameter.

In [None]:
!pip install opacus
%env PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512

In [None]:
# Usual suspects
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from tqdm.autonotebook import tqdm

# Our shiny cool lib
import opacus

# Part 0: Prerequistes

## Let's load the training data
Our task: **train a CIFAR10 model with differential privacy.**



In [None]:
BATCH_SIZE = 128

In [None]:
from torchvision.datasets import CIFAR10, CIFAR100
from torchvision.transforms import Compose, Normalize, ToTensor
from torch.utils.data import DataLoader
from opacus.utils.uniform_sampler import UniformWithReplacementSampler

IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]

from torchvision.datasets import CIFAR10

train_ds = CIFAR10('.',
                   train=True,
                   download=True,
                   transform=Compose([ToTensor(), Normalize(IMAGENET_MEAN, IMAGENET_STD)])
)

train_loader = torch.utils.data.DataLoader(
    train_ds,
    batch_size=BATCH_SIZE,
)

test_ds = CIFAR10('.',
                  train=False,
                  download=True,
                  transform=Compose([ToTensor(), Normalize(IMAGENET_MEAN, IMAGENET_STD)])
)
test_loader = torch.utils.data.DataLoader(
    test_ds,
    batch_size=BATCH_SIZE,
    shuffle=False,
)

# Helpful for quick checks
x, y = next(iter(train_loader))

In [None]:
x.shape

## Load pretrained model
We load pretrained Resnet and fine-tune only the last layer

In [None]:
from torchvision.models import resnet18

resnet_modules = list(resnet18(pretrained=True).children())

backbone = nn.Sequential(*resnet_modules[:-3])
head = nn.Sequential(*resnet_modules[-3:-1], nn.Flatten(), nn.Linear(512, 10))

backbone = backbone.eval()
head = head.train()

# Quick sanity check

with torch.no_grad():
  representation = backbone(x)

head(representation).shape

For maximal speed, we can check if CUDA is available and supported by the PyTorch installation. If GPU is available, set the device variable to your CUDA-compatible device. We can then transfer the neural network onto that device.

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
backbone = backbone.to(device)
head = head.to(device)

In the next few sections, we will replicate the functionality of make_private using custom functions

In [None]:
# validate that the model is compatible with opacus and fix any issues

from opacus.validators import ModuleValidator
head = ModuleValidator.fix(head)
ModuleValidator.validate(head, strict=False)

# Part 1: Defining custom private components

The `make_private` function we used earlier does a lot of heavy lifting.
This image captures its overall structure. https://github.com/pytorch/opacus/blob/main/tutorials/img/make_private.png

We will see how we can customize and replicate its functionality. This will be very useful for research and perhaps for your project report as well.



## 1.a. Defining a private model

We start by wrapping the model with GradSampleModule - very straightforward.

In [None]:
from opacus import GradSampleModule

head = GradSampleModule(head)
head

### Question 1. When implementing Poisson sampling, we need to forbid gradient accumulation. Why is this? (1 point)

Answer here or on overleaf

If we're using Poisson sampling, we have to forbid gradient accumulation: you'd have to call `optimizer.step()` and `zero_grad()` after every forward/backward pass.

In [None]:
head.forbid_grad_accumulation()

# Note comment out the code below and re-run once before running the training code.
# Opacus seems to not like using zero_grad() outside training.

# # first backward should work fine
# with torch.no_grad():
#     representation = backbone(x)
# preds = head(representation)
# preds.sum().backward()

# print("First backward successful")

# # second should fail
# with torch.no_grad():
#     representation = backbone(x)
# preds = head(representation)
# preds.sum().backward()

# head.zero_grad()

## 1.b. Private Data loader

We now got to the data loader. Note, that DPDataLoader returns a brand new DataLoader, which is backed by the same dataset.

In [None]:
from opacus.data_loader import DPDataLoader

dp_data_loader = DPDataLoader.from_data_loader(train_loader, distributed=False)

print("Is dataset the same: ", dp_data_loader.dataset == train_loader.dataset)
print(f"DPDataLoader length: {len(dp_data_loader)}, original: {len(train_loader)}")
print("DPDataLoader sampler: ", dp_data_loader.batch_sampler)

data_loader = dp_data_loader

The main reason we need a different private loader is because we need to do Poisson sampling for our DP amplification. An interesting property of Poisson sampling, which we need to take into account, is that batch sizes are not constant. Yes, on average it'll be the same as the batch size of the original data loader, but it'll vary on every iteration:

In [None]:
import matplotlib.pyplot as plt

batch_sizes = []
for x,y in dp_data_loader:
    batch_sizes.append(len(x))

plt.hist(batch_sizes)

## 1.c Custom Private Optimizer

Because of the variability, we can't infer the batch size from the input shape. And we need to know the batch size if we're averaging the gradients (with added noise) - under the DP assumptions sampling outcome (i.e. real batch size) does not affect the amount of noise added.

So we calculate expected batch size by looking at the data loader (it'll be the same as the batch size of the original data loader, we just need to make a few extra steps in case the original data loader was initialized with custom sampler).

In [None]:
from opacus.optimizers import DPOptimizer
from opacus.optimizers.optimizer import _check_processed_flag, _mark_as_processed
from torch.distributions.laplace import Laplace

sample_rate = 1 / len(data_loader)
expected_batch_size = int(len(data_loader.dataset) * sample_rate)

Many non-standard use cases will involve customizing the behaviour of `DPOptimizer`. We will customize this object to implement Laplace noise. By default, the noise added is Gaussian. To do this, we need to modify the `add_noise` method.

See this image for an overiview of how the DPOOptimizer functions: https://github.com/pytorch/opacus/blob/main/tutorials/img/optimizer.png

### Question 2. What should be the loc and scale of the Laplace noise below as a function of `self.noise_multiplier` and `self.max_grad_norm` (1 point)

In [None]:
# Define a non-private optimizer
optimizer = torch.optim.SGD(head.parameters(), lr=0.3, momentum=0.9, nesterov=True)


# Define a custom class which adds laplace noise
class LaplaceDPOptimizer(DPOptimizer):
    def add_noise(self):
        # FILL IN BELOW.
        # COMPUTE `loc` and `scale` as a function of `self.noise_multiplier` and `self.max_grad_norm`
        laplace = Laplace(loc= FILLINHERE , scale= FILLINHERE)
        for p in self.params:
            _check_processed_flag(p.summed_grad)

            noise = laplace.sample(p.summed_grad.shape)
            # becuse grad may be on GPU, we need send noise to GPU
            noise = noise.to(p.summed_grad.device)
            p.grad = p.summed_grad + noise

            _mark_as_processed(p.summed_grad)


# Convert our non-private optimizer to a private one
optimizer = LaplaceDPOptimizer(
    optimizer=optimizer,
    noise_multiplier=1.0,
    max_grad_norm=1.0,
    expected_batch_size=expected_batch_size,
)

## 1.d Privacy Accounting

And now the final (and most important) piece of the puzzle - privacy accounting. We will define an accountant for our Laplace mechanism.

Let us first write a function which computes the privacy parameter ɛ.

### Question 3. Given history (list of values of `noise_multiplier` and `q`), compute epsilon (2 points)

In [None]:
from typing import List, Tuple

def compute_eps(
    history: List[Tuple[float, float, int]]
) -> float:
    r"""Computes Differential Privacy guarantees of the
    Sampled Laplace Mechanism (SLM) given history.

    Args: history which is a list of (q,noise_multiplier,step)
        noise_multiplier: The ratio of the additive Laplacian noise parameter
            to the L1-sensitivity of the function
            to which it is added.
        q: Sampling rate of SLM.
        step: Current iteration step.

    Returns:
        The float value of epislon for pure DP
    """

    for noise_multiplier, q, step in history:
        # FILL IN HERE LOGIC TO COMPUTE EPSILON

    return epsilon

Now we will customize the abstract class of `IAccountant` to implement our custom LaplacianAccountant

In [None]:
from opacus.accountants.accountant import IAccountant

class LaplacianAccountant(IAccountant):
    def __init__(self):
        super().__init__()

    def step(self, *, noise_multiplier: float, sample_rate: float):
        if len(self.history) >= 1:
            step_num = self.history[-1][-1]
            self.history.append((noise_multiplier, sample_rate, step_num + 1))
        else:
            self.history = [(noise_multiplier, sample_rate, 1)]

    def get_epsilon(self, delta: float = 0, poisson: bool = True) -> float:
        """
        Return privacy budget (epsilon) expended so far.

        Args:
            delta: this is always 0 for Laplace mechanism
            poisson: ``True`` is input batches was sampled via Poisson sampling,
                ``False`` otherwise
        """
        assert delta == 0, "Laplace mechanism does not support delta"
        assert poisson, "Our mechanism only supports Poisson sampling"

        return compute_eps(self.history)

    def __len__(self):
        return len(self.history)

    @classmethod
    def mechanism(cls) -> str:
        return "lap"

We now need to do is to initialize the accountant object and attach it to track `DPOptimizer`

In [None]:
accountant = LaplacianAccountant()
optimizer.attach_step_hook(accountant.get_optimizer_hook_fn(sample_rate=sample_rate))

# Part 2: Finally, train the model!

Let us write code to train for 1 epoch

In [None]:
import numpy as np
from opacus.utils.batch_memory_manager import BatchMemoryManager

def accuracy(preds, labels):
    return (preds == labels).mean()

def train(backbone, head, optimizer, train_loader, epoch=1, verbose=True):
    top1_accs = []
    losses = []
    criterion = nn.CrossEntropyLoss()
    head.train()
    backbone.train()
    with BatchMemoryManager(
        data_loader=train_loader,
        max_physical_batch_size=BATCH_SIZE,
        optimizer=optimizer
    ) as memory_safe_data_loader:
        for i, (x, y) in  tqdm(enumerate(memory_safe_data_loader), desc="Step", unit="step"):
            optimizer.zero_grad()
            x = x.to(device)
            y = y.to(device)

            # compute output
            with torch.no_grad():
                x = backbone(x)

            logits = head(x)
            loss = criterion(logits, y)

            preds = np.argmax(logits.detach().cpu().numpy(), axis=1)
            labels = y.detach().cpu().numpy()

            # measure accuracy and record loss
            acc = accuracy(preds, labels)
            losses.append(loss.item())
            top1_accs.append(acc)

            # compute update
            loss.backward()
            optimizer.step()
            if i % 50 == 0 and verbose:
                epsilon = accountant.get_epsilon()
                print(
                    f"\tTrain Epoch: {epoch} \t"
                    f"Step: {i} \t"
                    f"Loss: {np.mean(losses):.6f} "
                    f"Acc@1: {np.mean(top1_accs) * 100:.6f} "
                    f"(ε = {epsilon:.2f})"
                )

    return

Next, we will create a function to validate on our test dataset

In [None]:
def test(head, backbone, test_loader):
    head.eval()
    backbone.eval()
    criterion = nn.CrossEntropyLoss()
    losses = []
    top1_acc = []

    with torch.no_grad():
        for x, y in test_loader:
            x = x.to(device)
            y = y.to(device)

            x = backbone(x)
            output = head(x)
            loss = criterion(output, y)
            preds = np.argmax(output.detach().cpu().numpy(), axis=1)
            labels = y.detach().cpu().numpy()
            acc = accuracy(preds, labels)

            losses.append(loss.item())
            top1_acc.append(acc)

    top1_avg = np.mean(top1_acc)

    print(
        f"\tTest set:"
        f"Loss: {np.mean(losses):.6f} "
        f"Acc: {top1_avg * 100:.6f} "
    )
    return np.mean(top1_acc)

In [None]:
for epoch in tqdm(range(10), desc="Epoch", unit="epoch"):
    train(backbone, head, optimizer, train_loader, epoch+1)
test(head, backbone, test_loader)

### Question 4. There is a fundamental mistake in the above implementation. We clip L2 norm, whereas we should have clipped L1 norm. Fix this (2 points)

To fix this, you will have to override the `clip_and_accumulate()` method of the `DPOptimizer` class when implementing `LaplaceDPOptimizer`. Copy the default implementaion from [here](https://github.com/pytorch/opacus/blob/a246aa644bc9f04aaf17faa5706efb145a7c6ac7/opacus/optimizers/optimizer.py#L429C9-L429C28) and edit what is required to change from L2 norm to L1 norm.

Re-run your code. What happens to the convergence?