# Getting Started with NVFlare: Differential Privacy with DP-SGD
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/NVIDIA/NVFlare/blob/main/examples/hello-world/hello-dp/hello-dp.ipynb)

This example demonstrates how to use NVIDIA FLARE with PyTorch and **Differential Privacy (DP)** to train a regression model with privacy guarantees. We use [Opacus](https://opacus.ai) to implement DP-SGD (Differentially Private Stochastic Gradient Descent) during local client training on each client. This achieves sample-level differential privacy.

## What is Differential Privacy?

[Differential Privacy (DP)](https://en.wikipedia.org/wiki/Differential_privacy) is a rigorous mathematical framework designed to provide strong privacy guarantees when handling sensitive data. In Federated Learning (FL), DP protects user information by adding carefully calibrated noise to the model updates during training.

**DP-SGD** (Differentially Private Stochastic Gradient Descent) adds noise during each optimization step:
1. **Gradient Clipping**: Gradients are clipped to a maximum norm to bound their sensitivity
2. **Noise Addition**: Gaussian noise is added to the clipped gradients  
3. **Privacy Accounting**: The privacy budget (ε, δ) is tracked across training steps

The privacy guarantee is characterized by:
- **ε (epsilon)**: Privacy budget - **smaller values = stronger privacy** (more noise)
- **δ (delta)**: Probability of privacy breach - typically set to 1/n² where n is dataset size

## Setup environment

If running in Google Colab, download the source code for this example:

In [None]:
%pip install --ignore-installed blinker

In [None]:
! npx --yes degit -f NVIDIA/NVFlare/examples/hello-world/hello-dp .

Install nvflare and dependencies (including opacus for DP):

In [None]:
%pip install -r requirements.txt

## Dataset: California Housing

This example uses the [California Housing dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html) - a regression problem to predict median house values.

**Dataset characteristics:**
- 20,640 samples
- 8 features (median income, house age, average rooms, etc.)
- 1 target (median house value)

The dataset is automatically partitioned across clients, so each client has a non-overlapping subset.

## Model: Tabular MLP

We use a simple Multi-Layer Perceptron (MLP) for tabular data regression. See [model.py](model.py).

In [None]:
import torch
import torch.nn as nn

class TabularMLP(nn.Module):
    """Simple Multi-Layer Perceptron for tabular data regression"""
    
    def __init__(self, input_dim=8, hidden_dims=[64, 32], output_dim=1):
        super(TabularMLP, self).__init__()
        
        layers = []
        prev_dim = input_dim
        
        # Build hidden layers
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(0.2))
            prev_dim = hidden_dim
        
        # Output layer
        layers.append(nn.Linear(prev_dim, output_dim))
        
        self.model = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.model(x)

The architecture:
- **Input layer**: 8 features
- **Hidden layers**: 64 → 32 neurons with ReLU activation and dropout
- **Output layer**: 1 neuron (house price prediction)

## Client Code with Differential Privacy

The client code [client.py](client.py) implements DP-SGD using **Opacus**. The key difference from standard training is adding the `PrivacyEngine`:

```python
from opacus import PrivacyEngine
import nvflare.client as flare

# Initialize NVFlare client
flare.init()

while flare.is_running():
    # Receive model from server
    input_model = flare.receive()
    model.load_state_dict(input_model.params)
    
    # === Apply Differential Privacy ===
    privacy_engine = PrivacyEngine()
    model, optimizer, train_loader = privacy_engine.make_private(
        module=model,
        optimizer=optimizer,
        data_loader=train_loader,
        noise_multiplier=1.1,  # Controls noise level
        max_grad_norm=1.0,      # Gradient clipping threshold
    )
    # ==================================
    
    # Train as usual - PrivacyEngine handles gradient clipping & noise
    for epoch in range(epochs):
        for data, target in train_loader:
            optimizer.zero_grad()
            loss = criterion(model(data), target)
            loss.backward()
            optimizer.step()
    
    # Check privacy budget spent
    epsilon = privacy_engine.get_epsilon(delta)
    print(f"Privacy spent: (ε = {epsilon:.2f}, δ = {delta})")
    
    # Send model back (note: use _module to get original model)
    output_model = flare.FLModel(
        params=model._module.state_dict(),
        metrics={"rmse": rmse, "privacy_epsilon": epsilon}
    )
    flare.send(output_model)
```

The `PrivacyEngine.make_private()` method:
1. Wraps the model to enable per-sample gradient computation
2. Modifies the optimizer to clip gradients and add noise
3. Wraps the data loader for privacy accounting

## Run FL Experiment with DP

Now we'll create a federated job that trains the model with differential privacy across multiple clients.

### 1. Define the FedJob Recipe

We use the `FedAvgRecipe` which implements the FedAvg algorithm. No custom server code needed!

In [None]:
from model import TabularMLP

from nvflare.app_opt.pt.recipes.fedavg import FedAvgRecipe
from nvflare.recipe import SimEnv, add_experiment_tracking
from nvflare.recipe.utils import add_cross_site_evaluation

n_clients = 2
num_rounds = 5
batch_size = 64
target_epsilon = 1.0  # Total privacy budget (cumulative across all rounds)

recipe = FedAvgRecipe(
    name="hello-dp",
    min_clients=n_clients,
    num_rounds=num_rounds,
    initial_model=TabularMLP(input_dim=8, hidden_dims=[64, 32], output_dim=1),
    train_script="client.py",
    train_args=f"--batch_size {batch_size} --target_epsilon {target_epsilon} --n_clients {n_clients}",
)

### 2. Add experiment tracking

Enable TensorBoard to visualize training metrics and privacy budget:

In [None]:
add_experiment_tracking(recipe, tracking_type="tensorboard")

### 3. (Optional) Add Cross-Site Evaluation

Uncomment to evaluate trained models across all client sites:

In [None]:
# Uncomment to enable cross-site evaluation
# add_cross_site_evaluation(recipe)

### 4. Run the FL Job

Execute the federated learning job in simulation mode:

In [None]:
env = SimEnv(num_clients=n_clients)
run = recipe.execute(env)
print()
print("Job Status is:", run.get_status())
print("Result can be found in:", run.get_result())
print()

## Visualize Results

You can visualize training metrics and privacy budget using TensorBoard:

In [None]:
%load_ext tensorboard
%tensorboard --bind_all --logdir /tmp/nvflare/simulation/hello-dp

### Metrics to Monitor:
- **train_loss**: Training loss over time
- **rmse**: Root Mean Squared Error on test set
- **privacy_epsilon**: Privacy budget spent (ε)

If you enabled cross-site evaluation, view results:

In [None]:
import json
try:
    with open('/tmp/nvflare/simulation/hello-dp/server/simulate_job/cross_site_val/cross_val_results.json') as f:
        print(json.dumps(json.load(f), indent=2))
except FileNotFoundError:
    print("Cross-site evaluation not enabled. Uncomment add_cross_site_evaluation(recipe) to enable.")

## Privacy-Utility Trade-off

Experiment with different epsilon values to observe the privacy-utility trade-off. **Note**: Epsilon is cumulative across all federated rounds.

| Epsilon (ε) | Privacy Level | Model Accuracy | Use Case |
|-------------|---------------|----------------|----------|
| ε ≤ 0.5     | Very Strong   | Lower          | Highly sensitive |
| ε = 0.5-1.0 | Strong        | Moderate       | Sensitive data |
| ε = 1.0-3.0 | Moderate      | Good (default) | General private |
| ε > 10      | Weak          | High           | Minimal privacy |

Try running with different epsilon values:

In [None]:
# Example: Try with different privacy levels
# Uncomment and run to compare results

# Stronger privacy (lower epsilon)
# target_epsilon = 0.5  # Very strong privacy
# recipe_strong_privacy = FedAvgRecipe(
#     name="hello-dp-strong",
#     min_clients=n_clients,
#     num_rounds=num_rounds,
#     initial_model=TabularMLP(input_dim=8, hidden_dims=[64, 32], output_dim=1),
#     train_script="client.py",
#     train_args=f"--batch_size {batch_size} --target_epsilon {target_epsilon} --n_clients {n_clients}",
# )
# add_experiment_tracking(recipe_strong_privacy, tracking_type="tensorboard")
# run_strong = recipe_strong_privacy.execute(SimEnv(num_clients=n_clients))

## Summary

In this example, you learned:

1. **What is Differential Privacy**: A mathematical framework for privacy-preserving machine learning
2. **DP-SGD with Opacus**: How to add differential privacy using Opacus' `PrivacyEngine`
3. **FL with DP**: How to integrate DP into federated learning with NVFlare
4. **Privacy Accounting**: How to track privacy budget (ε, δ) during training
5. **Privacy-Utility Trade-off**: Balance between privacy (noise) and model accuracy

### Best Practices:
- Start with moderate epsilon (e.g., 50) and adjust based on privacy requirements
- For sensitive data (medical, financial), use ε < 10
- Pre-train on public data before fine-tuning on private data
- Monitor privacy budget to ensure it doesn't exceed target

### Next Steps:
- Explore [Homomorphic Encryption](https://nvflare.readthedocs.io/) for additional privacy
- Try different model architectures and datasets
- Learn about [Secure Aggregation](https://nvflare.readthedocs.io/)
- Apply to your own use cases

## References

1. Abadi, M., et al. (2016). [Deep Learning with Differential Privacy](https://arxiv.org/abs/1607.00133). ACM CCS 2016.
2. McMahan, B., et al. (2017). [Communication-Efficient Learning of Deep Networks from Decentralized Data](https://proceedings.mlr.press/v54/mcmahan17a). AISTATS 2017.
3. [Opacus: User-friendly library for training PyTorch models with differential privacy](https://opacus.ai/)
4. [NVIDIA FLARE Documentation](https://nvflare.readthedocs.io/)