# Differential Privacy with TensorFlow Privacy

In this exercise, you will use a differentially private version of stochastic gradient descent (SGD) to classify a binary subset of the MNIST dataset. As a baseline, you will use logistic regression with L2 regularization, and you will compare it to convolutional neural networks. Finally, you will search for a good trade-off between model utility (i.e. F1 score) and privacy (epsilon).

If you are not familiar with TensorFlow Privacy already, you may use this [**tutorial**](https://www.tensorflow.org/responsible_ai/privacy/tutorials/classification_privacy) to get an overview about the library or check out their [**GitHub repository**](https://github.com/tensorflow/privacy).

In [None]:
# imports
import numpy as np
np.random.seed(0)

from sklearn.metrics import accuracy_score, f1_score
from sklearn.linear_model import LogisticRegression

import tensorflow_privacy
from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy

import tensorflow as tf
from tensorflow.keras.datasets import mnist

import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')

## Load and pre-process the dataset

* Just as in the last exercise, you should start by loading the **MNIST** dataset. You may use `tensorflow.keras.datasets.mnist`, which is included in the import statements above.

* Note that you need a **binarized** version of the MNIST dataset. Also, you need to keep only **2** classes out of 10, namely 5 and 8.
* Rescale the feature values such that each sample has **Euclidean norm** $≤ 1$.
* Plot some example images.

In [None]:
# load dataset using Keras
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# normalize samples to [0, 1] euclidean norm

# [insert your code here]

In [None]:
# transform into binary classification dataset

# [insert your code here]

In [None]:
# plot some example images

# [insert your code here]

## Setup, train and evaluate baseline model

* Setup a Logistic Regression model with L2 regularization and train it on the binarized MNIST data using **Differentially Private Stochastic Gradient Descent (DP-SGD)**. You may use `sklearn.linear_model.LogisticRegression` or an equivalent layer from TensorFlow (softmax/sigmoid). Use `tensorflow_privacy.DPKerasSGDOptimizer` as the optimizer (or the differentially private version of ADAM).

* Train the model on different privacy budgets: $\epsilon \in [0.1, 0.5, 1.0, 5.0, 10.0]$ with fixed $\delta = 0.0001$.

* Measure the classification performance using the $F_1$ score for each epsilon.

_Hint_: change the `noise_multiplier` hyperparameter in your `tensorflow_privacy.DPKerasSGDOptimizer` and observe how $\epsilon$ changes using `tensorflow_privacy.privacy.analysis.compute_dp_sgd_privacy`.

In [None]:
# use DP-SGD to train logistic regression with l2 regularization

# [insert your code here]

## Setup, train and evaluate CNN trained with DP-SGD

* Repeat the previous task, but instead of training L2-regularized Logistic Regression, you should train a 3-layer CNN using DP-SGD.

In [None]:
# use DP-SGD to train 3-layer CNN

# [insert your code here]

## Optional: Setup, train and evaluate CNN trained with "vanilla" SGD

* Retrain the CNN from the previous task with a "vanilla" (i. e. not differentially private) SGD optimizer, such as `tensorflow.keras.optimizers.SGD`.

In [None]:
# use "vanilla" SGD to train 3-layer CNN

# [insert your code here]

## Plot results

* Plot curves for each model, with $\epsilon$ on the x-axis and your utility metric ($F_1$ score) on the y-axis.

* What do you observe?

In [None]:
# line chart displaying privacy/utility trade-off

# [insert your code here]

**Bonus:** Try out the Gaussian-DP based accounting [here](https://github.com/tensorflow/privacy/blob/master/tensorflow_privacy/privacy/analysis/gdp_accountant.py)