# Chapter 14: Data Privacy for Machine Learning

Three main methods for privacy-preserving machine learning will be covered:
 - Differential Privacy
 - Federated Privacy
 - Encrypted Machine Learning

## Data Privacy Issues

Answer the following questions to decide which privacy-preserving mahine learning method to choose:
 - Who are you trying to keep the data private from?
 - Which parts of the system can be private and which can be exposed to the world?
 - Who are the trusted parties that can view the data?

### The Simplest Way to Increase Privacy

Only collect data that is necessary and only give the model the data that it really needs for a good prediction, with that many fields like name, gender or similary can be deleted without any negative effects. But be careful about potential biases that could result.

### What Data needs to be kept Private?

 - Personally Identifying Information (PII) is data that can directly identify a single person
 - Sensitive data, often defined as data that could harm someone if it were released (Health Data, Financial Data, ...)
 - quasi-identifying data, i.e. data that could uniqely identify somenone if enough data is known.

## Differential Privacy (DP)

DP is a formalization of the idea that a query or a transformation of a dataset should not reveal whether a person is in that dataset.

### Local and Global Differential Privacy

In local DP noise or randomness is added at the individual level, so privacy is maintained between an individual and the collector of the data.<br>
In global DP noise is added to a transformation on the entire dataset. The data collector is trusted with the raw data, but the result of the transformation does not reveal data about an individual.

### Epsilon, Delta and the Privacy Budget

See for details page 404.

### Differential Privacy for Machine Learning

Places DP can be used:
 - Federated Learning System
  - Either local or global DP
 -TensorFlwo Privacy Library
  - Global DP (Raw data is available for model training)
 - Private Aggregation of Teacher Ensembles (PATE) approach

## Introduction to TensorFlow Privacy

TensorFlow Privacy (TFP) adds DP to an optimizer during model training. The type of DP used in TFP is an example of global DP, i.e. noise is added during trainign so that private data is not exposed in a model's prediction.

### Training with a Differentially Private Optimizer

TFP can be installed with pip, but does currently only work with TensorFlow version 1.x:

In [None]:
!pip install tensorflow_privacy

In [1]:
import tensorflow as tf

In [2]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dense(128, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid"),
])

The differentially private optimizer requires that we set two extra hyperparameters compared to a normal tf.keras model:
 - Noise Multiplier
 - L2 Norm Clip

It's best to tune these to suit your dataset and measure their impact on $\epsilon$:

In [3]:
# DP parameters
NOISE_MULTIPLIER = 1.1
NUM_MICROBATCHES = 32 # The batch size must be exaclty divisible by the number of microbatches
LEARNING_RATE = 0.1
POPULATION_SIZE = 1000 # The number of examples in the training set
L2_NORM_CLIP = 1.0
BATCH_SIZE = 32 # The population size must be exactly divisible by the batch size
EPOCHS = 1

Initialize the differentially private optimizer:

In [4]:
from tensorflow_privacy.privacy.optimizers.dp_optimizer import DPGradientDescentGaussianOptimizer

In [6]:
optimizer = DPGradientDescentGaussianOptimizer(
    l2_norm_clip=L2_NORM_CLIP,
    noise_multiplier=NOISE_MULTIPLIER,
    num_microbatches=NUM_MICROBATCHES,
    learning_rate=LEARNING_RATE
)

# Loss must be calculated on a per-example basis rather than over an entire mini-batch
loss = tf.keras.losses.BinaryCrossentropy(
    from_logits=True,
    reduction=tf.losses.Reduction.NONE
)

Training a private model is just like training a normal tf.keras model:

In [None]:
model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

model.fit(X_train,
          y_train,
          epochs=EPOCHS,
          validation_data=(X_text, y_test),
          batch_size=BATCH_SIZE
)

### Calculating Epsilon

Now, we calculate the differential privacy parameters for our model and our choice of noise multiplier and gradient clip:

In [7]:
from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy

In [9]:
compute_dp_sgd_privacy.compute_dp_sgd_privacy(
    n=POPULATION_SIZE,
    batch_size=BATCH_SIZE,
    noise_multiplier=NOISE_MULTIPLIER,
    epochs=EPOCHS,
    delta=1e-4 # The value of delta is set to 1/(set size of the dataset), round to the nearest order of magnitude
)

DP-SGD with sampling rate = 3.2% and noise_multiplier = 1.1 iterated over 32 steps satisfies differential privacy with eps = 1.72 and delta = 0.0001.
The optimal RDP order is 8.0.


(1.715653096511666, 8.0)

## Federated Learning (FL)

FL is a protocol where the training of a machine learning model is distributed accross many different devices and the trained model is combined on a central server. The key point is that the raw data never leaves the separate devices and is never pooled in one place. This is very different from the traditional architecture of gathering a dataset in a central location and then training a model.<br>
Useful applications:
 - Mobile phones
 - User's browser
 - Sharing of sensitive data that is distributed accross multiple data owners.

FL is most useful in use cases that share the following characteristics:
 - The data required for the model can only be collected from distributed sources
 - The number of data sources is large
 - The data is sensitive in some way
 - The data does not require extra labelling - the labels are provided directly by the user and do not leave the source
 - Ideally, the data is drwan from close to identical distributions

### Federeated Learning in TensorFlow

TensorFlow Federated (TFF) simulates the distributed setup of FL and contains a version of SGD that can calculate updates on distributed data

## Encrypted Machine Learning

Encrypted Machien Learning leans on technology and research from the cryptographic community and applies these techniques to machine learning. The major methods that have been adopted so far are homomorphic encryption (HE) and secure multiparty computation (SMPC).<br>
There are two ways to using these techniques:
 - Encrypting a model that has already been trained on plain text data
 - Encrypting an entire system (if the data must stay encrypted during training)

### Encrypted Model Training

Useful when training models on encrypted data. This is useful when the raw data needs to be kept private from the data scientist training the model or when two or more parties own the raw data and want to train a model using all parties' data, but don't want to share the raw data.<br>
TFE can be used to train an encrypted model for this use case.

In [None]:
!pip install tf_encrypted

First step in building a TFE model is to define a class that yields training data in batches. This class is implemented locally by the data owner(s). It is converted to encrypted data using a decorator:
    
    @tfe.local_computation

In [11]:
import tf_encrypted as tfe

In [None]:
model = tfe.keras.Sequential()
model.add(tfe.keras.layers.Dense(1, batch_input_shape=[batch_size, num_features]))
model.add(tfe.keras.layers.Activation("sigmoid"))

### Converting a Trained Model to Serve Encrypted Predictions

The second scenario where TFE is useful is when you would like to serve encrypted model that have been trained on plain-text data. In this case you have full access to the unencrypted training data, but you want the users of your application to be able to receive private predictions. This provides privacy to the users, who upload encrypted data and receive an enrypted prediction.<br>
Keras models can be converted to TFE models via:

    tfe_model = tfe.keras.models.clone_model(model)

In this scenario, the following steps need to be carried out:
 - Load and preprocess the data locally on the client
 - Encrypt the data on the client
 - Send the encrypted data to the servers
 - Make a prediction on the encrypted data
 - Send the encrypted prediction to the client
 - Decrypt the predcition on the client and show the result to the user.

# References and Additional Resources

 - <a href="https://github.com/tensorflow/privacy">TensorFlow Privacy Github</a>
 - <a href="https://medium.com/dropoutlabs/introducing-pysyft-tensorflow-cc361ac75137">PySyft TensorFlow Medium Article</a>
 - <a href="https://tf-encrypted.io/">TensorFlow Encrypted</a>
 - <a href="https://github.com/tf-encrypted/tf-encrypted/tree/master/examples/notebooks/keras-training">TensorFlow Encrypted Github</a>
 - <a href="https://github.com/tf-encrypted/tf-encrypted/tree/master/examples/notebooks">TFE Notebook examples for serving private predictions</a>