## AI PRIVACY USING DIFFERENTIAL PRIVACY

### PROBLEM STATEMENT:
Data anonymization is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified while training AI/ML models. This will help organizations maintain confidentiality and AI privacy.
To protect sensitivity of data which holds sensitive information, intelligent solutions should be optimized with necessary privacy frameworks and accelerators using *Differential privacy*.

Differential privacy allows to avail a facilty in obtaining the useful information without divulging the private information or identiification about an individuals

Differential privacy enables to solve this problem by adding "noise" to the data that user can't identify any individual data.

#### IMPORT NECESSARY LIBRARIES

In [7]:
import tensorflow  as tf
import numpy as np

In [8]:
tf.compat.v1.disable_v2_behavior()

Instructions for updating:
non-resource variables are not supported in the long term


In [10]:
import tensorflow as tf;
print(tf.reduce_sum(tf.random.normal([1000, 1000])))

Tensor("Sum:0", shape=(), dtype=float32)


In [12]:
tf.get_logger().setLevel('ERROR')

In [14]:
import tensorflow_privacy

from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy

### Load and pre-procee the dataset

In [15]:
train, test = tf.keras.datasets.mnist.load_data()
train_data, train_labels = train
test_data, test_labels = test

train_data = np.array(train_data, dtype=np.float32) / 255
test_data = np.array(test_data, dtype=np.float32) / 255

train_data = train_data.reshape(train_data.shape[0], 28, 28, 1)
test_data = test_data.reshape(test_data.shape[0], 28, 28, 1)

train_labels = np.array(train_labels, dtype=np.int32)
test_labels = np.array(test_labels, dtype=np.int32)

train_labels = tf.keras.utils.to_categorical(train_labels, num_classes=10)
test_labels = tf.keras.utils.to_categorical(test_labels, num_classes=10)

assert train_data.min() == 0.
assert train_data.max() == 1.
assert test_data.min() == 0.
assert test_data.max() == 1.

### Define hyperparameters

epochs - It means one complete pass of the training dataset through the algorithm.



Batch size - It is the number of training examples utilized in one iteration.

In [16]:
epochs = 3
batch_size = 250

1.12_norm_clip - The maximum Euclidean (L2) norm of each gradient that is applied to update model parameters. This hyperparameter is used to bound the optimizer's sensitivity to individual training points.

2.Noise_multiplier - It is used to add noise to the gradients during training to increase the privacy.

3.microbatches - Each batch of data is split in smaller units called microbatches. By default, each microbatch should contain a single training example. This allows us to clip gradients on a per-example basis rather than after they have been averaged across the minibatch.

4.Learning rate - Tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.





In [17]:
l2_norm_clip = 1.5
noise_multiplier = 1.3
num_microbatches = 250
learning_rate = 0.25

if batch_size % num_microbatches != 0:
  raise ValueError('Batch size should be an integer multiple of the number of microbatches')

### Build the model

In [18]:
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(16, 8,
                           strides=2,
                           padding='same',
                           activation='relu',
                           input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPool2D(2, 1),
    tf.keras.layers.Conv2D(32, 4,
                           strides=2,
                           padding='valid',
                           activation='relu'),
    tf.keras.layers.MaxPool2D(2, 1),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(10)
])

In [19]:
from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy
from tensorflow_privacy.privacy.optimizers.dp_optimizer import DPGradientDescentGaussianOptimizer
#import optimizers 


In [20]:
import sys
from tensorflow_privacy.version import __version__
if hasattr(sys, 'skip_tf_privacy_import'):  
    # Useful for standalone scripts.
  pass
else:
  # TensorFlow v1 imports
  from tensorflow_privacy import v1

 

Define the optimizer and loss function for the learning model.

In [21]:
optimizer =DPGradientDescentGaussianOptimizer (
    l2_norm_clip=l2_norm_clip,
    noise_multiplier=noise_multiplier,
    num_microbatches=num_microbatches,
    learning_rate=learning_rate)

In [22]:
loss = tf.keras.losses.CategoricalCrossentropy(
    from_logits=True, reduction=tf.losses.Reduction.NONE)

### Train  the model

In [23]:
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

model.fit(train_data, train_labels,
          epochs=epochs,
          validation_data=(test_data, test_labels),
          batch_size=batch_size)

Train on 60000 samples, validate on 10000 samples
Epoch 1/3

  updates = self.state_updates


Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x217359eb8b0>

Two metrics are used to express the DP guarantee of an ML algorithm:

Delta () - Bounds the probability of the privacy guarantee not holding. A rule of thumb is to set it to be less than the inverse of the size of the training dataset. In this tutorial, it is set to 10^-5 as the MNIST dataset has 60,000 training points.
Epsilon () - This is the privacy budget. It measures the strength of the privacy guarantee by bounding how much the probability of a particular model output can vary by including (or excluding) a single training point. A smaller value for  implies a better privacy guarantee. However, the  value is only an upper bound and a large value could still mean good privacy in practice.

In [24]:
compute_dp_sgd_privacy.compute_dp_sgd_privacy(n=train_data.shape[0],
                                              batch_size=batch_size,
                                              noise_multiplier=noise_multiplier,
                                              epochs=epochs,
                                              delta=1e-5)

DP-SGD with sampling rate = 0.417% and noise_multiplier = 1.3 iterated over 720 steps satisfies differential privacy with eps = 0.79 and delta = 1e-05.
The optimal RDP order is 18.0.


(0.7903529309843027, 18.0)

In [25]:
compute_dp_sgd_privacy.compute_dp_sgd_privacy(n=60000, batch_size=250, noise_multiplier=1.3, epochs=15, delta=1e-5)

DP-SGD with sampling rate = 0.417% and noise_multiplier = 1.3 iterated over 3600 steps satisfies differential privacy with eps = 1.18 and delta = 1e-05.
The optimal RDP order is 17.0.


(1.179900673982703, 17.0)

In [None]:
The hyperparameters can be tuned to get different accuracy and epsilon values.