## AI PRIVACY USING DIFFERENTIAL PRIVACY

### PROBLEM STATEMENT:
Data anonymization is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified while training AI/ML models. This will help organizations maintain confidentiality and AI privacy.
To protect sensitivity of data which holds sensitive information, intelligent solutions should be optimized with necessary privacy frameworks and accelerators using *Differential privacy*.


Differential privacy allows to avail a facilty in obtaining the useful information without divulging the private information or identification about an individuals.

Differential privacy enables to solve this problem by adding "noise" to the data that user can't identify any individual data.


#### IMPORT NECESSARY LIBRARIES

In [5]:
%tensorflow_version 2.x
import tensorflow  as tf
import numpy as np
from keras import backend as K


In [6]:
tf.compat.v1.disable_v2_behavior()

Instructions for updating:
non-resource variables are not supported in the long term


In [8]:
pip install tensorflow_privacy

Collecting tensorflow_privacy
  Downloading tensorflow_privacy-0.8.0-py3-none-any.whl (287 kB)
[K     |████████████████████████████████| 287 kB 7.8 MB/s 
[?25hCollecting pandas~=1.1.4
  Downloading pandas-1.1.5-cp37-cp37m-manylinux1_x86_64.whl (9.5 MB)
[K     |████████████████████████████████| 9.5 MB 64.5 MB/s 
Collecting matplotlib~=3.3.4
  Downloading matplotlib-3.3.4-cp37-cp37m-manylinux1_x86_64.whl (11.5 MB)
[K     |████████████████████████████████| 11.5 MB 44.7 MB/s 
[?25hCollecting tensorflow-datasets~=4.5.2
  Downloading tensorflow_datasets-4.5.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 38.8 MB/s 
[?25hCollecting attrs~=21.2.0
  Downloading attrs-21.2.0-py2.py3-none-any.whl (53 kB)
[K     |████████████████████████████████| 53 kB 2.4 MB/s 
Collecting scipy~=1.5.0
  Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
[K     |████████████████████████████████| 25.9 MB 1.3 MB/s 
[?25hCollecting tensorflow-probability~=0.15.0


In [9]:
import tensorflow_privacy
from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy

### Load and pre-process the dataset

In [10]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


In [11]:
x_train = np.array(x_train,dtype=np.float32)/255
x_test =  np.array(x_test,dtype=np.float32)/255
y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
y_test = tf.keras.utils.to_categorical(y_test, num_classes=10)

In [12]:
x_train.shape

(50000, 32, 32, 3)

In [13]:
x_train=x_train/255
x_test=x_test/255

### Define hyperparameters
epochs - It means one complete pass of the training dataset through the algorithm.



Batch size - It is the number of training examples utilized in one iteration.


1.12_norm_clip - The maximum Euclidean (L2) norm of each gradient that is applied to update model parameters. This hyperparameter is used to bound the optimizer's sensitivity to individual training points.

2.Noise_multiplier - It is used to add noise to the gradients during training to increase the privacy.

3.microbatches - Each batch of data is split in smaller units called microbatches. By default, each microbatch should contain a single training example. This allows us to clip gradients on a per-example basis rather than after they have been averaged across the minibatch.

4.Learning rate - Tuning parameter in an optimization algorithm that determines the step size at each iteration while moving toward a minimum of a loss function.



In [14]:
epochs = 1
batch_size = 250
l2_norm_clip = 1.5
noise_multiplier = 1.3
num_microbatches = 250
learning_rate = 0.25

if batch_size % num_microbatches != 0:
  raise ValueError('Batch size should be an integer multiple of the number of microbatches')

In [15]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Activation,Flatten

In [45]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 16, 16, 16)        3088      
                                                                 
 max_pooling2d (MaxPooling2D  (None, 15, 15, 16)       0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 6, 6, 32)          8224      
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 5, 5, 32)         0         
 2D)                                                             
                                                                 
 flatten (Flatten)           (None, 800)               0         
                                                                 
 dense (Dense)               (None, 32)                2

In [23]:
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(16, 8,
                           strides=2,
                           padding='same',
                           activation='relu',
                           input_shape=(32, 32, 3)),
    tf.keras.layers.MaxPool2D(2, 1),
    tf.keras.layers.Conv2D(32, 4,
                           strides=2,
                           padding='valid',
                           activation='relu'),
    tf.keras.layers.MaxPool2D(2, 1),
    tf.keras.layers.Flatten(input_shape=[32,32]),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(10)
])

model.call = tf.function(model.call)


In [22]:
from tensorflow_privacy.privacy.analysis import compute_dp_sgd_privacy
from tensorflow_privacy.privacy.optimizers.dp_optimizer import DPGradientDescentGaussianOptimizer
#import optimizers 


In [17]:
import sys
from tensorflow_privacy.version import __version__
if hasattr(sys, 'skip_tf_privacy_import'):  
    # Useful for standalone scripts.
  pass
else:
  # TensorFlow v1 imports
  from tensorflow_privacy import v1

 

### Define the optimizer and loss function for the learning model.

In [24]:
optimizer =DPGradientDescentGaussianOptimizer (
    l2_norm_clip=l2_norm_clip,
    noise_multiplier=noise_multiplier,
    num_microbatches=num_microbatches,
    learning_rate=learning_rate)

In [19]:
loss = tf.keras.losses.CategoricalCrossentropy(
    from_logits=True, reduction=tf.losses.Reduction.NONE)

### Train  the model

In [25]:
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])


In [27]:

model.fit(x_train,y_train,
         batch_size=batch_size,
         epochs=epochs,
         validation_data=(x_test,y_test))

Train on 50000 samples, validate on 10000 samples


<keras.callbacks.History at 0x7fae4df3da90>

In [35]:
from tensorflow_privacy.privacy.analysis.rdp_accountant import compute_rdp

In [30]:
compute_dp_sgd_privacy.compute_dp_sgd_privacy(n=x_train.shape[0],
                                              batch_size=batch_size,
                                              noise_multiplier=noise_multiplier,
                                              epochs=epochs,
                                              delta=1e-5)

DP-SGD with sampling rate = 0.5% and noise_multiplier = 1.3 iterated over 200 steps satisfies differential privacy with eps = 0.52 and delta = 1e-05.
The optimal RDP order is 17.0.


(0.5203060966744342, 17.0)

Two metrics are used to express the DP guarantee of an ML algorithm:

Delta () - Bounds the probability of the privacy guarantee not holding. A rule of thumb is to set it to be less than the inverse of the size of the training dataset. In this tutorial, it is set to 10^-5 as the MNIST dataset has 60,000 training points.


Epsilon () - This is the privacy budget. It measures the strength of the privacy guarantee by bounding how much the probability of a particular model output can vary by including (or excluding) a single training point. A smaller value for  implies a better privacy guarantee. However, the  value is only an upper bound and a large value could still mean good privacy in practice.