## An Example Scenario: A Health Neural Network

First we're going to consider a scenario - you work for a hospital and you have a large collection of images about your patients. However, you don't know what's in them. You would like to use these images to develop a neural network which can automatically classify them, however since your images aren't labeled, they aren't sufficient to train a classifier. 

However, being a cunning strategist, you realize that you can reach out to 10 partner hospitals which DO have annotated data. It is your hope to train your new classifier on their datasets so that you can automatically label your own. While these hospitals are interested in helping, they have privacy concerns regarding information about their patients. Thus, you will use the following technique to train a classifier which protects the privacy of patients in the other hospitals.

- 1) You'll ask each of the 10 hospitals to train a model on their own datasets (All of which have the same kinds of labels)
- 2) You'll then use each of the 10 partner models to predict on your local dataset, generating 10 labels for each of your datapoints
- 3) Then, for each local data point (now with 10 labels), you will perform a DP query to generate the final true label. This query is a "max" function, where "max" is the most frequent label across the 10 labels. We will need to add laplacian noise to make this Differentially Private to a certain epsilon/delta constraint.
- 4) Finally, we will retrain a new model on our local dataset which now has labels. This will be our final "DP" model.

So, let's walk through these steps. I will assume you're already familiar with how to train/predict a deep neural network, so we'll skip steps 1 and 2 and work with example data. We'll focus instead on step 3, namely how to perform the DP query for each example using toy data.

So, let's say we have 10,000 training examples, and we've got 10 labels for each example (from our 10 "teacher models" which were trained directly on private data). Each label is chosen from a set of 10 possible labels (categories) for each image.

So, to put it in simpler words, we want to train a classifier to classify medical imaging. But we don't have the labelled data to train such a classifier, meanwhile, there are 10 other hospitals that do have what we're looking for,i.e., labelled data for training a classifier. They can't share the data with us to train our model due to confidentiality reasons. But what they can do for us is they can train a classifier on their data. Thus, the pipeline for our problem is:
- The 10 hospitals each train separate classifiers with the same labels.
- Now we can use these 10 models to label our dataset, but in a privacy preserving manner using global differential privacy.
- Thus, we're gonna use the 10 models to label our dataset.
- Now we just label each of our datapoint based on what the maximum number of classifiers out of the 10 classifiers have labelled it to be. This is where differential privacy is going to be used to perform a differentially private max query.

Now that our dataset is labelled, we can use it to train a classifier of our own to make predictions on new medical images autonomously. So, the part we are concerned with here is the differentially private mean query.

So, for this project we're going to synthesize the outputs of the 10 classifiers on our dataset and perform the differentially private query.

In [37]:
import numpy as np

In [38]:
num_teachers = 10 # number of partner hospitals
num_examples = 10000 # size of the dataset
num_labels = 10 # number of labels for our classifier

In [47]:
# Now we generate a tensor with no. of rows as the no. of teachers and no. of columns as no. of examples ,i.e., 10x10000

preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int)#.transpose(1,0) # fake predictions

In [48]:
preds[:,0]

array([0, 6, 4, 0, 4, 9, 6, 3, 7, 6])

Each element of the vector outputted above contains the predictions made by each of the hospitals on the first image. So, each prediction made above is due to what the model learned from a whole hospital of patients. Now if we combine the predictions made by the 10 hospitals into one prediction based on majority basis and do it in a differentially private manner, that is, the output of the query doesn't change when a hospital is removed, then it also won't change when a patient is removed from the training set of that model. The way we will be able to do this is by adding noise to the output of our query. For now we will get wrong labels because of this addition of noise, but when we train the model with this mechanism, the model will learn to arrive at the accurate label.

In [49]:
# So, here are the predictions on one image
an_image =  preds[:,0]
an_image

array([0, 6, 4, 0, 4, 9, 6, 3, 7, 6])

In [50]:
np.bincount(an_image, minlength=num_labels) # returns how many times that index number occurs

array([2, 0, 0, 1, 2, 0, 3, 1, 0, 1], dtype=int64)

In [51]:
label_counts = np.bincount(an_image, minlength=num_labels)
# So now we can find out which index of the above variable occurs the most using np.argmax
np.argmax(label_counts)

6

But this is the exact output not differentially private, so we have to revert back to the Laplacian mechanism to add random noise

In [52]:
epsilon = 0.1
beta = 1/epsilon

for i in range(len(label_counts)):
    label_counts[i] += np.random.laplace(0, beta, 1) # adding noise to the no. of times each label occured

In [53]:
label_counts

array([ -7,  15,  18,   5,   2,  31,   7, -11,   0,   6], dtype=int64)

In [54]:
np.argmax(label_counts)

5

So in the above code we learnt to find in a differentially private manner the majority prediction(however wrong) for one image in our dataset which means we have to iterate the above process for the whole dataset.


In [63]:
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int).transpose(1,0) # fake predictions

In [64]:
preds.shape # now we can iterate through the rows to find the label for one image at a time in our dataset

(10000, 10)

In [65]:
new_labels = list()
for an_image in preds:
    
    label_counts = np.bincount(an_image, minlength = num_labels)
    
    epsilon = 0.1
    beta = 1/epsilon
    
    for i in range(len(label_counts)):
        label_counts[i] += np.random.laplace(0, beta, 1)
    
    new_label = np.argmax(label_counts)
    
    new_labels.append(new_label)

In [67]:
new_labels

[6,
 7,
 7,
 7,
 2,
 4,
 5,
 7,
 5,
 7,
 5,
 9,
 4,
 3,
 1,
 0,
 3,
 5,
 5,
 4,
 2,
 0,
 5,
 1,
 4,
 9,
 0,
 8,
 8,
 3,
 0,
 2,
 0,
 0,
 7,
 9,
 3,
 4,
 4,
 4,
 7,
 0,
 1,
 2,
 8,
 5,
 5,
 9,
 9,
 6,
 5,
 2,
 1,
 4,
 7,
 1,
 0,
 6,
 6,
 4,
 2,
 0,
 0,
 2,
 8,
 2,
 8,
 0,
 5,
 7,
 3,
 1,
 5,
 6,
 2,
 8,
 5,
 2,
 9,
 8,
 9,
 7,
 4,
 1,
 5,
 7,
 5,
 4,
 9,
 5,
 6,
 3,
 7,
 3,
 0,
 3,
 7,
 9,
 2,
 6,
 6,
 4,
 8,
 7,
 1,
 9,
 0,
 5,
 7,
 8,
 4,
 9,
 2,
 3,
 7,
 3,
 6,
 1,
 8,
 9,
 3,
 7,
 0,
 0,
 6,
 3,
 0,
 0,
 8,
 2,
 5,
 1,
 0,
 4,
 3,
 5,
 3,
 8,
 6,
 0,
 0,
 0,
 9,
 4,
 7,
 5,
 8,
 9,
 4,
 4,
 5,
 6,
 8,
 1,
 0,
 7,
 3,
 5,
 2,
 7,
 0,
 7,
 8,
 8,
 3,
 0,
 1,
 2,
 0,
 4,
 8,
 1,
 2,
 0,
 8,
 6,
 9,
 4,
 2,
 1,
 7,
 6,
 6,
 7,
 8,
 6,
 0,
 3,
 0,
 6,
 0,
 6,
 6,
 9,
 2,
 6,
 2,
 0,
 0,
 2,
 2,
 5,
 5,
 8,
 7,
 7,
 3,
 5,
 7,
 5,
 0,
 1,
 4,
 0,
 5,
 5,
 1,
 1,
 5,
 1,
 1,
 4,
 9,
 6,
 0,
 9,
 7,
 5,
 4,
 9,
 7,
 1,
 6,
 7,
 4,
 5,
 4,
 9,
 6,
 5,
 7,
 5,
 4,
 1,
 0,
 6,
 2,
 5,
 2,
 5,


Now in order to find out how much epsilon budget is spent in the above computation we need to go through the PATE analysis