# Project 1: Differential Privacy for Deep Learning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]( https://colab.research.google.com/github/MarianoOG/Lesson-Notes-Secure-and-Private-AI/blob/master/Lectures/Project%201%20-%20Differential%20Privacy%20for%20Deep%20Learning.ipynb)

Previously, we defined perfect privacy as _a query to a database returns the same value even if we remove any person from the database_, and used this intuition in the description of epsilon/delta. In the context of deep learning we have a similar standard. **Training a model on a dataset should return the same model even if we remove any person from the dataset.**

We've replaced "querying a database" with "training a model on a dataset". However, one should note that this adds two points of complexity which database queries did not have:

### 1. Do we always know where _people_ are referenced in the dataset?

We need to treat each training example as a single, separate person. Strictly speaking, this is not true as some training examples have no relevance to people and others may have multiple or partial relevance to people (consider an image with multiple or no people contained within it).

### 2. Neural models rarely never train to the same output model

Even on identical data the output model is different. Several proposals have been made to solve this problem. We're going to focus on one of the most popular, PATE.

## Example Scenario: A Health Neural Network

You have a large collection of images from the patients of a hospital but you don't know what's in them. You realize that you can reach out to 10 partner hospitals which do have annotated data. It is your hope to train a classifier on their datasets so that you can automatically label your own. While these hospitals are interested in helping, they have privacy concerns regarding information about their patients. So you will use the following technique to train a classifier which protects the privacy of patients in the other hospitals.

1. You'll ask each of the 10 hospitals to train a model on their own datasets (all of which have the same kinds of labels).
2. You'll then use each of the 10 partner models to predict on your local dataset, generating 10 labels for each of your datapoints.
3. For each local data point (now with 10 labels), you will perform a _max_ function, where _max_ is the most frequent label across the 10 labels to obtain the final label. We will need to add Laplacian noise to make this Differentially Private to a certain epsilon/delta constraint.
4. Finally, we will retrain a new model on our local dataset which now has labels. This will be our final _DP_ model.

In [6]:
'''
In this proyect:
We will start on step 3 of the example scenario by using 10,000 training toy examples with 10 labels each.
Each label is chosen from a set of 10 possible labels (categories) for each image.
Then you're going to be given a dataset which you need to use to train a DP model using the PATE method.
'''
import numpy as np
from syft.frameworks.torch.differential_privacy import pate

# Variables of the toy database
num_teachers = 10    # we're working with 10 partner hospitals
num_examples = 10000 # the size of OUR dataset
num_labels   = 10    # number of lablels for our classifier

# Fake predictions
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int).transpose(1,0)

new_labels = list()
for an_image in preds:
    label_counts = np.bincount(an_image, minlength=num_labels)

    epsilon = 0.1
    beta = 1 / epsilon

    for i in range(len(label_counts)):
        label_counts[i] += np.random.laplace(0, beta, 1)

    new_label = np.argmax(label_counts)
    new_labels.append(new_label)
    
## Pate Analysis
labels = np.array([9, 9, 3, 6, 9, 9, 9, 9, 8, 2])
counts = np.bincount(labels, minlength=10)

query_result = np.argmax(counts)
print(query_result)

num_teachers, num_examples, num_labels = (100, 100, 10)
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int) #fake preds
indices = (np.random.rand(num_examples) * num_labels).astype(int) # true answers

preds[:,0:50] *= 0

data_dep_eps, data_ind_eps = pate.perform_analysis(teacher_preds=preds, indices=indices, noise_eps=0.1, delta=1e-5, moments=20)

assert data_dep_eps < data_ind_eps

print("Data Independent Epsilon:", data_ind_eps)
print("Data Dependent Epsilon:", data_dep_eps)

9
Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 0.9029013677789843
