# Project 1: Differential Privacy for Deep Learning

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]( https://colab.research.google.com/github/MarianoOG/Lesson-Notes-Secure-and-Private-AI/blob/master/Lectures/Project%201%20-%20Differential%20Privacy%20for%20Deep%20Learning.ipynb)

Previously, we defined perfect privacy as _a query to a database returns the same value even if we remove any person from the database_, and used this intuition in the description of epsilon/delta. In the context of deep learning we have a similar standard. **Training a model on a dataset should return the same model even if we remove any person from the dataset.**

We've replaced "querying a database" with "training a model on a dataset". However, one should note that this adds two points of complexity which database queries did not have:

### 1. Do we always know where _people_ are referenced in the dataset?

We need to treat each training example as a single, separate person. Strictly speaking, this is not true as some training examples have no relevance to people and others may have multiple or partial relevance to people (consider an image with multiple or no people contained within it).

### 2. Neural models rarely never train to the same output model

Even on identical data the output model is different. Several proposals have been made to solve this problem. We're going to focus on one of the most popular, PATE.

## Example Scenario: A Health Neural Network

You have a large collection of images from the patients of a hospital but you don't know what's in them. You realize that you can reach out to partner hospitals which do have annotated data. It is your hope to train a classifier on their datasets so that you can automatically label your own. While these hospitals are interested in helping, they have privacy concerns regarding information about their patients. So you will use the following technique to train a classifier which protects the privacy of patients in the other hospitals.

1. You'll ask each of the hospitals to train a model on their own datasets (all of which have the same kinds of labels).
2. You'll then use each of the partner models to predict on your local dataset, generating labels for each of your datapoints.
3. For each local data point (now with labels), you will perform a _max_ function, where _max_ is the most frequent label across the labels to obtain the final label. We will need to add Laplacian noise to make this Differentially Private to a certain epsilon/delta constraint.
4. Finally, we will retrain a new model on our local dataset which now has labels. This will be our final _DP_ model.

In [2]:
'''
In this proyect:
We will assume we have the labels from the partner models so we will start on step 3 of the example scenario.
We will be using data from 100 partner hospitals and each of them will provide 100 training toy examples.
Each label is chosen from a set of 10 possible labels (categories) for each image.
Then you're going to use the PATE method to analyze how much privacy (epsilon) is this database leaking.
'''
import numpy as np
from syft.frameworks.torch.differential_privacy import pate

# Variables of the toy database
num_teachers = 100    # we're working with 10 partner hospitals
num_examples = 100    # the size of OUR dataset
num_labels   = 10     # number of lablels for our classifier

# Predictions from hospitals on each of their images (partner labels)
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int).transpose(1,0)

# Adding noise to the labels
real_result = list()
private_result = list()
for an_image in preds:
    # Calculate real label (not private)
    label_counts = np.bincount(an_image, minlength=num_labels) 
    real_result.append(np.argmax(label_counts))
    
    # Calculating epsilon value
    epsilon = 0.1
    beta = 1 / epsilon

    # Adding noise
    for i in range(len(label_counts)):
        label_counts[i] += np.random.laplace(0, beta, 1)

    # Storing private results
    private_result.append(np.argmax(label_counts))

# Arranging the datasets to be used by PATE
preds = preds.transpose(1,0)
indices = np.asarray(private_result)

# PATE analysis for several agreement levels (forced)
for i in range(6):
    k = i*2
    preds[:,0:k] *= 0
    
    dep, ind = pate.perform_analysis(teacher_preds=preds,indices=indices,noise_eps=0.1,delta=1e-5,moments=20)

    assert dep < ind
    
    print("Forced agreement =", 100*k/num_examples, "%")
    print("Data Independent Epsilon:", ind)
    print("Data Dependent Epsilon:", dep)
    print("")

Forced agreement = 0.0 %
Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 11.756462732485105

Forced agreement = 2.0 %
Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 10.567352334004832

Forced agreement = 4.0 %
Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 8.900097381461716

Forced agreement = 6.0 %
Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 6.81765626580294

Forced agreement = 8.0 %
Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 4.877227210477253

Forced agreement = 10.0 %
Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 0.9029013677789843



Notice that when there's more agreement between the teacher predictions (partner models) the epsilon data dependent value is reduced drastically.