# Section 1 - Differencial Privacy

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]( https://colab.research.google.com/github/MarianoOG/Lesson-Notes-Secure-and-Private-AI)

First of all, we talk about a **differencial atack** when an attaker is able to obtain information from a particular individual or group from a database by making a query to the entire database and to a sub-database derived from the original (with some data removed). Then, the attaker compares the difference between the outputs of this queries and with this results is able to obtain information about the individual or group. We say that a database is **differencially private** when this kind of attacks are not succesfull.

In this section we will first explore how a query can reveal information from an individual or group in the database. Then we will define differencial privacy with the help of the concept of sensitivity. And finally we will explore differencial privacy in the context of deep learning.

## Lesson 1: Simple Database Queries

We're going to make a toy database (for educational reasons) by initializing a random list of 1s and 0s which will represent sensitive data from a population. The number of entries directly corresponds to the number of people in our database.

**Key to the definition of differenital privacy is the ability to ask the question "When querying a database, if I removed someone from the database, would the output of the query be any different?"**. Thus, in order to check this, we will create sub-databases that we will call *parallel databases* which are databases with one entry removed.

In [1]:
'''
In this first project, we will create parallel databases "pdb" from the main database "db".
'''

# First we import the needed libraries
import torch

# Function to create database with n entries (number of people)
def create_db(n):
    return torch.rand(n) > 0.5

# Funtion to create 1 parallel database removing one particual index
def get_parallel_db(db, remove_index):
    return torch.cat((db[0:remove_index], 
                      db[remove_index+1:]))

# Funtion to create all parallel databases from a database
def get_parallel_dbs(db):
    parallel_dbs = list()
    for i in range(len(db)):
        pdb = get_parallel_db(db, i)
        parallel_dbs.append(pdb)
    return parallel_dbs

# Function to create database and parallel databases
def create_db_and_parallels(n):
    db = create_db(n)
    pdbs = get_parallel_dbs(db)
    return db, pdbs

# Creating actual database and parallel databases
db, pdbs = create_db_and_parallels(20)

# Printing results:
print(db)         # Database
print(len(pdbs))  # Number of parallel databases
# print(pdbs)      # Parallel databases

tensor([1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1],
       dtype=torch.uint8)
20


## Lesson 2: Towards Evaluating The Differential Privacy of a Function

We want to be able to query our database and evaluate whether or not the result of the query is leaking private information. As mentioned previously, **this is about evaluating whether the output of a query changes when we remove someone from the database**. Specifically, we want to evaluate the *maximum* amount the query changes when someone is removed (maximum over all possible people who could be removed). 

In order to evaluate how much privacy is leaked, first we are going to define functions that will be our queries to the database. Then, we will measure the difference between each parallel db's query result and the query result for the entire database. Finally, we will calculate the maximum value along those results. This value is called sensitivity. **Sensitivity is measuring how much the output of the query changes when a person is removed from the database.**

In [2]:
'''
In this project we will:
1) Create a list of queries (sum, threshold and mean)
2) Calculate sensitivy for each one
'''
    
# Sum query function
def query_sum(db):
    return db.sum().float()

# Threshold query function
def query_threshold(db, threshold=5):
    return (db.sum() > threshold).float()

# Mean query function
def query_mean(db):
    return db.float().mean()

# Calculate sensitivity
def sensitivity(query, n):
    db, pdbs = create_db_and_parallels(n)
    db_result = query(db)
    max_distance = 0
    for pdb in pdbs:
        pdb_result = query(pdb)
        db_distance = torch.abs(pdb_result - db_result)
        if(db_distance > max_distance):
            max_distance = db_distance
    return max_distance

# Sensitivity for sum
print("Sum query")
for i in range(1,10):
    s = sensitivity(query_sum, i)
    print("N =", i, "- sentivity =", s)
print("Some results are not deterministic they will change if you re-run this cell.")
print("")
    
# Sensitivity for threshold
print("Treshold query")
for i in range(15):
    s = sensitivity(query_threshold, 10)
    print("N =", 10, "- sentivity =", s)
print("Even when the number of samples are the same the sensitity of this function varies")
print("")

# Sensitivity for mean
print("Mean query")
for i in range(1,10):
    s = sensitivity(query_mean, i*10)
    print("N =", i*10, "- sentivity =", s)
print("Observe how if the database is bigger the sensitivy is reduced.")

Sum query
N = 1 - sentivity = 0
N = 2 - sentivity = tensor(1.)
N = 3 - sentivity = tensor(1.)
N = 4 - sentivity = tensor(1.)
N = 5 - sentivity = tensor(1.)
N = 6 - sentivity = tensor(1.)
N = 7 - sentivity = tensor(1.)
N = 8 - sentivity = tensor(1.)
N = 9 - sentivity = tensor(1.)
Some results are not deterministic they will change if you re-run this cell.

Treshold query
N = 10 - sentivity = tensor(1.)
N = 10 - sentivity = 0
N = 10 - sentivity = 0
N = 10 - sentivity = 0
N = 10 - sentivity = 0
N = 10 - sentivity = 0
N = 10 - sentivity = 0
N = 10 - sentivity = tensor(1.)
N = 10 - sentivity = 0
N = 10 - sentivity = 0
N = 10 - sentivity = tensor(1.)
N = 10 - sentivity = 0
N = 10 - sentivity = 0
N = 10 - sentivity = tensor(1.)
N = 10 - sentivity = 0
Even when the number of samples are the same the sensitity of this function varies

Mean query
N = 10 - sentivity = tensor(0.0556)
N = 20 - sentivity = tensor(0.0316)
N = 30 - sentivity = tensor(0.0184)
N = 40 - sentivity = tensor(0.0135)
N = 50 

## Lesson 3 - A Basic Differential Attack

Now we measure how sensible each function is for those particular cases but none of the functions we've looked at so far are differentially private. We will discover how to obtain information from them using a basic differential attack.

The most basic type of attack can be done as follows: let's say we wanted to figure out a specific person's value in the database. All we would have to do is query for the sum of the entire database and then the sum of the entire database without that person. Something similar will happen with the mean and threshold queries that we have.

In [3]:
'''
In this project we will:
1) Construct a database.
2) Demonstrate how two different sum queries can expose the value of the person represented by row 10.
'''

# Generate a database with only one parallel database (in index 10)
db = create_db(100)
pdb = get_parallel_db(db, remove_index=10)

# Printing reults:
print("Private value:", db[10])
print("")
print("Differencing attacks:")
print("\tSum query =", query_sum(db) - query_sum(pdb))
print("\tMean query =", query_mean(db) - query_mean(pdb))
print("\tThreshold query =", query_threshold(db, 50) - query_threshold(pdb, 50))
print("")
print("As you can see, the basic sum query is not differentially private at all!")

Private value: tensor(0, dtype=torch.uint8)

Differencing attacks:
	Sum query = tensor(0.)
	Mean query = tensor(-0.0048)
	Threshold query = tensor(0.)

As you can see, the basic sum query is not differentially private at all!


Differential privacy always requires a form of randomness added to the query. 

One technique is to add randomness to each person's response. Take in consideration that when you induce noise to each person's response you will obtain a skewed result. However, if the amount of noise introduced is known its posible to calculate an aproximation of the real result. It should be noted that, especially when we only have a few samples, this comes at the cost of accuracy. The greater the privacy protection the less accurate the results. 

In [4]:
'''
In this project we will:
1) Create a function that will introduce noise to the database 
2) Create a function that will calculate an aproximation of the real result from a skewed result.
2) Demostrate how the effect of the induced noise is diminished when the data set is bigger.
'''

def add_noise(db, noise=0.2):
    selected = (torch.rand(len(db)) > noise).float() # Will decide if the entry will be real or random
    random_answer = (torch.rand(len(db)) > 0.5).float()
    augmented_database = db.float() * selected + random_answer * (1 - selected)
    return augmented_database
    
def deskew(sk_result, noise=0.2):
    return ((sk_result / noise) - 0.5) * noise / (1 - noise)

for i in range(2,8):
    db = create_db(10**i)
    true_result = query_mean(db)
    augmented_database = add_noise(db)
    sk_result = query_mean(augmented_database)
    private_result = deskew(sk_result)
    print("N =", 10**i, "\tWithout Noise =", true_result, "\tWith Noise =", private_result)

N = 100 	Without Noise = tensor(0.4400) 	With Noise = tensor(0.4750)
N = 1000 	Without Noise = tensor(0.4650) 	With Noise = tensor(0.4450)
N = 10000 	Without Noise = tensor(0.4960) 	With Noise = tensor(0.4955)
N = 100000 	Without Noise = tensor(0.5006) 	With Noise = tensor(0.4998)
N = 1000000 	Without Noise = tensor(0.4999) 	With Noise = tensor(0.5001)
N = 10000000 	Without Noise = tensor(0.5001) 	With Noise = tensor(0.5000)


The previous method of adding noise was called **Local Differentail Privacy** because we added noise to each datapoint individually. This is necessary for some situations wherein the data is SO sensitive that individuals do not trust noise to be added later. However, it comes at a very high cost in terms of accuracy. 

Alternatively we can add noise _after_ data has been aggregated by a function. This kind of noise allows us to perform differential privacy on smaller groups of individuals with lower amounts of noise. Becasue of this accuracy will be less affected. However, participants must be able to trust that no-one looked at their datapoints _before_ the aggregation took place. This method is called **Global Differential Privacy**.

In [29]:
'''
In this project we will:
1) Create a function that will introduce noise to the result of the query (global differencial privacy)
'''

def M(true_result, noise=0.2):
    sk = (torch.rand(1)).float()*noise - noise/2.0
    private_result = true_result + true_result*sk
    return private_result

for i in range(1, 10):
    db = create_db(10*i)
    true_result = query_sum(db)
    private_result = M(true_result)
    print("N =", 10*i, "\tWithout Noise =", true_result, "\tWith Noise =", private_result)

N = 10 	Without Noise = tensor(5.) 	With Noise = tensor([4.5829])
N = 20 	Without Noise = tensor(7.) 	With Noise = tensor([6.3510])
N = 30 	Without Noise = tensor(17.) 	With Noise = tensor([17.4437])
N = 40 	Without Noise = tensor(18.) 	With Noise = tensor([19.3799])
N = 50 	Without Noise = tensor(25.) 	With Noise = tensor([22.7138])
N = 60 	Without Noise = tensor(34.) 	With Noise = tensor([35.6781])
N = 70 	Without Noise = tensor(35.) 	With Noise = tensor([35.5389])
N = 80 	Without Noise = tensor(33.) 	With Noise = tensor([36.1455])
N = 90 	Without Noise = tensor(42.) 	With Noise = tensor([44.7209])


## Lesson 4 - The Formal Definition of Differential Privacy

This definition is a measure of how much privacy is afforded by a query M. Specifically, it's a comparison between running the query M on a database (x) and a parallel database (y). As a reminder, parallel databases are defined to be the same as a full database (x) with one entry/person removed.

[![Image From: "The Algorithmic Foundations of Differential Privacy" - Cynthia Dwork and Aaron Roth](dp_formula.png "Title")](https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf)

This theorem is called "epsilon delta" differential privacy. It says: for all parallel databases, the maximum distance between a query on database (x) and the same query on database (y) will be e^epsilon plus a probability delta.

### Epsilon

**Epsilon Zero:** If a query satisfied this inequality with epsilon 0, then that would mean that the query for all parallel databases outputed the exact same value as the full database. If the sensitivity of a query is 0, then the epsilon value also happened to be zero.

**Epsilon One:** If a query satisfied this inequality with epsilon 1, then the maximum distance between the two random distributions M(x) and M(y) is 1.

### Delta

Delta is basically the probability that epsilon breaks. Sometimes the epsilon value is different for some queries than for others, delta represents this difference using probabilities. Note that this expression doesn't represent the full tradeoff between epsilon and delta.

## Lesson 5: How To Add Noise for Global Differential Privacy

In this lesson, we're going to learn about how to take a query and add varying amounts of noise so that it satisfies a certain degree of differential privacy (a certain epsilo-delta values). In particular, we're going to focus on global differential privacy.

There are two kinds of noise we can add: **Gaussian Noise** or **Laplacian Noise**. Generally speaking Laplacian is better, but both are still valid.

### How much noise should we add?

The amount of noise necessary to add to the output of a query is a function of four things:

- the type of noise (Gaussian/Laplacian)
- the sensitivity of the query/function
- the desired epsilon (ε)
- the desired delta (δ)

**Querying Repeatedly:** If we query the database multiple times we can simply add the epsilons, even if we change the amount of noise and their epsilons are not the same.

### Using Laplacian Noise

Laplacian noise is increased/decreased according to a _scale_ parameter b. We choose _b_ based on the following formula:

b = sensitivity(query) / epsilon

In other words, if we set b to be this value, then we know that we will have a privacy leakage of <= epsilon. 

Laplace function guarantees that delta value is equal to 0 (there are some tunings where we can have very low epsilon where delta is non-zero, but we'll ignore them for now).

In [35]:
'''
In this project we will:
1) Create a query function that adds the right amount of noise to satisfy an epsilon constraint.
2) Test it using a "sum" and "mean" query. Ensure that you use the correct sensitivity measures for both.
'''
import numpy as np

def laplacian_noise(sensitivity, epsilon):
    beta = sensitivity / epsilon
    noise = torch.tensor(np.random.laplace(0, beta, 1)).float()
    return noise

# Sum query
print("Sum query")
epsilon = 0.01             # Make cicles for varying amound of epsilon
for i in range(1, 10):
    db = create_db(10*i)
    true_result = query_sum(db)
    noise = laplacian_noise(1, epsilon)
    private_result = true_result + noise
    print("N =", 10*i, "\tWithout Noise =", true_result, "\tWith Noise =", private_result)
print("")
    
# Mean query
print("Mean query")
epsilon = 0.0001
for i in range(1, 10):
    db = create_db(10*i)
    true_result = query_sum(db)
    noise = laplacian_noise(1/(10*i), epsilon)
    private_result = true_result + noise
    print("N =", 10*i, "\tWithout Noise =", true_result, "\tWith Noise =", private_result)

Sum query
N = 10 	Without Noise = tensor(6.) 	With Noise = tensor([135.6311])
N = 20 	Without Noise = tensor(8.) 	With Noise = tensor([77.1518])
N = 30 	Without Noise = tensor(11.) 	With Noise = tensor([235.4594])
N = 40 	Without Noise = tensor(19.) 	With Noise = tensor([22.2522])
N = 50 	Without Noise = tensor(26.) 	With Noise = tensor([210.6535])
N = 60 	Without Noise = tensor(29.) 	With Noise = tensor([171.1573])
N = 70 	Without Noise = tensor(34.) 	With Noise = tensor([47.3793])
N = 80 	Without Noise = tensor(39.) 	With Noise = tensor([145.0871])
N = 90 	Without Noise = tensor(47.) 	With Noise = tensor([-184.5951])

Mean query
N = 10 	Without Noise = tensor(5.) 	With Noise = tensor([-603.0083])
N = 20 	Without Noise = tensor(14.) 	With Noise = tensor([-271.9333])
N = 30 	Without Noise = tensor(19.) 	With Noise = tensor([18.8040])
N = 40 	Without Noise = tensor(20.) 	With Noise = tensor([140.6546])
N = 50 	Without Noise = tensor(23.) 	With Noise = tensor([-286.5993])
N = 60 	Without

## Lesson 6: Differential Privacy for Deep Learning

So in the last lessons you may have been wondering - what does all of this have to do with Deep Learning? Well, these same techniques we were just studying form the core primitives for how Differential Privacy provides guarantees in the context of Deep Learning. 

Previously, we defined perfect privacy as "a query to a database returns the same value even if we remove any person from the database", and used this intuition in the description of epsilon/delta. In the context of deep learning we have a similar standard.

Training a model on a dataset should return the same model even if we remove any person from the dataset.

Thus, we've replaced "querying a database" with "training a model on a dataset". In essence, the training process is a kind of query. However, one should note that this adds two points of complexity which database queries did not have:

    1. do we always know where "people" are referenced in the dataset?
    2. neural models rarely never train to the same output model, even on identical data

The answer to (1) is to treat each training example as a single, separate person. Strictly speaking, this is often overly zealous as some training examples have no relevance to people and others may have multiple/partial (consider an image with multiple people contained within it). Thus, localizing exactly where "people" are referenced, and thus how much your model would change if people were removed, is challenging.

The answer to (2) is also an open problem - but several interesitng proposals have been made. We're going to focus on one of the most popular proposals, PATE.

## An Example Scenario: A Health Neural Network

First we're going to consider a scenario - you work for a hospital and you have a large collection of images about your patients. However, you don't know what's in them. You would like to use these images to develop a neural network which can automatically classify them, however since your images aren't labeled, they aren't sufficient to train a classifier. 

However, being a cunning strategist, you realize that you can reach out to 10 partner hospitals which DO have annotated data. It is your hope to train your new classifier on their datasets so that you can automatically label your own. While these hospitals are interested in helping, they have privacy concerns regarding information about their patients. Thus, you will use the following technique to train a classifier which protects the privacy of patients in the other hospitals.

- 1) You'll ask each of the 10 hospitals to train a model on their own datasets (All of which have the same kinds of labels)
- 2) You'll then use each of the 10 partner models to predict on your local dataset, generating 10 labels for each of your datapoints
- 3) Then, for each local data point (now with 10 labels), you will perform a DP query to generate the final true label. This query is a "max" function, where "max" is the most frequent label across the 10 labels. We will need to add laplacian noise to make this Differentially Private to a certain epsilon/delta constraint.
- 4) Finally, we will retrain a new model on our local dataset which now has labels. This will be our final "DP" model.

So, let's walk through these steps. I will assume you're already familiar with how to train/predict a deep neural network, so we'll skip steps 1 and 2 and work with example data. We'll focus instead on step 3, namely how to perform the DP query for each example using toy data.

So, let's say we have 10,000 training examples, and we've got 10 labels for each example (from our 10 "teacher models" which were trained directly on private data). Each label is chosen from a set of 10 possible labels (categories) for each image.

In [49]:
import numpy as np

In [54]:
num_teachers = 10 # we're working with 10 partner hospitals
num_examples = 10000 # the size of OUR dataset
num_labels = 10 # number of lablels for our classifier

In [55]:
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int).transpose(1,0) # fake predictions

In [56]:
new_labels = list()
for an_image in preds:

    label_counts = np.bincount(an_image, minlength=num_labels)

    epsilon = 0.1
    beta = 1 / epsilon

    for i in range(len(label_counts)):
        label_counts[i] += np.random.laplace(0, beta, 1)

    new_label = np.argmax(label_counts)
    
    new_labels.append(new_label)

In [57]:
# new_labels

## Project - PATE Analysis

For the final project for this section, you're going to be given a dataset which you need to use to train a DP model using this PATE method. 

In [58]:
labels = np.array([9, 9, 3, 6, 9, 9, 9, 9, 8, 2])
counts = np.bincount(labels, minlength=10)
query_result = np.argmax(counts)
query_result

9

In [59]:
from syft.frameworks.torch.differential_privacy import pate

In [61]:
num_teachers, num_examples, num_labels = (100, 100, 10)
preds = (np.random.rand(num_teachers, num_examples) * num_labels).astype(int) #fake preds
indices = (np.random.rand(num_examples) * num_labels).astype(int) # true answers

preds[:,0:10] *= 0

data_dep_eps, data_ind_eps = pate.perform_analysis(teacher_preds=preds, indices=indices, noise_eps=0.1, delta=1e-5)

assert data_dep_eps < data_ind_eps





In [64]:
data_dep_eps, data_ind_eps = pate.perform_analysis(teacher_preds=preds, indices=indices, noise_eps=0.1, delta=1e-5)
print("Data Independent Epsilon:", data_ind_eps)
print("Data Dependent Epsilon:", data_dep_eps)

Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 1.52655213289881


In [65]:
preds[:,0:50] *= 0

In [66]:
data_dep_eps, data_ind_eps = pate.perform_analysis(teacher_preds=preds, indices=indices, noise_eps=0.1, delta=1e-5, moments=20)
print("Data Independent Epsilon:", data_ind_eps)
print("Data Dependent Epsilon:", data_dep_eps)

Data Independent Epsilon: 11.756462732485115
Data Dependent Epsilon: 0.9029013677789843
