# Section 1 - Differencial Privacy

In this section we're going to play around with Differential Privacy in the context of a database query.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)]( https://colab.research.google.com/github/MarianoOG/Lesson-Notes-Secure-and-Private-AI)

## Lesson 1: Toy Differential Privacy - Simple Database Queries

The database is going to be a very simple database with only one boolean column. Each row corresponds to a person. Each value corresponds to whether or not that person has a certain private attribute (such as whether they have a certain disease, or whether they are above/below a certain age). 

We are then going to learn how to know whether a database query over such a small database is differentially private or not - and more importantly - what techniques are at our disposal to ensure various levels of privacy

### Project, Simple Database and Parallel Databases

Step one is to create our database, we're going to do this by initializing a random list of 1s and 0s (which are the entries in our database). 

*Note:* the number of entries directly corresponds to the number of people in our database.

Key to the definition of differenital privacy is the ability to ask the question "When querying a database, if I removed someone from the database, would the output of the query be any different?". Thus, in order to check this, we must construct what we term "parallel databases" which are simply databases with one entry removed. 

In this first project, I want you to create a list of every parallel database to the one currently contained in the "db" variable. Then, I want you to create a function which both:

- creates the initial database (db)
- creates all parallel databases

In [1]:
# First we import the needed libraries
import torch

# Function to create database
def create_db(n):
    return torch.rand(n) > 0.5

# Funtion to create 1 parallel database
def get_parallel_db(db, remove_index):
    return torch.cat((db[0:remove_index], 
                      db[remove_index+1:]))

# Funtion to create all parallel databases
def get_parallel_dbs(db):
    parallel_dbs = list()
    for i in range(len(db)):
        pdb = get_parallel_db(db, i)
        parallel_dbs.append(pdb)
    return parallel_dbs

# Function to create db and parallels
def create_db_and_parallels(n):
    db = create_db(n)
    pdbs = get_parallel_dbs(db)
    return db, pdbs

db, pdbs = create_db_and_parallels(20)
print(db)
print(pdbs)

tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8)
[tensor([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
       dtype=torch.uint8), tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1],
 

## Lesson 2: Towards Evaluating The Differential Privacy of a Function

Intuitively, we want to be able to query our database and evaluate whether or not the result of the query is leaking "private" information. As mentioned previously, this is about evaluating whether the output of a query changes when we remove someone from the database. Specifically, we want to evaluate the *maximum* amount the query changes when someone is removed (maximum over all possible people who could be removed). So, in order to evaluate how much privacy is leaked, we're going to iterate over each person in the database and measure the difference in the output of the query relative to when we query the entire database. 

### Project - Evaluating the Privacy of a Function

Let's make our first "database query" a simple sum. Aka, we're going to count the number of 1s in the database.

We will measure the difference between each parallel db's query result and the query result for the entire database and then calculated the max value (which was 1). This value is called "sensitivity", and it corresponds to the function we chose for the query. Namely, the "sum" query will always have a sensitivity of exactly 1. However, we can also calculate sensitivity for other functions as well.

Note the intuition here. "Sensitivity" is measuring how sensitive the output of the query is to a person being removed from the database. For a simple sum, this is always 1, but for the mean, removing a person is going to change the result of the query by rougly 1 divided by the size of the database (which is much smaller). Thus, "mean" is a VASTLY less "sensitive" function (query) than SUM.

- Create a list of queries (sum, threshold and mean)
- Calculate sensitivy for each one
- Calculate L1 Sensitivity For Threshold
    - First compute the sum over the database (i.e. sum(db)) and return whether that sum is greater than a certain threshold.
    - Then, I want you to create databases of size 10 and threshold of 5 and calculate the sensitivity of the function. 
    - Finally, re-initialize the database 10 times and calculate the sensitivity each time.

In [3]:
# Sum query function
def query_sum(db):
    return db.sum()

# Threshold query function
def query_threshold(db, threshold=5):
    return (db.sum() > threshold).float()

# Mean query function
def query_mean(db):
    return db.float().mean()

# Calculate sensitivity
def sensitivity(query, n):
    db, pdbs = create_db_and_parallels(n)
    db_result = query(db)
    max_distance = 0
    for pdb in pdbs:
        pdb_result = query(pdb)
        db_distance = torch.abs(pdb_result - db_result)
        if(db_distance > max_distance):
            max_distance = db_distance
    return max_distance

# Sensitivity for sum
print("Sum query")
for i in range(10):
    s = sensitivity(query_sum, i)
    print(i, s)
    
# Sensitivity for threshold
print("Treshold query")
for i in range(10):
    s = sensitivity(query_threshold, i)
    print(i, s)

# Sensitivity for mean
print("Mean query")
for i in range(10):
    s = sensitivity(query_mean, i)
    print(i, s)


Sum query
0 0
1 tensor(1)
2 tensor(1)
3 tensor(1)
4 tensor(1)
5 tensor(1)
6 tensor(1)
7 tensor(1)
8 tensor(1)
9 tensor(1)
Treshold query
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
Mean query
0 0
1 0
2 0
3 tensor(0.3333)
4 tensor(0.1667)
5 tensor(0.1500)
6 tensor(0.1333)
7 tensor(0.1190)
8 tensor(0.1071)
9 tensor(0.0694)
