<b>Generate parallel databases</b>
<p>Key to the definition of differential privacy is the ability to ask the question "When querying a database, if I removed someone from the database, would the output of the query be any different?". Thus, in order to check this, we must construct what we term "parallel databases", wich are simply databases with one entry removed.</p>
<p>In this first project, I am going to create a list of every parallel database to the one currently contained in the "db" variable.</p>
Then, I am going to create a function which both:
<ul>
    <li>creates the initial database(db)</li>
    <li>creates all parallel databases</li>
<ul>

In [1]:
import torch

#the number of entries in our database
num_entries = 5000

db = torch.rand(num_entries) > 0.5
db

tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8)

In [2]:
remove_index = 2
db[0:5]

tensor([1, 0, 1, 0, 1], dtype=torch.uint8)

In [3]:
def get_parallel_db(db, remove_index):
    return torch.cat((db[0:remove_index],
                      db[remove_index + 1:]))

In [4]:
get_parallel_db(db, 3).shape

torch.Size([4999])

In [5]:
get_parallel_db(db, 11111).shape

torch.Size([5000])

In [6]:
def get_parallel_dbs(db):
    
    parallel_dbs = list()
    
    for i in range(len(db)):
        pdb = get_parallel_db(db, i)
        parallel_dbs.append(pdb)
    
    return parallel_dbs

In [7]:
pdbs = get_parallel_dbs(db)

In [8]:
pdbs

[tensor([0, 1, 0,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 1, 0,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 0,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0, 0], dtype=torch.uint8),
 tensor([1, 0, 1,  ..., 0, 0

In [9]:
def create_db_and_parallels(num_entries):
    db = torch.rand(num_entries) > 0.5
    pdbs = get_parallel_dbs(db)
    
    return db, pdbs

In [10]:
db, pdbs = create_db_and_parallels(20)

In [11]:
db

tensor([1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
       dtype=torch.uint8)

In [12]:
pdbs

[tensor([0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1],
        dtype=torch.uint8),
 tensor([1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1,

<b>Towards evaluating the differential privacy of a function</b>

<p>
Intuitively, we want to be able to query our database and evaluate whether or not the result of the query is leaking private information. This is about evaluating whether the output of a query changes when we remove someone from the database. Specifically, we want to evaluate the maximum amount the query changes when someone is removed (maximum over all possible people who could be removed). So, in order to evaluate how much privacy is leaked, we´re going to iterate over each person in the database and measure the difference in the output of the query relative to when we query the entire database.
</p>
<p>Just for the sake of argument, let's make our first "database query" a simple sum. Aka, we´re going to count the number of 1s in the database</p>

In [13]:
db, pdbs = create_db_and_parallels(5000)

In [14]:
db

tensor([0, 0, 0,  ..., 0, 1, 0], dtype=torch.uint8)

In [15]:
def query(db):
    return db.sum()

In [16]:
query(db)

tensor(2477)

In [17]:
#next we see that:
#when we remove data from the database,
#the output of the query changes
query(pdbs[5])

tensor(2477)

In [18]:
#let´s see the max amount of change
full_db_result = query(db)

In [19]:
max_distance = 0
for pdb in pdbs:
    pdb_result = query(pdb)
    
    #L1 sensitivity
    db_distance = torch.abs(pdb_result - full_db_result)
    
    if (db_distance > max_distance):
        max_distance = db_distance

In [20]:
#Sensitivity
max_distance

tensor(1)

Some facts:
max_distance will always be 1 in this database with this particular query, because there are only two possible values, 0 and 1

<b>Evaluating the privacy of a function</b>

Create a function that measures sensitivity on whichever query done in the databases

In [21]:
#Let's try to calculate the sensitivity for the "mean" function
def sensitivity(query, n_entries = 1000):
    
    db, pdbs = create_db_and_parallels(n_entries)
    
    full_db_result = query(db)
    
    max_distance = 0
    for pdb in pdbs:
        pdb_result = query(pdb)
        
        #L1 sensitivity
        db_distance = torch.abs(pdb_result - full_db_result)
        
        if (db_distance > max_distance):
            max_distance = db_distance
    return max_distance

In [22]:
def query(db):
    return db.float().mean()

In [23]:
sensitivity(query)

tensor(0.0005)

Some conclusions:
We are assuming we´re dealing with people. We care about sensitivity to people: not about sensitivity to removing certain values, but about removing ALL values related to a person

<b>Calculate L1 sensitivity for threshold</b>
<p>I am going to calculate the sensitivity for the "threshold" function</p>
<ul>
    <li>Create the query function: a sum, and return if the queried database is greater or less than a threshold</li>
    <li>Create 10 databases of size 10 (threshold = 5) and query them, calculating the sensitivity of each</li>
    <li>Print out the sensitivity of each database</li>
</ul>

In [24]:
def query(db, threshold = 5):
    return (db.sum() > threshold).float()

In [25]:
for i in range(10):
    sens_f = sensitivity(query, n_entries = 10)
    print(sens_f)

0
0
0
0
0
0
tensor(1.)
0
0
tensor(1.)


Some conclusions:
<ul>
    <li>When n_entries is equal or less than threshold, query will be 0 in original and parallel databases, thus having sensitivity 0</li>
    <li>When having parallel databases querying 1 for a given threshold, as original database queries 1, we have sensitivity 0</li>
    <li>For n_entries greater than threshold, when original database queries 1 and some parallel database queries 0, whe have sensitivity 1</li>
</ul>
<p>Thus we have a database conditioned sensitivity</p>

<b>Perform a differencing attack</b>
<p>I am going to perform a basic differencing attack to divulge what the value in row 10 is in database</p>
<p>In order to do this, I am going to first query the entire databse, and then the database without row 10</p>

In [32]:
db, _ = create_db_and_parallels(100)

In [33]:
pdb = get_parallel_db(db, remove_index = 10)

In [34]:
db[10]

tensor(1, dtype=torch.uint8)

In [39]:
sum(db)

tensor(43, dtype=torch.uint8)

In [36]:
#differencing attack using sum query
sum(db) - sum(pdb)

tensor(1, dtype=torch.uint8)

In [38]:
#differencing attack using mean query
(sum(db).float()/len(db)) - (sum(pdb).float()/len(pdb))

tensor(0.0058)

In [40]:
#differencing attack using thresold
(sum(db).float() > 42) - (sum(pdb).float() > 42)

tensor(1, dtype=torch.uint8)