In [1]:
# Now we are ready to work with more robust techniques to protect data privacy against a differencing attack. This is attained
# with the help of adding noise to the data. Now this can be done at two levels,thus giving rise to local and global 
# differential privacy. When noise is added by the data owner before submitting the data in the database its called 
# Local Differential Privacy.

In [2]:
# Now the noise is added at the local level before sending the data to the database. Butthe question is how much noise is
# enough? This question is dealt with the help of randomized response

## Randomized Response
Technique that is used in social sciences when surveying/asking people about any kind of unlawful or taboo behaviour to learn high level trends about the same

In [3]:
# For instance a sociologist tryng to learn how many people committed a certain crime. It is certain that many people will
# not be inclined to answer this honestly. Thus, we take help of Plausible Deniability where each person is asked to answer the
# question based on the outcome of two coin flips, which the surveyor won't be allowed to see. So, it goes like this:

## Plausible Deniability
 - Flip a coin two times
 - If the first coin flip is heads answer (yes/no) honestly
 - if the first coin flip is tails then answer according to the second coin flip,i.e., yes if heads and no if tails

In [4]:
# Now we're are guaranteed that half of the responses will be honest responses and for the other half there's a 50-50 chance of
# honest response.
# Now the interesting thing is that if a person answers 'yes' then they have a certain degree plausible deniability that 
# they are only answering so, because of the coin flip. So, there is this Localized Differential Privacy for each individuals
# data which gives them the freedom to answer more honestly. Then the researcher is able to take the aggregate of the whole 
# population and remove the added noise to get an accurate statistic.
# For understading this process lets assume that 70% of people were actually involved in the taboo behaviour that we're
# surveying for. 

Now from the randomized responses we're guaranteed that 50% of the population will say yes with 70% probability(i.e., honestly, when the first coin flip is heads) and the other half will say yes with 50% probability(when the first coin flip is tails answer according to second coin flip which is again an equiprobable event). So, the average of the two halves , i.e., average of 50% and 70%, is going to be 60%. This value of around 60% is the value that we would have got from the survey and we could revert back to the actual value of  around 70% by understanding of the fact that the 60% is actually the average of 50% with the true percentage of people(70%) who committed the act in question. Now this knowledge is achieved without knowing whether any particular person was involved or not. This technique is quite promising, but it may come at the cost of accuracy. It is only when there is a large number of people involved in the study that the noise is effectively removed. However, if the population size is small than a lot of factors can contribute to skewed data.

In [5]:
# With all the methodology in place we can now implement Local Differential Privacy with the help of randomized responses via
# coinflips. Lets use the mean function as our query function

In [19]:
# Database generation funcitions

import torch

def get_parallel_db(db,remove_index):
    return torch.cat((db[0:remove_index], db[remove_index+1:]))

def get_parallel_dbs(db):
    parallel_dbs = list()
    
    for i in range(len(db)):
        parallel_dbs.append(get_parallel_db(db, i))
    
    return parallel_dbs

def create_db_and_pdbs(num_entries):
    
    db = (torch.rand(num_entries) > 0.5).float()
    pdbs = get_parallel_dbs(db)
    
    return db, pdbs

In [20]:
db, pdbs = create_db_and_pdbs(100)

In [21]:
true_result = torch.mean(db.float())
true_result

tensor(0.5200)

In [22]:
db

tensor([0., 0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0.,
        0., 1., 1., 0., 1., 0., 0., 1., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0.,
        1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1., 0., 1.,
        0., 1., 0., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1., 1.,
        1., 0., 1., 0., 0., 1., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 1.,
        1., 1., 0., 0., 1., 0., 0., 1., 0., 1.])

Now we want to add noise by replacing some of the above datapoints with randomized responses as explained above. For that we first perform the coin flips for each individual in the database

In [23]:
first_coin_flip = (torch.rand(len(db))>0.5).float()

In [24]:
first_coin_flip

tensor([1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 1.,
        0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 1.,
        0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.,
        1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 1.,
        1., 1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
        0., 1., 0., 1., 1., 0., 1., 0., 1., 1.])

In [25]:
second_coin_flip = (torch.rand(len(db))>0.5).float()
second_coin_flip

tensor([0., 0., 0., 0., 0., 1., 0., 0., 1., 1., 0., 0., 1., 0., 1., 1., 1., 0.,
        0., 0., 1., 0., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 0.,
        1., 1., 0., 0., 1., 0., 0., 1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 1.,
        1., 1., 0., 0., 1., 1., 0., 0., 1., 0., 0., 1., 1., 0., 1., 0., 0., 0.,
        1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0.,
        1., 0., 0., 0., 1., 1., 0., 1., 0., 1.])

In [26]:
# Now a 1 acts as a heads and 0 as tails. So, the first coin flip decides whether the person answers honestly or not. Also the 1
# in the database is for yes and 0 for no, i.e.,those are all the honest responses. So, we can get the outcome after the first 
# coin flip by simply multiplying the database with first coin flip
db*first_coin_flip

tensor([0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        1., 0., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1., 0., 0., 0., 0.,
        0., 1., 0., 0., 1., 0., 0., 0., 0., 1.])

In [27]:
# Now we want to decide the response according to the second coin flip in places which have a zero after multiplying with the 
# first coin flip variable. That is, we want to put the outcome of second flip at (1-first_coin_flip) positions. Adding the
# product of the database with first coin flip with the product of one minus first coin flip and second coin flip gives 
# us the augmented database which is differentially private.
augmented_database = db.float()*first_coin_flip + (1-first_coin_flip)*second_coin_flip
augmented_database

tensor([0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0.,
        0., 0., 1., 0., 1., 0., 0., 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 1., 1., 1., 0., 1., 1., 0., 1.,
        0., 1., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,
        1., 0., 1., 1., 0., 0., 1., 1., 1., 1., 1., 1., 0., 1., 0., 0., 0., 0.,
        1., 1., 0., 0., 1., 1., 0., 1., 0., 1.])

In [28]:
# Now we want the database to be differentially private. So, this means that the output of our query function should give a very
# close output to the query when done on original db
# As discussed above the output of the query will be skewed close to 0.5 as we are doing average of 0.5 and the true result of
# the query. This means that we need to de-skew the output
dp_result = torch.mean(augmented_database.float())*2-0.5
dp_result

tensor(0.4800)

In [30]:
# Now we can package all this functionality into a single query function
def query(db):
    true_result = torch.mean(db.float())
    first_coin_flip = (torch.rand(len(db))>0.5).float()
    second_coin_flip = (torch.rand(len(db))>0.5).float()
    augmented_database = db.float()*first_coin_flip+(1-first_coin_flip)*second_coin_flip
    dp_result = torch.mean(augmented_database.float())*2-0.5
    return dp_result, true_result

In [31]:
query(db)

(tensor(0.4800), tensor(0.5200))

In [41]:
# Now we run the query on databases of different sizes
db, pdbs = create_db_and_pdbs(10000) #vary size here
private_result, true_result = query(db)
print("With Noise:"+str(private_result))
print("Without Noise:"+str(true_result))

With Noise:tensor(0.5020)
Without Noise:tensor(0.4965)


In [None]:
# Now our next goal is to vary the amount of noise that we're gonna add to the data. Basically we want to bias the first coin
# flip so that there is  an increased or decreased probability of getting a 1.