## 1.4 Demonstartion of Differential Privacy using PyDP
The PyDP package provides a Python API into Google's Differential Privacy library. This example uses the alpha 1.0 version of the package that has the following limitations:

Laplace noise generation technique.

Supports only integer and floating point values

To demonstrate DP, we protect the user from something called Membership Inference Attack(MIA). 

The idea behing DP is we should **not** be able to identify the identity of an individual user.

To prove that, we are going to take a database and create two copies of it such that they differ in only one record. Or exactly one record is absent from the original Database.

#### The Proof

We have generated a synthetic dataset with 5000 records with private information such as name as well as the email of the user. 

The objective of this notebook is going to be able to demonstrate how DP can protect the user from MIA. 

In the dataset at hand, we have the Item spent by the user (sales_amount) of each user. So if we take the sum of the sales_amount and compare it with the sum of sales_amount with the DB which has exactly one less record, we should be able to identify which user has spend how much and hence identify the user, which using the help of DP, we can avoid that. 

In [1]:
!pip install python-dp # installing PyDP



In [2]:
import pydp as dp # by convention our package is to be imported as dp (for Differential Privacy!)
from pydp.algorithms.laplacian import BoundedSum, BoundedMean, Count, Max
import pandas as pd
import statistics # for calculating mean without applying differential privacy


### Fetching the Data and loading it! 

In [3]:
# get carrots data from our public github repo
url1 = 'https://raw.githubusercontent.com/OpenMined/PyDP/demo/examples/launch_demo/data/01.csv'
df1 = pd.read_csv(url1,sep=",", engine = "python")
df1.head()

Unnamed: 0,id,first_name,last_name,email,sales_amount,state
0,1,Osbourne,Gillions,ogillions0@feedburner.com,31.94,Florida
1,2,Glynn,Friett,gfriett1@blog.com,12.46,California
2,3,Jori,Blockley,jblockley2@unesco.org,191.14,Colorado
3,4,Garald,Dorian,gdorian3@webeden.co.uk,126.58,Texas
4,5,Mercy,Pilkington,mpilkington4@jugem.jp,68.32,Florida


In [4]:
url2 = 'https://raw.githubusercontent.com/OpenMined/PyDP/demo/examples/launch_demo/data/02.csv'
df2 = pd.read_csv(url2,sep=",", engine = "python")
df2.head()

Unnamed: 0,id,first_name,last_name,email,sales_amount,state
0,1,Wallie,Kaman,wkaman0@samsung.com,99.69,Idaho
1,2,Raynard,Tooby,rtooby1@indiegogo.com,208.61,Texas
2,3,Mandie,Stallibrass,mstallibrass2@princeton.edu,42.87,Michigan
3,4,Nonna,Regitz,nregitz3@icq.com,160.94,Iowa
4,5,Barthel,Cowgill,bcowgill4@tiny.cc,179.88,Ohio


In [5]:
url3 = 'https://raw.githubusercontent.com/OpenMined/PyDP/demo/examples/launch_demo/data/03.csv'
df3 = pd.read_csv(url3,sep=",", engine = "python")
df3.head()

Unnamed: 0,id,first_name,last_name,email,sales_amount,state
0,1,Tomasina,Marcos,tmarcos0@wix.com,161.38,Indiana
1,2,Mill,Yitzhak,myitzhak1@barnesandnoble.com,182.22,Florida
2,3,Hobart,Banaszczyk,hbanaszczyk2@mac.com,41.67,Texas
3,4,Bonita,Benting,bbenting3@smugmug.com,190.26,Indiana
4,5,Kasper,Deyes,kdeyes4@storify.com,177.94,Ohio


In [6]:
url4 = 'https://raw.githubusercontent.com/OpenMined/PyDP/demo/examples/launch_demo/data/04.csv'
df4 = pd.read_csv(url4,sep=",", engine = "python")
df4.head()

Unnamed: 0,id,first_name,last_name,email,sales_amount,state
0,1,Dylan,Mattocks,dmattocks0@elegantthemes.com,141.9,Wisconsin
1,2,Tully,Pettko,tpettko1@engadget.com,15.09,Missouri
2,3,Ruy,Rodrigo,rrodrigo2@whitehouse.gov,90.72,Florida
3,4,Blakeley,Lower,blower3@macromedia.com,29.87,California
4,5,Horace,Studdert,hstuddert4@theatlantic.com,196.99,Ohio


In [7]:
url5 = 'https://raw.githubusercontent.com/OpenMined/PyDP/demo/examples/launch_demo/data/05.csv'
df5 = pd.read_csv(url5,sep=",", engine = "python")
df5.head()

Unnamed: 0,id,first_name,last_name,email,sales_amount,state
0,1,Susi,Barker,sbarker0@comsenz.com,220.5,Kentucky
1,2,Gan,Stork,gstork1@who.int,31.75,California
2,3,Corene,Izod,cizod2@wikia.com,163.53,California
3,4,Cornell,Schoales,cschoales3@freewebs.com,59.09,Minnesota
4,5,Petrina,Kennaird,pkennaird4@patch.com,186.38,Georgia


#### Combining the whole data into one single dataframe. 

In [8]:
combined_df_temp = [df1, df2, df3, df4, df5]
orignial_dataset = pd.concat(combined_df_temp)

The size of the combined dataset: 

In [9]:
orignial_dataset.shape

(5000, 6)

Now we create our new dataset for testing DP in which we remove exactly one record from the orignial DB.

In [10]:
reduct_dataset = orignial_dataset.copy()
reduct_dataset = reduct_dataset[1:] # this dataset does not have 
# Osbourne	Gillions	ogillions0@feedburner.com	31.94	Florida

In [11]:
orignial_dataset.head()

Unnamed: 0,id,first_name,last_name,email,sales_amount,state
0,1,Osbourne,Gillions,ogillions0@feedburner.com,31.94,Florida
1,2,Glynn,Friett,gfriett1@blog.com,12.46,California
2,3,Jori,Blockley,jblockley2@unesco.org,191.14,Colorado
3,4,Garald,Dorian,gdorian3@webeden.co.uk,126.58,Texas
4,5,Mercy,Pilkington,mpilkington4@jugem.jp,68.32,Florida


In [12]:
reduct_dataset.head()

Unnamed: 0,id,first_name,last_name,email,sales_amount,state
1,2,Glynn,Friett,gfriett1@blog.com,12.46,California
2,3,Jori,Blockley,jblockley2@unesco.org,191.14,Colorado
3,4,Garald,Dorian,gdorian3@webeden.co.uk,126.58,Texas
4,5,Mercy,Pilkington,mpilkington4@jugem.jp,68.32,Florida
5,6,Elle,McConachie,emcconachie5@census.gov,76.91,Texas


If we find the sum of sales_amount in `total_datset` and `reduct_datset`, we should see the difference in the sum to be exactly equal to the money spent  (`sales_amount`) by Osbourne Gillions.

In [13]:
sum_orignial_dataset = round(sum(orignial_dataset['sales_amount'].to_list()), 2)
sum_reduct_dataset = round(sum(reduct_dataset['sales_amount'].to_list()), 2)
sales_amount_Osbourne = round((sum_orignial_dataset - sum_reduct_dataset), 2)

In [14]:
assert sales_amount_Osbourne == orignial_dataset.iloc[0, 4]

It's quite evident that using the traditional methods, even though if we remove the private information like name and email, we could still infer the identity of the user. 

Using Differential Privacy, we can solve this! 

In [15]:
# we can set the lower bound as 5 because we know the person has to spend minimum of $5 
# while the upper bound as 250 as that's the maximum amount user can spend
# Keeping the datatype as float as the data has floating point numbers.
# if your data has integer only, you should consider using int as it saves a lot of memory
dp_sum_original_dataset = BoundedSum(epsilon= 1, lower_bound =  5, upper_bound = 250, dtype ='float') 

In [16]:
dp_sum_original_dataset.reset()
dp_sum_original_dataset.add_entries(orignial_dataset['sales_amount'].to_list()) # adding the data to the DP algorithm

In [17]:
dp_sum_og = round(dp_sum_original_dataset.result(), 2)
print(dp_sum_og)

636641.93


#### Taking Sum of data on the Redacted Dataset

In [18]:
dp_reduct_dataset = BoundedSum(epsilon= 1, lower_bound =  5, upper_bound = 250, dtype ='float')
dp_reduct_dataset.add_entries(reduct_dataset['sales_amount'].to_list())

In [19]:
dp_reduct_dataset.memory_used()

320

In [20]:
dp_sum_reduct = round(dp_reduct_dataset.result(),2)
print(dp_sum_reduct)

636578.06


In [21]:
round(dp_sum_og - dp_sum_reduct, 2)

63.87

In [22]:
print(f"Difference in sum using DP: {round(dp_sum_og - dp_sum_reduct, 2)}")
print(f"Actual Value: {sales_amount_Osbourne}")
assert round(dp_sum_og - dp_sum_reduct, 2) != sales_amount_Osbourne


Difference in sum using DP: 63.87
Actual Value: 31.94


In [23]:
print(f"Sum of sales_value in the orignal Dataset: {sum_orignial_dataset}")
print(f"Sum of sales_value in the orignal Dataset using DP: {dp_sum_og}")
assert dp_sum_og != sum_orignial_dataset

Sum of sales_value in the orignal Dataset: 636594.59
Sum of sales_value in the orignal Dataset using DP: 636641.93


In [24]:
print(f"Sum of sales_value in the Reductant Dataset: {sum_reduct_dataset}")
print(f"Sum of sales_value in the Reductant Dataset using DP: {dp_sum_reduct}")
assert dp_sum_reduct != sum_reduct_dataset

Sum of sales_value in the Reductant Dataset: 636562.65
Sum of sales_value in the Reductant Dataset using DP: 636578.06


#### Quering on the Partial

Consider a case when you are obtaining a stream of data and you want to give a partial result as and when you receive the data. 
The more stream of data you get, you get a better picture of what's there, but in this condition you have to give results as and when a new stream of data arrives. 



To achieve this, PyDP provides an option of using your partial privacy_budget. 



In [25]:
parital_dp_obj = BoundedSum(epsilon= 1, lower_bound =  5, upper_bound = 250, dtype ='float')

Combining first 3000 records in stream and then the other 2000 records.

In [26]:
new_df_1 = pd.concat([df1, df2, df3])
new_df_2 = pd.concat([df4, df5])
print(new_df_1.shape,new_df_2.shape)

(3000, 6) (2000, 6)


In [27]:
parital_dp_obj.add_entries(new_df_1['sales_amount'].to_list()) # adding the first 3000 records

In [28]:
parital_dp_obj.privacy_budget_left()

1.0

In [29]:
partial_sum_dp = round(parital_dp_obj.result(privacy_budget=0.3), 2) # using only 30% of available privacy budget 
print(partial_sum_dp)

383382.22


In [30]:
actual_partial_sum = round(sum(new_df_1['sales_amount'].to_list()), 2)
print(actual_partial_sum)

383911.03


In [31]:
print(f"Difference in sum for first 3000 records which used only 30% privacy budget= {round(abs(actual_partial_sum - partial_sum_dp), 2)}")

Difference in sum for first 3000 records which used only 30% privacy budget= 528.81


In [32]:
parital_dp_obj.privacy_budget_left()

0.7

In [33]:
parital_dp_obj.add_entries(new_df_2['sales_amount'].to_list()) # adding the remaining 2000 records to the list
partial_total_sum = round(parital_dp_obj.result(), 2)
print(partial_total_sum)

637046.82


In [34]:
parital_dp_obj.privacy_budget_left() # we have used up all the budget available to us

0.0

In [35]:
def sum_og_dataset(budget):
    '''
    Sample Function to calculate BoundedSum on the whole dataset with budget as specified
    '''
    dp_sum_original_dataset.reset()
    dp_sum_original_dataset.add_entries(orignial_dataset['sales_amount'].to_list())
    return round(dp_sum_original_dataset.result(budget), 2)


In [36]:
print(f"Actual Sum: {sum_orignial_dataset}")
print(f"Sum from the previous run with privacy budget 1.0: {dp_sum_og}")
print(f"Sum when using privacy_budget as 0.7 on the whole dataset together: {sum_og_dataset(budget=0.7)}")
print(f"Sum from this run with privacy budget 0.7 on split dataset: {partial_total_sum}")


Actual Sum: 636594.59
Sum from the previous run with privacy budget 1.0: 636641.93
Sum when using privacy_budget as 0.7 on the whole dataset together: 637237.89
Sum from this run with privacy budget 0.7 on split dataset: 637046.82
