# Lab 2: Differential Privacy


Unlike $k$-Anonymity, differential privacy is a property of *algorithms*, and not a property of *data*. That is, we can prove that an *algorithm* satisfies differential privacy, and to show that a *dataset* satisfies differential privacy, we must show that the algorithm which produced it satisfies differential privacy.


A function which satisfies differential privacy is often called a *mechanism*. We say that a *mechanism* $F$ satisfies differential privacy if for all *neighboring/adjacent datasets* $D$ and $\bar{D}$, and all possible sets of outputs $O$,
$
\begin{equation}
\frac{\mathsf{Pr}[F(D) \in O]}{\mathsf{Pr}[F(\bar{D}) \in O]} \leq e^\epsilon
\end{equation}
$

Two datasets are considered neighbors if they differ in the data of a single record. 

The important implication of this definition is that $F$'s output will be pretty much the same, *with or without* the data of any specific record. That is, an adversary can't determine which of $D$ or $\bar{D}$ was the input to $F$ and hence the adversary can't tell whether or not a specific data record was *present* in the input.

The $\epsilon$ parameter in the definition is called the *privacy parameter* or the *privacy budget*. Small values of $\epsilon$  provide higher levels of privacy and large values of $\epsilon$ provide less privacy. 
```

## The Laplace Mechanism

Differential privacy is typically used to protect the data that is used to answer some specific type of queries. Let's consider some queries on the attached sample data, *without* and *with* differential privacy.

In [1]:
import pandas as pd
import numpy as np

In [2]:
Data = pd.read_csv("sampleData.csv", na_values='',na_filter=False)

In [3]:
Data[0:15]

Unnamed: 0,last_name_syn,first_name_syn,id_card_syn,person_id_syn,Alcoholic,lang_spoken_cd,Gender,state_cd,cnty_cd,zip_cd,...,cov_end_date,void_ind,end_reason_cd,src_div_id,src_major_lob_cd,sold_ledger_nbr,fin_prod_cd,fin_sub_cd,mco_contract_nbr,plan_benefit_package_id
0,PETERSON,DUSTIN,H53571312,03MO0SeTLY93A6597426I4c6,N,ENG,M,MI,49,48502,...,9/28/2019,N,107,0,RSK,39052,MEDR,MRP,H5216,23.0
1,FULLER,AUSTIN,H73151310,31deaMOaa2c1fST84L5Y2AIb,Y,SPN,M,PR,1,601,...,12/31/9999,N,,0,RSK,39012,MEDR,MRP,H5216,148.0
2,RUSSELL,CAROL,H80863066,b2Mf42OSTLbd9YA8be2b2a3I,N,ENG,F,FL,5,32405,...,12/31/9999,N,,*,VIS,49803,VISN,NSG,,
3,GALAN,CODY,H65136334,eMO4ScdT00bf290052bL7YAI,N,ENG,M,KY,147,40769,...,12/31/2019,N,TE,73552805,EPO,32362,VIT,VIT,,
4,KHAN,JOSEPH,H50623362,b5M029599a3fO7S9T5L22YAI,N,,M,IL,31,60617,...,12/31/9999,N,,*,PPO,81879,ASO,ASP,,
5,LIRIANO,MARTHA,H22531400,MO5d2ce5ST03L63a6Y4A58I1,N,,F,TX,113,75006,...,12/31/9999,N,,0,SUP,39648,MEDS,MES,,
6,NOEL,CRYSTAL,H43542591,MO1cST192aLY8c81ea6A69bI,,ENG,F,TN,21,37080,...,12/26/2019,N,602,0,RSK,84283,MEDR,MDI,S5884,158.0
7,CEJAS,MILAGROS,H84556247,83Md9O6S68TL4Y0fde2AaccI,N,ENG,F,TN,93,37916,...,3/30/2020,N,602,0,RSK,20622,MEDR,MER,H4461,34.0
8,KUMAR,LETICIA,H73721165,e8MOST69Lf34e345Yc9AaIf0,,,F,MS,53,39038,...,12/31/9999,N,,0,RSK,84283,MEDR,MDI,S5884,158.0
9,CAVAZOS,CHAD,H64802189,b70M7eObbb17STLY3ba1AbIb,,,M,NY,119,10803,...,12/31/9999,N,,0,RSK,83206,MEDR,MDI,S5552,4.0


### Let's create anonymized version by removing the names and IDs

In [4]:
anon_data = Data.copy().drop(columns=['last_name_syn', 'first_name_syn','id_card_syn','person_id_syn'])
pii_data = Data[['last_name_syn', 'first_name_syn','id_card_syn','person_id_syn','Gender','birth_date','state_cd','zip_cd']]
anon_data[:5]

Unnamed: 0,Alcoholic,lang_spoken_cd,Gender,state_cd,cnty_cd,zip_cd,birth_date,decsd_date,cov_eff_date,cov_end_date,void_ind,end_reason_cd,src_div_id,src_major_lob_cd,sold_ledger_nbr,fin_prod_cd,fin_sub_cd,mco_contract_nbr,plan_benefit_package_id
0,N,ENG,M,MI,49,48502,9/1/1948,,6/1/2018,9/28/2019,N,107,0,RSK,39052,MEDR,MRP,H5216,23.0
1,Y,SPN,M,PR,1,601,12/1/1982,,1/1/2019,12/31/9999,N,,0,RSK,39012,MEDR,MRP,H5216,148.0
2,N,ENG,F,FL,5,32405,8/1/1984,,1/1/2019,12/31/9999,N,,*,VIS,49803,VISN,NSG,,
3,N,ENG,M,KY,147,40769,1/1/2003,,1/1/2019,12/31/2019,N,TE,73552805,EPO,32362,VIT,VIT,,
4,N,,M,IL,31,60617,12/1/2016,,12/1/2019,12/31/9999,N,,*,PPO,81879,ASO,ASP,,


In [5]:
pii_data

Unnamed: 0,last_name_syn,first_name_syn,id_card_syn,person_id_syn,Gender,birth_date,state_cd,zip_cd
0,PETERSON,DUSTIN,H53571312,03MO0SeTLY93A6597426I4c6,M,9/1/1948,MI,48502
1,FULLER,AUSTIN,H73151310,31deaMOaa2c1fST84L5Y2AIb,M,12/1/1982,PR,601
2,RUSSELL,CAROL,H80863066,b2Mf42OSTLbd9YA8be2b2a3I,F,8/1/1984,FL,32405
3,GALAN,CODY,H65136334,eMO4ScdT00bf290052bL7YAI,M,1/1/2003,KY,40769
4,KHAN,JOSEPH,H50623362,b5M029599a3fO7S9T5L22YAI,M,12/1/2016,IL,60617
...,...,...,...,...,...,...,...,...
100,BROWN,LILLIAN,H51471107,M5OSd808b6TLd73Y723AI8f8,F,5/1/1949,TN,37229
101,WHEELOCK,JOSEPH,H32056510,4ddMb9c9OSTLY13AIb944e34,M,10/1/1942,SD,57001
102,ELDER,FRANCES,H37827406,MO8b55abS7TLca96YA7Ic0ec,F,8/1/1944,IA,51201
103,SYMES,VALERIE,H06803427,dM1a9OSTf2L92e10cYeAdI24,F,12/1/1992,FL,32211


### Attack \#1 -Linkage using non-PII

Let's try to link the two datasets using none PII data


Let's create a subset of data of interest


In [6]:
sData=pii_data[:5].copy()
sData

Unnamed: 0,last_name_syn,first_name_syn,id_card_syn,person_id_syn,Gender,birth_date,state_cd,zip_cd
0,PETERSON,DUSTIN,H53571312,03MO0SeTLY93A6597426I4c6,M,9/1/1948,MI,48502
1,FULLER,AUSTIN,H73151310,31deaMOaa2c1fST84L5Y2AIb,M,12/1/1982,PR,601
2,RUSSELL,CAROL,H80863066,b2Mf42OSTLbd9YA8be2b2a3I,F,8/1/1984,FL,32405
3,GALAN,CODY,H65136334,eMO4ScdT00bf290052bL7YAI,M,1/1/2003,KY,40769
4,KHAN,JOSEPH,H50623362,b5M029599a3fO7S9T5L22YAI,M,12/1/2016,IL,60617


And now let's try to link it to the anonymized data

In [7]:
pd.merge(sData, anon_data, left_on=['birth_date', 'zip_cd','Gender'], \
         right_on=['birth_date', 'zip_cd','Gender'])

Unnamed: 0,last_name_syn,first_name_syn,id_card_syn,person_id_syn,Gender,birth_date,state_cd_x,zip_cd,Alcoholic,lang_spoken_cd,...,cov_end_date,void_ind,end_reason_cd,src_div_id,src_major_lob_cd,sold_ledger_nbr,fin_prod_cd,fin_sub_cd,mco_contract_nbr,plan_benefit_package_id
0,PETERSON,DUSTIN,H53571312,03MO0SeTLY93A6597426I4c6,M,9/1/1948,MI,48502,N,ENG,...,9/28/2019,N,107,0,RSK,39052,MEDR,MRP,H5216,23.0
1,FULLER,AUSTIN,H73151310,31deaMOaa2c1fST84L5Y2AIb,M,12/1/1982,PR,601,Y,SPN,...,12/31/9999,N,,0,RSK,39012,MEDR,MRP,H5216,148.0
2,RUSSELL,CAROL,H80863066,b2Mf42OSTLbd9YA8be2b2a3I,F,8/1/1984,FL,32405,N,ENG,...,12/31/9999,N,,*,VIS,49803,VISN,NSG,,
3,GALAN,CODY,H65136334,eMO4ScdT00bf290052bL7YAI,M,1/1/2003,KY,40769,N,ENG,...,12/31/2019,N,TE,73552805,EPO,32362,VIT,VIT,,
4,KHAN,JOSEPH,H50623362,b5M029599a3fO7S9T5L22YAI,M,12/1/2016,IL,60617,N,,...,12/31/9999,N,,*,PPO,81879,ASO,ASP,,


In [8]:
Data[:5]

Unnamed: 0,last_name_syn,first_name_syn,id_card_syn,person_id_syn,Alcoholic,lang_spoken_cd,Gender,state_cd,cnty_cd,zip_cd,...,cov_end_date,void_ind,end_reason_cd,src_div_id,src_major_lob_cd,sold_ledger_nbr,fin_prod_cd,fin_sub_cd,mco_contract_nbr,plan_benefit_package_id
0,PETERSON,DUSTIN,H53571312,03MO0SeTLY93A6597426I4c6,N,ENG,M,MI,49,48502,...,9/28/2019,N,107,0,RSK,39052,MEDR,MRP,H5216,23.0
1,FULLER,AUSTIN,H73151310,31deaMOaa2c1fST84L5Y2AIb,Y,SPN,M,PR,1,601,...,12/31/9999,N,,0,RSK,39012,MEDR,MRP,H5216,148.0
2,RUSSELL,CAROL,H80863066,b2Mf42OSTLbd9YA8be2b2a3I,N,ENG,F,FL,5,32405,...,12/31/9999,N,,*,VIS,49803,VISN,NSG,,
3,GALAN,CODY,H65136334,eMO4ScdT00bf290052bL7YAI,N,ENG,M,KY,147,40769,...,12/31/2019,N,TE,73552805,EPO,32362,VIT,VIT,,
4,KHAN,JOSEPH,H50623362,b5M029599a3fO7S9T5L22YAI,N,,M,IL,31,60617,...,12/31/9999,N,,*,PPO,81879,ASO,ASP,,


As you can see we were able to successfully launch a linkage attack on the anonymized dataset

## Inference attack

#### Example 1:
Let's find out if *Austin Fuller* is in this anonymized dataset knowing that he is among few pepole who speak spanish (SPN)

### Q1: "How many individuals in the dataset who speak Spanish?"

In [9]:
anon_data[anon_data['lang_spoken_cd'] =='SPN'].shape[0]

2

### Q2: "How many _Male_ individuals in the dataset who speak Spanish?"

In [11]:
anon_data[(anon_data['lang_spoken_cd'] =='SPN') & (anon_data['Gender'] =='M')].shape[0]

1

As we see from the 2nd Query we were able to guess (with some confidence) if a person's data is present in this dataset by using some prior knownledge about them 

## Using DP

One way to achieve differential privacy for the above query (Q1 and Q2) is to add random noise to its answer. But how much and what kind of noise we add?
we need to add enough noise to satisfy the definition of differential privacy, but not so much that the answer becomes useless.

#### The *Laplace mechanism*.

$\begin{equation}
F(D) = f(D) + \textsf{Lap}\left(\frac{\Delta f}{\epsilon}\right)
\end{equation}$

where $\Delta f$ is the *sensitivity* of $f$, and $\textsf{Lap}(b)$ denotes sampling from the Laplace distribution with center 0 and scale $b$.

The *sensitivity* of a function $f$ is the amount $f$'s output changes when its input changes by 1.
*Counting queries* always have a sensitivity of 1, so if a query counts the number of rows in the dataset with a particular property, and then we modify exactly one row of the dataset, then the query's output can change by at most 1.


Thus we can achieve differential privacy for our example queries (Q1 and Q2) above by using the Laplace mechanism with sensitivity 1 and an $\epsilon$ of our choice.

 We can sample from the Laplace distribution using Numpy's `random.laplace` method.

##### For Q1

In [14]:
sensitivity = 1
epsilon = 0.5

anon_data[anon_data['lang_spoken_cd'] =='SPN'].shape[0] + np.random.laplace(loc=0, scale=sensitivity/epsilon)

9.397413154757253

##### For Q2:

In [25]:
sensitivity = 1
epsilon = 0.5

anon_data[(anon_data['lang_spoken_cd'] =='SPN') & (anon_data['Gender'] =='M')].shape[0]+ \
np.random.laplace(loc=0, scale=sensitivity/epsilon)

1.96484359388292

You can see the effect of adding the random noise by running this code multiple times. Each time, the output changes, but most of the time, the answer is close enough to the true answer (which is 1 in this case) to be useful.

## How Much Noise is Enough?

How do we know that the Laplace mechanism adds enough noise to prevent the re-identification of individuals in the dataset? 

In Senthia Dwork et al work they propose that it is sufficient to add noise generated by Laplace distribution with 0 center and scale $b=\left(\frac{\Delta f}{\epsilon}\right)$ 

Let's see if that works!

#### Example 2: 
Let's write down a malicious counting query, which is specifically designed to determine whether _Austin Fuller_ is a Alcoholic. 

Assuming the attacker knows that _Austin Fuller_ among few people in the dataset who speak SPN, he may craft the following benign looking two queries to infere some private information about this individual.

### Q3: How many pepole are alcoholic

In [26]:
Alc = anon_data[anon_data['Alcoholic'] == 'Y'].shape[0]
Alc

4

### Q3': How many SPN speakers who  are alcoholic

In [27]:
Alc2 = anon_data[(anon_data['Alcoholic'] == 'Y') &  \
                 (anon_data['lang_spoken_cd'] == 'SPN')].shape[0]  
# we may use !='SPN' also to execlude the indivdual from the results of the 2nd query
Alc2

1

This result definitely violates Austin's pricacy, since it reveals his alcoholic status. 

So now let's use differential privacy for counting queries with the Laplace mechanism, and see how it goes:

In [33]:
sensitivity = 1
epsilon = 0.1

Alc2 = anon_data[(anon_data['Alcoholic'] == 'Y') &  \
                 (anon_data['lang_spoken_cd'] != 'SPN')].shape[0]  + \
                  np.random.laplace(loc=0, scale=sensitivity/epsilon)
Alc2

9.245758113883177

Is the true answer 3,4, or 5 or..? There's too much noise to be able to reliably tell.  The idea is not to reject suspicious queries that could be malicious (that by itself could leak info.),  instead add enough noise that the results of a malicious query will be useless to the adversary.

## <font color = red> Lab \#2 Exercise </font> (15 points):

Find a public dataset with PII then anonymize it as shown in this lab, then using the anonymized data demonstrate the follwoing:

* 1- Linkage attack, by using non-PII data
* 2- count and average queries inference attacks
* 3- use DP to protect against the inference attacks in 2

Submit your notebook and the used dataset on canvas Lab2 

After completing the exercise, rename the notebook as __`your name-lab2.ipynb`__  and attach it with the csv file of the dataset you used in canvas.

__Grade__: This lab is graded by the number of tasks you correctly completed. ( from 0 pts, if no new dataset is used or none of the tasks correctly completed, and up to  15 pts, if a reasonable new dataset is used and all tasks were correctly completed)

In [40]:
import pandas as pd
import numpy as np
from datetime import datetime

sample_data = pd.read_csv("pii_10_rows.csv")

anonymized_data = sample_data.drop(columns=['First Name', 'Last Name', 'Email', 'Phone Number', 'Address'])

anonymized_data['Age'] = anonymized_data['Date of Birth'].apply(lambda x: (datetime.now().year - datetime.strptime(x, '%Y-%m-%d').year))

anonymized_data.drop(columns=['Date of Birth'], inplace=True)

age_attack = anonymized_data[anonymized_data['Age'] > 60]

older_than_60_count = age_attack.shape[0]

average_age = anonymized_data['Age'].mean()

def laplace_mechanism(value, sensitivity, epsilon):
    noise = np.random.laplace(0, sensitivity / epsilon)
    return value + noise

epsilon = 0.5
sensitivity = 1

dp_older_than_60_count = laplace_mechanism(older_than_60_count, sensitivity, epsilon)

dp_average_age = laplace_mechanism(average_age, sensitivity, epsilon)

print(f"Count of individuals older than 60 (with DP): {dp_older_than_60_count}")
print(f"Average Age (with DP): {dp_average_age}")


Protected Count of individuals older than 60 (with DP): 6.554688045113425
Protected Average Age (with DP): 50.724824071210385
