# Differential Privacy

## Attack model

We have studied privacy attacks where attackers learn sensitive information about an individual by linking two (or more) published data sources. What if the datasets are not published? For example, imagine a situation where a hospital dataset is created, as shown below, where anyone can query it to learn aggregated statistics (e.g., "how many people suffer from obesity?") but the actual dataset is never released.

```{figure} medical-record.png
---
height: 400px
name: medical-record
---
```
You know that your neighbor Igor goes to the same hospita. You know Igor's Nationality (and obviously, Zip and gender!) and know or guess age. Can you learn, by quering the dataset, and not by directly accessing it, what medical problem Igor is suffering from?

The answer is, unfortunately, **Yes**, which is why we have this chapter. By progressively adding new conditions to your query, you can determine what Igor is suffering from. For example, you can query: *how many people, who are*

Unnamed: 0,Name,SSN,Marital status,Sex,DOB,Zip,Ethnicity,Problem
0,--,--,Divorced,Male,1995-12-21,2139,Asian,Hypertension
1,--,--,Divorced,Female,1996-06-27,2139,Asian,Obesity
2,--,--,Married,Female,1996-08-24,2139,Asian,Chest pain
3,--,--,Married,Female,1996-09-22,2139,Asian,Obesity
4,--,--,Married,Male,1995-08-11,2148,Black,Hypertension
5,--,--,Married,Male,1995-09-20,2138,Black,Shortness of breath
6,--,--,Married,Female,1996-08-07,2141,Black,Shortness of breath
7,--,--,Married,Male,1995-07-23,2141,Black,Obesity
8,--,--,Single,Female,1996-04-24,2138,White,Chest pain
9,--,--,Single,Male,1995-02-18,2138,White,Obesity


In [8]:
{
    "tags": [
        "remove-input"
    ]
}

import pandas as pd
import numpy as np
adult = pd.read_csv('../../datasets/adult_with_pii.csv')#[["age", "education", "marital status", "occupation",
                                                          #"relationship","native-country",  "workclass", "fnlwgt", "education-num", "capital-gain", "capital-loss", "hours-per-week",
                                                #           "race", "gender",  "salary"]]
columns=['Name', 'DOB', 'SSN', 'Zip', 'Education', 'Marital Status', 'Occupation',
           'Sex']
adult.head()[columns]

Unnamed: 0,Name,DOB,SSN,Zip,Education,Marital Status,Occupation,Sex
0,Karrie Trusslove,9/7/1967,732-14-6110,64152,Bachelors,Never-married,Adm-clerical,Male
1,Brandise Tripony,6/7/1988,150-19-2766,61523,Bachelors,Married-civ-spouse,Exec-managerial,Male
2,Brenn McNeely,8/6/1991,725-59-9860,95668,HS-grad,Divorced,Handlers-cleaners,Male
3,Dorry Poter,4/6/2009,659-57-4974,25503,11th,Married-civ-spouse,Handlers-cleaners,Male
4,Dick Honnan,9/16/1951,220-93-3811,75387,Bachelors,Married-civ-spouse,Prof-specialty,Female


## Definitions

### Neighboring datasets
Two datasets $D_1$ and $D_2$ are neighbors if they differ in at most one record (i.e., one row). Thus, either $D_1$ has at most one more (or less) row than $D_2$, or at most one row in $D_1$ has values that are different than $D_2$. We write $|D_1 - D_2|\leq 1$

```{image} ./figs/neighbor-datasets.png
:height: 150px
:name: neighbor-datasets

Neighboring datasets
```

For example, in {ref}`figure <neighbor-datasets>`, $D_2$ can be obtained by deleting one row from $D_1$ (or $D_1$ can be obtained by adding one row to $D_2$) and thus they are neighbors. $D_1$ and $D_3$ has the same number of rows, but the values in (exactly) one row is different, thus they are also neighbors. Note that, you can obtain $D_3$ by first deleting the last row from $D_1$ and then adding one row in the resulting dataset; thus, this replacement operation can be thought of as a combination of a deletion and an addition operations (i.e., two operations in total).

## Privacy Loss

privacy loss= $ln{\frac{P(M(D_1)=t)}{P(M(D_2)=t)}}$

### $\epsilon-$Differential Privacy

A *randomized* mechanism (or algorithm) $M$ satisfies $\epsilon$-differential privacy if and only if for any two neighboring datasets $D_1$ and $D_2$, the following condition is satisfied

$\forall S \in Range(M): P(M(D_1) \in S) \leq e^{\epsilon} P(M(D_2) \in S)$

where $\epsilon \leq 0$ and $Range(M)$ denotes the set of all possible outputs of the algorithm $M$.

This defition can be written as

$\forall S \in Range(M): \frac{P(M(D_1) \in S)}{ P(M(D_2) \in S)} \leq e^{\epsilon}$

where  $\frac{0}{0}=1$


The important implication of this definition is that $M$’s output will be pretty much the same, with or without the data of any specific individual. That means, since $M$ is randomized, the built-in randomness is "enough” to prevent someone from guessing, after seeing the output, which dataset was used to compute that output. Imagine that, if your data is present in $D_1$ but not in $D_2$, an adversary won't be able to tell whether or not your data was present in the
input dataset.

**Note:** The mechanism $M$ is differentially private (if it satisfies the above condition), not the datasets.

**Note:** The parameter $\epsilon$ dictates how much privacy you get: smaller values mean that when algorithm $M$ is applied on $D_1$ and $D_2$, the outcomes are similar (i.e., more privacy) than larger values.

### Bounded and Unbounded DP

## Acheiving Differential Privacy
Recall that the goal is to make an algorithm differentially private, i.e., the algorithm gives *similar* outputs for neighboring datastets. How do we acheive this? The most popular mechanism is to add noise to the outcomes, so that they become similar enough. There are multiple ways to select the noise, we will explore some of them below.

### The Laplace mechanism

Let’s consider a query on the census data: “How many individuals in the dataset are 40 years old or older?”

In [30]:
adult[adult['Age'] >= 40].shape[0]

7161

How can we add enough noise to satisfy the DP property, but not too much that the answer becomes useless? Laplace mechanism is one of the most popular ways to achieve DP in such cases. Here, we add some noise, sampled from the Laplace distribution, to the actual query output, and return the noisy result. If the actual query function is denoted by $f$ that computes the query using dataset $D$, then the noisy return value becomes $f(D)+ L$, where $L$ is the Laplace noise. Since the noise is sampled from a distribution, each time we call the query function, a different amount of noise will be added. Thus, the whole mechanism becomes randomized, and, DP also requires a randomized algorithm. The final question is, how to appropriately set  the parameters of the Laplace function, so that when we add noise from it, we can satisfy the DP condition? 

The Laplace distribution for a zero mean is given by $L(b)=\frac{1}{2b} e^{-\frac{|x|}{b}}$ where $b$ is the variance parameter. To satisfy DP, we set $b= \frac{\Delta}{\epsilon}$. So the noisy output becomes $f(D)+L(\frac{\Delta}{\epsilon})$ . This whole randomized mechanism can be represented as $M(D, \epsilon, \Delta) = f(D)+L(\frac{\Delta}{\epsilon})$, where $M$ is the randomized mechanism that takes a dataset, the privacy budget, and sensitivity of the query function, and produces a $\epsilon$-differentially private outcome.

For counting queries the sensitivity is 1: if a query counts the number of rows in the dataset with a particular property, and then we modify exactly one row of the dataset, then the query’s output can change by at most 1.
Thus we can achieve differential privacy for our example query by using the Laplace mechanism with sensitivity 1 and an
$\epsilon$ of our choosing. 

In [42]:
sensitivity = 1
epsilon = 0.1
# Get the differentially private count of adults who are older than 40 years. 
# Note that each call produces slightly different result, and each of them are different, but pretty close to the original value.
for i in range(5):
    print(int(adult[adult['Age'] >= 40].shape[0] + np.random.laplace(loc=0, scale=sensitivity/epsilon)))

7152
7150
7161
7164
7152


Now see that a larger value of $\epsilon$ produces more different outcomes (hence, the otucomes may not be "similar enough" to each other to protect privacy)

In [44]:
sensitivity = 1
epsilon = 5
for i in range(5):
    print(int(adult[adult['Age'] >= 40].shape[0] + np.random.laplace(loc=0, scale=sensitivity/epsilon)))

7160
7161
7160
7161
7161


Below is the analytical proof that Laplace mechanism satisfies DP condition.

For $D_1$, the Laplace noise will be $L(\frac{\Delta}{\epsilon}) = \frac{\epsilon}{2\Delta} e^{-(\frac{\epsilon |f(D_1)|}{\Delta})}$

For $D_2$, the Laplace noise will be $L(\frac{\Delta}{\epsilon}) = \frac{\epsilon}{2\Delta} e^{-(\frac{\epsilon |f(D_2)|}{\Delta})}$

For the DP mechanism $M$, the randomization part comes from this Laplace noise sampling, all other steps in the mechanism are deterministic. Thus, the final outcome of $M$, which is random or probabilistic, will follow the same distribution as the Laplace function. Tat means

$\frac{P(M(D_1))\in S}{P(M(D_2))\in S} =  \frac{e^{-(\frac{\epsilon|f(D_1)|} {\Delta})}}{e^{-(\frac{\epsilon|f(D_2)|} {\Delta})}} = e^{-\frac{\epsilon (|f(D_2)| - |f(D_1|)} {\Delta}} \leq e^{\epsilon}$


```{bibliography}
```