In [1]:
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

# Local differential privacy

## Threat model

Previously, we considered a situation where a trusted data curator collected data from the population, but what if there is not such trusted curator? This is often the case in practice, because the data collector (e.g., the government or a private for-profit company) may be also the one who analyzes and uses the collected data. Thus, we need a mechanism to apply DP (the randomization) *before* the data is collected. There are several ways to do that for different data types.

## Randomized Response (RR) mechanism for binary data

Let's assume that the data to be collected is binary (Yes or No). Under RR, the original response is randomized twice, following this process:

1. Flip a coin: if tail, send the original respone
2. Otherwise, flip another coin: if the second coin is a head, answer “yes” 
3. if the second coin is a tail, answer “no”

Randomizing the original response twice provides privacy in the form of *plausible deniability*, i.e., even if the final response is the same as the original response, the data subject can deny that that's how they had responded.

Does RR satisfy DP criteria? It does for $\epsilon=ln 3$, as the following proof demonstrates.

Imagine that $D_1$ contains the original response, and $D_2$ contains the final response (which might be the same as the original one). All other rows of those two datasets are the same (they are neighbors). If the original response was a "yes", then,

$\frac{P(M(D_1) = Y)}{P(M(D_2)= Y )} = \frac{P(Y|Y)}{P(Y|N)} = \frac{\frac{3}{4}}{\frac{1}{4}} = 3$

Thus, $3\leq e^{\epsilon}$ or $\epsilon \geq log_2 3=ln 3$

```{note}
Randomize response is $ln 3-$Differential private
```

### Utility of RR

How much utility is retained in randomized responses? How accurately can we recover the true "Yes" (or "No") responses?
Let's assume that `p` is the true probability (proportion) of "Yes" responses.

Actual (observed) number of “Yes”, $Y$ = expected number of true “yes” + expected number of false “yes"\
$Y = p* (\frac{1}{2} + \frac{1}{2} * \frac{1}{2}) + (1-p) * \frac{1}{2} * \frac{1}{2}$\
$Y = \frac{1}{2} * p + \frac{1}{4}$\
$p=2*(Y-\frac{1}{4})$

For a given (observed) `Y`, we can estimate the true probability of "Yes" response.

### General version of randomized response

Let, $Y_i$ denote the final response, and $X_i$ denote the original response. Then, for some $\gamma \in [0,.5] $\
$Y_i = \Biggl\{ {X_i  \text{   w.p.} \frac{1}{2}+\gamma}$, and, $1-X_i \text{  w.p.} \frac{1}{2}-\gamma$

If we set $\gamma=1/4$, then we recover the double coin flip setting. Setting $\gamma=0$ provides perfect privacy, because the final outcome ($Y_i$) is equally likely to be the original response ($X_i$) or the opposite ($1-X_i$). Setting $\gamma=.5$ provides perfect utility (but no privacy), because the final outcome ($Y_i$) is alwasy equal to the original response ($X_i$). The following simulation shows how varying $\gamma$ affects privacy and utility.

In [19]:
def randomize(orig_response, gamma):
    if np.random.random() < .5+gamma:
        return orig_response
    
    return 1-orig_response
    
x = [1]*10000 # originally all "Yes" responses
for gamma in np.arange(.5, -0.1,-.1):
    y= np.array([randomize(i, gamma) for i in x])
    print('gamma: {:.2f}, count[yes]={}'.format(gamma, len(y[y==1])))

gamma: 0.50, count[yes]=10000
gamma: 0.40, count[yes]=9031
gamma: 0.30, count[yes]=7973
gamma: 0.20, count[yes]=7003
gamma: 0.10, count[yes]=5965
gamma: 0.00, count[yes]=4967
