# Differential Privacy

> "You can't stop change any more than you can stop the sun from setting." ~ Shmi Skywalker

In this lesson we'll be learning about **Differential Privacy.** This is divided into two sections.

Section 1 will focus on **Intuition**: 
* Intuition
    * What is differential privacy (DP)?
    * How does DP work?
        * How does adding noise protect privacy?
        * How much noise do we add?
        * What is the tradeoff?
    * What is the privacy budget?
        * Connection between privacy budget and risk
    * What is epsilon?
    
Section 2 will focus on DP in **our codebase.**
* DP in PySyft
    * How DP in PySyft is different than DP elsewhere
        * Adversarial
        * Individual
        * Automatic
    * Differential Privacy Tensors
        * PhiTensors
        * GammaTensors
        * Helper Classes:
            * LazyRepeatArrays
            * DataSubjectArrays
    * Ledgers and Privacy Budget Accounting
    * Sigma and noise addition

<hr>

## Motivation 

In a previous lesson, we had discussed how difficult it is to protect people's privacy when working with or releasing data. We discussed the Netflix prize, where participants were de-anonymized with shocking accuracy. We mentioned problems caused by Data Linkages, and talked about the copy problem.

Differential Privacy is one of the Privacy Enhancing Technologies (PETs) that we had discussed in a previous session. Like other PETs, it tries to solve some of these problems.

We'll unpack this in more detail. But first, let's quickly standardize some terminology.

<img src="imgs/ds_terminology.png">

This is the standard "journey" of data in data science:
* Raw data is collected from lots of people (like you and I, called **data subjects**). 
* This raw data is collected and often cleaned/preprocessed by **data owners**,
* The data owners then pass on these datasets to their data scientists, who then test their algorithms or workflows on the data to draw useful conclusions, or build products.

<hr>

## Section 1: Intuition

Put simply, differential privacy is a mathematical guarantee that the output of an algorithm is similar when data belonging to one person is removed.

<img src="imgs/dp_definition.png">

Because the outputs are so similar, adding even the tiniest amount of noise can completely hide the effect of the person's data.

<img src="imgs/dp_similarity.png">

<br>
<br>
This means the Data Scientist still gets a reasonable and accurate answer (~25.7) but won't clue into the effect of your data.

So in a nutshell, DP makes it so that the data scientist doesn't work with just the datasets- they work with datasets plus some noise that serves to protect the privacy of the data subjects.

<img src="imgs/dp_ds.png">

<br>

Thus far, we've seen how differential privacy (DP) leads to the addition of a small amount of noise to protect the privacy of someone's data.

A reasonable next question to ask is ***how much noise do we add?***

### How Much Noise To Add? (in English)

Let's start with an easy, intuitive answer:

<img src="imgs/dp_enuf2hide.png">

In the image above, we haven't added enough noise to properly protect the person's privacy. We need to add more!


<img src="imgs/dp_not_enuf.png">

This time, we went way overboard with adding noise. The Data Scientist is going to suffer a serious loss of accuracy in their calculations. We need to add less!

What this should hopefully convey to you is that there is a **tradeoff** between the amount of noise we add, and the corresponding privacy and utility of the results we get from our algorithm.
* If we too much noise, the results are no longer accurate, rendering the algorithm useless.
* If we don't add enough noise, the results are accurate, but we aren't able to protect anyone's privacy.

### How Much Noise to Add? (in Math)


DP is a mathematical guarantee, and so there is a mathematical answer to this question.
But before we can answer this question of how much noise to add, we'll need to understand a key insight: **datasets are distributions.**


This might seem a bit strange to some of you. Let's take a simple example.

Imagine you had a dataset consisting of the numbers [1, 2, 3, 4 and 5].
An easy way you could convert this from a dataset into a distribution is if you iterated through every datapoint in the dataset, and asked what was the probability of this number being in this dataset.

You'd then get a graph that looks a lot like this:

<img src="imgs/dp_datasets_distr.png">


Voila! Using this simple scheme, we've converted our dataset into a probability distribution.

Now, you might ask- what was the point of that? Why did we need to convert our dataset into a distribution?

Well, it turns out- there are lots of ways to compare two distributions! Here's 16 of them, just for illustration:

<img src="imgs/dp_distr_comp.png">

This means we now have a way to compare two datasets- by converting them to distributions, and then comparing them using any of the methods shown in the image above.

You might ask- but Ishan, **Why is it important to compare datasets?** What does this have to do with figuring out how much noise we want to add?

Well, let's revisit the definition of Differential Privacy:

Differential Privacy: The outcome of an algorithm is ***similar*** when a single person's data is removed from a dataset.

**Intuition:** As soon as we know how similar the outcomes are, we immediately know how much noise is enough, and how much is too much. And it's easier to calculate this similarity by thinking about our datasets as if they were distributions.

### Introducing $\epsilon$, the privacy budget

This "similarity" or difference is captured in a parameter called $\epsilon$ (**epsilon**). It's also called the **privacy budget.**

$\epsilon$ has different formulas (depending on which method we used to compare our datasets/distributions), but it is always a measure of how much your data affects the outcome of the result:

<img src="imgs/dp_intro_epsilon.png">

$\epsilon$, or the privacy budget, is probably the most important idea in differential privacy, so it's worth taking some time to emphasize what is exactly is.


The privacy budget, $\epsilon$, is a measure of:
* How much your data stands out
    * Thus, it's also a measure of *privacy risk*; how likely your data is going to be identified
* How much your data affects the outcome of the query or algorithm
* How much noise is needed to hide your data's influence.

**Note:** 
There is a mathematical definition of epsilon (see resources below), but thinking about it in this way helps to build intuition.

### Effects of $\epsilon$

**The higher the $\epsilon$, the more your data affects the outcome of an algorithm.**

<img src="imgs/dp_epsilon_eff1.png">

For example:
* Imagine if we're trying to calculate the net worth of all the people living inside a single city, and we did this for every city in the US. Warren Buffet and Elon Musk would have gigantic values of $\epsilon$, whereas someone who's homeless probably has a much lower $\epsilon$.

<br>

**The higher the $\epsilon$, the more noise needs to be added to protect your privacy.**

This should make intuitive sense:
* As we discussed above, the higher the epsilon, the more your data affects the outcome of an algorithm.
* The more your data affects the outcome of an algorithm, the more noise you need to add to mask the effects of your data.

<img src="imgs/dp_epsilon_eff2.png">


In theory, noise is a random value that depends on $\epsilon$. In practice, we obtain it from sampling from a distribution where $\epsilon$ is a parameter. 

Often, it's a normal (Gaussian) distribution where $\epsilon$ affects the standard deviation.
So the bigger the $\epsilon$, the wider the range of values the noise will be likely be sampled from [1]:

<img src="imgs/dp_sigma.png">

[1]: We know this because of the empirical (68-95-99.7) rule in statistics, which says that when you sample a normal distribution, approximately 68% of the values will be within one standard deviation of the mean, 95% will be within 2 standard deviations of the mean, and 99.7% will be within 3 deviations of the mean, and so forth. (This rule is also how the company Six Sigma got its name! They aim to get manufacturing processes error rates to a six sigma rate.)

<img src="imgs/dp_empirical_rule.png">


<br>

### Final Intuition- Why does DP work?

I'd like to leave you with a intuitive, gut feeling as to why differential privacy, this way of adding noise, actually works. 

The intuition is that when it comes to protecting privacy, your data "hides" in the background of other people's data. The more your data fits in with other people's data, the easier it is to blend in and hide in plain sight.

Put it this way- it's much easier to spot the horse in the first image than the second one.

<img src="imgs/dp_horse.png">


<br>

Put it another way- the best way to hide a data point is to surround it with other, very similar data points! They'll be hard to tell apart, and it'll be like trying to look for a piece of **hay** in a very large, fluffy haystack:


<img src="imgs/dp_haystack.png">

By adding noise, we're doing something very similar- we're making it hard to discern one data point from another; the same way how dumping all your hay in a haystack makes it hard to single out individual pieces of hay.

### Takeaways

* Differential Privacy allows you to learn high level statistics and trends (how big is the haystack? Am I running out of hay? Is my hay on fire?) but not individual data (is the piece of hay in the center getting moldy?)
    * You can answer questions such as "What causes cancer?" and not "Does Walter White have cancer?"
* Differential Privacy works by adding noise to the output of an algorithm to protect privacy.
* $\epsilon$, the privacy budget, is an indicator of how likely it is that someone's data stands out and can be identified.
    * The higher the $\epsilon$, the more likely that person's data stands out and will be identified.
    * $\epsilon$ can be calculated in many ways, but it's always an indicator of the risk of being identified and having your privacy violated.

<hr>

## Section 2: DP in PySyft

The goal of this section is to help you understand the major components of the Differential Privacy system in PySyft, as well as how PySyft's DP system is different than regular DP systems elsewhere.



Differential Privacy in PySyft is different in 3 crucial ways:

1. It's **Adversarial**
    - There is a hard limit on how much any data scientist who interacts with your dataset can learn about it.
    - This limit is how much privacy budget ($\epsilon$) the data owner gives them.
    - For instance- the data owner gives you 5 $\epsilon$ of privacy budget. Every time you'd like to get a result, it would cost you a few $\epsilon$, and deduct from your available budget.
2. It's **Individual**
    - We track the privacy risk for **every individual in a dataset.** (every data subject.)
    - This is useful because we know exactly how much each person's data and privacy is at risk.
        - If the risk for any particular individual becomes too high, we can remove their data, and still use other people's data. This gives you more "mileage," since you're able to wring out as much signal as possible from the dataset while still ensuring no one's privacy is blown.
        - If a given algorithm violates any individual person's privacy too much, the data scientist gets penalized by having more $\epsilon$ deducted.
    - Note: [2]
3. It's **Automatic**
    - If you have enough privacy budget ($\epsilon$), you can get the results of your algorithms without waiting.


[2]: This is in contrast to regular Differential Privacy, where an entire dataset is characterized by a single $\epsilon$. This is because regular DP only considers the privacy risk of the individual whose data is most at risk. (There's a good reason for this- as soon as one person's privacy is blown, you have a data leak and your infrastructure is by definition not secure.)

<br>

## PySyft's DP Scenario

Let's say we have a Data Scientist. This person was given $\epsilon$ = 10 of privacy budget to work with.
The dataset they're using has 3 people's data in it- Rob, Bob, and Job.

<img src="imgs/dp_scen.png">


Let's say Rob and Bob's data is both "4", and Job's data is "6".
Although this dataset is small, we can intuitively tell that Job's data sticks out more than Rob's and Bob's. Thus, we expect his privacy to be more at risk, and any epsilon for Job to be higher.

(The fact that PySyft is able to reason this way, and deduce privacy risk for each person in the dataset- is an example of its **individual** nature at work.)

Let's say the Data Scientist wants to do a `sum()` operation:

<img src="imgs/dp_indiv.png">


<br>

## Components

Awesome- now that we've seen the entire DP process, end to end, in PySyft, let's deep dive into the key components- **Tensors**, **Ledgers**, and **Publishing.**

### DP Tensors in PySyft

The first (and arguably most used) component in our DP system are our custom Tensor classes.
The biggest difference between our Tensor classes and your standard Torch.tensor() or np.array() is that our Tensors not only store data, but also store *metadata.*

Let me explain. This is a regular Tensor:

In [2]:
import torch
torch.Tensor([1,2,3,4])

tensor([1., 2., 3., 4.])

<br>
This is a regular NumPy array:

In [3]:
import numpy as np
np.array([1,2,3,4])

array([1, 2, 3, 4])

<br>
As you can see, they both store the same data- [1,2,3,4]. They go a step further and let you do math with that data:

In [4]:
torch.Tensor([1,2,3,4]) + 2

tensor([3., 4., 5., 6.])

In [5]:
np.array([1,2,3,4]) + 2

array([3, 4, 5, 6])

<br>
And so on and so forth.

Our DP Tensors in PySyft also provide this ability to store data:

In [6]:
import syft as sy
sy.Tensor([1,2,3,4])

Tensor(child=[1 2 3 4])

<br>
... as well as do arithmetic on them:

In [7]:
sy.Tensor([1,2,3,4]) + 2

Tensor(child=[3 4 5 6])

<br>
But there's one key difference. Our tensors also store metadata, which helps us figure out how much noise we should add to protect privacy.

Specifically, we track 3 kinds of metadata:
* a theoretical lower bound on the data held in the tensor
* a theoretical upper bound on the data held in the tensor
* the data subjects, or the people whose data is stored in the tensor (and thus whose privacy we're protecting).

This metadata is provided by calling `.private()` on a Syft Tensor:

In [8]:
sy.Tensor([1,2,3,4]).private(min_val=0, max_val=5, data_subjects="Ishan")

Tensor(child=PhiTensor(child=[1 2 3 4], min_vals=<lazyrepeatarray data: [0] -> shape: (4,)>, max_vals=<lazyrepeatarray data: [5] -> shape: (4,)>))

<br>

Let's understand why each of these is necessary:

**Knowing the upper and lower bounds of the data helps us figure out how much noise is too much and too little.**

For instance, if I add 10 data points in the in the range  of [0, 10], and I add a noise value of 300, that will probably drown all the signal and skew the results heavily.

In [10]:
signal = sum([1,2,3,4,5,3,4,2,1,8])
print("Signal =", signal)
noise = 300
print("Noise =", noise)
result = signal + noise
print("Final result after DP = ", result)

Signal = 33
Noise = 300
Final result after DP =  333


On the other hand, if I add 10 data points in the range of [1,000,000 , 10,000,000], and I add a noise value of 50, that's probably too little noise to meaningfully protect anyone's privacy.

In [11]:
signal = sum([1e6,2e6,3e6,4e6,5e6,3e6,4e6,2e6,1e6,8e6])
print("Signal =", signal)
noise = 50
print("Noise =", noise)
result = signal + noise
print("Final result after DP = ", result)

Signal = 33000000.0
Noise = 50
Final result after DP =  33000050.0


For those of you who've done signal processing before, this might look similar to the idea of **signal-to-noise.**

<br>

Meanwhile, **knowing the data subjects helps us figure out whose privacy we're protecting.**

This is tremendously useful when it comes to things like Linkage Attacks (refer to Lesson 01- the Netflix Challenge example.)
If we're able to keep track of which data belongs to which person, we could quantify in strict mathematical terms ($\epsilon$) how much that individual's privacy is at risk.



Based on the number of Data Subjects in a Tensor, we can ***categorize*** our DP Tensors in 2 categories:

* PhiTensor: 1 unique data subject per data point in the Tensor
* GammaTensor: 2 or more uniqud data subjects per data point in the Tensor


Let's try playing around with them!

In [13]:
# Let's say we have some data:
data = np.random.randint(low=1, high=7, size=(5,5))
data

array([[6, 4, 4, 6, 5],
       [5, 4, 3, 2, 2],
       [1, 6, 2, 6, 5],
       [1, 5, 6, 6, 1],
       [6, 4, 5, 5, 1]])

In [14]:
# We put it in our Syft Tensor:
tensor = sy.Tensor(data)
tensor

Tensor(child=[[6 4 4 6 5]
 [5 4 3 2 2]
 [1 6 2 6 5]
 [1 5 6 6 1]
 [6 4 5 5 1]])

In [15]:
# We then annotate our Tensor with Metadata, thus creating a PhiTensor underneath the hood:
private_tensor = tensor.private(min_val=0, max_val=10, data_subjects="Ishan")
private_tensor

Tensor(child=PhiTensor(child=[[6 4 4 6 5]
 [5 4 3 2 2]
 [1 6 2 6 5]
 [1 5 6 6 1]
 [6 4 5 5 1]], min_vals=<lazyrepeatarray data: [0] -> shape: (5, 5)>, max_vals=<lazyrepeatarray data: [10] -> shape: (5, 5)>))

In [11]:
private_tensor.child

PhiTensor(child=[[6 5 1 3 3]
 [4 1 1 3 6]
 [4 1 3 3 6]
 [3 1 4 3 1]
 [2 4 2 3 5]], min_vals=<lazyrepeatarray data: [0] -> shape: (5, 5)>, max_vals=<lazyrepeatarray data: [10] -> shape: (5, 5)>)

This is the first of our DP tensors- called a **PhiTensor**.
A PhiTensor only has one unique data subject per value in the Tensor. You can see this below:

In [15]:
# This array has the same shape as our NumPy data- so there's a 1:1 correspondence.
private_tensor.child.data_subjects

array([[DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}],
       [DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}],
       [DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}],
       [DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}],
       [DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}, DataSubjectArray: {'Ishan'},
        DataSubjectArray: {'Ishan'}]], dtype=object)

If you add two PhiTensors belonging to different people, you would get a **GammaTensor**.
This is shown below:

In [None]:
second_tensor = sy.Tensor(data).private(min_val=0, max_val=10, data_subjects="Carl")

gamma_tensor = private_tensor + second_tensor
gamma_tensor.child

And you can see the data subjects in this array:

In [None]:
gamma_tensor.child.data_subjects

There's one important thing to note here. Once we have a GammaTensor, we need to be extra careful since we are combining private data belonging to two different people in a single Tensor.

We keep track of every operation that occurs with a GammaTensor in a dictionary called `source`. You can think of this as a tree keeping track of which input tensors combined to give which output tensors.

The key in this dictionary is an integer (unique to every GammaTensor) which maps to the corresponding GammaTensor.

We can see below how `gamma_tensor`, which was created by adding `private_tensor` and `second_tensor`, has both of those tensors in its `source` tree:

In [None]:
gamma_tensor.child.sources

GammaTensors also tracks which operation created it in `self.func_str`:

In [None]:
gamma_tensor.child.func_str

Notes:

- You might imagine that as we do more and more operations, this source tree can become gigantic. You're correct.
- This is why we keep track of PhiTensors separately. Many data science workflows can and are being done with data belonging to just a single individual, in which case we don't need to take the precautions associated with GammaTensor.

### DP Ledger in PySyft

### Publishing in PySyft