# Differential Privacy

> "You can't stop change any more than you can stop the sun from setting." ~ Shmi Skywalker

In this lesson we'll be learning about **Differential Privacy.** This is divided into two sections.

Section 1 will focus on **Intuition**: 
* Intuition
    * What is differential privacy (DP)?
    * How does DP work?
        * How does adding noise protect privacy?
        * How much noise do we add?
        * What is the tradeoff?
    * What is the privacy budget?
        * Connection between privacy budget and risk
    * What is epsilon?
    
Section 2 will focus on DP in **our codebase.**
* DP in PySyft
    * How DP in PySyft is different than DP elsewhere
        * Adversarial
        * Individual
        * Automatic
    * Differential Privacy Tensors
        * PhiTensors
        * GammaTensors
        * Helper Classes:
            * LazyRepeatArrays
            * DataSubjectArrays
    * Ledgers and Privacy Budget Accounting
    * Sigma and noise addition

<hr>

## Motivation 

In a previous lesson, we had discussed how difficult it is to protect people's privacy when working with or releasing data. We discussed the Netflix prize, where participants were de-anonymized with shocking accuracy. We mentioned problems caused by Data Linkages, and talked about the copy problem.

Differential Privacy is one of the Privacy Enhancing Technologies (PETs) that we had discussed in a previous session. Like other PETs, it tries to solve some of these problems.

We'll unpack this in more detail. But first, let's quickly standardize some terminology.

<img src="imgs/ds_terminology.png">

This is the standard "journey" of data in data science:
* Raw data is collected from lots of people (like you and I, called **data subjects**). 
* This raw data is collected and often cleaned/preprocessed by **data owners**,
* The data owners then pass on these datasets to their data scientists, who then test their algorithms or workflows on the data to draw useful conclusions, or build products.

<hr>

## Section 1: Intuition

Put simply, differential privacy is a mathematical guarantee that the output of an algorithm is similar when data belonging to one person is removed.

<img src="imgs/dp_definition.png">

Because the outputs are so similar, adding even the tiniest amount of noise can completely hide the effect of the person's data.

<img src="imgs/dp_similarity.png">

<br>
<br>
This means the Data Scientist still gets a reasonable and accurate answer (~25.7) but won't clue into the effect of your data.

So in a nutshell, DP makes it so that the data scientist doesn't work with just the datasets- they work with datasets plus some noise that serves to protect the privacy of the data subjects.

<img src="imgs/dp_ds.png">

<br>

Thus far, we've seen how differential privacy (DP) leads to the addition of a small amount of noise to protect the privacy of someone's data.

A reasonable next question to ask is ***how much noise do we add?***

### How Much Noise To Add? (in English)


Let's start with an easy, intuitive answer:

<img src="imgs/dp_enuf2hide.png">

In the image above, we haven't added enough noise to properly protect the person's privacy. We need to add more!


<img src="imgs/dp_not_enuf.png">

This time, we went way overboard with adding noise. The Data Scientist is going to suffer a serious loss of accuracy in their calculations. We need to add less!

### How Much Noise to Add? (in Math)


DP is a mathematical guarantee, and so there is a mathematical answer to this question.
But before we can answer this question of how much noise to add, we'll need to understand a key insight: **datasets are distributions.**


This might seem a bit strange to some of you. Let's take a simple example.

Imagine you had a dataset consisting of the numbers [1, 2, 3, 4 and 5].
An easy way you could convert this from a dataset into a distribution is if you iterated through every datapoint in the dataset, and asked what was the probability of this number being in this dataset.

You'd then get a graph that looks a lot like this:

<img src="imgs/dp_datasets_distr.png">


Voila! Using this simple scheme, we've converted our dataset into a probability distribution.

Now, you might ask- what was the point of that? Why did we need to convert our dataset into a distribution?

Well, it turns out- there are lots of ways to compare two distributions! Here's 16 of them, just for illustration:

<img src="imgs/dp_distr_comp.png">

This means we now have a way to compare two datasets- by converting them to distributions, and then comparing them using any of the methods shown in the image above.

You might ask- but Ishan, **Why is it important to compare datasets?** What does this have to do with figuring out how much noise we want to add?

Well, let's revisit the definition of Differential Privacy:

Differential Privacy: The outcome of an algorithm is ***similar*** when a single person's data is removed from a dataset.

**Intuition:** As soon as we know how similar the outcomes are, we immediately know how much noise is enough, and how much is too much. And it's easier to calculate this similarity by thinking about our datasets as if they were distributions.

### Introducing $\epsilon$, the privacy budget

This "similarity" or difference is captured in a parameter called $\epsilon$ (**epsilon**). It's also called the **privacy budget.**

$\epsilon$ has different formulas (depending on which method we used to compare our datasets/distributions), but it is always a measure of how much your data affects the outcome of the result:

<img src="imgs/dp_intro_epsilon.png">

$\epsilon$, or the privacy budget, is probably the most important idea in differential privacy, so it's worth taking some time to emphasize what is exactly is.


The privacy budget, $\epsilon$, is a measure of:
* How much your data stands out
    * Thus, it's also a measure of *privacy risk*; how likely your data is going to be identified
* How much your data affects the outcome of the query or algorithm
* How much noise is needed to hide your data's influence.

**Note:** 
There is a mathematical definition of epsilon (see resources below), but thinking about it in this way helps to build intuition.

### Effects of $\epsilon$

**The higher the $\epsilon$, the more your data affects the outcome of an algorithm.**

<img src="imgs/dp_epsilon_eff1.png">

For example:
* Imagine if we're trying to calculate the net worth of all the people living inside a single city, and we did this for every city in the US. Warren Buffet and Elon Musk would have gigantic values of $\epsilon$, whereas someone who's homeless probably has a much lower $\epsilon$.

<br>

**The higher the $\epsilon$, the more noise needs to be added to protect your privacy.**

This should make intuitive sense:
* As we discussed above, the higher the epsilon, the more your data affects the outcome of an algorithm.
* The more your data affects the outcome of an algorithm, the more noise you need to add to mask the effects of your data.

<img src="imgs/dp_epsilon_eff2.png">


In theory, noise is a random value that depends on $\epsilon$. In practice, we obtain it from sampling from a distribution where $\epsilon$ is a parameter. 

Often, it's a normal (Gaussian) distribution where $\epsilon$ affects the standard deviation.
So the bigger the $\epsilon$, the wider the range of values the noise will be likely be sampled from [1]:

<img src="imgs/dp_sigma.png">

[1]: We know this because of the empirical (68-95-99.7) rule in statistics, which says that when you sample a normal distribution, approximately 68% of the values will be within one standard deviation of the mean, 95% will be within 2 standard deviations of the mean, and 99.7% will be within 3 deviations of the mean, and so forth. (This rule is also how the company Six Sigma got its name! They aim to get manufacturing processes error rates to a six sigma rate.)

<img src="imgs/dp_empirical_rule.png">


<br>

### Final Intuition- Why does DP work?

I'd like to leave you with a intuitive, gut feeling as to why differential privacy, this way of adding noise, actually works. 

The intuition is that when it comes to protecting privacy, your data "hides" in the background of other people's data. The more your data fits in with other people's data, the easier it is to blend in and hide in plain sight.

Put it this way- it's much easier to spot the horse in the first image than the second one.

<img src="imgs/dp_horse.png">


<br>

Put it another way- the best way to hide a data point is to surround it with other, very similar data points! They'll be hard to tell apart, and it'll be like trying to look for a piece of **hay** in a very large, fluffy haystack:


<img src="imgs/dp_haystack.png">

By adding noise, we're doing something very similar- we're making it hard to discern one data point from another; the same way how dumping all your hay in a haystack makes it hard to single out individual pieces of hay.

### Takeaways

* Regular Differential Privacy allows you to learn high level statistics and trends (how big is the haystack? Am I running out of hay? Is my hay on fire?) but not individual data (is hay #5 turning blue?)
    * You can answer questions such as "What causes cancer?" and not "Does Walter White have cancer?"
* Differential Privacy works by adding noise to the output of an algorithm to protect privacy.
* $\epsilon$, the privacy budget, is an indicator of how likely it is that someone's data stands out and can be identified.
    * The higher the $\epsilon$, the more likely that person's data stands out and will be identified.
    * $\epsilon$ can be calculated in many ways, but it's always an indicator of the risk of being identified and having your privacy violated.

<hr>

## Section 2: DP in PySyft

<hr>