## Overview

In previous section, we learned how to Deploy a Domain Server as a Data Owner and how to upload private data to the server by annotating them using Auto DP metadata. In this section, we will delve into more details about using DP for setting up privacy budget using Differential Privacy. We will also take a look at the Data Scientist workflow for doing remote data science.

By the end of this tuturial, we will discover the underlying concept behind Differential Privacy and how setting a privacy budget for a user or Data Scientist determines how much can be learned from any individual data subject in the private data.


**This tutorial covers the following**
- Differential Privacy: Intuition and Terminology
- Differential Privacy in PySyft
- Assign privacy budget to private data
- Create user account for Data Scientist with assigned privacy budget
- Make a query and publish result as a Data Scientist
- Request more privacy budget

### Differential Privacy: Intuition and Terminology

Differential Privacy is a mathematical guarantee that if you run an algorithm against a dataset, the output is indistinguishable from:

- running the same algorithm, against the same dataset again, a second time
- except, the second time, removing data belonging to a specific Data Subject

```
# result of operation on all data is indistinguishable from
# result of operation on all data minus a Data Subject's data points
dp(op(data[:])) ~= dp(op(data[:i] + data[i+1:]))
```

The result of such a guarantee, is that it should not be possible to guess if a specific Data Subject is represented in a dataset. Therefore by extension the privacy of any given individual is protected because you cannot learn anything about someone, if you cannot know with any certainty if they are present or absent from the dataset used in your algorithm.



#### Dataset without Differential Privacy

Let's say we have a dataset of 10K data subjects.
First let's see what the real difference would be between a query with and without a single data subject's data.

![caption](./files/without_dp_noise.webp)

#### Dataset protected with Differential Privacy

Now let's see how we can hide that difference by just adding noise so both the query results look similar.

![caption](./files/with_dp_noise.webp)

This means that the Data Scientist still gets a reasonable and accurate answer of approximately ~25.7, but they can't be sure if you were in the dataset; and if you were, what impact your age had on the average.

So in a nutshell, `DP` makes sure that the Data Scientist doesn't work with the raw datasets, rather they work with datasets + `DP noise`. `DP` also guarantees to add just enough noise to protect the privacy of the data subjects, while still allowing to provide useful results.

![caption](./files/dp_full_arch.webp)

#### How Much Noise to Add?

Thus far, we've seen how `DP (Differential Privacy)` leads to the addition of a small amount of noise to protect the privacy of someone's data.

A reasonable next question to ask is: `How much noise should we add?`

Let's try with an example:

![caption](./files/how-much-noise-1.webp)

It's important to understand that the way a Data Scientist can learn private information about individuals in a dataset is through comparisons of results over time. In the above example if the Data Scientist got the mean of the dataset and it was `25.705` and then were able to query the system without your data (many data APIs allow for different ways to group or select data based on fields) seeing a result of `25.70` this small difference might tip them off enough to guess your age with reasonable accuracy.

Let's try again.

![caption](./files/how-much-noise-2.webp)

This time, while your age is now very hard to guess, we have added so much noise that the results have lost some of their accuracy and utility to the Data Scientist.

#### Privacy-Accuracy Trade-off

What this should hopefully convey to us is that there is a `tradeoff` between the privacy gained by adding noise and the corresponding utility lost in the results we get from our algorithms.

- If we add too much noise, the results are no longer accurate, rendering the algorithm useless
- If we don't add enough noise, the results are accurate, but we aren't able to protect anyone's privacy

#### Privacy Budget (∈)

This "similarity" or difference is captured in a parameter called `∈`, which we often refer to as the privacy budget.

`∈` has different formulas (depending on which method we used to compare our datasets or distributions), but it is always a measure of how much your data (belonging to a data subject) affects the algorithms result.

![caption](./files/epsilon.webp)

`∈`, or the Privacy Budget, is probably the most important idea in Differential Privacy, so it's worth taking some time to emphasize exactly what it is.

The privacy budget, 
`∈`, is a measure of:

- How much your data stands out
    - Thus, it's also a measure of privacy risk; how likely your data is going to be identified
- How much your data affects the outcome of the query or algorithm
- How much noise is needed to hide your data's influence

**Note**: There is a more rigorous mathematical definition of epsilon, but thinking about it in this way helps to build an intuition.

**The higher the `∈`, the more your data affects the outcome of an algorithm.**



#### DP in a nutshell

Differential Privacy helps, 
- Answer questions such as What causes cancer? but not Does Walter White have cancer?
- Differential Privacy works by adding noise to the output of an algorithm to protect privacy

∈, the privacy budget, is an indicator of how likely it is that someone's data stands out and can be identified:
- The higher the ∈, the more likely that person's data stands out and will be identified
- can be calculated in many ways, but it's always an indicator of the risk of being identified and having your privacy violated

### Differential Privacy in PySyft

The goal of this section is to help you understand the major components of the `Differential Privacy` (DP) system in `syft`, as well as how `Syft's` DP system differs from other DP systems elsewhere.

#### Syft's DP system is

**Adversarial**
- A Data Scientist can learn no more than what their privacy budget allows

#### How do these features help you as a `Data Owner?`

- Data Scientist has a hard limit on how much they can learn
- They are incentivized to be frugal with their query costs due to the hard limits of their total budget
- Lower cost queries mean lower risk to Data Subject privacy

#### How do these features help you as a `Data Scientist?`

- The safe guards in place increase the confidence of Data Owners and mean it's more likely you will be trusted with acccess to important datasets

**Individual**
- Privacy budgeting ∈ is tracked at the level of individual Data Subjects.
- While the privacy budget for a query is the maximum epsilon of any individual Data Subject whose data was involved in the query, we can also filter Data Subject's automatically to lower the cost

#### How do these features help you as a `Data Owner?`

- You have a guarantee that each of your Data Subject's data will remain private

#### How do these features help you as a `Data Scientist?`

- You can do your work without worrying about accidentally violating someone's privacy because the system won't let you

**Automatic**
- If you have enough privacy budget, you can get the results of any query you want without waiting
- Importance for Data Owners
- Image you were giving a data scientist access to a classified dataset.

#### How do these features help you as a `Data Owner?`

- You don't have to get on the phone, approve every line of code they write, or fill out paperwork
- Importance for Data Scientists
- Image you were the the data scientist getting access to this classified dataset.

#### How do these features help you as a `Data Scientist?`

- You don't have to get on the phone or fill out any paperwork, just to get permission to do your work

Differential Privacy in syft is made up of three main components:
- DP Tensors
- DP Ledger
- Publishing

![caption](./files/dp_publish.webp)

In our previous example, our private data consists of `Rob`, `Bob` and `Job` as individual data subjects. This is handled by syft's `DP Tensors` and acted on with our Pointer system so that Data Scientists can't copy the data.

We talked about how syft tracks and differentiates individuals by recording the history of their cumulative privacy risk in something we call a `DP Ledger` which lives inside a Domain along side the private data.

Finally, the process of calculating epsilon, filtering out data which is overbudget and returning a `result + noise` to a Data Scientist automatically is something we call `Publishing`.

### DP in Action in PySyft

Before we move forward with DP, first we need to deploy a Domain Server and Upload our private dataset as we did in the [previous tutorial](./tutorial-1-domain-setup-and-data-upload.ipynb). 

#### Deploy a Domain Server
Before deploying a Domain Server, let's make sure that we have all the required packages installed.

In [10]:
!pip install -U hagrid

Collecting hagrid
  Downloading hagrid-0.2.125-py3-none-any.whl (63 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.7/63.7 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: hagrid
  Attempting uninstall: hagrid
    Found existing installation: hagrid 0.2.121
    Uninstalling hagrid-0.2.121:
      Successfully uninstalled hagrid-0.2.121
Successfully installed hagrid-0.2.125

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m22.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [11]:
from hagrid import wizard

In [12]:
wizard.check_hagrid

In [13]:
!pip install syft==0.7.0b59

Collecting syft==0.7.0b59
  Downloading syft-0.7.0b59-py2.py3-none-any.whl (7.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Collecting sqlalchemy==1.4.36
  Downloading SQLAlchemy-1.4.36.tar.gz (8.1 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting pycapnp==1.2.1
  Downloading pycapnp-1.2.1-cp310-cp310-macosx_11_0_arm64.whl (1.4 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m


Installing collected packages: pycapnp, sqlalchemy, syft
  Attempting uninstall: pycapnp
    Found existing installation: pycapnp 1.2.2
    Uninstalling pycapnp-1.2.2:
      Successfully uninstalled pycapnp-1.2.2
  Attempting uninstall: sqlalchemy
    Found existing installation: SQLAlchemy 1.4.45
    Uninstalling SQLAlchemy-1.4.45:
      Successfully uninstalled SQLAlchemy-1.4.45
[33m  DEPRECATION: sqlalchemy is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559[0m[33m
[0m  Running setup.py install for sqlalchemy ... [?25ldone
[?25h  Attempting uninstall: syft
    Found existing installation: syft 0.7.0b62
    Uninstalling syft-0.7.0b62:
      Successfully uninstalled syft-0.7.0b62
Successfully installed pycapnp-1

In [14]:
wizard.check_syft

Other than `syft` and `hagrid`, we will also need docker to deploy the backend containers and essentially deploy the Domain Server.

In [16]:
wizard.check_grid_docker

Now we are ready to **Launch the Domain Server.**

In [17]:
DOMAIN_NAME = 'test_domain' # edit DOMAIN_NAME as per your choice

!hagrid launch {DOMAIN_NAME} domain to docker:80 --tag=0.7.0-beta.59

[2K✅ Updated HAGrid from branch: [1;36m0.7[0m.[1;36m0[0mm
[2K[32m⠧[0m [1;34mUpdating HAGrid from branch: 0.7.0[0m
[2K[32m⠸[0m [1;34mChecking for Docker Service[0m   ice[0m   
[1A[2K✅ Docker service is running
✅ Git 2.32.1
✅ Docker 20.10.17
✅ Docker Compose 2.7.0


 _   _       _     _                 _   _                       _
| | | |     | |   | |               | | | |                     | |
| |_| | ___ | | __| |   ___  _ __   | |_| | __ _ _ __ _ __ _   _| |
|  _  |/ _ \| |/ _` |  / _ \| '_ \  |  _  |/ _` | '__| '__| | | | |
| | | | (_) | | (_| | | (_) | | | | | | | | (_| | |  | |  | |_| |_|
\_| |_/\___/|_|\__,_|  \___/|_| |_| \_| |_/\__,_|_|  |_|   \__, (_)
                                                            __/ |
                                                           |___/
        
Launching a PyGrid Domain node on port 80!

  - NAME: test_domain
  - RELEASE: production
  - ARCH: linux/arm64
  - TYPE: domain
  - DOCKER_TAG: 0.7.0-beta.59
  - HAGRID_

Let's check the health of the `Domain Server` we have just deployed.

In [24]:
host = !curl -s https://icanhazip.com
host = host[0]

# hagrid can automatically detect an external IP. Or you can run `!hagrid check host` to check your server health 

#!hagrid check {host}
!hagrid check localhost:80

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━┓
┃[1m [0m[1mPyGrid   [0m[1m [0m┃[1m [0m[1mInfo                                   [0m[1m [0m┃[1m [0m[1m  [0m[1m [0m┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━┩
│[35m [0m[35mUI (βeta)[0m[35m [0m│ http://localhost:80/login               │ ✅ │
│[35m [0m[35mapi      [0m[35m [0m│ http://localhost:80/api/v1/openapi.json │ ✅ │
└───────────┴─────────────────────────────────────────┴────┘


Description of what each of the above PyGrid components mean are available in the [previous tutorial](./tutorial-1-domain-setup-and-data-upload.ipynb). 

#### Uploading Private Data to a Domain Server

To utilize the privacy-enhancing features offered in PyGrid and to communicate with your domain server, first you need to import `syft`.

In [20]:
try:
    import syft as sy
    print("Syft is imported")
    
except:
    print("Syft is not installed. Please follow the Getting Started section above.")

Syft is imported


Before we can upload data, we need to login to the server as an `admin`.

WARNING: Please change the default username and password below to a more secure and private combination of your preference.

In [21]:
try:
    domain_client = sy.login(
      port=80,
      email="info@openmined.org",
      password="changethis"
   )
except Exception as e:
    print("Unable to login. Please check your domain is up with `!hagrid check localhost:8081`")


Anyone can login as an admin to your node right now because your password is still the default PySyft username and password!!!

Connecting to localhost... done! 	 Logging into test_domain... done!


For this tutorial, we will use a dataset of ages, and the names of the people these ages belong to will be the letters `A -> J`.

In [25]:
import numpy as np
ages = np.array([25, 35, 21, 19, 40, 55, 31, 18, 27, 33])
names = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
print(len(ages))
print(len(names))

10
10


Now, as we have done previously we can create a `PhiTensor` by combining the `data` and the `data_subjects` and annotating the bounds based on our knowledge of the dataset.

In [28]:
age_tensor = sy.Tensor(ages).annotate_with_dp_metadata(
   lower_bound=0, upper_bound=100, data_subjects=names
)
age_tensor

Tensor annotated with DP Metadata!
You can upload this Tensor to a domain node by calling `<domain_client>.load_dataset` and passing in this tensor as an asset.


Tensor(child=PhiTensor(child=[25 35 21 19 40 55 31 18 27 33], min_vals=<lazyrepeatarray data: [0] -> shape: (10,)>, max_vals=<lazyrepeatarray data: [100] -> shape: (10,)>))

Now we are ready to upload the dataset to the Domain Server.

In [29]:
domain_client.load_dataset(
    assets={"ages_tensor": age_tensor},
    name="ages_dataset",
    description="Ages of a group of people"
)

Loading dataset...Loading dataset... checking assets...Loading dataset... checking dataset name for uniqueness...Loading dataset... checking dataset name for uniqueness...                                                                                                                    Loading dataset... checking asset types...                              Loading dataset... uploading...🚀                        

Uploading `ages_tensor`: 100%|[32m████████████████████████████████████████[0m| 1/1 [00:00<00:00, 50.28it/s][0m


Dataset is uploaded successfully !!! 🎉

Run `<your client variable>.datasets` to see your new dataset loaded into your machine!


### Create Data Scientist Account

As a Data Owner, you can use PySyft to pre-determine the degree of information access you grant to a user using your domain server. For PyGrid, you can think of a privacy budget as a specified limit to the `visibility` a user can have into any one data subject on your domain server. When you create a user account in PyGrid, by default that user is assigned the lowest level of permissions and is given a privacy budget of `0` which means that they have `0` visibility into your domain’s data subjects.

Therefor, while creating an account for a Data Scientist, you have to explicitly assign the `privacy budget` associated with that account.

In [52]:
starting_budget = 999999
data_scientist_details = {
    "name": "Jane Doe",
    "email": "janedoe@skywalker.net",
    "password": "janes_house",
    "budget": starting_budget,
}
domain_client.users.create(**data_scientist_details)

User created successfully!


We have just created an account for a Data Scientist and assigned a privacy budget.

### Make a query and publish result as a Data Scientist

After the account has been created, it's time for  and Domain Owners to wear the hat of a Data Scientist and try out the Data Scientist workflow. Let's make a query using `0.5e` and then analyze the results to compare how close the value of the results is to the actual value, if there was no `DP noise` in place.

Firstly, we should login to the domain as a data scientist using the same credentials through which we created for the data scientist.

In [53]:
skywalker = sy.login(port=80, email="janedoe@skywalker.net", password="janes_house")

Connecting to localhost... done! 	 Logging into test_domain... done!


In [54]:
print("Allotted Privacy Budget: ", skywalker.privacy_budget)

Allotted Privacy Budget:  999999.0


Once logged in, the Data Scientist can now access the dataset info and run computations through `syft` pointers.

In [55]:
age_dataset_prt = skywalker.datasets[-1]
age_dataset_prt

Dataset: ages_dataset
Description: Ages of a group of people



Asset Key,Type,Shape
"[""ages_tensor""]",Tensor,"(10,)"


In [56]:
age_ptr = age_dataset_prt["ages_tensor"]
age_ptr

PointerId: c67261a051844435943a9123f04e79ba
Status: [92mReady[0m
Representation: array([16, 26, 51, 52, 78, 45, 18, 23, 73, 43])

(The data printed above is synthetic - it's an imitation of the real data.)

Since we specified the bounds above by their theoratical limits, the synthetic dataset matches those bounds and gives us a bit of a feeling for what kind of data there might be.

#### Calculate Private Mean

Let's find out what the average age in this dataset is by using the `mean` op on the data pointers, instead of raw data.

In [57]:
mean_ptr = age_ptr.mean()
mean_ptr

PointerId: 04f8b658bc764e40aba45d99ad7c5d7f
Status: [93mProcessing[0m
Representation: array([81.45952218])

(The data printed above is synthetic - it's an imitation of the real data.)

We could continue to run additional operations but at some point we want our result.

#### Publish Result

The way a Data Scientist interacts with the `DP` system in `syft` is by calling `publish` on their results. What they are essentially saying is, I am done with my computation and I want to see the results without having to go through a lengthy process of approval. By providing a desired amount of noise expressed as `sigma` or `standard deviations` from the underlying distribution, the system can calculate the amount of `epsilon` required to protect individuals at that level of noise. If the Data Scientist has that amount of `epsilon` budget available to them and they have not previously exceeded this amount for the any given Data Subject under study in the results, the system can account for this `publish` action, deduct the budget and release a noisy result to the Data Scientist immediately.

Now, let's get the result of this computation by calling `publish` on the pointer with a `sigma`. The return value is a `pointer` to the result which if successful will be available for download.

More information about selecitng the right `sigma` can be found in [How to find the right sigma](#How-to-find-the-right-sigma) section

In [58]:
result_ptr = mean_ptr.publish(sigma=1.5) # 1.5 is the default

Since `publish` can take some time we will have to wait before we can call `get`.

In [59]:
result_ptr.exists

True

In [60]:
result_ptr.block_with_timeout(60)
remote_mean = result_ptr.get(delete_obj=False)
remote_mean

31.91326755087075

#### Compare the noisy result to the true result

In [61]:
local_mean = ages.mean()
print("Real Mean:      =", local_mean)
print("Published Mean: =", remote_mean)
print("Difference:     =", round(abs(local_mean - remote_mean) / local_mean * 100, 2), "%")

Real Mean:      = 30.4
Published Mean: = 31.91326755087075
Difference:     = 4.98 %


As we can see, the noisy result is different but not so much from the original result without noise.

#### Track  Epsilon Spend

In [62]:
current_budget = skywalker.privacy_budget
print("Starting Budget:", starting_budget)
print("Current Budget: ", current_budget)
print("Cost:           ", starting_budget - current_budget)

Starting Budget: 999999
Current Budget:  999690.0357041412
Cost:            308.9642958587501


**Note**: If you recreate new pointers and publish them they can count as new epsilon spend so your cost below will change the more you keep running the cells. At any point, you can keep track of your available privacy budget by running the follwoing command.

In [63]:
print("Allotted Privacy Budget: ", skywalker.privacy_budget)

Allotted Privacy Budget:  999690.0357041412


### Request More Privacy Budget

There might be instances, when the Data Scientist runs out of privacy budget before getting their expected results. In such cases, they have the opportunity to request for more budget right inside from `PyGrid`. 

In [64]:
skywalker.request_budget(eps=55, reason="Query might save millions of lives")

Requested 55 epsilon of budget. Call .privacy_budget to see if your budget has arrived!


This request will show up on the Data Owners end where they can either `approve` or `reject` the request.

In [65]:
domain_client.requests

Unnamed: 0,Name,Email,Role,Request Type,Status,Reason,Request ID,Requested Object's ID,Requested Object's tags,Requested Budget,Current Budget
0,Zarreen1,znr1@openmined.org,Data Scientist,DATA,accepted,,<UID: bd60e487c6864957b0e0ddb4fec510ae>,<UID: a5fa8a2aeecd4838bf8c357bcd4d838f>,[],,
1,Zarreen1,znr1@openmined.org,Data Scientist,DATA,accepted,,<UID: 9553044db5d8459ca1e19c7acefdb744>,<UID: c55fcfacfa184a91b8308a91f7101f26>,[],,
2,Jane Doe,janedoe@skywalker.net,Data Scientist,BUDGET,pending,Query might save millions of lives,<UID: 0e8d26d3b71940f6b251d842c904cd37>,,[],55.0,999690.0625


In [67]:
domain_client.requests[-1].accept()

In [69]:
domain_client.requests

Unnamed: 0,Name,Email,Role,Request Type,Status,Reason,Request ID,Requested Object's ID,Requested Object's tags,Requested Budget,Current Budget
0,Zarreen1,znr1@openmined.org,Data Scientist,DATA,accepted,,<UID: bd60e487c6864957b0e0ddb4fec510ae>,<UID: a5fa8a2aeecd4838bf8c357bcd4d838f>,[],,
1,Zarreen1,znr1@openmined.org,Data Scientist,DATA,accepted,,<UID: 9553044db5d8459ca1e19c7acefdb744>,<UID: c55fcfacfa184a91b8308a91f7101f26>,[],,
2,Jane Doe,janedoe@skywalker.net,Data Scientist,BUDGET,accepted,Query might save millions of lives,<UID: 0e8d26d3b71940f6b251d842c904cd37>,,[],55.0,999690.0625


On the Data Scientist's side, they can check if they were granted additional budget.

In [70]:
skywalker.privacy_budget

999745.0357041412

Awesome 👏 !! You have successfully exceuted **remote data science** on private data!

If you would like to shut down the domain server, you can easily do so by running the following command

In [71]:
!hagrid land {DOMAIN_NAME} --silent --force

[2K✅ Updated HAGrid from branch: [1;36m0.7[0m.[1;36m0[0mm
[2K[32m⠼[0m [1;34mUpdating HAGrid from branch: 0.7.0[0m
[1A[2KHAGrid land test_domain complete!


#### How to find the right `sigma`

`Sigma` represents how much noise the user wants added to the result. The noise is randomly selected from a Gaussian distribution with `sigma` as the `standard deviation` and `zero mean`.

So the first thing we need to remember while setting sigma is that if we set a very low sigma compared to the published value, it might not add enough noise, and the user would require a large privacy budget to get the accurate result. On the other hand, decreasing the value of sigma will result in more accurate results but at the expense of a more privacy budget being spent and leaking more information about private data.

Example: Let's assume the value being published is `100000`, then adding a slight noise of `20` will result in `100020`, which isn't a significant noise comparatively and thus would require a large budget to be spent. Similarly, if the value being published is `0.1` and you add noise of `20`, then the result value is 20.1 which is way off from the actual result and thus affects the accuracy of the result, although having spent low `epsilon`.