## Introduction

Welcome 👋!

First of all, thank you for helping the world become more privacy-friendly by using PySyft. This notebook is primarily intended for **data scientists** and it will guide you into how to make use of all the capabilities PySyft has to offer in the latest version 0.8.

This tutorial will cover five parts:\
(1) 👉 Understanding the workflow\
(2) 👉 Understanding the dataset\
(3) 👉 Writing a code request\
(4) 👉 Sending a code request\
(5) 👉 What queries are supported in syft 0.8

## Part 1 - Understanding the workflow

#### Getting started
You would like to pursue research in the field of algorithmic transparency to protect the rights of social media users. We collaborated with private companies that made available via PySyft an initial set of datasets to enable such research.

They will provide access to their data, as explained below. At this stage, you should have received an URL and credential information (email and password) from a designated data manager. A data manager is a company representative who will collaborate with you to provide access to the data.

Without the credentials, you cannot login into PySyft API or platform, and thus cannot carry out your research tasks. Please reach out to the data managers if you have not received instructions about logging in.

#### The role of data managers
Data managers are PySyft users on behalf of companies, who are responsible for datasets. They can upload, edit and remove datasets. Most importantly, a data manager is the person who will share the login credentials with you in the first place, and and who will approve your code requests (also named code queries, or code submissions).

#### Code request
Once you receive your credentials to log into the domain, you will be able to submit code queries, which will allow you to learn from the data. A code query can be seen as a function which will give you a desired answer (for example, the average rating for a post type, the highest/lowest value with specific criteria etc). We'll see how to send code queries in part (2) of this tutorial.

#### However...

You won't directly have access to the private data. This is convenient for data managers, as they retain ownership of the data and prevent it from getting copied elsewhere. However, you will still be able to learn relevant statistical information from the dataset, even without having access to the data per se. This is good for data scientists like you, as you can still make use of the knowledge contained in the dataset and be able to advance your research. Win-win!

#### Real data and mock data

How it works is that there will be two datasets:
- the real dataset, containing the true unaltered information
- a mock dataset, containing fake data generated to be of the same type and shape to the real one

#### High-side domain and low-side domain

The real dataset will be kept private in what is called **the high side domain**. Only the data manager or other company employess will have direct access to it, and not data scientists.

The mock dataset will be stored into the **low side domain** and you will have direct access to it via code request.

#### Workflow altogether

Let's assume the data manager, approved your project request to study the dataset, and now you can start learning from it via code requests.

**Decide on a research question**\
First, what question do you want to answer? Let's imagine that you want the average number of reactions for a social media post of a certain type.

**Prepare the query** (purple arrows)\
Once you know your goal, you will start by preparing the query. You will only work on the mock dataset, the one from the low-side domain. The dataset will look and feel like the real one, but it won't contain the true data. You can make use of this to write and test your query, to make sure that the code compiles properly and returns an answer. You can edit and test your code query as many times as you want.

**Submit the code request** (orange arrows)\
Once you are happy with the code query, then you can send it to be run on the real data. This is how you will get an accurate (but still privacy-friendly) answer for your research question. What happens here is that your code will be sent via PySyft API directly to the data manager. The data manager will check your code, make sure it is safe to be run on the real data, and if so, then they will execute your code on the private data, and generate the answer. This is why it's important that your code runs correctly when you decide to submit the code request - otherwise, the data manager will also be unable to run it and this will result in delays. 

**Get the answer** (orange arrows)\
After the data manager runs the code, they will also inspect the answer to make sure privacy is preserved if they share it with you. If all is fine, then they will send your the unaltered answer which your function returned on the private, real data. You will be notified when the answer is ready and will be able to retrieve this using the PySyft API.

All the technical details will be explained in this tutorial, in part (2).

*Note*: because getting the real answer involves manual checks and execution on behalf of the data manager, this means that often the data scientist won't receive the answer right away, but it might be hours or even days later.

![Screenshot 2023-04-26 at 22.23.04.png](level-0.png)

## Part 2 - Understanding the dataset

So many of the technical solutions which we use on a day to day basis (including social media platforms) make use of AI algorithms in the decisions that they make. As such, it is increasingly important to be able to question, understand and explain how these algorithms work (algorithmic transparency), and to be able to answer ethical questions (for example, to prove or disprove if an AI algorithm shows any signs of bias towards a minority group, in order to mitigate the risks posed for that group). But to achieve this, we need relevant data gathered on a topic, and skilled scientists to derive conclusions out of it.

In writing this tutorial, we assume the motivation of **advancing algorithmic transparency** in this research project, and as such, the dataset will be relevant for this purpose.

The inspiration from this project is a study run by Twitter to examine the effects of machine learning on the amplification of political content from elected officials or news outlets. Their paper can be found [here](https://www.pnas.org/doi/full/10.1073/pnas.2025334119).

**Study summary**\
Personalization algorithms play an important role in how the content on the Twitter timeline is selected and ordered, which can lead to amplification or suppression of certain messages. To assess whether some political groups benefit from this more than others and whether this results in a preferential treatment due to algorithm internals rather than the interactions, Twitter ran an in-depth study comparing the algorithmically ranked Home timeline versus the reverse chronological Home timeline.

The study was carried out during March - June 2020, and targeted only 5% of the twitter users, as such:
- 4% of the users were placed in the **treatment group**, meaning that they experienced the **algorithmically ranked Home timeline**
- 1% of the users were placed in the **control group**, meaning that they experienced the **reverse chronological Home timeline**
- the rest 95% of the Twitter users were not part of the study
- the users placed in the treatment and in the control group were randomly selected

The dataset has the following form:
| Tweet ID | Link      | Impression | Bucket    | ... |
|----------|-----------|------------|-----------|-----|
| id1      | news.ca/a | 100        | Control   |     |
| id2      | news.ca/a | 150        | Treatment |     |
| id3      | news.ca/b | 205        | Control   |     |
| id4      | news.ca/b | 178        | Treatment |     |
| ...      |           |            |           |     |

It records the number of impressions (in this case, the number of views) for a specific post, both from the control group and from the treatment group.

Of course, the idea of the study can be extended to any type of data. In this example, we can replace `tweets` with any other object recommended to the user, such as pictures, videos, reels, items to purchase etc.

Next, we will work with an imaginary dataset to show the abilities of the PySyft library. Following the Twitter study above, let's imagine that there exists a social media platform where users can only post images which contain bubble tea drinks. Each picture has a link that takes the user to an online bubble tea store, where they can order the particular drink featured in the image. Impressions, in this example, are measured by the number of clicks on the link in each picture.

The form of the dataset is as such:
| PostId   | Link      | Impression | Bucket    | ... |
|----------|-----------|------------|-----------|-----|
| id1      | bt/brown-sugar1 | 10        | Control   |     |
| id2      | bt/brown-sugar1 | 28        | Treatment |     |
| id3      | bt/oolong1 | 21        | Control   |     |
| id4      | bt/oolong1 | 4        | Treatment |     |
| ...      |           |            |           |     |

You are trying to identify if the order in which the images are shown influence the purchasing decision of the users. Just like in the Twitter example, there is the control (chronological order) and the treatment group (using a ranking algorithm).

You have access to this dataset on the low-side domain hosted by the company. Their private data is on the high-side domain, as explained before, and currently you will work with mock data. We will work with a small example, containing only 6 rows and the four columns from before.

## Part 3 - Writing a code request

You have access to a dataset (with fake data) on which you will prepare the query. Below are the steps you need to do.

#### Logging in

In [1]:
import syft as sy
import pandas as pd



In [2]:
# change your email and password with your own credentials
guest_domain_client = sy.login(url='http://localhost:8082', email='ana.banana@uni.org', password='student')

Logged into Trusting Silver as <ana.banana@uni.org>


In [3]:
guest_domain_client

<SyftClient - Trusting Silver <d87e4efcc9e8494b826df57b8670a2b3>: HTTPConnection: http://localhost:8082>

#### Access the dataset

In [6]:
# you can see a list of all available datasets! For now, it will be just one.
guest_domain_client.datasets

Unnamed: 0,type,id,name,url
0,syft.service.dataset.dataset.Dataset,3d404c84206b40b4a93f8d1c8ad0dd3a,Bubble tea mock data,


In [8]:
# you can directly access one of the datasets in the list, let's save it in a variable
dataset = guest_domain_client.datasets[0]

In [9]:
# the pointer to the dataset
dataset

```python
Syft Dataset: Bubble tea mock data
Assets:
	mock_bubble_tea_data: Mock data for bubble tea consumption
Citation: Carmen Popa
Description: The fake data to show bubble tea trends for consumers

```

In [10]:
# checkout the assets of the dataset; in the lsit below, there is just one asset available
dataset.assets

Unnamed: 0,key,type,id
0,mock_bubble_tea_data,syft.service.dataset.dataset.Asset,dfbfa3dc97c241d6ba3e4496dc2835fb


In [74]:
# TODO Q1

In [29]:
print(type(dataset))
print(type(dataset.assets[0]))
print(type(dataset.assets[0].mock))
print(type(dataset.assets[0].mock.syft_action_data))

<class 'syft.service.dataset.dataset.Dataset'>
<class 'syft.service.dataset.dataset.Asset'>
<class 'syft.service.action.pandas.PandasDataFrameObject'>
<class 'pandas.core.frame.DataFrame'>


In [32]:
# let's access the Pandas dataframe
mock_df = dataset.assets[0].mock.syft_action_data

You can work with this object like with any other dataframe. You can access the shape, print some rows, get statistical results on the columns, split the dataset etc.

In [37]:
mock.shape

(4, 4)

In [38]:
mock_df.head()

Unnamed: 0,drink_name,sugar_level,has_pearls,customer_ratings
0,brown_sugar,0.7,True,4.9
1,thai_milk,0.3,True,4.6
2,mango_coconut,0.9,False,3.5
3,strawberry_cheese,0.5,True,4.3


In [39]:
print('Maximum sugar level:', mock_df["sugar_level"].max())

Maximum sugar level: 0.9


In [40]:
mock_df[0:2]

Unnamed: 0,drink_name,sugar_level,has_pearls,customer_ratings
0,brown_sugar,0.7,True,4.9
1,thai_milk,0.3,True,4.6


#### Query for mock data

Now that you can access the dataset, let's try to answer a question based on the data.

Q: what is the average sugar level of the drinks?

In [41]:
# check the result on the mock dataset
mock_df['sugar_level'].mean()

0.6

Next, let's write the function which will be tested on the mock data and run on the real data to provide the accurate answer. 

The template below is a good start for this function. It has a **syft annotation**, which specifies the **input policy** and the **output policy**. These will be explained later. TODO Q2 Q3 Q4

In [59]:
@sy.syft_function(input_policy=sy.ExactMatch(bubble_tea_data=mock),
                  output_policy=sy.SingleExecutionExactOutput())
def sugar_level_query(bubble_tea_data):
    # imports and general info for each query
    import pandas as pd
    from opendp.mod import enable_features
    enable_features('contrib')
    from opendp.measurements import make_base_laplace
    aggregate = 0.
    base_lap = make_base_laplace(scale=5.)
    noise = base_lap(aggregate)

    # customize your query here
    df = bubble_tea_data
    sugar_level_avg = df['sugar_level'].mean()
    return (float(sugar_level_avg), float(noise))

In [60]:
result = sugar_level_query(bubble_tea_data=mock)
result

(0.6, 1.5430206738337342)

#### [Good practice] exploring the function before sending it

In [61]:
from syft.client.api import NodeView
node_view = NodeView.from_api(guest_domain_client.api)
assert node_view in sugar_level_query.kwargs

In [62]:
node_view

NodeView(node_name='Trusting Silver', verify_key=04137bfea77b4c86492df2b9849190bd4a19b7a2833b3c500dd37fef6957615c)

In [63]:
sugar_level_query.input_policy_type

syft.service.policy.policy.ExactMatch

In [64]:
sugar_level_query.output_policy_type

syft.service.policy.policy.OutputPolicyExecuteOnce

In [65]:
print(sugar_level_query.code)

@sy.syft_function(input_policy=sy.ExactMatch(bubble_tea_data=mock),
                  output_policy=sy.SingleExecutionExactOutput())
def sugar_level_query(bubble_tea_data):
    # imports and general info for each query
    import pandas as pd
    from opendp.mod import enable_features
    enable_features('contrib')
    from opendp.measurements import make_base_laplace
    aggregate = 0.
    base_lap = make_base_laplace(scale=5.)
    noise = base_lap(aggregate)

    # customize your query here
    df = bubble_tea_data
    sugar_level_avg = df['sugar_level'].mean()
    return (float(sugar_level_avg), float(noise))



## Part 4 - Sending a code request

Now that you have written a code request, the next step is to send it. This involves creating a project first, where you explain the purpose of your research. TODO Q5

In [70]:
new_project = sy.Project(name="Bubble tea average sugar level")

Add a description to the project. It's important here to be specific and clear, and even include relevant links in the description that can help the approver get as much insight as possible into your research question and intention. This will help having your request approve, as the data manager is informed.

In [71]:
proj_desc = """Hello, I want to explore the average sugar level in bubble tea drinks from the dataset. This will help me see how ..."""
new_project.set_description(proj_desc)

In [73]:
# Sent the code to the quest domain! Use the function name as the parameter (e.g. sugar_level_query) TODO Q6
guest_domain_client.api.services.code.request_code_execution(sugar_level_query)

In [75]:
# You can check all the previous code submissions you've made to this domain.
guest_domain_client.code

Unnamed: 0,type,id,status,service_func_name
0,syft.service.code.user_code.UserCode,3b72f052e2a440179cecab37e723af6b,"{NodeView(node_name='Modest Karp', verify_key=...",most_liked_bubble_tea
1,syft.service.code.user_code.UserCode,65881481da1b4bb48749e5619cd3f043,"{NodeView(node_name='Trusting Silver', verify_...",sugar_level_query


In [76]:
# Pick a specific code submission by the index and inspect it
submitted_code = guest_domain_client.code[0]
submitted_code

```python
class UserCode:
  id: str = 3b72f052e2a440179cecab37e723af6b
  node_uid: str = None
  user_verify_key: str = 4ea8053fe7417899e19a84448a46ffceed451d06ed90eef0bc5b1f2aa9e10143
  raw_code: str = "@sy.syft_function(input_policy=sy.ExactMatch(drink_data=mock),
                  output_policy=sy.SingleExecutionExactOutput())
def most_liked_bubble_tea(drink_data):
    import pandas as pd
    from opendp.mod import enable_features
    enable_features('contrib')
    from opendp.measurements import make_base_laplace
    aggregate = 0.
    base_lap = make_base_laplace(scale=5.)
    noise = base_lap(aggregate)

    # your code:
    df = drink_data
    most_liked_drink_row = df.iloc[mock_df['customer_ratings'].argmax()]
    return (# str(most_liked_drink_row["drink_name"]),
            float(most_liked_drink_row["customer_ratings"]),
            float(noise)
           )
"
  input_policy_type: str = <class 'syft.service.policy.policy.ExactMatch'>
  input_policy_init_kwargs: str = {NodeView(node_name='Modest Karp', verify_key=04137bfea77b4c86492df2b9849190bd4a19b7a2833b3c500dd37fef6957615c): {'drink_data': <UID: 4d6b7de22de8464ea7d53ec6a1ee5788>}}
  input_policy_state: str = b''
  output_policy_type: str = <class 'syft.service.policy.policy.OutputPolicyExecuteOnce'>
  output_policy_init_kwargs: str = {}
  output_policy_state: str = b''
  parsed_code: str = "def user_func_most_liked_bubble_tea_4ea8053fe7417899e19a84448a46ffceed451d06ed90eef0bc5b1f2aa9e10143_b4bd04f2a5bd30a85536f05f53741ceb9121c015867cf1fc684d3217df0b60d4(drink_data):

    def most_liked_bubble_tea(drink_data):
        import pandas as pd
        from opendp.mod import enable_features
        enable_features('contrib')
        from opendp.measurements import make_base_laplace
        aggregate = 0.0
        base_lap = make_base_laplace(scale=5.0)
        noise = base_lap(aggregate)
        df = drink_data
        most_liked_drink_row = df.iloc[mock_df['customer_ratings'].argmax()]
        return (float(most_liked_drink_row['customer_ratings']), float(noise))
    result = most_liked_bubble_tea(drink_data=drink_data)
    return result"
  service_func_name: str = "most_liked_bubble_tea"
  unique_func_name: str = "user_func_most_liked_bubble_tea_4ea8053fe7417899e19a84448a46ffceed451d06ed90eef0bc5b1f2aa9e10143_b4bd04f2a5bd30a85536f05f53741ceb9121c015867cf1fc684d3217df0b60d4"
  user_unique_func_name: str = "user_func_most_liked_bubble_tea_4ea8053fe7417899e19a84448a46ffceed451d06ed90eef0bc5b1f2aa9e10143"
  code_hash: str = "b4bd04f2a5bd30a85536f05f53741ceb9121c015867cf1fc684d3217df0b60d4"
  signature: str = (drink_data)
  status: str = {NodeView(node_name='Modest Karp', verify_key=04137bfea77b4c86492df2b9849190bd4a19b7a2833b3c500dd37fef6957615c): <UserCodeStatus.SUBMITTED: 'submitted'>}
  input_kwargs: str = ['drink_data']
  enclave_metadata: str = None

```

In [79]:
new_project.add_request(obj=submitted_code, permission=sy.UserCodeStatus.EXECUTE)

In [78]:
new_project

```python
class ProjectSubmit:
  id: str = None
  name: str = "Bubble tea average sugar level"
  description: str = "Hello, I want to explore the average sugar level in bubble tea drinks from the dataset. This will help me see how ..."
  changes: str = [syft.service.project.project.ObjectPermissionChange]

```

In [80]:
new_project.changes

Unnamed: 0,type,id
0,syft.service.project.project.ObjectPermissionC...,4c9f96d1849c403bb7347f36e2f15585
1,syft.service.project.project.ObjectPermissionC...,3c021ffd169348c9a070cfb953e84b53


In [84]:
# Submit the project
guest_domain_client.submit_project(new_project)

#### Receiving an answer to your request

Below we'll see how you can obtain an answer from your request. As mentioned before, the code is inspected by a data manager, and then run on the real data. This can take a while to execute, as it invovles manual work. It can be up to a few days.

There are three possible answers which you can get:
- code still waiting for approval:
        `SyftNotReady: Your code is waiting for approval: {NodeView(node_name='Trusting Silver', verify_key=04137bfea77b4c86492df2b9849190bd4a19b7a2833b3c500dd37fef6957615c): }`
- code approved: TODO
- code rejected: TODO

In [None]:
guest_domain_client.api.services.code[-1].output_policy # TODO Q7

In [86]:
# To check your result, access the asset again
asset = guest_domain_client.datasets[0].assets[0]
asset

```python
Asset: mock_bubble_tea_data
Pointer Id: 4d6b7de22de8464ea7d53ec6a1ee5788
Description: Mock data for bubble tea consumption
Total Data Subjects: 0
Shape: (4, 4)
Contributors: 0

```

In [91]:
# And check the code status for the function which you created, giving it the asset as parameter TODO Q8
real_result = guest_domain_client.api.services.code.sugar_level_query(bubble_tea_data=asset)
real_result