## Introduction

Welcome 👋!

First of all, thank you for helping the world become more privacy-friendly by using PySyft. This notebook is primarily intended for **data scientists** and it will guide you into how to make use of all the capabilities PySyft has to offer in the latest version 0.8.

This tutorial will cover three parts:\
(1) 👉 Understanding the workflow\
(2) 👉 How to send your first code request\
(3) 👉 What queries are supported in syft 0.8

## Part 1 - Understanding the workflow

#### Getting started
You would like to pursue research in the field of algorithmic transparency to protect the rights of social media users. We collaborated with private companies that made available via PySyft an initial set of datasets to enable such research.

They will provide access to their data, as explained below. At this stage, you should have received an URL and credential information (email and password) from a designated data manager. A data manager is a company representative who will collaborate with you to provide access to the data.

Without the credentials, you cannot login into PySyft API or platform, and thus cannot carry out your research tasks. Please reach out to the data managers if you have not received instructions about logging in.

#### The role of data managers
Data managers are PySyft users on behalf of companies, who are responsible for datasets. They can upload, edit and remove datasets. Most importantly, a data manager is the person who will share the login credentials with you in the first place, and and who will approve your code requests (also named code queries, or code submissions).

#### Code request
Once you receive your credentials to log into the domain, you will be able to submit code queries, which will allow you to learn from the data. A code query can be seen as a function which will give you a desired answer (for example, the average rating for a post type, the highest/lowest value with specific criteria etc). We'll see how to send code queries in part (2) of this tutorial.

#### However...

You won't directly have access to the private data. This is convenient for data managers, as they retain ownership of the data and prevent it from getting copied elsewhere. However, you will still be able to learn relevant statistical information from the dataset, even without having access to the data per se. This is good for data scientists like you, as you can still make use of the knowledge contained in the dataset and be able to advance your research. Win-win!

#### Real data and mock data

How it works is that there will be two datasets:
- the real dataset, containing the true unaltered information
- a mock dataset, containing fake data generated to be of the same type and shape to the real one

#### High-side domain and low-side domain

The real dataset will be kept private in what is called **the high side domain**. Only the data manager or other company employess will have direct access to it, and not data scientists.

The mock dataset will be stored into the **low side domain** and you will have direct access to it via code request.

#### Workflow altogether

Let's assume the data manager, approved your project request to study the dataset, and now you can start learning from it via code requests.

**Decide on a research question**\
First, what question do you want to answer? Let's imagine that you want the average number of reactions for a social media post of a certain type.

**Prepare the query** (purple arrows)\
Once you know your goal, you will start by preparing the query. You will only work on the mock dataset, the one from the low-side domain. The dataset will look and feel like the real one, but it won't contain the true data. You can make use of this to write and test your query, to make sure that the code compiles properly and returns an answer. You can edit and test your code query as many times as you want.

**Submit the code request** (orange arrows)\
Once you are happy with the code query, then you can send it to be run on the real data. This is how you will get an accurate (but still privacy-friendly) answer for your research question. What happens here is that your code will be sent via PySyft API directly to the data manager. The data manager will check your code, make sure it is safe to be run on the real data, and if so, then they will execute your code on the private data, and generate the answer. This is why it's important that your code runs correctly when you decide to submit the code request - otherwise, the data manager will also be unable to run it and this will result in delays. 

**Get the answer** (orange arrows)\
After the data manager runs the code, they will also inspect the answer to make sure privacy is preserved if they share it with you. If all is fine, then they will send your the unaltered answer which your function returned on the private, real data. You will be notified when the answer is ready and will be able to retrieve this using the PySyft API.

All the technical details will be explained in this tutorial, in part (2).

*Note*: because getting the real answer involves manual checks and execution on behalf of the data manager, this means that often the data scientist won't receive the answer right away, but it might be hours or even days later.

![Screenshot 2023-04-26 at 22.23.04.png](level-0.png)