## Introduction

Welcome 👋!

First of all, thank you for helping the world become more privacy-friendly by using PySyft. This notebook is primarily intended for **data scientists** and it will guide you into how to make use of all the capabilities PySyft has to offer in the latest version 0.8.

This tutorial will cover three parts:\
(1) 👉 Understanding the workflow\
(2) 👉 How to send your first code request\
(3) 👉 What queries are supported in syft 0.8

## Part 1 - Understanding the workflow

#### Context
Let's imagine you want to do research into bubble tea trends because you want to open your own store. You found a dataset which contains the information you would need to answer your questions. The dataset is owned by a company and mentained by Amie, one of their employees. For this scenario, Amie has the role of **data manager**.

#### The role of data managers
Data managers are PySyft users who are responsible for datasets. They can upload, edit and remove datasets. Most importantly, a data manager is the person who is approving project requests and code submissions.

#### Project request
Once you find Amie's dataset listed on the platform, you can submit a project request where you request access to the data. In your project request, you should include the reasons why you need access to the data, and any other trustworthy information that prove you have good intentions, such as university affiliation (this will increase your chances of getting your project request approved).

We'll see how to submit a project request in part (2) of the tutorial.

#### Code request
After your project request is approved, you will be able to submit code queries, which will allow you to learn from the data. A code query can be seen as a function which will give you a desired answer (for example, the average rating for a particular type of bubble tea drink, or the drink with the most sugar). We'll see how to send code queries in part (2) of this tutorial.

#### However...

You won't directly have access to the data stored in the dataset. This is convenient for data managers, as they retain ownership of the data and prevent it from getting copied elsewhere. However, you will still be able to learn relevant statistical information from the dataset, even without having access to the data per se. This is good for data scientists like you, as you can still make use of the knowledge contained in the dataset and be able to advance your research. Win-win!

#### Real data and mock data

How it works is that there will be two datasets:
- the real dataset, containing the true unaltered information
- a mock dataset, containing fake data generated to be of the same type and shape to the real one

#### High-side domain and low-side domain

The real dataset will be kept private in what is called **the high side domain**. Only the data manager or other company employess will have direct access to it, and not data scientists.

The mock dataset will be stored into the **low side domain** and you will have direct access to it via code request.

#### Workflow altogether

Let's assume Amie approved your project request to study the dataset, and now you can start learning from it via code requests.

**Decide on a research question**\
First, what question do you want to answer? Let's imagine that you want the average rating for the brown sugar bubble milk tea, i.e. how liked it was overall by all customers who bought it.

**Prepare the query** (purple arrows)\
What you know your goal, you will start by preparing the query. You will only work on the mock dataset, the one from the low-side domain. The dataset will look and feel like the real one, but it won't contain the true data. You can make use of this to write and test your query, to make sure that the code compiles properly and returns an answer. You can edit and test your code query as many times as you want.

**Submit the code request** (orange arrows)\
One you are happy with the code query, then you can send it to be run on the real data. This is how you will get an accurate (but still privacy-friendly) answer for your research question. What happens here is that your code will be sent via PySyft API directly to the data manager (Amie, in this case). The data manager will check your code, make sure it is safe to be run on the real data, and if so, then they will execute your code on the private data, and generate the answer. This is why it's important that your code runs correctly when you decide to submit the code request - otherwise, the data manager will also be unable to run it and this will result in delays. 

**Get the answer** (orange arrows)\
After the data manager runs the code, they will also inspect the answer to make sure privacy is preserved if they share it with you. If all is fine, then they will send your the unaltered answer which your function returned on the private, real data. You will be notified when the answer is ready and will be able to retrieve this using the PySyft API.

All the technical details will be explained in this tutorial, in part (2).

*Note*: because getting the real answer involves manual checks and execution on behalf of the data manager, this means that often the data scientist won't receive the answer right away, but it might be hours or even days later.

![Screenshot 2023-04-26 at 22.23.04.png](level-0.png)

## Part 5 - Queries supported in Syft 0.8

This part of the tutorial will briefly outline what is supported and not when using Syft. There are some limitations, as shown below, and this might result in errors when running the code request on the private data. This section will explain what to bear in mind when writing your code request, and how to adapt your code to the current Syft capabilities.

**What is supported for NumPy?**

- array creation
- computing statistical information: `mean`, `std`, `min`, `max` etc
- slicing and indexing

Generally, the most common operations for NumPy are also supported when using Syft. But here are some things which you should keep in mind when writing a code request:

**[1] Converting objects**

`np.assarray(mock_object)`, `pd.DataFrame(mock_object)` and `torch.Tensor(mock_object)` will return objects which lose the Syft class associated to them, which is not desired. Avoid using them.

However, `np.astype()` is safe to be used.

**[2] Tuples**

Methods which return tuples are supported, but it's recommended that you extract the elements in the tuple in different variables and don't continue working with the tuple as a whole to avoid hard-to-debug issues.

**[3] Methods that return new arrays**

Numpy methods that return new arrays (e.g. `np.pad()`), as well as methods that extract arrays (e.g. `np.diag()`) might cause issues, for the reason explained in [1].

**[4] NaN values**

Pay attention to functions and operations which can return `NaN`, as this can show undefined behaviour.

**[5] Metadata**

TODO - but this is bad.

**[6] Chaining operation, and using multiple operations in a line**

Using multiple functions on the same line might not work as expected. For example, if you want to calculate `(n - n.mean()) / n.std()` it is advisable to compute the `std` and the `mean` separately, and then to compute the final result using the variables with the values from the standard deviation and the mean. 

Similarly, chaining multiple functions one after another might cause an error. For example, if you want to filter an array and then compute the average, it is advisable to do this in two different steps, rather than do it in a one-liner.

**[7] In-place modifications**

In-place modifications will cause a crash. Avoid usecases from the examples below:
```
mock_object += 1
np.add(mock_object, mock_object, out=mock_object)
np.sub(mock_object, mock_object, out=mock_object)
np.multiply(mock_object, mock_object, out=mock_object)
np.divide(mock_object, mock_object, out=mock_object)
mock_object[mock_object > 0.5] *= 2
```

**[8] NumPy subclasses**

For example, `np.random.shuffle(mock_object)` will show a warning and generate an incorrect result.

**[9] Record arrays**

Record arrays are not supported and using them will throw errors.