## Introduction

Welcome ðŸ‘‹!

First of all, thank you for helping the world become more privacy-friendly by using PySyft. This notebook is primarily intended for **data scientists** and it will guide you into how to make use of all the capabilities PySyft has to offer in the latest version 0.8.

This tutorial will cover five parts:\
(1) ðŸ‘‰ Understanding the workflow\
(2) ðŸ‘‰ Understanding the dataset\
(3) ðŸ‘‰ Writing a code request\
(4) ðŸ‘‰ Sending a code request\
(5) ðŸ‘‰ What queries are supported in syft 0.8

## Part 1 - Understanding the workflow

#### Getting started

You would like to pursue research in the field of algorithmic transparency to protect the rights of social media users. We collaborated with private companies that made available via PySyft an initial set of datasets to enable such research.

They will provide access to their data, as explained below. At this stage, you should have received an URL and credential information (email and password) from a designated data manager. A data manager is a company representative who will collaborate with you to provide access to the data.

Without the credentials, you cannot login into PySyft API or platform, and thus cannot carry out your research tasks. Please reach out to the data managers if you have not received instructions about logging in.

#### The role of data managers
Data managers are PySyft users on behalf of companies, who are responsible for datasets. They can upload, edit and remove datasets. Most importantly, a data manager is the person who will share the login credentials with you in the first place, and and who will approve your code requests (also named code queries, or code submissions).

#### Code request
Once you receive your credentials to log into the domain, you will be able to submit code queries, which will allow you to learn from the data. A code query can be seen as a function which will give you a desired answer (for example, the average rating for a post type, the highest/lowest value with specific criteria etc). We'll see how to work with code queries in parts 3 and 4 of this tutorial.

#### However...

You won't directly have access to the private data. This is convenient for data managers, as they retain ownership of the data and prevent it from getting copied elsewhere. However, you will still be able to learn relevant statistical information from the dataset, even without having access to the data per se. This is good for data scientists like you, as you can still make use of the knowledge contained in the dataset and be able to advance your research. Win-win!

#### Real data and mock data

How it works is that there will be two datasets:
- the real dataset, containing the true unaltered information
- a mock dataset, containing fake data generated to be of the same type and shape to the real one

#### High-side domain and low-side domain

The real dataset will be kept private in what is called **the high side domain**. Only the data manager or other company employess will have direct access to it, and not data scientists.

The mock dataset will be stored into the **low side domain** and you will have direct access to it via code request.

#### Workflow altogether

Let's assume the data manager approved your project request to study the dataset, and now you can start learning from it via code requests.

**Decide on a research question**\
First, what question do you want to answer? Let's imagine that you want the average number of reactions for a social media post of a certain type.

**Prepare the query** (purple arrows)\
Once you know your goal, you will start by preparing the query. You will only work on the mock dataset, the one from the low-side domain. The dataset will look and feel like the real one, but it won't contain the true data. You can make use of this to write and test your query, to make sure that the code compiles properly and returns an answer. You can edit and test your code query locally as many times as you want.

**Submit the code request** (orange arrows)\
Once you are happy with the code query, then you can send it to be run on the real data. This is how you will get an accurate (but still privacy-friendly) answer for your research question. What happens here is that your code will be sent via PySyft API directly to the data manager. The data manager will check your code, make sure it is safe to be run on the real data, and if so, then they will execute your code on the private data, and generate the answer. This is why it's important that your code runs correctly when you decide to submit the code request - otherwise, the data manager will also be unable to run it and this will result in delays. 

**Get the answer** (orange arrows)\
After the data manager runs the code, they will also inspect the answer to make sure privacy is preserved if they share it with you. If all is fine, then they will send you the unaltered answer which your function returned on the private, real data this time. You will be notified when the answer is ready and will be able to retrieve this using the PySyft API.

All the technical details will be explained in this tutorial, in parts 3 and 4.

**Imporatant note**: Because getting the real answer involves manual checks and execution on behalf of the data manager, this means that often the data scientist won't receive the answer right away, but it might be hours or even days later. We know that this is not ideal, but as privacy enhancing techniques become more mature, and as your organisation trusts them more and more, the whole workflow will simplify and the steps in between your query and getting the result will reduce. This advanced level of accessibility and convenience in doing remote data science, while maintaining privacy, is at the core of OpenMined's mission - this is what we are pushing for, and, right now, you are a pioneer of this technology, helping us bring it forward through your work.

![Screenshot 2023-04-26 at 22.23.04.png](level-0.png)

## Part 2 - Understanding the dataset

So many of the technical solutions which we use on a day to day basis (including social media platforms) make use of AI algorithms in the decisions that they make. As such, it is increasingly important to be able to question, understand and explain how these algorithms work (algorithmic transparency), and to be able to answer ethical questions (for example, to prove or disprove if an AI algorithm shows any signs of bias towards a minority group, in order to mitigate the risks posed for that group). But to achieve this, we need relevant data gathered on a topic, and skilled scientists to derive conclusions out of it.

In writing this tutorial, we assume the motivation of **advancing algorithmic transparency** in this research project, and as such, the dataset will be relevant for this purpose.

The inspiration from this project is a study run by Twitter to examine the effects of machine learning on the amplification of political content from elected officials or news outlets. Their paper can be found [here](https://www.pnas.org/doi/full/10.1073/pnas.2025334119).

**Study summary**\
Personalization algorithms play an important role in how the content on the Twitter timeline is selected and ordered, which can lead to amplification or suppression of certain messages. To assess whether some political groups benefit from this more than others and whether this results in a preferential treatment due to algorithm internals rather than the interactions, Twitter ran an in-depth study comparing the algorithmically ranked Home timeline versus the reverse chronological Home timeline.

The study was carried out during March - June 2020, and targeted only 5% of the twitter users, as such:
- 4% of the users were placed in the **treatment group**, meaning that they experienced the **algorithmically ranked Home timeline**
- 1% of the users were placed in the **control group**, meaning that they experienced the **reverse chronological Home timeline**
- the rest 95% of the Twitter users were not part of the study
- the users placed in the treatment and in the control group were randomly selected

The dataset has the following form:
| Tweet ID | Link      | Impression | Bucket    | ... |
|----------|-----------|------------|-----------|-----|
| id1      | news.ca/a | 100        | Control   |     |
| id2      | news.ca/a | 150        | Treatment |     |
| id3      | news.ca/b | 205        | Control   |     |
| id4      | news.ca/b | 178        | Treatment |     |
| ...      |           |            |           |     |

It records the number of impressions (in this case, the number of views) for a specific post, both from the control group and from the treatment group.

Of course, the idea of the study can be extended to any type of data. In this example, we can replace `tweets` with any other object recommended to the user, such as pictures, videos, reels, items to purchase etc.

Next, we will work with an imaginary dataset to show the abilities of the PySyft library. Following the Twitter study above, let's imagine that there exists a social media platform where users can only post images which contain bubble tea drinks. Each picture has a link that takes the user to an online bubble tea store, where they can order the particular drink featured in the image. Impressions, in this example, are measured by the number of clicks on the link in each picture.

The form of the dataset is as such:
| PostId   | Link      | Impression | Bucket    | ... |
|----------|-----------|------------|-----------|-----|
| id1      | bt/brown-sugar1 | 10        | Control   |     |
| id2      | bt/brown-sugar1 | 28        | Treatment |     |
| id3      | bt/oolong1 | 21        | Control   |     |
| id4      | bt/oolong1 | 4        | Treatment |     |
| ...      |           |            |           |     |

You are trying to identify if the order in which the images are shown influence the purchasing decision of the users. Just like in the Twitter example, there is the control (chronological order) and the treatment group (using a ranking algorithm).

You have access to this dataset on the low-side domain hosted by the company. Their private data is on the high-side domain, as explained before, and currently you will work with mock data. We will work with a small example, containing only 6 rows and the four columns from before.

## Part 3 - Writing a code request

You have access to a dataset (with fake data) on which you will prepare the query. Below are the steps you need to do.

#### Logging in

In [1]:
import syft as sy
import pandas as pd



In [2]:
# change the url, and your email and password with your own credentials

guest_domain_client = sy.login(url='http://localhost:8081', email='ana.banana@uni.org', password='student')

Logged into Nice Bach as <ana.banana@uni.org>


In [3]:
guest_domain_client

<SyftClient - Nice Bach <9af5795ca6b1447cafaced0bfbe2eca3>: HTTPConnection: http://localhost:8081>

#### Access the dataset

In [4]:
# You can see a list of all available datasets! For now, it will be just one, the bubble tea mock data described above.

guest_domain_client.datasets

Unnamed: 0,type,id,name,url
0,syft.service.dataset.dataset.Dataset,575c017e652140369133196d432102b4,Algorithmic study - Bubble tea mock data,


In [5]:
# You can directly access one of the datasets in the list, let's save it in a variable.

dataset = guest_domain_client.datasets[0]

In [6]:
# You can see all the class fields (metadata added by the data owner) of the dataset.

dataset

```python
Syft Dataset: Algorithmic study - Bubble tea mock data
Assets:
	bubble_tea_data_model_auditing: Mock data for bubble tea social media study
Citation: Carmen Popa
Description: The fake dataset for the bubble tea algorithmic study

```

**About assets**\
PySyft library provides flexibility so that data owners can group together assets into a single dataset. As such, a dataset is a collection of assets, and each asset is a singular object which holds the data. For example, one can add the same data in multiple formats, but use a single uploaded dataset, with more assets (one asset for each format of the data). Similarly, one can split the dataset into testing / training / validation, and use assets to mark this.

For our current example, the dataset has just one asset, as shown below. But it might be that a dataset contains multiple assets, and their meaning or difference will be marked in the `key` field.

In [7]:
# Checkout the assets of the dataset; in the list below, there is just one asset available.

dataset.assets

Unnamed: 0,key,type,id
0,bubble_tea_data_model_auditing,syft.service.dataset.dataset.Asset,0d5710b409d54b10b6293bd6a1d70929


In [8]:
# The PySyft dataset has multiple wrappers around the core pandas dataframe object.
# To work with the data, you should extract it as explained below:

print(type(dataset))
print(type(dataset.assets[0]))
print(type(dataset.assets[0].mock))  # ---> this is the dataset which needs to be passed to the code submission; it's the Syft object that holds the data
print(type(dataset.assets[0].mock.syft_action_data))  # ---> this is the dataframe which you can use locally; it's the pandas dataframe

<class 'syft.service.dataset.dataset.Dataset'>
<class 'syft.service.dataset.dataset.Asset'>
<class 'syft.service.action.pandas.PandasDataFrameObject'>
<class 'pandas.core.frame.DataFrame'>


In [9]:
# Let's access the Pandas dataframe. You can work with this object like with any other dataframe.
# You can access the shape, print some rows, get statistical results on the columns, split the dataset etc.

mock_df = dataset.assets[0].mock.syft_action_data

In [10]:
mock_df.shape

(6, 4)

In [11]:
mock_df.head()

Unnamed: 0,PostId,Link,Impression,Bucket
0,id1,bt/brown-sugar1,120,Control
1,id2,bt/brown-sugar1,275,Treatment
2,id3,bt/oolong1,181,Control
3,id4,bt/oolong1,49,Treatment
4,id5,bt/taro1,122,Control


In [12]:
print('Total number of impressions:', mock_df["Impression"].sum())

Total number of impressions: 885


In [13]:
mock_df[0:2]

Unnamed: 0,PostId,Link,Impression,Bucket
0,id1,bt/brown-sugar1,120,Control
1,id2,bt/brown-sugar1,275,Treatment


#### Query for mock data

Now that you can access the dataset, let's try to answer some ethical questions based on the data. Let's consider the example in which you want to query the average number of impressions for the Treatment group.

In [14]:
# check the result on the mock dataset

mock_df[mock_df['Bucket'] == 'Treatment']['Impression'].mean()

154.0

But this is the answer computed on the mock data. As explained before, you need to submit a function (code request) which will be run on the private data to obtain the real result. 

The template below is a good start for this function. It has a **syft annotation**, which specifies the **input policy** and the **output policy**.

Input and Output Polices are rules set by the data owner on what data can go IN and what data can come OUT of custom code. These policies basically pair up the code submission with the dataset asset it was intended for.

The input policy gives the data owner the confidence that the system will only allow the code to run on the specified asset from the dataset (i.e. `ExactMatch(bubble_tea_data=mock)`). This means that an approved code cannot be run on any asset at random.

Output policies are used to maintain state, meaning they can remember between execution. They are useful to imposing limits such as allowing an execution of a code only for 10 times. This allows a data owner to have confidence on how many times and what the output structure looks like when that custom code is executed.

In [15]:
# extracting the PySyft object which holds the dataset, to pass it to the code submission below

mock = dataset.assets[0].mock

In [16]:
# the parameter name needs to match the argument name specified in the input policy (e.g. `bubble_tea_data`)

@sy.syft_function(input_policy=sy.ExactMatch(bubble_tea_data=mock),
                  output_policy=sy.SingleExecutionExactOutput())
def average_impression_treatment_query_v1(bubble_tea_data):
    # customize your query here
    
    df = bubble_tea_data
    result = df[df['Bucket'] == 'Treatment']['Impression'].mean()
    
    return float(result)

*Note: The system will not restrain you from submitting a function with the same name, as long as the code within is different. However, we recommend adding version identifiers to your function name (e.g. `_v1` used above) to be able to identify between multiple submissions. You might want to change the query at some point, or to correct something in your query after you already submit it. If you use the same  function name, it might be hard to identify which is that particular version of the code, because they will all appear under the same name, and then you might now know which returned result is the one you were looking for. Thus, we suggest having different names depending on the version of the code. So basically, for every submission, increment the number of the version.*

In [17]:
# Notice the use of mock (PySyft object) instead of mock_df (Pandas dataframe)

result = average_impression_treatment_query_v1(bubble_tea_data=mock)
result  # should be the same one as the one you tested locally (e.g. 154 for the current example)

154.0

#### Noise

As we can see from the code, `average_impression_treatment_query_v1` returns the result as it is. This function will also be run on the private data, which means that it will return then exact average that is computed on the private data.

In some cases, this might not be accepted by the data owner due to privacy reasons. Without any alterations in the computed result, you can, for example, make multiple slighlty different queries and then infer private information about the subjects, which infringes their privacy. As such, adding some noise to the result is needed.

Ideally, this would be taken care of automatically by PySyft library, or by the data manager, making sure that they do not send to you the exact real result. However, currently, this is not supported, and the proposed workflow is that the data scientist will add the noise in their request.

In your work, it might be that for some datasets, it is not needed to add any noise. This depends on the nature of the data, and on the privacy requirements imposed by the data owners. However, for some datasets, adding noise to a result might be mandatory, and a query like the one above will be rejected.

If you need to add noise to your result, you can make use of data privacy engines, such as the [OpenDP](https://docs.opendp.org/en/stable/index.html) open-source library.

Next, we will write a new version of the function (`average_impression_treatment_noise_query_v1`), which still computes the average number of impressions for the Treatment group, but adds noise to the result.

In [18]:
# OpenDP documentation for mean: 
# https://docs.opendp.org/en/stable/user/transformations/aggregation-mean.html

@sy.syft_function(input_policy=sy.ExactMatch(bubble_tea_data=mock),
                  output_policy=sy.SingleExecutionExactOutput())
def average_impression_treatment_noise_query_v1(bubble_tea_data):
    # imports
    from opendp.mod import enable_features
    from opendp.measurements import make_base_laplace
    enable_features('contrib')
    
    # generate noise
    aggregate = 0.
    base_lap = make_base_laplace(scale=5.)
    noise = base_lap(aggregate)

    # customize your query here
    df = bubble_tea_data
    result = df[df['Bucket'] == 'Treatment']['Impression'].mean()
    
    # return the result with noise
    return float(result) + float(noise)

In [19]:
result_with_noise = average_impression_treatment_noise_query_v1(bubble_tea_data=mock)
result_with_noise  # try running it multiple times and see how the result changes

153.46826112023587

#### Things to keep in mind when writing functions

Remember that you can only test your function locally on the mock dataset, but you are not able to run it yourself on the dataset with the real data. So there might be cases when your function works just fine locally, but the data manager is not able to run it successfully on the real data. To minimize these, write your queries as generic as possible. For example, if you make use of the size of the dataset (number of lines, number of columns), then it's better to compute it inside the function using the Pandas methods for this (`.shape, len()`), than to hardcode it based on the values given by the mock dataset. As a rule of thumb, try to not hardcode things which you can just compute on the input dataset, as this will certainly help having your query generic.

#### [Good practice] exploring the function before sending it

These are instructions on how to double-check the settings of your custom function, before sending it to be approved. 

In [20]:
average_impression_treatment_query_v1.kwargs

In [21]:
average_impression_treatment_query_v1.input_policy_type

syft.service.policy.policy.ExactMatch

In [22]:
average_impression_treatment_query_v1.output_policy_type

syft.service.policy.policy.OutputPolicyExecuteOnce

In [23]:
print(average_impression_treatment_query_v1.code)

@sy.syft_function(input_policy=sy.ExactMatch(bubble_tea_data=mock),
                  output_policy=sy.SingleExecutionExactOutput())
def average_impression_treatment_query_v1(bubble_tea_data):
    # customize your query here
    
    df = bubble_tea_data
    result = df[df['Bucket'] == 'Treatment']['Impression'].mean()
    
    return float(result)



## Part 4 - Sending a code request

#### Creating a project

Now that you understand how to write a code request, the next step is to send it. You will send the request as part of a project, so first, you need to create a project.

Within a project, you will be able to send multiple code requests relevant with the scope of the project. In this tutorial, we will send the two code requests (with noise and without noise) in the same project.

However, you can only submit one project with the same name. If you try to submit the same project twice, it will show an error due to name clashes. This means that, if you have follow-up requests or if you want to correct something in the previous submission, you will need to create a new project. 

Rule of thumb: create a project everytime you want to submit one or more code requests. Once submitted, you cannot modify of reuse the project for another submission, currently.

In [24]:
# Create a project

new_project = sy.Project(name="Research on algorithmic transparency in bubble tea social media posts")

One important aspect about the project is the description. It's important here to be specific and clear, and even include relevant links in the description that can help the approver (data owner) get as much insight as possible into your research questions and intention. This will increase the chances of having your project approved.

In [25]:
# Add a proper description

proj_desc = """Hello, I want to explore the average number of impressions from the treatment group in this dataset. This will help me see how ..."""
new_project.set_description(proj_desc)

In [26]:
# Sent the code to the quest domain. Use the function name to specify which function you want to send.
# Note: this is not the proper code submission, that will happen when you submit the project. At this stage,
# the data manager will not be able to see your code request, not until you submit the project.

guest_domain_client.api.services.code.request_code_execution(average_impression_treatment_query_v1)
guest_domain_client.api.services.code.request_code_execution(average_impression_treatment_noise_query_v1)

```python
class Request:
  id: str = fbd55358ec45424ca4df42bb8355c4d1
  requesting_user_verify_key: str = 7a7adf41e1306099798cf8ceaad88c4b946765966ca71c5033c68782a4967cf8
  approving_user_verify_key: str = None
  request_time: str = 2023-05-20 11:22:56
  approval_time: str = None
  status: str = RequestStatus.PENDING
  node_uid: str = 9af5795ca6b1447cafaced0bfbe2eca3
  request_hash: str = "3133cfbfb3833c88efbf3371cb1512be34e3d3801341a846f912f6dd3e5d6b24"
  changes: str = [syft.service.request.request.UserCodeStatusChange]

```

In [27]:
# You can check all the previous code uploads you've made to this domain.
# For this tutorial, there will be only two, namely the functions that we wrote before.

guest_domain_client.code

Unnamed: 0,type,id,status,service_func_name
0,syft.service.code.user_code.UserCode,98fb26d759b048d69cb0d3a165ef5e2f,"{NodeView(node_name='Nice Bach', verify_key=24...",average_impression_treatment_query_v1
1,syft.service.code.user_code.UserCode,04c1d81c23e647d9ace1aafd6650f035,"{NodeView(node_name='Nice Bach', verify_key=24...",average_impression_treatment_noise_query_v1


In [28]:
# You can pick a specific code submission from the list above by the index and inspect it.

submitted_code = guest_domain_client.code[0]

In [29]:
type(submitted_code)  # notice the UserCode syft class; this object is needed when adding the request to the project

syft.service.code.user_code.UserCode

In [30]:
submitted_code.input_policy_type

syft.service.policy.policy.ExactMatch

In [31]:
submitted_code.output_policy_type

syft.service.policy.policy.OutputPolicyExecuteOnce

In [32]:
submitted_code.code

"@sy.syft_function(input_policy=sy.ExactMatch(bubble_tea_data=mock),\n                  output_policy=sy.SingleExecutionExactOutput())\ndef average_impression_treatment_query_v1(bubble_tea_data):\n    # customize your query here\n    \n    df = bubble_tea_data\n    result = df[df['Bucket'] == 'Treatment']['Impression'].mean()\n    \n    return float(result)\n"

In [33]:
# Next, add the code requests to the project (before submitting the project).
# Note: for this step, you need to use the objects which have been uploaded to the domain client
# This means that refering to the functions as their name is not enough.

new_project.add_request(obj=guest_domain_client.code[0], permission=sy.UserCodeStatus.EXECUTE)
new_project.add_request(obj=guest_domain_client.code[1], permission=sy.UserCodeStatus.EXECUTE)

In [34]:
new_project

```python
class ProjectSubmit:
  id: str = None
  name: str = "Research on algorithmic transparency in bubble tea social media posts"
  description: str = "Hello, I want to explore the average number of impressions from the treatment group in this dataset. This will help me see how ..."
  changes: str = [syft.service.project.project.ObjectPermissionChange, syft.service.project.project.ObjectPermissionChange]

```

In [35]:
# You can see a list of the code submissions added

new_project.changes

Unnamed: 0,type,id
0,syft.service.project.project.ObjectPermissionC...,e2ebbe8bdbd54c3b93b8a70602fd1369
1,syft.service.project.project.ObjectPermissionC...,6f98ba325bf24f2386994954c56649ee


In [36]:
# Submit the project

guest_domain_client.submit_project(new_project)

#### Receiving an answer to your request

Below we'll see how you can obtain an answer from your request. As mentioned before, the code is inspected by a data manager, and then run on the real data. This can take a while to execute, as it invovles manual work. It can be up to a few days.

There are three possible answers which you can get:
- code still waiting for approval:
        `SyftNotReady: Your code is waiting for approval: {NodeView(node_name='Trusting Silver', verify_key=04137bfea77b4c86492df2b9849190bd4a19b7a2833b3c500dd37fef6957615c): }`
- code approved: you will see the returned result, i.e. `(157.9, 2.02)`

In [37]:
# To check your result, you will need the asset again. This is because the code can be run on one or multiple
# assets, as specified in the input policy. Our examples will only accept one input, but it might be that some queries
# are supported on multiple inputs. For this, specify which result you are looking for.

asset = guest_domain_client.datasets[0].assets[0]
asset

```python
Asset: bubble_tea_data_model_auditing
Pointer Id: cc51b887d2334f00813a8b9cf3570083
Description: Mock data for bubble tea social media study
Total Data Subjects: 0
Shape: (6, 4)
Contributors: 0

```

In [40]:
# And check the code status for the function which you created. TODO - can I add multiple assets?

real_result = guest_domain_client.api.services.code.average_impression_treatment_query_v1(bubble_tea_data=asset)
real_result

157.9

In [39]:
# For this request, the answer was not provided yet by the data manager

real_result = guest_domain_client.api.services.code.average_impression_treatment_noise_query_v1(bubble_tea_data=asset)
real_result

## Part 5 - Queries supported in Syft 0.8

This part of the tutorial will briefly outline what is supported and not when using Syft. There are some limitations, as shown below, and this might result in errors when running the code request on the private data. This section will explain what to bear in mind when writing your code request, and how to adapt your code to the current Syft capabilities.

**Important note:**

You can work with the dataset in two ways (both before submitting the project and the code request):

(1) Fully local:
- this is when you inspect the data outside the code request
- for this, you should use `dataset.assets[0].mock.syft_action_data` as explained in part 3
- the code written as such will use the real NumPy and Pandas libraries, the ones which you have locally, and not the Syft library

(2) Low-side domain:
- this is when you run the code on the low-side domain, by writing a code request
- for this, you should use `dataset.assets[0].mock` as explained in part 3
- this will use the code request which you wrote, and as such, will use the Syft library
- the behaviour of this function should, in theory, be the same as the behaviour of the submitted function (when it's run on the real data, on the high-side domain)*

\*In theory, the code request on the dataset using method (2) should work when it would also work on the real data, and should fail when it would also fail on the real data. This would be a good indicator when there is something wrong in the code: works via method (2), then it the whole submission should be alright! However, there is a bug in the library which does not guarantee this behaviour at the moment, so in other words, if all is fine when testing method (2), **the code run on the real data can still crash.**

**What is supported for NumPy?** TODO

- array creation
- computing statistical information: `mean`, `std`, `min`, `max` etc
- slicing and indexing

Generally, the most common operations for `NumPy` are also supported when using Syft. But here are some things which you should keep in mind when writing a code request:

**[1] Converting objects**

`np.assarray(mock_object)`, `pd.DataFrame(mock_object)` and `torch.Tensor(mock_object)` will return objects which lose the Syft class associated to them, which is not desired. Avoid using them.

However, `np.astype()` is safe to be used.

**[2] Tuples**

Methods which return tuples are supported, but it's recommended that you extract the elements in the tuple in different variables and don't continue working with the tuple as a whole to avoid hard-to-debug issues.

**[3] Methods that return new arrays**\
also valid for methods that extract an array (e.g. `np.diag`)

Numpy methods that return new arrays (e.g. `np.pad()`), as well as methods that extract arrays (`e.g. np.diag()`) might cause issues, for the reason explained in \[1\].

**[4] NaN values**

Pay attention to functions and operations which can return NaN, as this can show undefined behaviour. 

**[5] Metadata**

TODO - but this is bad.

**[6] Chaining operation, and using multiple operations in a line**

Using multiple functions on the same line might not work as expected. For example, if you want to calculate `(n - n.mean()) / n.std()` it is advisable to compute the `std` and the `mean` separately, and then to compute the final result using the variables with the values from the standard deviation and the mean. 

Similarly, chaining multiple functions one after another might cause an error. For example, if you want to filter an array and then compute the average, it is advisable to do this in two different steps, rather than do it in a one-liner.

**[7] In-place modifications**

In-place modifications will cause a crash. Avoid usecases from the examples below:
```
mock_object += 1
np.add(mock_object, mock_object, out=mock_object)
np.sub(mock_object, mock_object, out=mock_object)
np.multiply(mock_object, mock_object, out=mock_object)
np.divide(mock_object, mock_object, out=mock_object)
mock_object[mock_object > 0.5] *= 2
```

**[8] NumPy subclasses**

For example, `np.random.shuffle(mock_object)` will show a warning and generate an incorrect result.

**[9] Record arrays**

Record arrays are not supported and using them will throw errors.