# PySyft - Docs Overview

The goal of PySyft is to provide a general purpose web server for remote data science. A data owner can launch one of such servers, load data into it, configure it with their definitions of “use” and “mis-use”, and empower a remote data scientist to use Syft’s Python client to study this data while being held accountable to its definition of acceptable use.

Note: to fully view the previous outputs of this notebook, go to `File` -> `Trust Notebook`.

## What is a domain server?

A web server, which hosts the data and manages the remote study of a data owner’s data by external data scientists. 

It is responsible for allowing the Data Owner to manage the data, as well as incoming requests from data scientists and for gatekeeping the data scientist’s access to data, compute, and experimental results stored within the data owner’s compute infrastructure.


## How is a domain server used to enable remote data science?

The most simple way of performing remote data science is like a manual process for remote procedural calls:

<a href="https://ibb.co/ydNSBJH"><img src="https://i.ibb.co/djfgJy8/Fire-Shot-Capture-052-Internal-Data-Owner-Handbook-Google-Docs-docs-google-com.png" alt="Fire-Shot-Capture-052-Internal-Data-Owner-Handbook-Google-Docs-docs-google-com" border="0"></a>

##### Step 1
IT Specialist launches a “high-side” domain server that is 100% walled off from the internet
##### Step 2: 
IT Specialist launches a second “low-side” domain server that is attached to the internet.
##### Step 3: 
Data Manager loads the “high-side” with private data.
##### Step 4:
Data Manager uses the “high-side” server to generate fake/mock data assets that are identical in every way possible to the real data (without actually having the real values)
##### Step 5:
Data Manager copies the fake/mock assets into the “low-side” server.
##### Step 6:
A Data Scientist is given access to the low-side domain to:
- access the fake data assets to get an understanding of the dataset schema
- write code to be run on private data, but test it against the fake assets
- submit the code for review and approval by the data manager to retrieve the private result

##### Step 7:
Data Manager is responsible to check incoming code request, review and respond with the private result.

Step 6 - 7 are repeated till the Data Scientist is able to complete his study of interest.

## How to install Syft

For instructions, please visit the following tutorial: 
    
https://github.com/OpenMined/PySyft/blob/dev/notebooks/tutorials/data-scientist/01-installing-syft-client.ipynb


For testing, installing Python 3.9+ and the latest beta version using `pip install --pre syft` should suffice.
        

## Launching a Domain Server

There are two type of possible deployments of a domain server:
- heavy deployment which launches the domain server as a Docker container
- light deployment which uses in-memory workers to simulate the actual web server.

For testing, we will follow only the light deployment, which works  the same on all platforms, and requires only Python 3.9+ to be installed on the system. This is done as it follows:

In [1]:
import syft as sy

domain_server = sy.orchestra.launch(name="my-domain", port=8080, dev_mode=False, reset=True) 


Warning: syft is imported in light mode by default.         
To switch to dark mode, please run `sy.options.color_theme = 'dark'`

Starting my-domain server on 0.0.0.0:8080
SQLite Store Path:
!open file:///tmp/3ef2b06132fd4d8e8317eaef4c5f940d.sqlite

Waiting for server to start... Done.


To test it works, one can login as a data owner using the default credentials:

In [2]:
domain_client = sy.login(port=8080, email="info@openmined.org", password="changethis")

Logged into my-domain as <info@openmined.org>


## Access a Domain Server

Accesing a domain server is as simple as logging into the domain with correct credentials. When logging in, a Python-based client is returned which enables the user to interact with the domain server, based on their access rights.

For logging in, one need to know the port at which the domain server was launched, in addition to the credentials and log in as it follows:

In [3]:
domain_client = sy.login(port=8080, email="info@openmined.org", password="changethis")

Logged into my-domain as <info@openmined.org>


## Managing access to the domain server

### Registering a Data Scientist account

A data scientist account can be created by the data owner of the domain, by providing a set of credentials: 
- name
- email
- password
- (optional) institution
- (optional) website

In [4]:
domain_client.register(
    name="Jane Doe", 
    email="jane@caltech.edu", 
    password="abc123"
)

## Data on the Domain Server

### Uploading data to the Domain Server

Creating a dataset (`sy.Dataset`) requires passing:
- `name`, which acts as a key within datasets
- `asset_list` containing the actual data uploaded as part of the dataset
- (optional) `description`, which contains a short, but longer description of the dataset
- (optional) `citation`
- (optional) `url`
- (optional) `contributors_list` consisting of contributors to the dataset


For creating a asset (`sy.Asset`) requires passing:
- `name`, which acts as a key within assets list
- `data`, which is the private dataset
- `mock`, which is the fake counter part of the data, as it has the same dataset schema, but does not contain any sensitive information.

For creating a contributor, it requires passing:
- `role`, usually `sy.roles.UPLOADER` or `sy.roles.EDITOR`
- `name`
- `email`
The latter is usually useful for questions on the dataset.

In [5]:
# An example creating a dataset

import numpy as np 

dataset = sy.Dataset(
    name="my dataset",
    description="description of the dataset",
    asset_list=[
        sy.Asset(
            name="my asset",
            data=np.array([1,2,3]),
            mock=np.array([1,1,1])
    )]
)

# Optional, one could add contributors
dataset.add_contributor(
    role=sy.roles.UPLOADER, 
    name="Andrew",
    email="andrew@gmail.com"
)

In the case where no mock is available, the mock can be passed as an empty object:

In [6]:
dataset.add_asset(
    sy.Asset(
        name="my asset without mock",
        data=np.array([1,2,3]),
        mock=sy.ActionObject.empty()
    )
)

dataset

name
Loading... (need help?)


Uploading a dataset only requires running the following:

In [7]:
domain_client.upload_dataset(dataset)

  0%|                                                     | 0/2 [00:00<?, ?it/s]

Uploading: my asset


100%|█████████████████████████████████████████████| 2/2 [00:00<00:00,  4.31it/s]


Uploading: my asset without mock


### Finding data on the Domain Server

To retrieve datasets that are uploaded to a Domain Server, one could do:

In [9]:
datasets = domain_client.datasets
datasets

id,name,url
Loading... (need help?),,


Similarly, one can search for a specific dataset using `.search()`:

In [10]:
datasets.search('my')

id,name,url
Loading... (need help?),,


### Understanding the Dataset Object

PySyft library provides flexibility so that data owners can group together assets into a single dataset. 

As such, a dataset (`sy.Dataset`) is a collection of assets, and each asset is a singular object which holds the data. 

For example, one can add the same data in multiple formats, but use a single uploaded dataset, with more assets (one asset for each format of the data). Similarly, one can split the dataset into testing / training / validation, and use assets to mark this.

To refer to a specific dataset, you can use `.get_by_id(..)` or by indexing within the list.

In [11]:
dataset = datasets[0]
dataset

id,name,shape
Loading... (need help?),,


To retrieve a specific asset, you can use the name of the asset:

In [12]:
asset_my = dataset.assets['my asset']
asset_my

0
Loading... (need help?)

0
Loading... (need help?)


Each asset is a two-sided object that secures the private data:
- `asset` or `asset.pointer` points to the private data and it can be passed in computations, but cannot be accessed directly by anyone except Data Manager
-`asset.mock` is a `Pandas DataFrame` containing the fake counter part of the data, which can be safely used for testing code
-`asset.data` is the actual private data, which is unaccesible by the Data Scientist

In [13]:
asset = dataset.assets[0]
asset

0
Loading... (need help?)

0
Loading... (need help?)


The `mock` is useful for testing your code locally before submission.

In [14]:
mock = dataset.assets[0].mock
mock

array([1, 1, 1])

## Interaction: data owners <> data scientists

The interaction between data owners and data scientists are structured in the following manner:
1. Data Scientist creates a project, under which he/she submits a series of code requests
2. Data Owner inspect incoming requests and chooses to:
    - approve and respond with the result released on the private data
    - deny and provide a reasoning for denial
3. Data Scientist inspects the status of his/her requests and download the private result

Before moving forward, let's use two domain clients, to refer to the steps done by the DO vs DS:
- `domain_client`, as we logged in previously, is the client used by the Data Owner
- `ds_client`, as shown below, is the client used by the Data Scientist

In [15]:
ds_client = sy.login(port=8080, email="jane@caltech.edu", password="abc123")

Logged into my-domain as <jane@caltech.edu>


### Creating Projects

A project (`sy.Project`) sets the scope of on-going collaboration between data owners and data scientists and it has the following fields:
- `name`
- `description`
- `ds_client` - the client who is creating this project, in this case, the Data Scientist's client

As it is an `on-going` collaboration, requests can be added before the project starts or while it is on-going. 

In [16]:
project = sy.Project(
    name="research on algorithmic fairness",
    description="Exploration of the algorithmic outcomes of algorithm A",
    members=[ds_client]
)

project

```python
class ProjectSubmit:
  id: str = cda964ee4c754614a69bccc82232a0e5
  name: str = "research on algorithmic fairness"
  description: str = "Exploration of the algorithmic outcomes of algorithm A"
  created_by: str = "jane@caltech.edu"

```

##### NOTE: Before submitting any request, a project must be started.

However, the project above only exists locally now, as a way for us to check before submitting it to the domain server.

To start a project, you simply do:

In [17]:
project_ongoing = project.start()
project_ongoing

```python
class Project:
  id: str = cda964ee4c754614a69bccc82232a0e5
  name: str = "research on algorithmic fairness"
  description: str = "Exploration of the algorithmic outcomes of algorithm A"
  created_by: str = "jane@caltech.edu"
  events: str = []

```

### Viewing projects on a Domain Server

Projects can be accessed by both domain owner and data scientist. However, the data scientist will naturally only have access to see the projects that they have submitted and not other data scientists' projects. 

Let's use the Data Owner client here to get all projects.

In [18]:
domain_client.projects

id,name,description,created_by,events
Loading... (need help?),,,,


Similarly to the datasets, you can index into the list above to select a specific project or use `get_project` and passing in the name.

In [19]:
project_research = domain_client.get_project('research on algorithmic fairness')
project_research

```python
class Project:
  id: str = cda964ee4c754614a69bccc82232a0e5
  name: str = "research on algorithmic fairness"
  description: str = "Exploration of the algorithmic outcomes of algorithm A"
  created_by: str = "jane@caltech.edu"
  events: str = []

```

## Requests within projects

As mentioned before, the data scientist can attach multiple requests under a project.

### How to create a code request as a Data Scientist

A `sy.Request` contains a specific change requested by the Data Scientist and the object to which the change corresponds to.

The most common type of request is a user code execution request, whereas the object is a code snippet - a `sy.UserCode`, which is requested to run on the private data, with specific contrainst defined via decorators, namely `InputPolicy` and `OutputPolicy`.

Let's do the following steps:
- create a Python method `compute_sum` we would like to run against private data
- create a Syft Object that wraps `compute_sum` which enforces the way the code can be executed (for the safety of the private data)
- test the execution of the syft object against the most data

### Defining a Python method

In [20]:
def compute_sum(data):
    return data.sum()

# Running on the fake dataframe for testing
compute_sum(mock)

3

### Add Syft Function wrapper and Policies
To convert the Python method into a Syft Object, we add a `sy.syft_function` annotation which are rules set by the data owner on what data can go IN and what data can come OUT of custom code. These policies basically pair up the code submission with the dataset asset it was intended for.
- `input_policy` ensures that the system will only allow the code to run on the specified asset from the dataset (using a pre-defined policy as `ExactMatch(data=asset)`). This means that an approved code cannot be run on any asset at random.
- `output_policy` can maintain state, meaning they can store metadata between execution and useful for limiting how many times a method can be run. The most common policy is single execution (`sy.SingleExecutionExactOutput`).

If we want to enforce using an exact asset and single execution, we will define it as:

In [21]:
@sy.syft_function(input_policy=sy.ExactMatch(data=asset),
                  output_policy=sy.SingleExecutionExactOutput())
def compute_sum(data):
    return data.sum()

##### How to send a code request as a Data Scientist
Once the cells above were run, the syft object `compute_sum` is available for submission, requiring only the project reference and the Data Scientist's client as:

In [27]:
project_research.create_code_request(
    compute_sum, 
    ds_client)

### View requests within project
To view the requests which are part of the project, you can inspect:

In [28]:
project_research.requests

id,request_time,updated_at,status,changes,requesting_user_verify_key
Loading... (need help?),,,,,


To view the code reached the Domain Server, you can look at the `.code` property of the domain via the client.

In [29]:
ds_client.code

id,status.approved,service_func_name
Loading... (need help?),,


### Inspect existing code requests in a domain as a Data Owner

While a Data Scientist can only see the requests within their own project, the Data Owner can see all requests across his domain.

To see the requests on the domain, look at:

In [30]:
domain_client.requests

id,request_time,updated_at,status,changes,requesting_user_verify_key
Loading... (need help?),,,,,


### Answering code requests as a Data Owner

The Data Owner has two option:
- answer by depositing a response computed on the private counter part of the data
- deny, by providing a written reason

Before that, the Data Owner would:
- inspect that the code is not malicious
- retrieve a callable reference to the method
- run the method against mock data for safety
- in case of approval, run the method against the private data
- in case of denial, specify the reason

In [31]:
request = domain_client.requests[0]
request

```python
class Request:
  id: str = 18c32f58831c4979a6688d845f62683e
  request_time: str = 2023-06-15 12:38:34
  updated_at: str = None
  status: str = RequestStatus.PENDING
  changes: str = [syft.service.request.request.UserCodeStatusChange]
  requesting_user_verify_key: str = d9dfe5dc262b96a07ab91b4250c76e721f9bba9b4412681f377e938ac5891a03

```

In [32]:
# Inspect the code that was sent

print(request.changes[0].link.code)

@sy.syft_function(input_policy=sy.ExactMatch(data=asset),
                  output_policy=sy.SingleExecutionExactOutput())
def compute_sum(data):
    return data.sum()



In [39]:
# Get a callable method to the function sent and run against mock data

user_func = request.changes[0].link.unsafe_function
user_func(data=mock)



3

In [44]:
# Denial

request.deny(reason="Risk of privacy leak")

As a Data Owner, you can check the status of requests again to see that the `RequestStatus` was updated:

In [45]:
domain_client.requests

id,request_time,updated_at,status,changes,requesting_user_verify_key
Loading... (need help?),,,,,


In [46]:
# Approval

real_result = user_func(data=asset.data)
real_result

6

In [47]:
request.accept_by_depositing_result(real_result)

In [48]:
domain_client.requests

id,request_time,updated_at,status,changes,requesting_user_verify_key
Loading... (need help?),,,,,


#### Checking status of request as Data Scientist

However, as a data scientist, you can:
1. check the requests within your project similarly to see the status
2. re-run the method on the pointer, which will return the actual result if it was deposited or the denial reason

In [49]:
project_research.requests

id,request_time,updated_at,status,changes,requesting_user_verify_key
Loading... (need help?),,,,,


In [50]:
# Download the result, as it was approved:

ds_client.code.compute_sum(data=asset.data).get_from(ds_client)

6

# Support

For further questions, please refer to our Slack channel for this exercise:

https://join.slack.com/share/enQtNTQyNDYyMzk1MDI3OC00YTdiYjZjMGU1ODRlYTZjYjg0MWJmNzEyMGJiNjM0ZWZkMjk5M2Q4ODdjZWMzOTg0NjNhM2JlMTQ2NWNkZWY4