# Loading data into Syft Domain Server as a Data Owner

Welcome to Syft! This tutorial consists of 4 Jupyter notebooks that covers the basics of Syft which includes
* [Uploading a private dataset as a Data Owner](./00-load-data.ipynb)
* [Submitting code to run analysis on the private dataset as a Data Scientist](./01-submit-code.ipynb)
* [Reviewing and approving the code as a Data Owner](02-review-code-and-approve.ipynb)
* [Downloading/Retrieving the results of the code execution as a Data Scientist](03-data-scientist-download-result.ipynb)

In Syft, a **Data Owner** provides datasets which they would like to make available for study by an outside party they may or may not fully trust has good intentions. Meanwhile, **Data Scientists** are end users who desire to perform computations or answer a specific question using one or more Data Owners' datasets.

### Install Syft & Import packages

In [None]:
SYFT_VERSION = ">=0.8.2.b0,<0.9"
package_string = f'"syft{SYFT_VERSION}"'
%pip install {package_string} -f https://whls.blob.core.windows.net/unstable/index.html -q

In [None]:
import syft as sy
sy.requires(SYFT_VERSION)
from syft import autocache
import pandas as pd

### Launch a Syft Domain Server

In [None]:
# Launch a fresh domain server named "test-domain-1" in dev mode on the local machine
node = sy.orchestra.launch(name="test-domain-1", port="auto", dev_mode=True, reset=True)

In [None]:
# log into the node with default root credentials
domain_client = node.login(email="info@openmined.org", password="changethis")

In [None]:
# List the available API
domain_client.api

### Data Subjects

Think of Data Subjects as individuals/organizations/institutions owning a dataset that you can pool together privately in Syft.

For this notebook, we'll create a sample dataset that includes trade information of various commodities for different countries.

In [None]:
# Check for existing Data Subjects
data_subjects = domain_client.data_subject_registry.get_all()

In [None]:
data_subjects

In [None]:
assert len(data_subjects) == 0

### Add Data subjects

In [None]:
country = sy.DataSubject(name="Country", aliases=["country_code"])

In [None]:
canada = sy.DataSubject(name="Canada", aliases=["country_code:ca"])
germany = sy.DataSubject(name="Germany", aliases=["country_code:de"])
spain = sy.DataSubject(name="Spain", aliases=["country_code:es"])
france = sy.DataSubject(name="France", aliases=["country_code:fr"])
japan = sy.DataSubject(name="Japan", aliases=["country_code:jp"])
uk = sy.DataSubject(name="United Kingdom", aliases=["country_code:uk"])
usa = sy.DataSubject(name="United States of America", aliases=["country_code:us"])
australia = sy.DataSubject(name="Australia", aliases=["country_code:au"])
india = sy.DataSubject(name="India", aliases=["country_code:in"])

In [None]:
country.add_member(canada)
country.add_member(germany)
country.add_member(spain)
country.add_member(france)
country.add_member(japan)
country.add_member(uk)
country.add_member(usa)
country.add_member(australia)
country.add_member(india)

country.members

In [None]:
# Adds the data subject and all its members to the registry
response = domain_client.data_subject_registry.add_data_subject(country)
response

In [None]:
assert response

In [None]:
# Lets look at the data subjects added to the data
data_subjects = domain_client.data_subject_registry.get_all()
data_subjects

In [None]:
assert len(data_subjects) == 10

### Prepare the dataset

For simplicity, we'll be working with Canada's trade dataset

In [None]:
canada_dataset_url = "https://github.com/OpenMined/datasets/blob/main/trade_flow/ca%20-%20feb%202021.csv?raw=True"

In [None]:
df = pd.read_csv(autocache(canada_dataset_url))
df

In Syft, every dataset has two variants - **Mock** and **Private**.

* **Mock** dataset is a mock/dummy version of the private data that can be accessed & read by the data scientists.
* **Private** dataset is the actual data that will never be accessed by the data scientist.

To keep things simple, we sample different data points as Mock & Private. But in reality you would want to generate a random dataset for the Mock variant. 

In [None]:
# private data samples
ca_data = df[0:10]
ca_data

In [None]:
# Mock data samples
mock_ca_data = df[10:20]
mock_ca_data

### Create a Syft Dataset

In Syft, `Dataset` is a collection of Assets. For example, `Dataset` can be a "Lung Cancer Dataset", and `Assets` will be train, test & validation splits for this dataset.

In [None]:
dataset = sy.Dataset(name="Canada Trade Value")

In [None]:
dataset.set_description("Canada Trade Data")

In [None]:
dataset.add_citation("Person, place or thing")
dataset.add_url("https://github.com/OpenMined/datasets/tree/main/trade_flow")

In [None]:
dataset.add_contributor(name="Andrew Trask", 
                        email="andrew@openmined.org",
                        note="Andrew runs this domain and prepared the dataset metadata.")

dataset.add_contributor(name="Madhava Jay", 
                        email="madhava@openmined.org",
                        note="Madhava tweaked the description to add the URL because Andrew forgot.")

In [None]:
dataset.contributors

In [None]:
assert len(dataset.contributors) == 2

### Add Assets to the Syft Dataset

In [None]:
ctf = sy.Asset(name="canada_trade_flow")
ctf.set_description("Canada trade flow represents export & import of different commodities to other countries")

In [None]:
ctf.add_contributor(name="Andrew Trask", 
                    email="andrew@openmined.org",
                    note="Andrew runs this domain and prepared the asset.")

In [None]:
# This is where we add the private data (pandas df/numpy array) to the `Asset`
ctf.set_obj(ca_data)

In [None]:
# We must set the shape of this private data
ctf.set_shape(ca_data.shape)

In [None]:
# We assign the data subject for whom this data belongs to, in this
ctf.add_data_subject(canada)

In [None]:
# Optionally, if we don't want to add any Mock dataset
ctf.no_mock()

In [None]:
# We must add this Asset to our Dataset
dataset.add_asset(ctf)

In [None]:
# In case we want to remove a dataset & its associated assets
dataset.remove_asset(name=ctf.name)

In [None]:
# Let's assign the Mock data to the Asset by calling `set_mock` method
ctf.set_mock(mock_ca_data, mock_is_real=False)

In [None]:
# Let's add our Asset back into our "Canada Trade Value" Dataset
dataset.add_asset(ctf)

### Upload Syft Dataset to Domain Server

In [None]:
domain_client.upload_dataset(dataset)

In [None]:
# We can list all the datasets on the Domain Server by invoking the following
datasets = domain_client.datasets.get_all()
datasets

In [None]:
assert len(datasets) == 1

In [None]:
datasets

### Reading the Syft Dataset from Domain Server

Following the logical hierarchy of `Dataset`, `Asset`, and its variant, we can read the data as follows

In [None]:
# Reading the mock dataset
mock = domain_client.datasets[0].assets[0].mock

In [None]:
assert mock_ca_data.equals(mock)

In [None]:
# Reading the real dataset
# NOTE: Private data can be accessed by the Data Owners, but NOT the Data Scientists
real = domain_client.datasets[0].assets[0].data

In [None]:
assert ca_data.equals(real)

### Create a new Data Scientist account on the Domain Server

Signup is disabled by default.
An Admin/DO can enable it by `domain_client.settings.allow_guest_signup(enable=True)`

Refer to notebook [07-domain-register-control-flow](./07-domain-register-control-flow.ipynb) for more information.

In [None]:
domain_client.register(name="Jane Doe", email="jane@caltech.edu", password="abc123", institution="Caltech", website="https://www.caltech.edu/")

In [None]:
# Cleanup local domain server
if node.node_type.value == "python":
    node.land()