## Overview

In the remote data science landscape, the data science and machine learning tasks are collaborated between two main actors
 - Data Owner
 - Data Scientist
 
The network that facilitates this collaboration is called **Domain Server** or **Domain Node**. 
 
As the name suggests, a **Data Owner** owns the data that will be used for remote data science. They make their datasets available for study in a secured and protected way to an outside party they may or may not fully trust to have good intentions. This outside party is usually a data scientist.

**Data Scientists** are end users who desire to perform computations or answer a specific question using one or more data owners' secured datasets with remote execution. 

It is important to note that the people charged with protecting data (Data Owners), should not be the same people (Data Scientists) who's job is to extract meaningful value from the data. If they were the same person, their conflict of interest provides too many opportunities for things to go wrong. 

This tutorial will demonstrate how a Data Owner can launch their own Domain Server and securely host private datasets.

![caption](./files/big-arch.png)

## Required Tools

There are three main components that work together to orchestrate the remote data science between a data owner and data scientists. 

- PySyft: Privacy-Preserving Library

- PyGrid: Networking and Management Platform

- HAGrid: Deployment and Command Line Tool

**PySyft** is the main library containing a set of data serialization and remote code execution APIs which mimic existing popular Data Science tools while working interchangeably with existing popular data types. One of the ways this works is by providing a special Proxy object in Python which acts like a Network Pointer to a remote object on a Domain Server. These Pointers look, act and feel just like real objects, but cannot be copied or viewed without special permissions. Therefore, PySyft enables execution of Data Science operations without sending in raw code or copying data from the data owners' server. The python package for PySyft is called `syft`.

**PyGrid** is the server component of PySyft. PyGrid nodes are referred by their type, e.g. domain server or network server. These are cross-platform servers running where the data lives (a.k.a. at the data owners' premise).

Finally, **HAGrid** is a very handy `cli` tool which takes care of all the heavy-lifting in the background and makes deploying a Domain or Network server very easy. It also comes with an interactive UI. The python package for HAGrid is called `hagrid`.

## Getting Started

The purpose of this tutorial is to help you install everything you need to run a Domain node deployed in your personal machine (`localhost`) or deployed on Azure. We will also be installing everything you might need to run Jupyter notebooks with PySyft installed, such as if you’re pretending to be both Data Owner and Data Scientist as a part of an experiment. 

We will be setting up the following dependencies before installing `syft` and `hagrid`.

- Python >=3.9
- pip
- Conda
- Jupyter notebook
- Docker

### Installation on Linux

This section is to help you install and be able to deploy a Domain Node on Ubuntu Linux, with a version of `20.04.03` or newer, in the simplest way possible. If you have a different distribution other than Ubuntu, just replace the `apt` & `apt-get` with your package manager.

#### 1. Launching a Terminal Instance
We will use the Linux Terminal to install all the prerequisites and launch the domain. A quick way to launch the terminal is by pressing Ctrl+Alt+T.

#### 2. Installing Python 3.9 or newer
We’ll be working with Python 3.9 or newer. To check if you have it installed, you may run:

`python3 --version`

Your output should looks something like Python `3.x.y` where x>=9.

If you don’t have the correct version of Python, installing it is as easy as running the following:

```
sudo apt update
sudo apt install python3.9
python3 --version
```

#### 3. Installing and using Pip

Pip is the most widely used package installer for Python and will help us to install the required dependencies MUCH easier. You can install it by running the following:

`python -m ensurepip --upgrade`

If you already have it installed, you can check to make sure it’s the latest version by running:

`python -m pip install --upgrade pip`

Your output should looks something like `Requirement already satisfied: pip in <package-dir>.`

#### 4. Conda and setting up a virtual environment

Conda is a package manager that helps you to easily install a lot of data science and machine learning packages, but also to create a separated environment when a certain set of dependencies need to be installed. To install Conda, you can:

a. Download the [Anaconda installer](https://www.anaconda.com/products/individual#Downloads).

b. Run the following code, modifying it depending on where you downloaded the installer (e.g. `~/Downloads/`):

`bash ~/Downloads/Anaconda3-2020.02-Linux-x86_64.sh`

The naming might be different given it could be a newer version of Anaconda.

c. Create a new env specifying the Python version (we recommend Python 3.8/3.9) in the terminal:

```
conda create -n syft_env python=3.9
conda activate syft_env
```

d. To exit the env, you can run:

`conda deactivate`

#### 5. Install Jupyter Notebook

A very convenient way to interact with a deployed node is via Python, using a Jupyter Notebook. You can install it by running:

`pip install jupyterlab`

If you encounter issues, you can also install it using Conda:

`conda install -c conda-forge notebook`

To launch the Jupyter Notebook, you can run the following in your terminal:

`jupyter notebook`

#### 6. Installing and configuring Docker

[Docker](https://docs.docker.com/get-started/overview/) is a framework which allows us to separate the infrastructure needed to run PySyft in an isolated environment called a `container` which you can use off the shelf, without many concerns. If it sounds complicated, please don’t worry, we will walk you through all steps, and you’ll be done in no time! Additionally, we will also use [Docker Composite V2](https://docs.docker.com/compose/), which allows us to run multi-container applications.

a. Install Docker by running this command:

`sudo apt-get upgrade docker & docker run hello-world`

b. Install Docker Composite V2 as described [here](https://docs.docker.com/compose/cli-command/#installing-compose-v2).

c. Run the below command to verify the install:

`docker compose version`

You should see somthing like `Docker Compose version 2.x.y` in the output when runnning the above command.

d. If you see something else, go through the [instructions here](https://www.rockyourcode.com/how-to-install-docker-compose-v2-on-linux-2021/) or if you are using Linux, you can try to do:

```
mkdir -p ~/.docker/cli-plugins
curl -sSL https://github.com/docker/compose/releases/download/v2.2.3/docker-compose-linux-x86_64 -o ~/.docker/cli-plugins/docker-compose
chmod +x ~/.docker/cli-plugins/docker-compose
```

e. Also, make sure you can run without sudo:

```
echo $USER //(should return your username)
sudo usermod -aG docker $USER
```

#### 7. Install PySyft and Hagrid
Finally, to install the OpenMined stack that you need in order to deploy a node, please run:

`pip install -U syft hagrid --pre`

## Launch a Domain Server as a Data Owner

The concept of Remote Data Science starts with a server-based model called Domain Server. It allows data owners to upload their private data into these servers and create an account with a username and password for Data Scientist.

To reiterate, the advantage of using a Domain Server is that as a data owner, you can catalyze the impact your dataset can have by allowing

- a Data Scientist to only get answers to the types of questions you allow them to
- and get those answers without needing to directly access or have a copy of your data

![caption](https://openmined.github.io/PySyft/_images/00-deploy-domain-00.gif)

To launch a domain node, there are three things that you need to know:
    
1. **What type of node do you need to deploy?** There are two different types of nodes: `Domain Node` and `Network Node`. By default, HAGrid launches the primary node that is our Domain Node.


2. **Where are you going to launch this node to?** We need to specify that we want to launch it to the docker container at port `8081.


3. **What is the name of your Domain Node going to be?** For that, please specify the DOMAIN_NAME to your preference.

Now the final step is to launch a domain server. For that please follow these steps:

1. Start Docker (from terminal or using Docker Desktop)
2. Run the following one-line command:

In [5]:
DOMAIN_NAME = 'test_domain' # edit DOMAIN_NAME as per your preference

!hagrid launch {DOMAIN_NAME} domain to docker:8081 --tag=latest --tail=false


[2K✅ Updated HAGrid from branch: dev from branch: dev[0m0m
[2K[32m⠼[0m [1;34mUpdating HAGrid from branch: dev[0m
[2K[32m⠴[0m [1;34mChecking for Docker Service[0m   ice[0m   
[1A[2K✅ Docker service is running
✅ Git 2.32.1
✅ Docker 20.10.17
✅ Docker Compose 2.7.0


 _   _       _     _                 _   _                       _
| | | |     | |   | |               | | | |                     | |
| |_| | ___ | | __| |   ___  _ __   | |_| | __ _ _ __ _ __ _   _| |
|  _  |/ _ \| |/ _` |  / _ \| '_ \  |  _  |/ _` | '__| '__| | | | |
| | | | (_) | | (_| | | (_) | | | | | | | | (_| | |  | |  | |_| |_|
\_| |_/\___/|_|\__,_|  \___/|_| |_| \_| |_/\__,_|_|  |_|   \__, (_)
                                                            __/ |
                                                           |___/
        
Launching a PyGrid Domain node on port 8081!

  - NAME: test_domain
  - RELEASE: production
  - TYPE: domain
  - DOCKER_TAG: latest
  - HAGRID_VERSION: 
  - PORT: 8081
  - DO

While this command runs, you will see various `volumes` and `containers` being created. Once this step is complete, let's move on to the next step, where we will learn to monitor the health of our Domain Server.

## Monitor Domain Server

One exciting benefit of HAGrid is that it makes it easier for your organization/ IT department to monitor & maintain the status of your system as you move forward with other steps. Let’s do a quick health check to ensure the Domain is up and running.

In [11]:
!hagrid check localhost:8081

┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━┓
┃[1m [0m[1mPyGrid   [0m[1m [0m┃[1m [0m[1mInfo                                     [0m[1m [0m┃[1m [0m[1m  [0m[1m [0m┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━┩
│[35m [0m[35mUI (βeta)[0m[35m [0m│ http://localhost:8081/login               │ ✅ │
│[35m [0m[35mapi      [0m[35m [0m│ http://localhost:8081/api/v1/openapi.json │ ✅ │
└───────────┴───────────────────────────────────────────┴────┘


If your output is similar to the above image, voila! Your very own Domain Server was just born. When it’s ready, you will see the following in the output:

- `host`: IP address of the launched Domain Node.

- `UI (βeta)`: Link to an admin portal that allows you to control Domain Node from a web browser.

- `api`: Application layer that we run in our notebooks to make the experience more straightforward and intuitive.

- `ssh`: Key to get into virtual machine.

- `jupyter`: Notebook environment you will use to upload your datasets.

Congratulations 👏 You have now successfully deployed a Domain Server!

## Uploading Private Data to a Domain Server

At this point, you have successfully deployed a Domain Server that represents your organization’s private data server. Now as promised, you can upload your private data the Domain Server and make it securely available for remote data science. In this section, you will learn how to upload data to your new domain server, which involves annotating and doing ETL before uploading.

#### Steps to Upload Private Data

 - Preprocessing of Data

 - Marking it with correct metadata

 - Uploading data to Domain Server

![caption](https://openmined.github.io/PySyft/_images/01-upload-data-00.jpg)

### Import Syft

To utilize the privacy-enhancing features offered in PyGrid and to communicate with your domain server, first you need to import `syft`.

In [12]:
try:
    import syft as sy
    print("Syft is imported")
    
except:
    print("Syft is not installed. Please follow the Getting Started section above.")

Syft is imported


### Log into Domain

By default, only the Domain server Admin can upload data, so to upload your data, you will need to first login as the admin. Upload data permissions can be customized after logging into the domain server.

To login to your Domain server, you will need to define which Domain you are logging into and who you are. In this case, it will take the form of:

- IP Address of the domain host

- Your user account Email and Password

**WARNING**: Please change the default username and password below to a more secure and private combination of your preference.

In [13]:
try:
    domain_client = sy.login(
      port=8081,
      email="info@openmined.org",
      password="changethis"
   )
except Exception as e:
    print("Unable to login. Please check your domain is up with `!hagrid check localhost:8081`")



Anyone can login as an admin to your node right now because your password is still the default PySyft username and password!!!

Connecting to localhost... done! 	 Logging into test_domain... done!

Version on your system: 0.7.0-beta.59
Version on the node: 0.7.0-beta.62



You have just logged in to your Domain! It is highly recommended to change the credentials before moving forward. You can do so directly from the server UI located at the address defined as UI (βeta) in the Monitor Domain Server section. Steps to change the default admin credentials for Domain Owner are shown below.
![caption](https://openmined.github.io/PySyft/_images/01-upload-data-01.gif)

### Preprocessing of Data

For this tutorial, we will use a simple dataset of four peoples `ages` and `hourly income`. 

In [14]:
try:
    import pandas as pd
    data = {'ID': ['011', '015', '022', '034'],
         'Age': [40, 39, 9, 8],
         'Hourly Income': [20, 25, 32, 18]  }

    dataset = pd.DataFrame(data)
    print(dataset.head())
    
except Exception:
    print("Install the latest version of Pandas using the command: !pip install pandas")

    ID  Age  Hourly Income
0  011   40             20
1  015   39             25
2  022    9             32
3  034    8             18


### Marking data with correct metadata

Now that we have our dataset, we need to annotate it with privacy-specific metadata called `Auto DP metadata`. Auto DP metadata allows the PySyft library to protect and adjust the visibility different Data Scientists will have into any one of our data subjects. Data Subjects are the entities whose privacy we want to protect. So, in this case, they are the individual four people. 

In order to protect the privacy of the people within our dataset we first need to specify who those people are. In this example we have created a column with unique ID’s for each person in this dataset.

#### Important Steps

- Data subjects are entities whose privacy we want to protect

- Each feature needs to define the appropriate minimum and maximum ranges

- When defining min and max values, we are actually defining the theoretical amount of values that could be learned about that aspect.

- To help obscure the variables someone may learn about these datasets we then need to set an appropriate lower_bound to the lowest possible persons age (0), and the upper_bound to the highest possible (mostly) persons age (100). Similar procedure should be followed for hourly income data.

If your project has a `training set`, `validation set` and `test set`, you must annotate each data set with Auto DP metadata.

In [15]:
data_subjects = sy.DataSubjectArray.from_objs(dataset["ID"])

age_data = sy.Tensor(dataset["Age"]).annotate_with_dp_metadata(
   lower_bound=0, upper_bound=100, data_subjects=data_subjects
)
hourly_income_data = sy.Tensor(dataset["Hourly Income"]).annotate_with_dp_metadata(
   lower_bound=10, upper_bound=500, data_subjects=data_subjects
)

Tensor annotated with DP Metadata!
You can upload this Tensor to a domain node by calling `<domain_client>.load_dataset` and passing in this tensor as an asset.
Tensor annotated with DP Metadata!
You can upload this Tensor to a domain node by calling `<domain_client>.load_dataset` and passing in this tensor as an asset.


### Uploading Data to Domain Server

Once you have prepared your data, it’s time to upload it to the Domain Server. To help Data Scientists later search and discover our datasets, we will add details like a name and a description of what this dataset represents.

If your project has a train, validation and test set, you need to add them as `assets`. In this case, `Age` and `Hourly Income` columns are assets.

In [16]:
domain_client.load_dataset(
   name="Age_Income_Dataset",
   assets={
      "Age_Data": age_data,
      "Hourly_Income": hourly_income_data
   },
   description="Our dataset contains the Ages and Hourly Incomes of four employees with unique ID's. There are 3 columns and 4 rows in our dataset."
)

Loading dataset...Loading dataset... checking assets...Loading dataset... checking dataset name for uniqueness...Loading dataset... checking dataset name for uniqueness...                                                                                                                    Loading dataset... checking asset types...                              Loading dataset... uploading...🚀                        

Uploading `Age_Data`: 100%|[32m███████████████████████████████████████████[0m| 1/1 [00:00<00:00, 58.35it/s][0m
Uploading `Hourly_Income`: 100%|[32m█████████████████████████████████████[0m| 1/1 [00:00<00:00, 128.80it/s][0m


Dataset is uploaded successfully !!! 🎉

Run `<your client variable>.datasets` to see your new dataset loaded into your machine!


### Checking the Dataset

To check the dataset you uploaded to the Domain Server, please run the below command, and it will list all the datasets on this Domain with their Names, Descriptions, Assets, and Unique IDs.

In [19]:
domain_client.datasets

Idx,Name,Description,Assets,Id
[0],Customer data in a Mall in Canada,"This dataset contains information about 200 customers from a Mall in Canada. Columnsinclude Age, Annual Income (k$), Spending Score (1-100), Male, Female","[""age""] -> Tensor [""income""] -> Tensor [""spend""] -> Tensor ...",b9619f8b-4543-412d-8dee-3a969f9c598f
[1],Customer data in a Mall in Canada,"This dataset contains information about 200 customers from a Mall in Canada. Columnsinclude Age, Annual Income (k$), Spending Score (1-100), Male, Female","[""age""] -> Tensor [""income""] -> Tensor [""spend""] -> Tensor ...",b03741fe-a7e1-4c0d-bb97-dc5941b6402a
[2],Customer data in a Mall in Canada,"This dataset contains information about 200 customers from a Mall in Canada. Columnsinclude Age, Annual Income (k$), Spending Score (1-100), Male, Female","[""age""] -> Tensor [""income""] -> Tensor [""spend""] -> Tensor ...",96a79430-6a3b-4d24-ba88-ae5881bf6b6a
[3],Customer data in a Mall in Canada,"This dataset contains information about 200 customers from a Mall in Canada. Columnsinclude Age, Annual Income (k$), Spending Score (1-100), Male, Female","[""age""] -> Tensor [""income""] -> Tensor [""spend""] -> Tensor ...",9fc2dc1c-2ce6-4153-a32e-b636ba36d475
[4],Age_Income_Dataset,Our dataset contains the Ages and Hourly Incomes of four employees with unique ID's. There are 3 columns and 4 rows in our dataset.,"[""Age_Data""] -> Tensor [""Hourly_Income""] -> Tensor",5fa28a75-2f5c-42c3-a2f0-7f387a648d23


In [20]:
domain_client.datasets[-1]

Dataset: Age_Income_Dataset
Description: Our dataset contains the Ages and Hourly Incomes of four employees with unique ID's. There are 3 columns and 4 rows in our dataset.



Asset Key,Type,Shape
"[""Age_Data""]",Tensor,"(4,)"
"[""Hourly_Income""]",Tensor,"(4,)"


Awesome 👏 !! You have uploaded the dataset onto your Domain Server!


In [None]:
- Overview (Data Owner, Data Scientist)
- PySyft, PyGrid, HAGrid short intro 
- The picture demonstrating the whole landscape
- Getting Started
- Install pre-requisite libraries
- Install PySyft and Hagrid
- Data Owner Side: Launch a Domain Server
- Checking your Domain Server
- Upload Private Data to a Domain Server
- (Optional) Create an user account for data scientist