# Lesson 2: Remote Data Science Demo!

<b><u> Instructors</b></u>:  Ishan Mishra, Phil Culliton


## Concept 1: Lesson Introduction

Welcome to Lesson 2!

In the previous lesson, we learned about Remote Data Science, its motivations, key technical components, and main terminology. 

In this lesson, we're going to jump right into experiencing Remote Data Science ourselves, learning how to be both Data Scientist and Data Owner along the way!

In the next 15 minutes, we're going to give you a rapid run through where you as a Data Owner, will prepare a dataset, deploy a domain node, upload that dataset to the domain node. Finally, as a Data Scientist, you'll take this data and perform remote data science on it!

If any part of this demo doesn't seem clear right now, don't worry! We'll be walking you through each of these steps in more detail in subsequent lessons.

<hr>

## Concept 2: The Scenario

You work at a lab, and some people have been showing symptoms of COVID19. You ask everyone to take a COVID19 test, and collect the results. However, not everyone wants their test results to be known publicly. You want to respect everyone's privacy, but at the same time you want to know if there's been a big COVID outbreak at your lab!

You've heard about a brilliant scientist at CalTech named Dr. Sheldon Cooper. He says he can analyze your dataset without ever seeing it, and still tell you how many people in your lab tested positive for COVID19. 

<br>

<b> Relevant information: </b>
- You're the owner of a COVID19 test result dataset. 
- You'll be spinning up a domain node, and creating an account for a data scientist (Dr. Cooper)
- The dataset you're giving him is a binary array, such as: [0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1].
    - In this array, each column represents a different individual who works in your lab. who was tested for COVID19. 
    - A zero implies that they tested negatively for it, and a "one" implies they unfortunately tested positively for it.

- The simple question you're asking him to evaluate is <b> how many people in the dataset have tested positive for COVID19? </b>

<hr>

## Concept 3: Installation Steps

In the last concept, we learned about the scenario wherein we have a dataset (a COVID dataset) we want to empower a data scientist (Dr. Cooper) to be able to study, but we don't want to give the dataset to him. We've decided to launch a domain node to make this happen!

However, before we launch a node, we need to first install some dependencies! But before we add dependencies we first need to create a virtual environment for our installation. Assuming you have conda installed ([Install Instructions](https://docs.anaconda.com/anaconda/install/index.html)), run the following commands:

<hr>

### <b>1. Create a Virtual Environment with Python 3.9<b>

```
conda create -n lab python=3.9
conda activate lab
```

<hr>

### <b>2. Download the PySyft and Hagrid libraries from PyPI: </b>

Once you've activated your environment, then you're ready to install the two python libraries we're going to need for this lesson :syft and hagrid. You'll learn more about these tools later in the course, but for now it suffices to say that PySyft is a Python library for remote data science, and hagrid is a Python based CLI tool for deploying domain nodes that PySyft can talk to. You can install them in your virtual environment using the following command (Note: run them in the same terminal as you ran the conda commands):

```
pip install --pre syft
pip install -U hagrid
```

### <b>3. Launch Docker, and ensure it has atleast 8GB of RAM available. </b>

- OSX:  <a href="https://www.docker.com/products/docker-desktop">Docker Desktop</a>

In order for hagrid to work correctly you must have docker installed and configured properly with 8 GB of ram. 

- Open the Docker Website
- Install Docker for your operating system (OSX and Windows should install Docker Desktop)
- Open/Launch Docker Desktop app if you're on OSX/Windows (see Linux guide if you're on Linux)
- Docker Desktop Users (OSX and Windows):
    - Click the docker icon at the top of your monitor (looks like a little whale)
    - Click "Preferences"
    - On the left panel, click "Resources"
    - Then drag the "Memory" dot until it says 8.00 GB (can go higher if you wish).
    - Then click "Apply & Restart"

- Linux: No configuration needed, Docker will automatically scale the amount of memory it uses.

Congratulations! You've successfully downloaded, installed, and configured all of the dependencies needed to launch a Domain node! Keep your terminal windows open and proceed to the next concept!

<hr>

## Concept 4: Launch Your Domain Node

### <b>4. Launch HAGrid</b>

Pick a name for your Domain node! In this case, let's use "local_node"
```
hagrid launch local_node
```

<hr>

### <b>5. Login to the Domain!</b>


<i>Voila!</i> You now have a working domain node. To login, go to <b>localhost:"port_number"</b> in your browser (port_number is 8081 by default).

The default username and password are as follows:

- Username: <b> info@openmined.org </b>
- PW: <b> changethis </b>

<hr>

### 6. Upload your Dataset to your Domain!

Open Python, and run the following:

In [32]:
from pathlib import Path
import json

def docker_desktop_memory():
    path = str(Path.home()) + "/Library/Group Containers/group.com.docker/settings.json"

    f = open(path, 'r')
    out = f.read()
    f.close()
    return json.loads(out)['memoryMiB']

In [33]:
docker_desktop_memory()

2048

In [1]:
import numpy as np
import syft as sy
from syft.core.adp.entity import DataSubject

raw_data = np.random.choice([0, 1], size=(10)).astype(np.int32)
dataset = {}

for person_index, test_result in enumerate(raw_data):
    data_owner = DataSubject(name=f'Patient #{person_index}')
    dataset[person_index] = sy.Tensor(np.ones(1, dtype=np.int32) * test_result).annotate_with_dp_metadata(lower_bound=0, upper_bound=1, entities=data_owner)


domain_node = sy.login(email="info@openmined.org", password="changethis", port=8081)
# domain_node.load_dataset(assets=dataset, name="COVID19 Test Results", description="Positive/Negative COVID19 Test results", metadata="No metadata")


Anyone can login as an admin to your node right now because your password is still the default PySyft username and password!!!

Connecting to http://localhost:8081... done! 	 Logging into canada... done!


In [2]:
import pandas as pd

In [3]:
datasets = domain_node.datasets.all()

In [4]:
dataset = datasets[0]

In [5]:
assets = {}
for raw_asset in dataset['data']:
    asset = list()
    asset.append(raw_asset['shape'])
    asset.append(raw_asset['dtype'])
    assets[raw_asset['name']] = asset
dataset['data'] = None
dataset['assets'] = assets

In [6]:
domain_node.datasets

Idx,Name,Description,Assets,Id
[0],COVID19 Test Results,Positive/Negative COVID19 Test results,"[""0""] -> Tensor [""1""] -> Tensor [""2""] -> Tensor ...",1d0a43e4-66c9-4648-964c-a39c42fab35b


In [7]:
domain_node.datasets[0]

Dataset: COVID19 Test Results
Description: Positive/Negative COVID19 Test results


 assets = my_dataset.assets 

to view receive a dictionary you can parse through using Python
(as opposed to blowing up your notebookwith a massive printed table).



Asset Key,Type,Shape
"[""0""]",Tensor,"(1,)"
"[""1""]",Tensor,"(1,)"
"[""2""]",Tensor,"(1,)"
"[""3""]",Tensor,"(1,)"
"[""4""]",Tensor,"(1,)"
"[""5""]",Tensor,"(1,)"
"[""6""]",Tensor,"(1,)"
"[""7""]",Tensor,"(1,)"
"[""8""]",Tensor,"(1,)"
"[""9""]",Tensor,"(1,)"


<hr>

## 7. Create an account for a Data Scientist
Create a data scientist account. This is who will be using the dataset you've just uploaded to your domain node. (Dr. Sheldon Cooper, in our scenario!)

In [70]:
domain_node.users.create(
    **{
        "name": "Sheldon Cooper",
        "email": "sheldon@caltech.edu",
        "password": "bazinga",
        "budget": 100
    }
)

Let's quickly double check that the Data Scientist account was created properly, by checking the list of users:

In [71]:
domain_node.users

Unnamed: 0,added_by,allocated_budget,budget,budget_spent,created_at,daa_pdf,email,id,institution,name,role,verify_key,website
0,<syft.lib.python._SyNone object at 0x7feaa814d...,0.0,5.55,0.0,2021-11-01 05:24:06.991358,<syft.lib.python._SyNone object at 0x7feaa814d...,info@openmined.org,1,<syft.lib.python._SyNone object at 0x7feaa814d...,Jane Doe,Owner,e07db5d214010770d9146551d4dc4f3daf1bb7d9c89e66...,<syft.lib.python._SyNone object at 0x7feaa814d...
1,Jane Doe,0.0,100.0,0.0,2021-11-01 05:27:59.352133,1,sheldon@caltech.edu,2,,Sheldon Cooper,Data Scientist,eff162ee3f57400fb11d204cdbfd625d58b229901fdb23...,


<hr>

Congratulations! You've successfully deployed a domain node, loaded in data, and created a user account! In the next concept, you're going to pretend to be Sheldon Cooper and do some remote data science!

## Concept 5: Remote Data Science

## 8. Log into the Domain Node, as the Data Scientist

In [72]:
ds_node = sy.login(email="sheldon@caltech.edu", password="bazinga", port=8081)

Connecting to http://localhost:8081... done! 	 Logging into adp... done!


<hr>

## 9. View the available datasets on the Node

In [73]:
ds_node.datasets

Idx,Name,Description,Assets,Id
[0],COVID19 Test Results,Positive/Negative COVID19 Test results,"[""0""] -> Tensor [""1""] -> Tensor [""2""] -> Tensor [""3""] -> Tensor [""4""] -> Tensor [""5""] -> Tensor [""6""] -> Tensor [""7""] -> Tensor [""8""] -> Tensor [""9""] -> Tensor",3de3d35a-5c22-44e8-8aef-b24df4c2c07d


In [74]:
# Let's get a pointer to the dataset
dataset = ds_node.datasets[0]

<b> Let's try to find out how many people in this dataset had COVID19... </b>

In [79]:
dataset

Dataset: COVID19 Test Results
Description: Positive/Negative COVID19 Test results



Asset Key,Type,Shape
"[""0""]",Tensor,"(1,)"
"[""1""]",Tensor,"(1,)"
"[""2""]",Tensor,"(1,)"
"[""3""]",Tensor,"(1,)"
"[""4""]",Tensor,"(1,)"
"[""5""]",Tensor,"(1,)"
"[""6""]",Tensor,"(1,)"
"[""7""]",Tensor,"(1,)"
"[""8""]",Tensor,"(1,)"
"[""9""]",Tensor,"(1,)"


Let's try to see if we, as the Data Scientist, working with this remote dataset, can see what it looks like by printing it.

In [75]:
print(dataset)

<syft.core.node.common.client_manager.dataset_api.Dataset object at 0x7fea9a251040>


Now, normally, if we had a Tensor object (such as the ones from PyTorch or Tensorflow), we'd be able to print it and look at it. However, as you can see, this returns a Dataset object 

In [77]:
results = [dataset[f'{i}'] for i in range(10)]

In [82]:
from time import sleep
total_cases = 0
for result in results:
    ptr = result.publish()
    sleep(1)
    total_cases += ptr.get()

In [84]:
print(f'The total number of COVID19 cases are: {total_cases[0]}')

The total number of COVID19 cases are: 6.811603905014408


So as you can see, Dr. Cooper was able to determine the number of people in your lab who tested positive for COVID in your lab. More incredibly, he did this <b> without knowing any one individual's test result! </b>

This is the power of remote data science, and of the PySyft framework. <b> We're able to work with and get the benefits of remote data, without directly owning it, or exposing anyone's privacy.</b>

<hr>

## Concept 6: Shut down your Domain Node
Open a new terminal window with the virtual environment active:

```
conda activate lab
```

Run this command in your terminal:

```
hagrid land local_node
```

<hr>

# Concept 7: Lesson Conclusion

# <marquee> CONGRATULATIONS! </marquee>
In just a few minutes, you spun up a domain node, created and uploaded a dataset, and then as a data scientist, you used that dataset to perform remote data science!

In the next three lessons, you'll be able to peer past the curtain and see how all of this comes together. You'll be guided through, step by step, how to manage and work with more complex and sophisticated datasets, prepare them for upload, and how to manage and take care of your domain node, and configure it to your liking. Finally, you'll get to do more interesting remote data science!

<hr>

<hr>