# PyGrid: Remote Inference - Data Scientist

<img src="../../../docs/img/pygrid_logo.png" align="center"/>

The ability to evaluate custom models, using private datasets without having access to them; is a powerful idea that will change the way we interact with data during a machine learning workflow. PySyft and PyGrid offer the ability to run inferences remotely by using a variety of technologies and applications.

In this notebook series, we'll be covering all the nuances of this process, showing how to send private datasets _(as Data Owner)_, and how to perform remote computation using private environments _(as Data Scientist)_.

The main goal of these notebooks is to explore different techniques and technologies that can finally make **private data _"accessible"_ to Data Scientists, while also providing Data Owners with total control of their data.**

**NOTE**: _This notebook was designed to be executed in pair with the [PyGrid Remote Inference - Data Owner](./PyGrid%20Remote%20Inference%20-%20Data%20Owner.ipynb) notebook. In order to reproduce it properly, follow the checkpoints and instructions described in the next sections._

### Overview

- [**Creating User Accounts**](#creating-user-accounts)
- [**Remote Datasets**](#remote-datasets)
- [**Train a Local Model**](#train-local-model)
- [**PyGrid Workers**](#setting-computing-environment)
- [**Remote Inference + Differencial Privacy**](#remote-inference)
- [**Data Retrieval**](#data-access-request)

<a id="creating-user-accounts"></a>
## Creating User Accounts

#### Import libs

In [None]:
from syft.grid.client.client import connect  # Method used to connect with the domain.
from syft.grid.client.grid_connection import (
    GridHTTPConnection,
)  # Protocol used to talk with the domain

import syft as sy
import torch as th

# Set logging level
import logging

logging.basicConfig(level=logging.INFO)

import pydp

sy.load("pydp")

#### Create User account
In this scenario, we're assuming that the data scientist will start from scratch.

In [None]:
PYGRID_PORT = 5000

In [None]:
# Since we still don't have our own account,
# we can connect with the domain without credentials.
unauthenticated_client = connect(
    url=f"http://localhost:{PYGRID_PORT}",  # Domain Address
    conn_type=GridHTTPConnection,
)  # HTTP Connection Protocol

unauthenticated_client.users.create(
    email="scientist@researchorg.edu", password="pwd123"
)

In [None]:
# Now we can finally log-in using our credentials.
domain_client = connect(
    url=f"http://localhost:{PYGRID_PORT}",  # Domain Address
    credentials={"email": "scientist@researchorg.edu", "password": "pwd123"},
    conn_type=GridHTTPConnection,
)  # HTTP Connection Protocol

Done! We have an User account!

<a id="remote-datasets"></a>
## Remote Datasets

### Checking for available datasets
Now, let's take a look at the domain repository.

In [None]:
domain_client.datasets.all(pandas=True)

In [None]:
remote_dataset = domain_client.datasets["b76f9a38-ecc4-43ca-86cd-d232ee22cc7a"]

As we can see, we have a dataset available to be used. Datasets are robust grid structures that were designed to point to several remote data, obeying the structure of those who created it,  they can represent csv files, images, or even the abstraction of train/test datasets. In this example, we'll be covering a dataset composed by different CSV files. _Although, they will no longer being stored as CSV files inside of the domain_.

### Exploring the metadata
As you probably know, we can't have access to the real values of the dataset. It's private and remote! However, we can explore its metadata information in order to understand how the data has been organized.

#### Manifest
Dataset manifest is a document commonly used to describe the data meaning. Here, we expect to know the meaning of each column, their object types and the purpose of the dataset.

In [None]:
print(remote_dataset.manifest)

#### Tags
Commonly used to give an overview about the data.

In [None]:
remote_dataset.tags

#### Dataset Pandas
Used to understand how the dataset pointers are organized, what's their types, shape and name.

In [None]:
remote_dataset.pandas


_PS: At the time this notebook has been written, the domain was only supporting compressed "tar.gz" files  as a dataset. Contact the author of this article to find out the current status of this feature._

<a id="train-local-model"></a>
## Train a Local Model

In [None]:
from diabetes_model_training import train_diabetes_model, plot_training_acc

model, loss, acc, epochs_list = train_diabetes_model(th)
plot_training_acc(acc, loss, epochs_list)

<a id="setting-computing-environment"></a>
## PyGrid Workers

PyGrid aims to provide a custom and private environment for the users to perform their computation. That way, the user is empowered to choose their computing resources. We're currently supporting **Azure**, **GCP** and **AWS**. In this notebook we'll be using Azure as a cloud platform since it provides an additional secure layer based in a Trusted Execution Environment (TEE). That way, the data owner can protect his data from the user and also from the infrastructure where the domain lives.

#### Get Instance type
First, we need to know what instances are available to be deployed.

In [None]:
domain_client.workers.instance_type(pandas=True)

#### Create a worker
Once we have decided about the vm instance, we can request for the domain to create one for us.

In [None]:
domain_client.workers.create(instance_type="t2.large")
domain_client.workers.all(pandas=True)

Then, with the worker deployed, we can get a proxy client which will be used to send messages to the environment through the domain.

In [None]:
worker = domain_client.workers[1]
print("Worker Provider: ", worker.provider)
print("Worker Instance Type: ", worker.instance_type)
print("Worker Region:", worker.region)
print("Worker Syft (Logic) Address", worker.address)

#### Loading private dataset
Ok, now we have the worker able to perform remote computation, but it's still empty. In order to transfer private datasets and tensors we must use the _load_ method. This method will get a pointer of an object that lives in the domain, and send it to our worker.

In [None]:
# 1 - Let's choose one of those dataset pointers to be our data sample during the remote inference
private_data_sample = remote_dataset.files[0]
print("Dataset Name: ", private_data_sample.name)
print("Dataset Shape: ", private_data_sample.shape)
print("Dataset Type: ", private_data_sample.dtype)
print("Dataset Pointer: ", private_data_sample.pointer)

# 2 - Then we can load it from the domain to our own worker.
domain_client.load(private_data_sample.pointer, worker.address)

# 3 - Finally, we can see the data inside the worker store
worker.store.pandas

#### Loading a model
Let's do the same thing with our model trained locally.

In [None]:
# PS: Since we're transfering the model from our own machine to our Virtual Machine
# we'll be using "send" instead of load
remote_model = model.send(worker)

<a id="remote-inference"></a>
## Remote Inference + Differential Privacy
**Finally!** We have everything ready.

#### Running inference

In [None]:
# Spliting the private data set into features and labels
feature = worker.store[0][0:, 0:8]
labels = worker.store[0][0:, 8]

predicted = remote_model(feature)

#### Computing accuracy
Now, we need to compare the predicted results with the private dataset labels.

In [None]:
acc = (predicted.reshape(-1).round() == labels).int().tolist()

#### Adding noise
To increase the data security we'll be adding a small noise on our result by using Differential Privacy techniques. <br><br>
Since we intend to compute the accuracy of our prediction, we can just perform a private mean which can vary from 0 to 1. For this example we're defining our privacy budget equals to 0.8.

In [None]:
BoundedMean = worker.pydp.algorithms.laplacian.BoundedMean
mean_ptr = BoundedMean(0.8, lower_bound=0.01, upper_bound=1.0, dtype="float")

In [None]:
acc_result = mean_ptr.quick_result(acc)

#### Saving the results
Before deleting our ephemeral instance we must save the result of our computation. We can do it by using the _save_ command. This command will send a data from our worker to the domain.

In [None]:
# 1 - Sending from worker to domain
worker.save(acc_result)

# 2 - Deleting our Virtual Machine
# del domain_client.workers[1]

<a id="data-access-request"></a>
## Data Retrieval
The last step is to request the compliance officer for data access permissions.

In [None]:
domain_client.store.pandas

In [None]:
acc_ptr = domain_client.store["0cbd040781a04559ba40e983723d8d2c"]

Done! We have an accuracy data stored somewhere inside of our domain. Now let's finally request the data access.

In [None]:
acc_ptr.request(reason="I'd like to have access to my accuracy result!")

### <img src="https://github.com/OpenMined/design-assets/raw/master/logos/OM/mark-primary-light.png" alt="he-black-box" width="100"/> Checkpoint : Now STOP and run the Data Owner notebook until the next checkpoint.

In [None]:
acc_ptr.get()

**_Voilà!_** This is our accuracy result!

## Congratulations!!! - Time to Join the Community!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement toward privacy preserving, decentralized ownership of AI and the AI supply chain (data), you can do so in the following ways!

### Take our FREE Video Course "The Private AI Series"
Learn how privacy technology is changing our world and how you can lead the charge.
We cover non technical concepts about structured transparency, as well as deep dive into the technical aspects of various Cryptographic technologies and how to use them with Syft and Grid.
* [📺 Video Course](https://courses.openmined.org/)

### Star PySyft and PyGrid on GitHub
The easiest way to help our community is just by starring the GitHub repos! This helps raise awareness of the cool tools we're building.

* [⭐️ Star PySyft](https://github.com/OpenMined/PySyft)
* [⭐️ Star PyGrid](https://github.com/OpenMined/PyGrid)

### Join our Slack!
The best way to keep up to date on the latest advancements is to join our community! You can do so by filling out the form at http://slack.openmined.org

### Join a Code Project!
The best way to contribute to our community is to become a code contributor! At any time you can go to PySyft GitHub Issues page and filter for "Projects". This will show you all the top level Tickets giving an overview of what projects you can join! If you don't want to join a project, but you would like to do a bit of coding, you can also look for more "one off" mini-projects by searching for GitHub issues marked "good first issue".

* [PySyft Good First Issue Tickets](https://github.com/OpenMined/PySyft/labels/Good%20first%20issue%20%3Amortar_board%3A)
* [PyGrid Good First Issue Tickets](https://github.com/OpenMined/PyGrid/labels/good%20first%20issue)

### Donate
If you don't have time to contribute to our codebase, but would still like to lend support, you can also become a Backer on our Open Collective. All donations go toward our web hosting and other community expenses such as hackathons and meetups!

* [OpenMined's Open Collective Page](https://opencollective.com/openmined)