# PyGrid: A Peer-to-Peer Platform for Private Data Science and Federated Learning

## This blog helps newcomers understand through a simple example what PyGrid is all about and how to use it. Credits to Patrick, Ionesio, Juan and Dev (Links TBC).

![](images/1.png)

## __What if you could train on all of the world’s data, without that data leaving the device, and while keeping that data private?__

### PyGrid is a peer-to-peer platform for private data science and federated learning. With PyGrid, data owners can provide, monitor, and __manage access to their own private data clusters. The data does not leave the data owner’s server__.

### Data scientists can then use PyGrid to __perform private statistical analysis on the private dataset__, or even __perform federated learning across multiple institution’s datasets__.

### This blog post will cover:

#### 1. Basic concepts needed to understand __PyGrid__, such as __Federated Learning__ and __Secure Multi-party Computation__
#### 2. The __PySyft library__ and the __PyGrid platform__. PySyft is a private machine learning library which is deployed using the PyGrid platform
#### 3. Practical examples of privacy preserving analysis using PyGrid: these examples will help us understand PyGrid’s architecture and figure out how it can be applied to real problems

# Federated Learning

### The first concept we need to understand is Federated Learning. Federated Learning is a technique which allows AI models to learn without users having to give up their data. How does this work?

### The first step is to create an initial model. Then, the data scientist would send this model down to a dataset owner - in this case, Joe’s device.

![](images/4.png)

### Now, Joe can update this model by training it on his dataset. After training, the updated model will return to the AI company.

![](images/5.png)

### Now, the AI Incorporated sends the updated AI model to another device, in this case Jane’s device.

### Jane’s will update this model by training it on her dataset. After the model trains on Jane’s data, Jane sends the updated weights back to the data scientist.

![](images/6.png)

### Now the model has learned from both Joe and Jane’s data. We can repeat this process over many nodes, and we can even train models simultaneously on multiple nodes and average them, allowing for faster improvements to the model.

### The main benefits of federated learning are:

#### - The training data stays on user’s devices (or, for example, the hospital servers)
#### - This increases privacy of sensitive data
#### - Reduces legal liability for model owners (data scientists, companies)
#### - Reduces network bandwidth involved in uploading huge datasets

### It’s easy to think of potential use cases of this technology. For example,

#### - [A scientist training on data from multiple hospitals](https://blog.openmined.org/federated-learning-differential-privacy-and-encrypted-computation-for-medical-imaging/)
#### - [A smartphone app training on data from multiple phones](https://blog.openmined.org/apheris-openmined-pytorch-announcement/)
#### - [A company leaning to predict data when it’s machines need maintenance by training models across data from many sensors](https://blog.openmined.org/predictive-maintenance-of-turbofan-engines-using-federated-learning/)

## __Secure Multi-party Computation__

### Secure Multi-party Computation (SMPC) is a different way to encrypt data, sharing it to different devices. The main advantage is, unlike traditional cryptography, SMPC allows us to perform logic and arithmetic operations using encrypted data.

### How can you perform math with multi-party computation? Here’s a (very) simplified example of how this works:

![](images/2.png)

### In this example, we have Andrew holding his number, in this case he is the owner of the number 5, his personal data. Andrew can anonymize his data decomposing his number into 2 (or more) different numbers. In this case, he decomposes the number 5 into 2 and 3. That way, he can share his anonymized data with his friends Marianne and Bob.

### Here, none of them really know the real value of Andrew’s data. They’re holding only a part of it. Any of them can perform any kind of operation without the agreement of all of them. But, while these numbers are encrypted between them, we’ll still be able to perform computations. __That way, we can use encrypted values to compute user’s data without showing any kind of sensitive information__.

![](images/3.png)

### With these concepts understood, we can now explain PySyft and PyGrid.

## __Use Case: Private Statistical Analysis__
### Let’s explore two work flows:

### 1. The __data owner__ that wants to publish his sensitive data on his node. (In this case, a hospital pediatric ward).
### 2. The __data scientist__ that wants to find a specific dataset over the grid network to compute some statistical analysis.

## __The Data Owner__
## __Step 1: Import PySyft and dependencies__
### The first step as a data owner is importing our dependencies.

### In this case we’ll import syft and replace standard torch modules using syft hook. Follow [this](https://github.com/OpenMined/PyGrid) to install the necessary dependencies.

### Visit http://localhost:5000/connected-nodes to ensure all nodes are ready and setup! You should get a response like below for the standard setup:

### {"grid-nodes": ["James", "Alice", "Bill", "Bob"]}

In [4]:
import syft as sy
from syft.grid.clients.data_centric_fl_client import DataCentricFLClient
from syft.grid.clients.model_centric_fl_client import ModelCentricFLClient

import torch as th
hook = sy.TorchHook(th)



## __Step 2: Connect your node__

### The next step is to connect with your own node. It’s important to note that the node app was deployed in some environment and you need to know its address previously. In this case we’ll connect with the hospital node.

In [5]:
hospital_datacluster = DataCentricFLClient(hook, "http://localhost:3000")
hospital_datacluster

<Federated Worker id:Bob>

## __Step 3: Prepare data as tensors and add brief description__

### Now, we need to prepare our dataset to be published on the hospital node. To provide a clear understanding about the data that we want to publish we should add a brief description explaining the data meaning and the data structure. In this use case, we want to publish the hospital’s monthly birth records.

In [6]:
data_description = """Description:
                        This data presents the monthly birth records.
                        
                        Columns:
                            Gender: 0 - Male, 1 - Female
                            Weight: in Kg
                            Height: in cm


                        Shape 5 * 3
                        """

monthly_birth_records = th.tensor([[1, 3.5, 47.3],
                                   [0, 3.7, 48.1],
                                   [0, 3.9, 50.0],
                                   [1, 4.1, 52.3],
                                   [0, 4.1, 49.7]
                                  ])

## __Step 4: Define access rules and permissions__

### After that, we need to define rules to control the data access. In this case, we’re allowing some users (Bob, Ana and Alice) to get total access to the real values of this data.

In [7]:
private_dataset = monthly_birth_records.private_tensor(allowed_users = ("Bob", "Ana", "Alice"))

## __Step 5: Add tags and labels to help data scientists find your dataset__

### To make our data accessible from queries, we also need to add tags to identify and label them.

### In this example, we’re adding two tags: __#February__ to identify the month and __#birth-records__ to identify the data meaning.

In [8]:
private_dataset = private_dataset.tag("#February", "#birth-records").describe(data_description)
private_dataset

(Wrapper)>PrivateTensor>tensor([[ 1.0000,  3.5000, 47.3000],
        [ 0.0000,  3.7000, 48.1000],
        [ 0.0000,  3.9000, 50.0000],
        [ 1.0000,  4.1000, 52.3000],
        [ 0.0000,  4.1000, 49.7000]])

## __Step 6: Publish! You’re done.__

### Now, the data is ready to be published. It’s important to note that you need to be allowed to publish private datasets on this node. In this example we’re using Bob’s credential to publish this data on the node.

In [9]:
data_pointer = private_dataset.send(hospital_datacluster, user = "Bob")
data_pointer

(Wrapper)>[PointerTensor | me:16456678568 -> Bob:25523317939]
	Tags: #February #birth-records 
	Shape: torch.Size([5, 3])
	Description: Description:...

### As the data owner, that's all that we need to do!

## __The Data Scientist__

## __Step 1: Import PySyft and dependencies__

## As a data scientist, We also need to import syft library and replace torch modules using syft hook.

In [10]:
import syft as sy
from syft.grid.public_grid import PublicGridNetwork

import torch as th
hook = sy.TorchHook(th)



## __Step 2: Connect to Grid Platform__

## Unlike the data owners, we don’t know where the nodes and the datasets are, so we first need to connect with the GridNetwork. The address of the grid network will be the gateway component address.

In [11]:
grid = PublicGridNetwork(hook, "http://localhost:5000")
grid

<syft.grid.public_grid.PublicGridNetwork at 0x7fa8ce650250>

## __Step 3: What data are you looking for? Search the network.__

### After connecting with the grid network, we can search for the desired dataset tags. Perhaps you’re looking for x-rays of pneumonia, or hospital birth records. In this example, we’re using the same tags that we’ve published before. The grid network will return a dictionary containing the node’s ids as keys and data pointers as values.

In [12]:
results = grid.search("#February", "#birth-records")
results

{'Bob': [(Wrapper)>[PointerTensor | me:18426392449 -> Bob:25523317939]
  	Tags: #February #birth-records 
  	Shape: torch.Size([5, 3])
  	Description: Description:...]}

## __Step 4: Create a reference to that data pointer__

### Next, we define a direct reference to the hospital’s data pointer.

In [16]:
feb_records = results['Bob'][0]
print(feb_records.description)

Description:
                        This data presents the monthly birth records.
                        
                        Columns:
                            Gender: 0 - Male, 1 - Female
                            Weight: in Kg
                            Height: in cm


                        Shape 5 * 3
                        


## __Wait - what if I try to copy that data?__

## If we try to retrieve the real values of the data pointer without being allowed, an exception will be raised. This is how PyGrid can keep the data on the owner’s hands and empower them with the control to allow or deny access to the data samples.

In [18]:
feb_records.get()
# GETPERMITTEDERROR

## __Step 5: Perform your computations__

### Even without copying the data, we are still able to perform remote computations on this data. In this example, we want to compute the average of the babies weight and height. To do this, we need to sum the column values remotely.

In [19]:
def sum_column(dataset, column):
    sum_result = dataset[0][column].copy()
    for i in range(1, dataset.shape[0]):
        sum_result += dataset[i][column]
    return sum_result

### Now, we can compute the weight sum remotely. It will generate other remote tensor.

In [28]:
weight_sum = sum_column(feb_records, 1)
weight_sum

(Wrapper)>[PointerTensor | me:30827401111 -> Bob:79401854137]

### We can do the same thing with the height column.

In [29]:
height_sum = sum_column(feb_records, 2)
height_sum

(Wrapper)>[PointerTensor | me:51795564794 -> Bob:48861446528]

### Now, we just need to retrieve the aggregated value using our credentials and divide the value by 5  which is the dataset size.

In [30]:
avg_weight = weight_sum.get(user='Bob')/5
avg_weight

(Wrapper)>PrivateTensor>tensor(3.8600)

### We can do the same thing to get the height average. That way, we are able to compute the average weight and height  of the babies that was born on this month without get access to any sensitive data.

In [31]:
avg_height = height_sum.get(user='Bob')/5
avg_height

(Wrapper)>PrivateTensor>tensor(49.4800)

### __Done! We know the average height and weight of the babies born in February without ever moving the dataset to our own server, and we never needed to receive any private information about the individual babies.__