# Medical Federated Learning Program- Data Science

Welcome!
Up until this point, you were a Data Owner, who:
- Deployed a domain node using a `hagrid launch` command
- Annotated a dataset
- Uploaded this annotated dataset to the domain node
- Connected this domain node to a Network node
- And created a Data Scientist account!

In this notebook, we'll be demonstrating the basic building blocks that enable a data scientist to use this data, while preserving the privacy, security and integrity of the data you uploaded to your domain node.

Now let's walk through how a data scientist would go about using this data, and this infrastructure!

## Accessing data you don't have: Introducing Tensor Pointers

The raw, private data that you uploaded to your domain node never leaves your domain node.
How then do data scientists work with it? The answer is called a **Tensor Pointer.**


Let's login to our domain node as the Data Scientist:

In [1]:
import syft as sy

# Login to the domain node as a DS
ds_domain_node_1 = sy.login(email="sam@stargate.net", password="changethis", port=8081)


Anyone can login as an admin to your node right now because your password is still the default PySyft password!!!

Connecting to localhost... done! 	 Logging into canada... done!


<hr>
**We've logged in!**

Now as the data scientist, let's see which datasets are on this domain node:

In [2]:
ds_domain_node_1.datasets

Idx,Name,Description,Assets,Id
[0],TissueMNIST,This dataset is a modified form of TissueMNIST which is made available from the Broad Bioimage Benchmark Collection.,"[""train_images""] -> int64 [""train_labels""] -> int64 [""val_images""] -> int64 ...",39f5779b-27f2-4156-b3d2-ef037a0f2fb5
[1],Cohort 1 Data,Test results from Cohort 1,"[""data""] -> int64",25cf2c92-5fa6-4da8-a1c9-4d50ae2ee7e0


In [3]:
data = ds_domain_node_1.datasets[-1]["data"]

In [4]:
data

array([4, 4, 2, 2])

 (The data printed above is synthetic - it's an imitation of the real data.)

## Working with data you don't have: Remote Procedure Calls

So now we know that **Tensor Pointers** let you access data on another domain node without having to make a copy of it.
But this is only half the story; afterall, we don't just want to access data, we want to *work with it!* How do we do that?

The answer is something called **Remote Procedure Calls**. Let's start with our tensor pointer:

In [5]:
data.public_shape

(4,)

In [6]:
transformed_data = data.T

In [7]:
transformed_data.public_shape

(4,)

## Getting results you can see: Differential Privacy

To recap: 
- We now know that **Tensor Pointers** give a data scientist the ability to access data remotely
- They can work on remote data (using **Remote Procedure Calls**) without physically having it on their device!

Now let's say you've done your analysis. How do you actually get results? And how do we make sure the data scientist seeing the results of their analysis doesn't invade or violate anyone's privacy?

The answer lies in something called **Differential Privacy**.

In [8]:
ds_domain_node_1.privacy_budget

9999.0

In [9]:
transformed_data.get()

[2022-06-08T04:13:00.621489+0000][CRITICAL][logger]][3014356] You do not have permission to .get() Object with ID: <UID: ce88ef0131d84315a79c73d2960700ff> on node canada Please submit a request.


AuthorizationException: You do not have permission to .get() Object with ID: <UID: ce88ef0131d84315a79c73d2960700ff> on node canada Please submit a request.

As you can see, this failed since our data scientist is trying to download raw data that they don't have the permission for.

They would retrieve this data by spending something called a **privacy budget.** They specify how much noise they want to add, and accordingly, their privacy budget gets deducted:

In [10]:
ptr = data.publish(sigma=0.5)

In [11]:
res = ptr.get()



In [12]:
res.decode()

array([-1.15815735, -0.9848175 , -0.92993164,  0.37869263])

In [13]:
ds_domain_node_1.privacy_budget

9999.0

In [14]:
transformed_data

array([4, 4, 2, 2])

 (The data printed above is synthetic - it's an imitation of the real data.)

In [15]:
new_pointer = transformed_data.publish(sigma=0.5)

In [16]:
result = new_pointer.get()

In [17]:
result.decode()

array([-0.89207458,  0.45620728, -0.22010803,  0.17390442])

In [18]:
result.shape

(4,)

## Finding data you don't have: Network Nodes



In [19]:
sy.networks

Unnamed: 0,name,host_or_ip,vpn_host_or_ip,protocol,port,admin_email,website,slack,slack_channel
0,opengrid,40.83.192.48,100.64.0.1,http,80,support@openmined.org,https://www.openmined.org/,https://slack.openmined.org/,#support
1,GovAI Network 1,20.84.12.110,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support
2,GovAI Network 2,20.84.14.55,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support
3,GovAI Network 3,20.84.13.24,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support
4,GovAI Network 4,20.84.14.84,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support
5,GovAI Network 5,20.84.12.64,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support
6,GovAI Network 6,20.84.14.94,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support
7,GovAI Network 7,20.84.12.138,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support
8,GovAI Network 8,20.84.14.128,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support
9,GovAI Network 9,20.84.12.33,100.64.0.1,http,80,support@openmined.org,https://www.governance.ai/,https://slack.openmined.org/,#support


## Combining data from many nodes: Secure Multiparty Computation

Okay- so we created a Tensor Pointer, used its remote procedure calls to conduct an experiment, and got the result by spending some privacy budget. This combination lets you use domain from one domain node. But what if you wanted to combine data from several domain nodes?

This is possible through a cryptographic tool called **Secure Multiparty Computation**, or SMPC for short.

Let's start by logging onto a second domain node:

In [20]:
ds_domain_node_2 = sy.login(email="sam@stargate.net", password="changethis", port=8082)


Anyone can login as an admin to your node right now because your password is still the default PySyft password!!!

Connecting to localhost... done! 	 Logging into italy... done!


In [21]:
ds_domain_node_2.datasets

Idx,Name,Description,Assets,Id
[0],TissueMNIST,This dataset is a modified form of TissueMNIST which is made available from the Broad Bioimage Benchmark Collection.,"[""train_images""] -> int64 [""train_labels""] -> int64 [""val_images""] -> int64 ...",92aee3ef-185a-438a-b058-757bfd4180ef
[1],Cohort 2 Data,Test results from Cohort 2,"[""data""] -> int64",8cd9180d-208f-4bc0-a00e-1e83de947883


Let's say we want to take the two example datasets, and see the result of their addition. We would do this by simply adding the two tensor pointers:

In [23]:
pointer1 = ds_domain_node_1.datasets[-1]["data"]
pointer2 = ds_domain_node_2.datasets[-1]["data"]

pointer_to_result = pointer1 + pointer2

Once we have a pointer to the result, we can spend some privacy budget by publishing!

In [24]:
published_result = pointer_to_result.publish(2)

In [26]:
published_result.get()

array([[ 3.54360979e+18,  7.93422757e+18,  2.00631757e+18,
         6.16724546e+18],
       [-3.05205646e+00, -5.06864302e+00, -4.14999905e+00,
        -2.80169500e-01]])

The power of SMPC is that it lets us to calculate and use data that's stored in two separate places (maybe even separated by thousands of miles!) without exposing any private information to any party.

## Learning more from data: Machine Learning

So far, all of the techniques explained previously have focused on providing privacy and security, in order to get access to more data.

This final technique focuses more on what we'll be using the data *to do.*

In [None]:
import numpy as np

X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T

syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1
for j in xrange(60000):
    l1 = 1/(1+np.exp(-(np.dot(X,syn0))))
    l2 = 1/(1+np.exp(-(np.dot(l1,syn1))))
    l2_delta = (y - l2)*(l2*(1-l2))
    l1_delta = l2_delta.dot(syn1.T) * (l1 * (1-l1))
    syn1 += l1.T.dot(l2_delta)
    syn0 += X.T.dot(l1_delta)

## Everything combined: PySyft

PySyft combines all of these tools and techniques, and improves upon many of them. We'll soon be training a model with data across 100 domain nodes, 

In [None]:
# Let's get pointers to the training data on both domain nodes!
om_train_data = om.datasets[-1]["training_data"]
om_targets_data = om.datasets[-1]["training_targets"]
fb_train_data = fb.datasets[-1]["training_data"]
fb_targets_data = fb.datasets[-1]["training_targets"]

In [None]:
# Let's combine all the training data!
train_data = om_train_data.concatenate(fb_train_data)
targets_data = om_targets_data.concatenate(fb_targets_data)
X = train_data
y = targets_data

In [None]:
def relu(x,deriv=False):
    if deriv==True:
        return x>0
    return x*(x>0)

In [None]:
layer0_weights = 2*np.random.random((3,4)) - 1
layer1_weights = 2*np.random.random((4,1)) - 1

In [None]:
for j in range(1):
    # Forward propagation
    layer1_inputs = relu(X @ layer0_weights)  ; layer1_inputs.block
    layer2_inputs = relu(layer1_inputs @ layer1_weights) ; layer2_inputs.block 
    
    # Calculate errors
    layer2_inputs_delta = (y - layer2_inputs)* relu(layer2_inputs,deriv=True) ; layer2_inputs_delta.block
    layer1_inputs_delta = (layer2_inputs_delta@(layer1_weights.T)) * relu(layer1_inputs,deriv=True) ; layer1_inputs_delta.block
    
    # Update weights
    layer1_weights  = layer1_weights + layer1_inputs.T @ layer2_inputs_delta ; layer1_weights.block
    layer0_weights =  layer0_weights + X.T @ layer1_inputs_delta ; layer0_weights.block

In [None]:
layer0_weights_dp = layer0_weights.publish(sigma=1e4)
layer1_weights_dp = layer1_weights.publish(sigma=1e4)

In [None]:
print(layer0_weights_dp.get_copy())
print(layer1_weights_dp.get_copy())

And voila! The model weights have updated.

## Conclusion

Thanks for following along! I hope you walk away from this with a better understanding of how PySyft works, and also get a sense of where we're going with all this. Very soon, we'll be using the domain nodes you setup to train an AI model on breast cancer data that's across 100 institutions. We're very humbled and grateful to get the chance to help all of you.

If you have any questions or concerns, please feel free to post a message in the #support Slack channel! :)