# Medical Federated Learning Program - Data Science
<img src="imgs/heading_title.jpg" style="width: 100%; margin:0;" />

<img src="imgs/heading_recap.png" style="width: 100%; margin:0;" />

<img src="imgs/heading_ds.png" style="width: 100%; margin:0;" />

![heading_overview_1](imgs/heading_overview.png)

<img src="imgs/login_tab.png" style="width: 664px; margin:0;"/>

In [None]:
import syft as sy

ds_domain_node_1 = sy.login(
    email="sam@stargate.net", 
    password="changethis", 
    port=8081
)

<img src="imgs/tab_list_datasets.png" style="width: 664px; margin:0;" />

In [None]:
ds_domain_node_1.datasets

<img src="imgs/tab_pointer_tensor.png" style="width: 100%; margin:0;"/>

<img src="imgs/tab_dataset_pointer.png" style="width: 664px; margin:0;" />

In [None]:
data = ds_domain_node_1.datasets[0]["data"]
data

## Step 4: Working with data you don't have: Remote Procedure Calls

So now we know that **Tensor Pointers** let you access data on another domain node without having to make a copy of it.
But this is only half the story; afterall, we don't just want to access data, we want to *work with it!* How do we do that?

The answer is something called **Remote Procedure Calls**. Let's start with our tensor pointer:

In [None]:
data.public_shape

In [None]:
transformed_data = data.T

In [None]:
transformed_data.public_shape

## Step 5:  Getting results you can see: Differential Privacy

To recap: 
- We now know that **Tensor Pointers** give a data scientist the ability to access data remotely
- They can work on remote data (using **Remote Procedure Calls**) without physically having it on their device!

Now let's say you've done your analysis. How do you actually get results? And how do we make sure the data scientist seeing the results of their analysis doesn't invade or violate anyone's privacy?

The answer lies in something called **Differential Privacy**.

In [None]:
ds_domain_node_1.privacy_budget

In [None]:
transformed_data.get()

As you can see, this failed since our data scientist is trying to download raw data that they don't have the permission for.

They would retrieve this data by spending something called a **privacy budget.** They specify how much noise they want to add, and accordingly, their privacy budget gets deducted:

In [None]:
ptr = data.publish(sigma=0.5)

In [None]:
res = ptr.get()
res.decode()

In [None]:
ds_domain_node_1.privacy_budget

In [None]:
transformed_data

In [None]:
new_pointer = transformed_data.publish(sigma=0.5)

In [None]:
result = new_pointer.get()

In [None]:
result.decode()

In [None]:
result.shape

## Step 6: Combining data from many nodes: Secure Multiparty Computation

Okay- so we created a Tensor Pointer, used its remote procedure calls to conduct an experiment, and got the result by spending some privacy budget. This combination lets you use domain from one domain node. But what if you wanted to combine data from several domain nodes?

This is possible through a cryptographic tool called **Secure Multiparty Computation**, or SMPC for short.

Let's start by logging onto a second domain node:

In [None]:
ds_domain_node_2 = sy.login(email="sam@stargate.net", password="changethis", port=8082)

In [None]:
ds_domain_node_2.datasets

Let's say we want to take the two example datasets, and see the result of their addition. We would do this by simply adding the two tensor pointers:

In [None]:
pointer1 = ds_domain_node_1.datasets[0]["data"]
pointer2 = ds_domain_node_2.datasets[0]["data"]

pointer_to_result = pointer1 + pointer2

Once we have a pointer to the result, we can spend some privacy budget by publishing!

In [None]:
published_result = pointer_to_result.publish(2)

In [None]:
published_result.get()

The power of SMPC is that it lets us calculate and use data that's stored in two separate places (maybe even separated by thousands of miles!) without exposing any private information to any party.

## Step 7: Finding data you don't have: Network Nodes

In [None]:
sy.networks

## Learning more from data: Machine Learning

So far, all of the techniques explained previously have focused on providing privacy and security, in order to get access to more data.

This final technique focuses more on what we'll be using the data *to do.*

In [None]:
import numpy as np

X = np.array([ [0,0,1],[0,1,1],[1,0,1],[1,1,1] ])
y = np.array([[0,1,1,0]]).T

syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1
for j in range(60000):
    l1 = 1/(1+np.exp(-(np.dot(X,syn0))))
    l2 = 1/(1+np.exp(-(np.dot(l1,syn1))))
    l2_delta = (y - l2)*(l2*(1-l2))
    l1_delta = l2_delta.dot(syn1.T) * (l1 * (1-l1))
    syn1 += l1.T.dot(l2_delta)
    syn0 += X.T.dot(l1_delta)

## Everything combined: PySyft

PySyft combines all of these tools and techniques, and improves upon many of them. We'll soon be training a model with data across 100 domain nodes, 

In [None]:
# Let's get pointers to the training data on both domain nodes!
domain_node_1_train_data = ds_domain_node_1.datasets[-1]["training_data"]
domain_node_1_targets_data = ds_domain_node_1.datasets[-1]["training_targets"]
domain_node_2_train_data = ds_domain_node_2.datasets[-1]["training_data"]
domain_node_2_targets_data = ds_domain_node_2.datasets[-1]["training_targets"]

In [None]:
# Let's combine all the training data!
train_data = domain_node_1_train_data.concatenate(domain_node_2_train_data)
targets_data = domain_node_1_targets_data.concatenate(domain_node_2_targets_data)
X = train_data
y = targets_data

In [None]:
def relu(x,deriv=False):
    if deriv==True:
        return x>0
    return x*(x>0)

In [None]:
layer0_weights = 2*np.random.random((3,4)) - 1
layer1_weights = 2*np.random.random((4,1)) - 1

In [None]:
for j in range(1):
    # Forward propagation
    layer1_inputs = relu(X @ layer0_weights)  ; layer1_inputs.block
    layer2_inputs = relu(layer1_inputs @ layer1_weights) ; layer2_inputs.block 
    
    # Calculate errors
    layer2_inputs_delta = (y - layer2_inputs)* relu(layer2_inputs,deriv=True) ; layer2_inputs_delta.block
    layer1_inputs_delta = (layer2_inputs_delta@(layer1_weights.T)) * relu(layer1_inputs,deriv=True) ; layer1_inputs_delta.block
    
    # Update weights
    layer1_weights  = layer1_weights + layer1_inputs.T @ layer2_inputs_delta ; layer1_weights.block
    layer0_weights =  layer0_weights + X.T @ layer1_inputs_delta ; layer0_weights.block

In [None]:
layer0_weights_dp = layer0_weights.publish(sigma=1e4)
layer1_weights_dp = layer1_weights.publish(sigma=1e4)

In [None]:
print(layer0_weights_dp.get_copy())
print(layer1_weights_dp.get_copy())

And voila! The model weights have updated.

## Conclusion

Thanks for following along! I hope you walk away from this with a better understanding of how PySyft works, and also get a sense of where we're going with all this. Very soon, we'll be using the domain nodes you setup to train an AI model on breast cancer data that's across 100 institutions. We're very humbled and grateful to get the chance to help all of you.

If you have any questions or concerns, please feel free to post a message in the #support Slack channel! :)