## How can we perform statistical analysis/train models on sensitive or private data?

Is it possible to want to protect privacy and still perform predictive analysis on these private data, you most likely would say thats's not possible simply because the traditional data science process involves first collecting various data from diverse sources into one server or location and then perform training or analysis on those data. in this tutorial, we're going to look into tools and techniques required to answer questions on data we do not own locally or have access to due to privacy concerns but would still love to perform analysis on them. 

More specifically, we are going to look at the basics pysyft which is an open-source tool built on top of pytorch and tensorflow for computing over information you do not own on machines you do not have control over. Pysyft helps keep data at their edge locations while still allowing for computation over the data in a decentralized manner. This tutorial take a practical coding-based approach, and the best way to learn the material is to execute the code cell by cell and experiment with the examples.  

# Prerequisites
- Familiarity with pytorch [check the 60-minute blitz pytorch official tutorial](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)

# Setup 

## How to run the code

This notebook was developed on [Jovian.ml](https://www.jovian.ml), a platform for sharing data science projects online. You can "run" this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your own computer*.

### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing this notebook is to click the "Run" button at the top of the jovian page [via this link](https://jovian.ai/tifeasypeasy/introducing-privacy-preserving-tool), and select "Run on Binder". This will run the notebook on [mybinder.org](https://mybinder.org), a free online service for running Jupyter notebooks. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


### Option 2: Running on your computer locally

You'll need to install Python and download this notebook on your computer to run it locally. I recommend using the [Conda](https://docs.conda.io/en/latest/) distribution of Python. Here's what you need to do to get started:

1. Install Conda by [following these instructions](https://github.com/Boluwatifeh/Secure-AI/blob/master/INSTALLATION.md). Make sure to add Conda binaries to your system `PATH` to be able to run the `conda` command line tool from your Mac/Linux terminal or Windows command prompt. 


2. Create and activate a [Conda virtual environment](https://github.com/Boluwatifeh/Secure-AI/blob/master/INSTALLATION.md) called `my_env` which you can use for this tutorial series:
```
conda create -n my_env python=3.6 
conda activate my_env
```
You'll need to create the environment only once, but you'll have to activate it every time want to run the notebook. When the environment is activated, you should be able to see a prefix `(my_env)` within your terminal or command prompt.


3. Install the required Python libraries within the environmebt by the running the following command on your  terminal or command prompt:
```
conda install jupyter notebook
pip install syft
```
If you're facing some issues installing syft locally, kindly skip the `pip install syft` command and move on as there is a room for installing syft via the jupyter notebook.

4. Clone this repository on github 
```
git clone https://github.com/Boluwatifeh/Secure-AI.git
```

5. Enter the project directory and start the Jupyter notebook:
```
cd tutorials
jupyter notebook
```

6. You can now access Jupyter's web interface by clicking the link that shows up on the terminal or by visiting http://localhost:8888 on your browser. Click on the notebook `introducing-privacy-preserving-tool.ipynb` to open it and run the code. If you want to type out the code yourself, you can also create a new notebook using the "New" button.

For some reason if any part of the above iinstruction doesn't work, first check INSTALLATION.md for help, and then open a GitHub Issue. 


# Ways of preserving privacy 

## 1. Data anonymization
Data anonymization is the process of protecting private or sensitive information by erasing or encrypting identifiers that  connect an individual to stored data. For example, you can run Personally Identifiable Information(PII) such as names, social security numbers and addresses through a data anonymization process that retains the data but keeps the source anonymous. As much as this technique seem to protect privacy to some level, it is still very much not advisable to implement as we have seen several cases of data de-anonymization on anonymized data. Some studies have shown that it's quite easy to de-anonymize anonymized data an example is in the Netflix prize competition where a team was asked to build a movie recommendation system with an anonymized dataset of over one hundred million movie ratings from half a million users so that individual team participating in the competition could train their models. Later sometime, two researchers were able to de-anonymize the name of the movies and the users on the ratings with the anonymized data.

# Disadvantages 
- If there's a case of two related anonymized datasets, its very possible to de-anonymize the datasets and invade privacy  by studying the two datasets and looking for correlation between them.
- Collecting data and performing anonymization on the dataset limits ability to derive value and insight from such data therefore users experience cannot be personalized in the case of anonymized datasets.

![](https://github.com)

## 2. Federated learning 
The simplest way to protect privacy of data is not to collect the data in the first instance. As we can see in the case of data anonymization, it doesn't work but without collecting the data, we can't perform analysis or train a model. So how do we fix that ? Federated learning is a machine learning setting where many clients(e.g mobile devices or a whole organization) collaboratively train a model under the orchestration of a central server while keeping the training data decentralized. It embodies the principles of focused collection and data minimization and can mitigate many of the systematic privacy risks and costs resulting from traditional, centralized machine learning.

![](https://github.com)

# Introducing PySyft
We are going to use an open source library called PySyft to implement federated learning that we just dicussed.

## Installing libraries
The jovian library allows us to commit our notebook to the jovian platform which is a data science collaboration and sharing platform

In [2]:
# !pip install jovian --upgrade --quiet
# uncomment the above command if youre running on binder

In [3]:
# this installs the syft library, comment the below command if you were able to install syft locally
!pip install syft

Collecting syft
[?25l  Downloading https://files.pythonhosted.org/packages/1c/73/891ba1dca7e0ba77be211c36688f083184d8c9d5901b8cd59cbf867052f3/syft-0.2.9-py3-none-any.whl (433kB)
[K     |████████████████████████████████| 440kB 3.1MB/s eta 0:00:01
[?25hCollecting Flask~=1.1.1 (from syft)
[?25l  Downloading https://files.pythonhosted.org/packages/f2/28/2a03252dfb9ebf377f40fba6a7841b47083260bf8bd8e737b0c6952df83f/Flask-1.1.2-py2.py3-none-any.whl (94kB)
[K     |████████████████████████████████| 102kB 8.1MB/s ta 0:00:011
[?25hCollecting lz4~=3.0.2 (from syft)
[?25l  Downloading https://files.pythonhosted.org/packages/65/38/dacc3cbb33a9ded9e2e57f48707e8842f1080997901578ebddaa0e031646/lz4-3.0.2-cp37-cp37m-manylinux2010_x86_64.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 11.4MB/s eta 0:00:01     |████████████▎                   | 675kB 11.4MB/s eta 0:00:01
[?25hCollecting syft-proto~=0.5.2 (from syft)
[?25l  Downloading https://files.pythonhosted.org/packages/a8/85/fea3

[K     |████████████████████████████████| 36.9MB 32.1MB/s eta 0:00:01B/s eta 0:00:030:03�▉                            | 4.4MB 15.0MB/s eta 0:00:03MB/s eta 0:00:02| 9.9MB 15.0MB/s eta 0:00:02 10.4MB 15.0MB/s eta 0:00:02MB 15.0MB/s eta 0:00:02     |████████████▌                   | 14.4MB 15.0MB/s eta 0:00:02█████            | 22.9MB 32.1MB/s eta 0:00:01��███████████▎           | 23.3MB 32.1MB/s eta 0:00:01��█████████████████▌          | 24.8MB 32.1MB/s eta 0:00:0100:01
[?25hCollecting python-socketio>=4.3.0 (from flask-socketio~=4.2.1->syft)
[?25l  Downloading https://files.pythonhosted.org/packages/3d/97/00741edd49788510b834b60a1a4d0afb2c4942770c11b8e0f6e914371718/python_socketio-4.6.0-py2.py3-none-any.whl (51kB)
[K     |████████████████████████████████| 61kB 12.6MB/s eta 0:00:01
Collecting netifaces (from aioice<0.7.0,>=0.6.17->aiortc==0.9.28->syft)
  Downloading https://files.pythonhosted.org/packages/0d/18/fd6e9c71a35b67a73160ec80a49da63d1eed2d2055054cc2995714949132/netifaces-0.

  Found existing installation: tornado 6.0.3
    Uninstalling tornado-6.0.3:
      Successfully uninstalled tornado-6.0.3
Successfully installed Flask-1.1.2 Pillow-8.0.1 RestrictedPython-5.1 Werkzeug-1.0.1 aioice-0.6.18 aiortc-0.9.28 av-8.0.2 crc32c-2.2 dill-0.3.2 flask-socketio-4.2.1 importlib-metadata-2.0.0 importlib-resources-1.5.0 itsdangerous-1.1.0 lz4-3.0.2 msgpack-1.0.0 netifaces-0.10.9 numpy-1.18.5 openmined.threepio-0.2.0 phe-1.4.0 protobuf-3.13.0 psutil-5.7.0 pyee-8.1.0 pylibsrtp-0.6.7 python-engineio-3.13.2 python-socketio-4.6.0 requests-toolbelt-0.9.1 scipy-1.4.1 shaloop-0.2.1a11 syft-0.2.9 syft-proto-0.5.2 tblib-1.6.0 torch-1.4.0 torchvision-0.5.0 tornado-4.5.3 websocket-client-0.57.0 websockets-8.1 zipp-3.4.0


# Imports and initializing hook 
Lets import torch and initialize the hook. The hook is to override pytorch's methods in order to execute commands on workers 

In [27]:
import torch 
import syft as sy
hook = sy.TorchHook(torch)



In [40]:
x = torch.tensor([1,2,3,4,5])
y = x + x
print(y)

tensor([ 2,  4,  6,  8, 10])


At least the above code looks familiar its just a normal pytorch arithmetic which involves we having the data locally, but then how can we perform this kind of basic arithmetic on a machine we do not own or have access to?

# Creating a remote machine(worker)
Before we can perform any computation on  a remote machine, we need to have an interface that connects us to that machine. Let us explore one of the functionality of syft by creating a pretend machine which belongs to Joe and send some data over to perform basic arithmetic.

At this point we would no longer be dealing with the normal pytorch tensors alone but **pointers** to tensors 

In [41]:
joe = sy.VirtualWorker(hook, id="joe")
print('Joe has:' + str(joe._objects)) 

Joe has:{}


Over here we have Joe which is a worker, let's create some  tensors and send them to Joe as it appears Joe is an empty machine right now without any data on it. 

In [42]:
x = torch.tensor([1,2,3,4,5])
y = torch.tensor([1,1,1,1,1])
# sending tensors to joe
x = x.send(joe)
y = y.send(joe)
print(x)
print(y)

(Wrapper)>[PointerTensor | me:45439385298 -> joe:25448987438]
(Wrapper)>[PointerTensor | me:46984909473 -> joe:44741931737]


As we can see, after sending those tensors to Joe what was returned is a PointerTensor to that tensor sent and this would be the basics of execution as the pointer holds information about the tensor we sent earlier.  We can use the ._objects attribute to view the tensor sent to Joe

In [43]:
print('Joe has' + str(joe._objects))

Joe has{25448987438: tensor([1, 2, 3, 4, 5]), 44741931737: tensor([1, 1, 1, 1, 1])}


NOTE: PointerTensor usually don't hold data themselves,  they rather contain metadata about a tensor stored on another machine.The purpose of the PointerTensor is to give us an interface to tell the other machine to compute functions using this tensor.

Let's view the metadata that pointers consist of 

In [44]:
print(x)
print(y)

(Wrapper)>[PointerTensor | me:45439385298 -> joe:25448987438]
(Wrapper)>[PointerTensor | me:46984909473 -> joe:44741931737]


There are two main attributes specific to pointers:

- `x.location : ` the location, a reference to the location that the pointer is pointing to
- `x.id_at_location : <random integer>`, the id where the tensor is stored at location

There are also other more generic attributes:
- `x.id : <random integer>`, the id of our pointer tensor, it was allocated randomly
- `x.owner : "me"`, the worker which owns the pointer tensor, here it's the local worker, named "me"

In [45]:
x.location

<VirtualWorker id:joe #objects:2>

In [46]:
joe

<VirtualWorker id:joe #objects:2>

In [47]:
joe == x.location

True

In [48]:
x.id_at_location

25448987438

In [49]:
x.owner

<VirtualWorker id:me #objects:0>

Now that we've sent some data to the remote machine for commputation, how do we get this data back after computation is done you might ask, we can use the get() method to get back the data sent which in this case is x from Joe's machine. 

In [50]:
x = x.get()
y=y.get()
print(x)
print(y)
print('Joe has:' + str(joe._objects))

tensor([1, 2, 3, 4, 5])
tensor([1, 1, 1, 1, 1])
Joe has:{}


As you can see, after getting the data from joe's machine the data now returns to our machine and joe no longer has the tensors

# Performing remote arithmetic operations
As we have seen earlier in the introduction to pysyft, we were able to create torch tensors and perform arithmetic. Same thing applies with remote tensors too, we can perform arithmetic operations using syft on remote tensors just like we normally do with pytorch while still having full access to pytorch functions including backpropagation. Let's look at some basic arithmetic operation on PointerTensors

In [51]:
x = torch.tensor([1,2,3,4,5]).send(joe)
y = torch.tensor([1,1,1,1,1]).send(joe)

In [52]:
z = x + y

In [53]:
z

(Wrapper)>[PointerTensor | me:11310952391 -> joe:26732459988]

In [55]:
z.get()

tensor([2, 3, 4, 5, 6])

As we would assume, z returns a PointerTensor and the get method returns the output of the operation tensors back to us simply because the execution was not done locally, instead was carried out on joe's machine therefore we need to use the get() method in order to get back the resulting tensors 

# Torch functions and Backpropagation 

In [56]:
x, y

((Wrapper)>[PointerTensor | me:34301111294 -> joe:78664832368],
 (Wrapper)>[PointerTensor | me:3293199719 -> joe:90815115877])

In [57]:
z = torch.add(x,y)

In [58]:
z

(Wrapper)>[PointerTensor | me:5774831528 -> joe:68221850279]

In [60]:
z.get()

tensor([2, 3, 4, 5, 6])

The syft API give us access to torch functions and makes the whole operation seem torch like, it also supports backpropagation and many other functionalities 

In [61]:
x = torch.tensor([1,2,3,4,5.], requires_grad=True).send(joe)
y = torch.tensor([1,1,1,1,1.], requires_grad=True).send(joe)

In [62]:
z = (x + y).sum()

In [63]:
z.backward()

(Wrapper)>[PointerTensor | me:28251579733 -> joe:6502340217]

In [64]:
x = x.get()

In [65]:
x

tensor([1., 2., 3., 4., 5.], requires_grad=True)

In [66]:
x.grad

tensor([1., 1., 1., 1., 1.])

# Conclusion

And this brings us to the end of the tutorial and i hope you have been able to have a grasp of the pysyft library for basic remote execution. There are much more advanced technique we can implement with pysyft other than federated learning such as secure multiparty computation and differential privacy, unfortunately we won't be able to cover that in this tutorial but you can always check the openmined community on github.

## Learn more 
- Star the pysyft library on github
- Sign up for a free udacity course developed by the openmined team [udacity secure and private AI course](https://www.udacity.com/course/secure-and-private-ai--ud185)

### Save and upload your notebook

Whether you're running this Jupyter notebook on an online service like Binder or on your local machine, it's important to save your work from time, so that you can access it later, or share it online. You can upload this notebook to your [Jovian.ml](https://jovian.ml) account using the `jovian` Python library.

In [3]:
# Install the jovian library
!pip install jovian --upgrade --quiet

In [4]:
import jovian

In [5]:
project_name = 'introducing privacy preserving tool'

In [None]:
jovian.commit(project=project_name)

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m


The first time you run `jovian.commit`, you'll be asked to provide an API Key, to securely upload the notebook to your Jovian.ml account. You can get the API key from your Jovian.ml profile page after logging in / signing up.

![api-key](https://github.com/Boluwatifeh/Secure-AI/blob/master/images/api-image.png)

`jovian.commit` uploads the notebook to your Jovian.ml account, captures the Python environment and creates a shareable link for your notebook as shown above. You can use this link to share your work and let anyone (including you) run your notebooks and reproduce your work.