# Train an Image-Classifier using Federated Learning with Differential Privacy
## A complete End-to-End Real World Example - PySyft, PyGrid
* This tutorial aims at demonstrating a high level of privacy using **Secured Federated Learning** and **Differential Privacy based on the Laplace-Mechanism** as part of the **PATE framework**. To tackle a problem similair to a real-world-problem, yet still using a well explored example (the privacy tools are of main interest in this tutorial) it was chosen to train an Image-Classifier on the **[Cifar-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html)**. In the [original paper on the PATE framework](https://arxiv.org/pdf/1610.05755.pdf), simpler problems such as training a classifier on MNIST and SVHN were tackled.
* Containing 60000 32x32 pixels coloured images of 10 different classes (airplanes, birds, etc.) this should showcase a reasonably similair task to another real-world example such as the training of a classifier for skin-cancer-classification, which heavily relies on sensitive private data. See [Stanford's Skin Cancer Classification with Deep Learning](https://cs.stanford.edu/people/esteva/nature/) for more information on this specific example.
* To be able to give a complete end-to-end example the goal is to also use PyGrid to build a peer-to-peer network which should be used as basis for a coordination of the workers. 
* Besides a tutorial this should also serve as good testing ground for the *"production-readyness"* of PySyft and PyGrid. (Speed, Accuracy, etc.)

### Specific Situation - Real-World-Scenario: 
To make the example as close to a real world production use-case and to best give a quick overview over the PATE framework for DP along with FL, we define a simple usecase along three different parties taking part in this setup. <br>
**We will assume the role of the model provider who also coordinates the procedure given the following parties.** One can imagine that requests from both model providers and data providers can lead to the initiation of the proecss given a neutral party which coordinates the procedure in the future (See *Shortcomings* Section)
  1. **Data Provider:** Owns personal sensitive data. **E.g.:** in this case some photographers who all made pictures of some objects in real life and sorted them (into the ten classes for this examle). His/her goal is to either *create/train a classifier* which automates the sorting of the images for him/her or to *participate in creating a general image classifier which can recognize objects on images in general*. Possibly he/she also wants somebody to *host the classifier to speed up inference time*. 
  2. **Model Provider:** Creates the model, in this case the classifier-model. His/her goal is to use his model-architecture and train it on the photographers (Data Provider) personal data and to then make this trained model available to a broad range of user. 
  3. **Hosting Provider:** Provides computing ressources along possible hosting capabilities. His/her goal is to provide the Data and Model Provider with the neccessary computing power for training and hosting capabilities for deployment. 
  * **How is this example representative for the need/usage of privacy preserving ML?**
    This example portrays some of the key characteristics of privacy-preserving ML. 
      * **Combining Knowledge:** A single photographer wouldn't provide enough data to train a good image classifier and wouldn't have the knowledge to build a classifier. Together with different other (possibly even competing photographers) and a model provider he can train a good model together with the other parties. 
      * **Federated Learning:** He doesn't want his sorted photographs to be accessable by either the other competing photographers or the model creator. (He wants to sell his photographs)
      * **Differential Privacy:** Classifiers as ML-Models in general can memorize specific parts of the training data, making it possible to retrieve information about the training data (or even parts of some datapoints). In this case the photographer doesn't want users of the classifier (e.g.: the competing photographers) to be able to retrieve information about his unique way of photograhping (e.g.: key motives, etc.)  
      * *All the above points are also exactly the critical points when trainng a classifier e.g. on a skin-cancer dataset.*

### The PP-techniques that will be used here:
* **FL** - Federated Learning: We will be using **SMPC-Encrypted Federated Learning** *(For a quick overview see below)*
* **DP** - Differential Privacy: We will be using the **Laplace-Mechanism** as part of the "noisy voting" in the PATE procedure *(More information below)*


### The PP-tools that will be used here: TO BE UPDATE 
* **PySyft**
  * Plans (NOT YET: Currently can't be usefully applied with this setting, see *Shortcomings*)
  * Protocols (NOT YET: Currently can't be usefully applied with this setting, see *Shortcomings*)
  * VirtualWorkers
* **PyGrid** 
  * Gateway (To come)
  * GridNode (To come)
  
**TODO: CHECK SPECIFIC TYPES AND IN WHICH LIBRARIES THEY ARE! (Apparently currently changing)**

Author:
- Nicolas Remerscheid - GitHub: [@NiWaRe](https://github.com/NiWaRe)

References: 
*This example/usecase partly uses code from previous tutorials on PySyft and PyGrid and the Udacity Tutorial on Cifar10-Image-Classification (Part of the Deep Learning Nanodegree):* 
- Andrew Trask: DP and PATE Explanation part of [the Udacity Private AI Course](https://www.udacity.com/course/secure-and-private-ai--ud185) 
- Théo Ryffel: The general PySyft Tutorial Series, notably [Part 12 - Encrypted MNIST](https://github.com/OpenMined/PySyft/blob/master/examples/tutorials/Part%2012%20bis%20-%20Encrypted%20Training%20on%20MNIST.ipynb)
- [Udacity's Cifar 10 Notebook](https://github.com/udacity/deep-learning-v2-pytorch/blob/master/convolutional-neural-networks/cifar-cnn/cifar10_cnn_solution.ipynb)

 
***TODO: LINK SPECIFIC LEARNING MATERIALS AS PREREQUESITS.***

# The PATE Framework
## The assumptions 
Introduced in [this paper](https://arxiv.org/pdf/1610.05755.pdf) the *Private Aggregation of Teacher
Ensembles* (PATE) framework consists of the following basic setup. <br>
**Before:** It is important to note that the PATE framework assumes a certain situation: 
* **Discrete Model Output:** The PATE framework (acoording to the original paper) assumes that a model should be trained that has a discrete output i.e. it categorizes or classifies some input into categories. In this case this is an image-classifier. 
* **Data structure:** The PATE framework assumes that the following data exists and is stored in the following way:
  * **Private Data:** The private data is stored on multiple workers (photographers here) The private data is *labeled* and each worker has *different datapoints* (different photographs). Furthermore the private data isn't sufficient to train a good classifier alone.  
  * **Public Data:** In addition to the private data PATE assumes that there exists a second larger dataset of *unlabeled, public and unsensitive* images. In this case for example this could be very large database of pictures from various newspapers. They aren't labeled and everybody can access them since the newspapers decided to share them (e.g.: they don't sell them directly)
* **Semi-Supervised Learning:** Given the particular datasets that are given PATE assumes that we want to leverage smaller labeled datasets (normally used for supervised learning) together with an unlabeled dataset (normally used for unsupervised learning) to train a final model, meaning we want to train it using semi-supervised learning. 


## The procedure
(For more info see in the learning material section above, notably Andrew Trask's tutorial) <br>
This image shows an overview over the procedure: (from the [the original paper](https://arxiv.org/pdf/1610.05755.pdf))
<img src="./material/PATE_framework_overview_from_paper.png" width="600">

1. STEP: **Multiple Teacher-Models are trained on each of the private datasets**
  * This is also where we want to use **Federated Learning** to not *directly* access the data at all. To make sure that the photographers don't see the model of the model provider (he/she sells that) we'll be encrypting both the model and the data via Additive Secret Sharing using Secure Multi-Party Computing (SMPC). 
2. STEP: **For each unlabeled datapoint in the public dataset teacher-models jointly predict a label**
  * This is where noise is added for **Differential Privacy**. 
3. STEP: **One Student-Model is trained on the newly labeled public dataset**
  * This is where **the final model** is created. 

#  Setting Up (No Grid yet)  !CHECK NEEDED!
## Motivation/Explanation of the approach for the PATE Setup 
Ideally we would like to initiate a PATE procedure which is coordinated completely automatically given some existing workers (which could be searched for with the grid infrastructure) To achieve that we would need to assume the role of a neutral party which can remotely coordinate the process. <br> 
To date certain features don't exist to fully coordinate everything coordinately:  (more in *Shortcomings*) 
* ReLUs, Maxpool, etc. which are one of the most used functions in DNNs can only be computed encrypted on **2 workers** (based on the current implementation of the crypto-protocls [secureNN](https://www.microsoft.com/en-us/research/publication/securenn-efficient-and-private-neural-network-training/) and the SPDZ protocol for adding, mm)
* There is no possibility to `remote_share()` remotely encrypted a tensor which is stored on another machine. <br>(***TODO: Check if solvable***)
For setting up the PATE-Scenario as described above we assume that the data from the data providers and the model from the model provider are distributed for encryption (SMPC) <br>
Despite the given constraints the goal is to make the PATE procedure as realistic as possible: 
* For SMPC **"Hosting Groups"** consisting of 2 workers are created, one for each teacher. The idea is that only one hosting_group consisting of 2 workers is not powerful enough to host the potentially parallel (goal for the future) training of all the different teacher-models. To guarantee that the two workers don't work together for decryption of either the model or some dataset all the groups consist of a teacher (with his dataset) and the student. <br> In the future the goal is to dynamically host additional neutral workers from a third-party provider to support the computation if necessary. 
* *For now* the training of the student-models and the labeling of the student dataset through the noisy voting process happen both in **sequential order** 
* *For now* **no plans or protocols** were implemented for further automation of the procedure, because no alternative for a remote encryption (remote-share) was found. This would be needed e.g. to move the public dataset from hosting group to hosting group automatically to gather the teacher predictions (remote_share() from model provider, after first prediction remote_get() and again remote_share to next hosting group)
    * Nevertheless **the future goal** is to be able to create a completely automated PATE-protocol which only needs to be provided with the participating parties. (teachers, students + *dynamic task-based number of external hosting-workers* )
* *For now* **usage of the grid** (actual workers with a Public or Private Grid) as an added first step for finding of the correct workers to initiate a PATE-procedure. Whith the search functionality data-tags can be searched for and the corresponding workers (potential teachers) can be found (location attribute) The remaining procedure stays the same as with VirtualWorkers and without gateway.  

In [None]:
!pip install syft

In [1]:
import torch 
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim

from torchvision import datasets, models
import torchvision.transforms as transforms
from torch.utils.data.sampler import SubsetRandomSampler

import numpy as np
import syft as sy 
from copy import deepcopy

hook = sy.TorchHook(torch)

In [2]:
# One-Hot Encoding (Copied from @laRiffle Part 12)
def one_hot_of(index_tensor):
        """
        Transform to one hot tensor
        
        Example:
            [0, 3, 9]
            =>
            [[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
             [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]]
            
        """
        onehot_tensor = torch.zeros(*index_tensor.shape, 10) # 10 classes for Cifar10
        onehot_tensor = onehot_tensor.scatter(1, index_tensor.view(-1, 1), 1)
        return onehot_tensor
    
# TODO: If we want to automated protocoll, we need to assume the role of a neutral instance which means we need 
#       a way to remotely share data 
def remote_share():
    pass

In [3]:
## SIMPLE VERSION ##
 
# TODO: Change explanation: Everything could be executed by neutral player except model distribution 
#       and encryption and training (remote training job and remote_share haven't been implemented for now)
#       For the other tasks the usage of plans doesn't make sense (as seen in notes on iPAD!)

###################
# WORKER CREATION #
###################

# TODO: Extend to be a central worker creation function (introduce variable grid=False)
#       to choose between creating actual grid workers or VirtualWorkers. 
def CREATE_V_WORKERS(w_number, w_name):
    """
        Create multiple, named virtual workers
        
        Args:
            w_number - number of workers to be created 
            w_name - name of worker group 
            
        Return:
            array of the workers 
    """
    worker = []
    for i in range(w_number):
        worker.append(sy.VirtualWorker(hook, id=w_name+str(i)))
    if len(worker) == 1: 
        worker = worker[0]
    return worker

# NOT USED YET 
# TODO: Find heuristic to estimate training effort. (Potentially also take in other arguments)
# TODO: Potentially also consider prediction of public dataset (if public dataset is large)
#       Normally the forward pass for the student dataset labeling is comparable small to training n-teacher-models
def ASSESS_COMP_EFFORD(size_model, size_teacher_ds, size_student_ds):
    """
        This function takes in the number of params of the model to be trained, the average size of the teacher
        datasets and the size of the student dataset. It then estimates the total amount of work necessary for 
        the PATE process using some metric. ...
    """
    comp_effort = 0 
    return comp_effort


# TODO: Potentially only take in teachers and students (from which the model and the dataset can be taken)
def CREATE_HOSTING_GROUPS(teachers, student, model=None, datatest=None, configs=None): 
    """
        This function creates and returns worker hosting groups for hosting the models and the data in a 
        SMPC-Encrypted way to then later use those for PATE process. If needed it dynamically adds hosts based 
        on the computational ressources of the workers and the assessed computational effort. 
        
        Args: 
            teachers - Array of the teacher workers
            student - Single worker 
            model - The model to be used for student and teachers 
            dataset - The data used
            configs - Dict with user specified configs (speed of procedure, etc.) for determining extra hosting
        
        Return: 
            Array of different hosting groups (also arrays)
    """
    
    # Create Groups based on two criteria: 
    # 1. Partners shouldn't be likely to collaborate for security reasons => Model-Owner and Data-Owner
    # 2. Enough computational ressources, otherwise additional hosting workers are created. 
    
    # TODO: Get comp. ressources of teachers and workers (For now: just set to 100)
    comp_res = {teacher.id:100 for teacher in teachers}
    comp_res[student.id] = 100
    
    # TODO: Get comp ressources from host from config file
    host_comp_res = 20
    
    # TODO: Get model and the datasets from the students and the teachers and calc numb_params and size of datasets
    # Get computational effort for task 
    comp_eff = ASSESS_COMP_EFFORD(0, 0, 0)
    
    # Create hosting groups (teacher always first, student always second)
    # TODO: Maybe use dicts instead of ordering 
    hosting_groups = []
    for teacher in teachers: 
        comp_diff = comp_res[teacher.id] + comp_res[student.id] - comp_eff
        if comp_diff >= 0: 
            hosting_groups.append([teacher, student])
        else: 
            # TODO: Dynamic adding of extra hosting providers only possible if secureNN with more than 2
            #       participants is possible
            required_extra = np.ceil(comp_diff / host_comp_res)
            extra_hosts = CREATE_V_WORKERS(required_extra, "hosting_providers")
            hosting_groups.append([teacher, student] + extra_hosts)
            
    return hosting_groups
    
    
##################### 
# Data Distribution #
#####################
 
# This could be done client-side from the data-providers in a real scenario 
# (Or could be initiated automatically by some protocol)

# TODO: Think about doing batching on device, having only one pointer to dataloader 
#       Could decrease communication complexity (Check how FederatedDataset Tut did that)
def SIMPLE_PATE_SETUP(hosting_groups, dataset, batch_size, crypto_provider, ratio=0.4, extra_public_ds=False):
    """
        Given the current implementation of secureNN Model-Encryption via SMPC only two hosts are possible.
        This function securely (SMPC) distributes a single dataset equally across different given 
        hosting groups. 
        
        The function takes an array of hosting groups (a hosting group is an array in the following order: 
        [teacher, student, possibly extra hosting workers]), a dataset, a batch_size, a crypto_provider and a ratio 
        which determines how much of the provided data should be regarded as private teacher data and how much 
        as public student data.
        
        If wanted the user can decide to pass in only the private dataset (setting extra_public_ds=True) 
        and send the public dataset himself to the student. (If for example the user wishes to take data 
        from another distribution for the public dataset)
        
        Args: 
            hosting_groups - Array of different worker groups (also array)
            dataset - Dataset to be used by torch.utils.data.DataLoader
            batch_size, crypto_provider - Self Explanatory 
            ratio - e.g.: 0.4 -> 40% of the total data wil be stored on teachers as sensitive data 
                    the rest on the student 
            extra_public_ds - If True the user can create a public dataset for the student from another dataset 
                              If e.g. tests want to be made where public data is of different distribution 
            
        Return: 
            dist_dataset_teachers - Dict, keys are the teacher.id, values are (inp, target) batches stored as arrays
            dist_dataset_student - Array of (inp) batches 
    """
    # No differentiation between training and testing is made, 
    # because in reality the data providers won't have the data seperated for ML 
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
    numb_teachers = len(hosting_groups)
    # increment = len(dataset) / numb_teachers
    # Student is stored in any group, second place 
    student = hosting_groups[0][1]

    # Directly store (input, target) pairs as an array to enumerate over them -> use as dataLoader over PTRs
    # Create dict of teacher pointer arrays, helpful for training afterwards 
    dist_dataset_teachers = {}
    # Instantiate to be able to use .append() in main loop afterwards 
    for hosting in hosting_groups: 
        # The teacher is always stored first in hosting group
        current_teacher = hosting[0]
        dist_dataset_teachers[current_teacher.id] = []
    dist_dataset_student = []
    
    for i, (data, target) in enumerate(dataloader):
        # Teachers: The first 40% (normally less than on public) of the data
        current_group = hosting_groups[i%numb_teachers]
        current_teacher = current_group[0]
        # TODO: Make faster and using index arrays directly split data and share only once 
        if i < int(ratio * (len(dataset)/batch_size)):
            ### PySyft ###
            # TODO: Add descritpion when using PyGrid.
            data = data.tag("cifar", "input").fix_precision().share(*current_group, crypto_provider=crypto_provider, 
                                                                    requires_grad=True)
            # As nn.CrossEntropyLoss can't be used we have to one-hot encode the targets
            target = one_hot_of(target).tag("cifar", "input").fix_precision().share(*current_group, 
                                                                        crypto_provider=crypto_provider, 
                                                                        requires_grad=True)
            # Store the pointers to the send data.
            dist_dataset_teachers[current_teacher.id].append((data, target))
        # Student: The larger part is unlabeled 
        else: 
            if not extra_public_ds:    
                # Directly send to student (should be public dataset) 
                ### PySyft ###
                # FOR NOW: We assume the role of the model owner, for neutral instance we need a remote_share()
                # data = data.tag("cifar", "input").send(student, requires_grad=True)
                # Store the pointers to the send data.
                dist_dataset_student.append(data)
        
    return dist_dataset_teachers, dist_dataset_student 

In [4]:
## SIMPLE VERSION ##

# Create Argument class (as @LaRiflle in Part 12)
class Arguments():
    def __init__(self):
        self.numb_teachers = 3
        self.numb_classes = 10
        # Epsilon we will use for noisy voting
        self.epsilon = 0.1
        # Delta we want for the total privacy leakage analysis AFTER training
        self.delta = 1e-5
        # TODO: Increase before complete test-run (only to have quick forward, backward for testing)
        self.batch_size = 20
        self.test_batch_size = 64
        # TODO: Increase if more time (training faster)
        self.epochs = 2
        self.lr = 0.02
        self.seed = 1
        self.patience = 3
        self.val_size = 0.2
        # self.ratio = 0.4
        # self.log_interval = 1 # Log info at each batch
        # self.precision_fractional = 3
    
args = Arguments()

# TODO: WHY DOESN?T THIS WORK?????????????????????????????????
_ = torch.manual_seed(args.seed)

# Normalize data and convert to torch.FloatTensor
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    
# Get Cifar10 Dataset from torchvision.datasets
cifar10_train_data = datasets.CIFAR10('data', train=True,
                              download=True, transform=transform)

# We distribute testing data also on workers, as in a real scenario. 
cifar10_test_data = datasets.CIFAR10('data', train=False,
                              download=True, transform=transform)

# For Testing
sample_input = torch.randn((args.batch_size, 3, 32, 32))*255

# Model Provider (+ Public Data Provider)
student = CREATE_V_WORKERS(1, "model_provider")

# We assume the role of the model provider (initiator of PATE procedure)
# IMPORTANT: Local worker should not be a client worker
#hook.local_worker.is_client_worker = False
#student = hook.local_worker

# Private Data Provider
teachers = CREATE_V_WORKERS(args.numb_teachers, "photographer")

# Crypto Provider 
crypto_provider = CREATE_V_WORKERS(1, "crypto_provider")

# Create Hosting Groups for SMPC-Encryption 
hosting_groups = CREATE_HOSTING_GROUPS(teachers, student)
print(hosting_groups)

# Distribute Data 
PTR_dtrain_teachers, dtrain_student = SIMPLE_PATE_SETUP(hosting_groups, cifar10_train_data, 
                                                            args.batch_size, crypto_provider)

# We assume that the we got a pointer to the total dataset on the worker and then parted the pointers into 
# the training and testing dicts 
# Note: There are now labeled test samples on the student as well for evaluation purposes only
PTR_dtest_teachers, dtest_student = SIMPLE_PATE_SETUP(hosting_groups, cifar10_test_data,
                                                        args.batch_size, crypto_provider)

# CHANGE: This is just for sim purposes: This test-data wouldn't exist normally but is still created for 
# performance evalulation purposes of the final model. (compare it to normal training)
# TODO: Possibly better for evaluation to use complete test data for testing (only store it on teachers)
# Merge part of the test data that was distributed on student together with rest, because 
# unlabeled datapoints can't be used for testing 
#PTR_dtrain_student += temp1 

Files already downloaded and verified
Files already downloaded and verified
[[<VirtualWorker id:photographer0 #objects:0>, <VirtualWorker id:model_provider0 #objects:0>], [<VirtualWorker id:photographer1 #objects:0>, <VirtualWorker id:model_provider0 #objects:0>], [<VirtualWorker id:photographer2 #objects:0>, <VirtualWorker id:model_provider0 #objects:0>]]


In [5]:
## SIMPLE VERSION ##
# TODO: Mini-Test: Make dynamic based on actual args.batch_size and consider non-complete batches 

# As an example: Get the target of the first example in the first batch -> Note: Not one-hot-encoded
print("Data Providers: ")
example_worker = teachers[0].id
print(PTR_dtrain_teachers[example_worker][0][0])
# For Encryption: no single location obviously 
##print(PTR_dtrain_teachers["target"][1][0].location)
##print(PTR_dtrain_teachers["target"][2][0].location)
#print("Pointer on first batch on student: ", dtrain_student[0])

numb_student_batches = len(dtrain_student) + len(dtest_student)
print("Total batches on student:",  len(dtrain_student), " + ", len(dtest_student))

# TODO: Important that they have equal number of datapoints (each from other distribution)
#       Introduce weighting at averaging afterwards?
numb_teacher_train_batches = [len(PTR_dtrain_teachers[worker.id]) for worker in teachers]
numb_teacher_test_batches = [len(PTR_dtest_teachers[worker.id]) for worker in teachers]
print("Total batches on teachers: training: ", numb_teacher_train_batches, " + testing: ", numb_teacher_test_batches)

# Total 60.000 datapoints = 1800 * 20 + 1200 * 20 = 60.000 (batch_size = 20)
total_distributed_batches = numb_student_batches + sum(numb_teacher_train_batches+numb_teacher_test_batches)
print("Distributed: ", total_distributed_batches*args.batch_size, " /60.000 datapoints")

# Quick check if data is rightfully stored on workers
#print(teachers[0]._objects)
#print(student._objects)

Data Providers: 
(Wrapper)>AutogradTensor>FixedPrecisionTensor>[AdditiveSharingTensor]
	-> [PointerTensor | me:70559082793 -> photographer0:99088322026]
	-> [PointerTensor | me:2845580457 -> model_provider0:3833413114]
	*crypto provider: crypto_provider0*
Total batches on student: 1500  +  300
Total batches on teachers: training:  [334, 333, 333]  + testing:  [67, 67, 66]
Distributed:  60000  /60.000 datapoints


In [6]:
## Validation Set Creation ##

# image classes
classes = ['airplane', 'automobile', 'bird', 'cat', 'deer',
           'dog', 'frog', 'horse', 'ship', 'truck']

# TODO: Possibly make function because we need to split twice 
def split_train_val(val_size): 
    pass

PTR_dvalid_teachers = {}
# Shuffle + Train/Val Split
for teacher in teachers:
    # Current batch 
    batches = PTR_dtrain_teachers[teacher.id]
    
    # Randomize Batches (Don't learn a pattern the photograher might have used for sorting)
    np.random.shuffle(batches)

    # Split Val and Train Data 
    numb_batches = len(batches)
    split = int(args.val_size * numb_batches)
    PTR_dvalid_teachers[teacher.id] = batches[:split] 
    PTR_dtrain_teachers[teacher.id] = batches[split:]

    # Test
    print(len(PTR_dvalid_teachers[teacher.id]))
    print(len(PTR_dtrain_teachers[teacher.id]))

# Student - Splitting: Later

66
268
66
267
66
267


## Setup Done. Actual Start.
Now we have all the workers the way we described in the setup above. (Everything in batches, for datapoints " \* self.batch_size") *Little Recap:* 
* `PTR_dtrain_teachers` 802 *labeled* batches are distributed across *3 teacher workers* (= **40% of 2500** batches (60.000 datapoints) in total = *1000* and again **80%** left for training and 20% for validation)
* `PTR_dtest_teachers` 200 *labeled* examples are also distributed across *3 teacher workers*
* `PTR_dvalid_teachers` 198 *labeled* examples for validation during trainig (66 + 66 + 66)
* `PTR_dtrain_student` 1500 (from test distribution) *unlabeled* examples are stored on *1 student worker*
* `PTR_dtest_student` 300 *labeled* examples are also distributed across *1 student* (only for testing)

* `PTR_dvalid_student` array (LATER)

<br> <br>
* We have the different pointers without access to the actual data/model. (*Assumption, Passiv Sec.:* We can't call `.get()` ?)
**IMPORTANT: This is where we would start normally, if real workers existed.**

### Model Creation 
The goal was to **ResNet18** to allow for fast training but to showcase SMPC-encryption with state-of-the-art models (using residual connections, batchnorm, relu, etc.) <br>
*Unfortunately for now this isn't possible, see Shortcomings* <br>
For that reason I used the same model as was used in Udacity's Notebook (for performance comparison afterwards) and a very small one for testing purposes. (both work, the second just takes considerably longer with SMPC)
<br>
<br>
***TODO: MAKE A SECTION: SECURITY ASSUMPTIONS***
* **IMPORTANT TO NOTE:** We could call `.get()` on the pointers to retrieve all the data, but in a real scenario with a gateway this wouldn't be possible. The important part is that we only see pointers and never the actual model or the data.
* The hosting parties could communicate with each other and make it possible to decrypt a tensor which was safed via SMPC on the hosting providers. (given they can identify which of the parts belong to each other, e.g. by recording the time when they were received along with some network model, knowledge)
<br>

In [7]:

################################# 
# Model Creation & Distribution #
#################################

# (Public dataset is already on student (local_worker) #
# Normally the model would also be send to the local_worker and then form there to the 
# different hosting_groups, but there isn't anything like .remote_share() for now
# (If there would be, we could assume a neutral role and coordinate the complete process, which then could also be
#  done completely automatically)

class Small_Model(nn.Module):
    def __init__(self):
        super(Small_Model, self).__init__()
        self.fc1 = nn.Linear(3 * 32 * 32, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        # 3 Channels, 32x32 images
        x = x.view(-1, 3 * 32 * 32)
        x = F.relu(self.fc1(x))
        x = self.fc3(x)
        return x

# Copied and changed from Udacity's Notebook 
# define the CNN architecture
class Cifar10Model(nn.Module):
    def __init__(self):
        super(Cifar10Model, self).__init__()
        # convolutional layer (sees 32x32x3 image tensor)
        self.conv1 = nn.Conv2d(3, 16, 3, padding=1)
        # convolutional layer (sees 16x16x16 tensor)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        # convolutional layer (sees 8x8x32 tensor)
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        
        # max pooling layer 
        # PR: !!! RuntimeError: "max_pool2d_with_indices_cpu" not implemented for 'Long'
        # Should work as part of secureNN protocol
        #self.pool = nn.MaxPool2d(2, 2)
        
        # Conv with 1x1 kernel is the same as average pool -> good alternative 
        # TODO: Check number of params this alternative approach changes 
        self.pool1 = nn.Conv2d(16, 16, kernel_size=2, stride=2)
        self.pool2 = nn.Conv2d(32, 32, kernel_size=2, stride=2)
        self.pool3 = nn.Conv2d(64, 64, kernel_size=2, stride=2)
        # linear layer (64 * 4 * 4 -> 500)
        self.fc1 = nn.Linear(64 * 4 * 4, 500)
        # linear layer (500 -> 10)
        self.fc2 = nn.Linear(500, 10)
        # dropout layer (p=0.25)
        self.dropout = nn.Dropout(0.25)

    def forward(self, x):
        # add sequence of convolutional and max pooling layers
        x = self.pool1(F.relu(self.conv1(x)))
        x = self.pool2(F.relu(self.conv2(x)))
        x = self.pool3(F.relu(self.conv3(x)))
        # flatten image input
        x = x.view(-1, 64 * 4 * 4)
        # add dropout layer
        x = self.dropout(x)
        # add 1st hidden layer, with relu activation function
        x = F.relu(self.fc1(x))
        # add dropout layer
        x = self.dropout(x)
        # add 2nd hidden layer, with relu activation function
        x = self.fc2(x)
        return x

cifar10_model = Small_Model()
#print(cifar10_model)

# Copy and distribute to all hosting_groups
PTR_models = {}
for i, hosting_group in enumerate(hosting_groups): 
    # TODO: Could also use model as a plan (any advantages in this scenario?)
    # Create new model instance 
    cifar10_model_copy = type(cifar10_model)() 
    # Load actual params 
    cifar10_model_copy.load_state_dict(cifar10_model.state_dict()) 
    # Index models also with the teachers.id (the same for as for the data!)
    current_teacher = hosting_group[0]
    PTR_models[current_teacher.id] = cifar10_model_copy.fix_precision().share(*hosting_group, 
                                                      crypto_provider=crypto_provider, requires_grad=True)
    
print(PTR_models)

{'photographer0': Small_Model(
  (fc1): Linear(in_features=3072, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=10, bias=True)
), 'photographer1': Small_Model(
  (fc1): Linear(in_features=3072, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=10, bias=True)
), 'photographer2': Small_Model(
  (fc1): Linear(in_features=3072, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=10, bias=True)
)}


In [None]:
## Nr.2 - DO NOT EXECUTE ##

In [None]:
## Nr.3 - TESTING ##

In [None]:
## Nr.4 - TESTING ##

In [None]:
## Nr.5 - TESTING ##

In [None]:
## Nr.6 - DO NOT EXECUTE ##

## PATE Step 1: Traning Teacher Models 
***TODO: Possibly add more explanation.***
Here FL learning will be done. <br>
* **First** the **Sequential approach:** 
    * train one student model and produce labels for student dataset and store them on teacher
    * do this for all students one after the other
    * then to create the final labeling, cound the number a class was predicted by the teachers. Then add laplacian noise to every label count to then take the majority label (the label that was predicted the most among the teachers) | *the whole = laplace mechanism*
    * finally train the model on the newly labeled public dataset. 


In [8]:
# First start with a sequential approach: Parallel exection for protocols isn't possible for now 
# + I first have to checkout in more detail the serde aproach for serilization (remote training plans)
# Also deepcopy of the model and afterwards automatic encryption isn't possible because remote_share() doesn't exist

def criterion(pred, target):
            # loss = F.nll_loss(output, target)  <-- not possible here
            batch_size = pred.shape[0]
            loss = ((pred - target)**2).sum().refresh()/batch_size
            return loss

### Sequential approach ###
def run_epoch(model, dataloader, train, optimizer=None, criterion=None):
    for i, (inp, target) in enumerate(dataloader):

        # Reset Optimizer 
        optimizer.zero_grad()
        
        # Forward Pass
        out = model(inp)

        # Loss
        loss = criterion(out, target)
        
        # Only train if function is used for training 
        if train: 
            # Backward Pass
            loss.backward()
            # Optimizer Step 
            optimizer.step()
        
        # Stats
        true_classes = torch.argmax(target, dim=0).get().float_precision()
        pred_classes = torch.argmax(out, dim=0).get().float_precision()
        
        numb_correct = sum(torch.eq(true_classes, pred_classes)).item()
        batch_size = inp.shape[0]
        epoch_acc = numb_correct/batch_size
        
        # get and convert tensor into normal variable
        epoch_loss = loss.get().item()

    return epoch_loss, epoch_acc

# TODO: Do test as well 
def test(): 
    pass

In [None]:
## Train the teacher models ##
for i in range(args.numb_teachers):
    print("#### Start training of teacher-model Nr. ", i, " ####")
    # Choose the model 
    model = PTR_models[teachers[i].id]
    
    # Create an optimizer for each model
    optimizer = optim.SGD(model.parameters(), lr=args.lr)

    # Optimizer has to also have int values (see Part 12 Tutorial)
    optimizer = optimizer.fix_precision() 
    
    # Choose the dataset of the corresponding teacher 
    train_loader = PTR_dtrain_teachers[teachers[i].id]
    valid_loader = PTR_dvalid_teachers[teachers[i].id]
    
    # patience for early stoppping 
    current_patience = 0
    best_acc = 0
    best_params = {}
    
    for epoch in range(args.epochs):
        # Training
        train = True
        model.train()
        train_loss, train_acc = run_epoch(model, train_loader, train, optimizer, criterion)
        print(f"Epoch {epoch + 1: >3}/{args.epochs}, train loss: {train_loss:.2e}, accuracy: {train_acc * 100:.2f}%")
        
        current_patience += 1
        
        # Evaluation after each epoch for stats and for hyperparam tuning (epochs through early stopping for now)
        train = False
        model.eval()
        val_loss, val_acc = run_epoch(model, valid_loader, train, optimizer, criterion)
        print(f"Epoch {epoch + 1: >3}/{args.epochs}, val loss: {val_loss:.2e}, accuracy: {val_acc * 100:.2f}%")
        
        # Early stopping 
        #if val_acc > best_acc: 
            # reset timer
        #    current_patience = 0
        #    best_acc = val_acc
            
            # TODO: Neither of the both approaches work (Think about how it should work in theory with pointers)
            # Copy current best model, simple assignment won't work because it 
            # only copies a pointer on model which is still trained
            # normal copy.copy() isn't enough because dict has nested structure -> copy.deepcopy()
            #best_model = deepcopy(model.state_dict())
            
            # Create new model instance 
            #best_model = type(model)() 
            # Load actual params 
            #best_model.load_state_dict(model.state_dict()) 
        
        #if current_patience > args.patience: 
        #    break
        
    # Load final best model (Scope in Python egal!!)
    #model.load_state_dict(best_model)


#### Start training of teacher-model Nr.  0  ####
Epoch   1/2, train loss: 7.90e+02, accuracy: 10.00%


## PATE Step 2: Label Public Dataset with aggregated teacher-predictions
* *Quick recap:*
    * For each teacher a model was trained and is stored on the corresponding hosting group
    * A public dataset still resides unencrypted on model owner device (in this case a simple variable locally)
* **Adding DP, Noise:**
    * *1. Approach:* Adding noise to labels on respective hosting_groups, that way no raw label is ever stored on this device. (Given the model owner or possibly a neutral instance doesn't call .get())
    * *2. Approach:* First storing all raw labels from all different teachers and then add noise after the max-function. This would probably require less noise to be added for an equal privacy leakage (see Andrew Trasks tutorial series on that or PATE paper) but possibly respresent a greater security risk as the raw labels are stored unencrypted on a device other than the teacher's device. 
    * **CHOICE:** In this example we go for the first approach which should give us better performance at a lower privacy leakage. <br> ***TODO: THINK ABOUT HOW TO AGGREGATE WITHOUT TRUSTED CURATOR, EXAMPLE WITH ACTIVE SEC.***
   
*This could be executed in parallel while other teacher models are still being trained. (That's why has been put in an extra cell)* 

In [None]:
## SIMPLE VERSION (Sequ.) ##

# TODO: Possibly put the for loop to get the PTR_preds in this function in more general utility function 
#       Also in general: try to create more general purpose functions out of the capital functions 
def PATE_GET_LABELS(): 
    """
    """
    return 0

def DP_NOISY_VOTING(teacher_preds, num_classes, epsilon): 
    """
        Does "noisy voting" (ref. PATE paper above) given the label predictions of all participating teachers. 
        Implements Laplace Mechanism: Query on teacher_preds sample (extract majority label) and add laplace noise.
        
        Args: 
            teacher_preds - Matrix: num_teachers x num_samples
            num_classes - Int indicating how many classes exist
            epsilon - Int indicating the desired privacy budget (epsilon, delta)
        
        Returns: 
            Array of final noisy aggregated labels 
    """
    
    # Adding noise through laplace mechanism (based on tutorial from Andrew Trask) 
    new_labels = list()
    # Iterate through all samples (columns)
    for i in range(teacher_preds.shape[1]):
        # All different teacher predictions for a sample
        # TODO: Change to do everything in tensors (find alternative for np.bincount())
        # FloatTensor -> Numpy 
        preds = teacher_preds[:, i].numpy().astype(int)
      
        # Gives back array with counts of all values that exist in preds 
        # (array should be min of length num_classes so that all classes, even with 0 occurences, are present)
        label_counts = np.bincount(preds, minlength=num_classes)

        # laplace distr. param (delta = 0 for laplacian noise)
        beta = 1 / epsilon

        # label_counts should never have more than num_classes: Otherwise something wrong with preds 
        for i in range(num_classes):
            # loc = 0, scale = beta, output_dim = scalar
            label_counts[i] += np.random.laplace(0, beta, 1)

        new_label = np.argmax(label_counts)

        new_labels.append(new_label)
    return new_labels 

# Store pointers to predicted labels
teacher_preds = {}

for i in range(args.numb_teachers):
    # Select current variables
    teacher = teachers[i].id
    print("#### Starting Prediction Teacher: ", teacher, " ####")
    
    model = PTR_models[teacher]
    hosting_group = hosting_groups[i]
    
    # Move public dataset to correct hosting group: share (also encrypts public data in case it is sensitive)
    # TODO: Possibly also consider letting the test set also be labeled by the teachers
    PTR_dtrain_student = []
    for i, data in enumerate(dtrain_student):
        PTR_dtrain_student.append(data.fix_precision().share(*hosting_group, crypto_provider=crypto_provider))
    
    # Make predictions
    teacher_preds[teacher] = []
    for i, data in enumerate(PTR_dtrain_student):
        #print("Predicting Batch Nr.", i)
        # Run Forward Pass
        batch_out = model(data)
        
        # Get batch predictions
        batch_pred = torch.argmax(batch_out, dim=0)
        
        # (Append would create array of arrays, we want one array)
        teacher_preds[teacher] += batch_pred.get().float_precision()
    
    # Get student datasets back for next round
    dtrain_student = []
    for i, data in enumerate(PTR_dtrain_student):
        dtrain_student.append(data.get().float_precision())
    
    # TODO: Better than trusted curator: Running average + add noise ? (But first on still unprotected)

# Noisy aggregation of labels will be saved in array
# TODO: Think of better way to get all labels before aggregation 
#       (Secure Aggregator would require remote_share given current setup)
    
# Create prediction matrix: teachers x predictions
tuple_of_preds = [teacher_preds[teacher] for teacher in teacher_preds.keys()]
teacher_preds = np.stack(tuple_of_preds, axis=0)

# Do the actual noisy labeling
noisy_labels = torch.Tensor(DP_NOISY_VOTING(teacher_preds, args.numb_classes, args.epsilon))

# Merge labels into unlabeled student dataset 
temp = []
for i, data in enumerate(dtrain_student):
    corresp_targets = noisy_labels[i:i+args.batch_size]
    temp.append((data, corresp_targets))
    
dtrain_student = temp

In [None]:
### TEMPORARY ###

print(len(dtrain_student[-1]))
print(len(dtrain_student[-1]))
print(len(dtrain_student)*args.batch_size)
print(len(PTR_dtrain_teachers) * args.batch_size)

In [None]:
### TEMPORARY ###

# Create prediction matrix: teachers x predictions
test = (np.random.rand(len(dtrain_student)*args.batch_size, len(teachers)) * 10).astype(int)
print(test.shape)
#tuple_of_preds = [teacher_preds[teacher] for teacher in teacher_preds.keys()]
#teacher_preds = np.stack(tuple_of_preds, axis=1)

# TODO: Convert to Tensor earlier 
# Do the actual noisy labeling
noisy_labels = torch.LongTensor(DP_NOISY_VOTING(test, args.numb_classes, args.epsilon))
print(len(noisy_labels))
# Merge labels into unlabeled student dataset 
temp = []
for i, data in enumerate(dtrain_student):
    inc = i+args.batch_size
    # If not enough elements for whole batch, cut off (otherwise [] is returned and the target batch is larger than the data batch)
    if inc > len(noisy_labels):
        inc = len(noisy_labels) + 1
    corresp_targets = noisy_labels[i:inc]
    temp.append((data, one_hot_of(corresp_targets)))
    print("Data", data.shape)
    print("Target", corresp_targets.shape)
    
dtrain_student = temp

## PATE Step 3: Train Student Model
Here Student Model will be trained 

In [None]:
# As for training the student models: split training into train and valid
dvalid_student = []
batches = dtrain_student
    
# Randomize Batches
np.random.shuffle(batches)

# Split Val and Train Data 
numb_batches = len(batches)
split = int(args.val_size * numb_batches)
dvalid_student = batches[:split] 
dtrain_student = batches[split:]

# Test
print(len(dvalid_student))
print(len(dtrain_student))

In [None]:
# First start with a sequential approach: Parallel exection for protocols isn't possible for now 
# + I first have to checkout in more detail the serde aproach for serilization (remote training plans)
# Also deepcopy of the model and afterwards automatic encryption isn't possible because remote_share() doesn't exist

def criterion_stud(pred, target):
            # loss = F.nll_loss(output, target)  <-- not possible here
            batch_size = pred.shape[0]
            loss = ((pred - target)**2).sum()/batch_size
            return loss

### Sequential approach ###
def run_epoch_stud(model, dataloader, train, optimizer=None, criterion=None):
    for i, (inp, target) in enumerate(dataloader):

        # Reset Optimizer 
        optimizer.zero_grad()
        
        # Forward Pass
        out = model(inp)

       # print("Inp", inp.shape)
       # print("Target", target.shape)
        
        # Loss
        loss = criterion_stud(out, target)
        
        # Only train if function is used for training 
        if train: 
            # Backward Pass
            loss.backward()
            # Optimizer Step 
            optimizer.step()
        
        # Stats
        true_classes = torch.argmax(target, dim=0)
        pred_classes = torch.argmax(out, dim=0)

        numb_correct = sum(torch.eq(true_classes, pred_classes)).item()
        batch_size = inp.shape[0]
        epoch_acc = numb_correct/batch_size
        
        # get and convert tensor into normal variable
        epoch_loss = loss.item()

    return epoch_loss, epoch_acc

# TODO: Do test as well 
def test(): 
    pass

In [None]:
# TODO: Implement training of student model (Potentially other train configs could be used, change ARG-Class)

# Same as for student training but now training can happen on device.
# Choose the model 
#student_model = Cifar10Model()
student_model = Small_Model() 
    
# Create an optimizer for each model
optimizer = optim.SGD(student_model.parameters(), lr=args.lr)
    
for epoch in range(args.epochs):
    print("#### Start training of student model ####")
    # Training
    train = True
    student_model.train()
    train_loss, train_acc = run_epoch_stud(student_model, dtrain_student, train, optimizer, criterion_stud)
    print(f"Epoch {epoch + 1: >3}/{args.epochs}, train loss: {train_loss:.2e}, accuracy: {train_acc * 100:.2f}%")
        
    #current_patience += 1
    
    # Evaluation after each epoch for stats and for hyperparam tuning (epochs through early stopping for now)
    train = False
    student_model.eval()
    val_loss, val_acc = run_epoch_stud(student_model, dvalid_student, train, optimizer, criterion_stud)
    print(f"Epoch {epoch + 1: >3}/{args.epochs}, val loss: {val_loss:.2e}, accuracy: {val_acc * 100:.2f}%")
        
    # Early stopping 
    #if val_acc > best_acc: 
        # reset timer
     #   current_patience = 0
     #   best_acc = val_acc
        # Copy current best model, simple assignment won't work because it 
        # only copies a pointer on model which is still trained
     #   best_model = deepcopy(model.state_dict())
        
    #if current_patience > args.patience: 
    #    break
        
# Load final best model: 
#student_model.load_state_dict(best_model)

## Testing of PATE Procedure 
* **Evaluation:**
    * *Privacy Leakage:* PATE Analysis from syft library
    * *Model Performance Evaluation:* Test student model (and compare to alternative -> Udacity Notebook)
    * *Speed Evaluation:* How much more time did this take? 
* **OLD:**
    * Here using the testing infrastructure the PATE procedure will be tested. <br>
    * The goal is to implement the PATE procedure as a `PySyft.protocoll` using `PySyft.plans` which also should be tested as a general automated procedure before deploying everything on actual workers and a grid. 

In [None]:
# This calculates the privacy leakage using the PATE approach above: 
# From the pate.py file
# preds: a torch tensor of dim (num_teachers x num_examples). Each value corresponds to the
#        index of the label which a teacher gave for a specific example
# indices: a torch tensor of dim (num_examples) of aggregated examples which were aggregated using
#          the noisy max mechanism.
# noise_eps: the epsilon level USED to create the indices
# delta: the DESIRED level of delta
from syft.frameworks.torch.dp import pate

# No Float inputs
data_dep_eps, data_ind_eps = pate.perform_analysis(teacher_preds=teacher_preds.type(torch.LongTensor),
                                                   indices=noisy_labels.type(torch.LongTensor), 
                                                   noise_eps=args.epsilon, delta=args.delta)
print("Data Independent Epsilon:", data_ind_eps)
print("Data Dependent Epsilon:", data_dep_eps)

In [None]:
# Model Performance Testing 

In [None]:
# Speed Evaluation (gathered data from training and testing)

# Performance Evaluation (Combine with above)
* **Performance Evaluation:** *Depending on time can be extended or not*
  Analyse Performance of Decentralized Training with normal Training, Possible metrics: 
    * Training Time: Convergence Speed, Compuation Time (incl. communication, etc.)
    * Model-Performance: Relevant Metrics (Accuracy, etc.)
    * Privacy-Leakage-Anlysis in different security situations (active and passive security)
    * Model-Robustness: Impact of decentralized 
    * Model-Fairness: Impact of decentralized 

In [None]:
# Possibly just use the Udacity Notebook as comparision for performance 

# The final project on the grid (INCORPORATE NEW STRATEGY IN NOTES)
Here the final real end-to-end example of photographers training a image-classfier together with a model provider should be showcased using an actual grid, whith a gateway and workers which could actually run on remote machines. 

In [None]:
# TODO: Setting up a grid with workers + documentation + further links to grid tutorials 

In [None]:
# The main difference: we have a gateway and can automate the matching process + add an initiation/request func.

# Shortcomings 
## What can't be done
Why for now the PATE framework was implemented using **two host-providers** (and not as originally thought using all participating parties and only dynamically add hosts if needed for computation) and **sequentially** (the training of the teachers and the prediction on the public set could happen in parallel and would greatly speed overall execution time of the PATE procedure)
* **ReLU** can't yet be applied if more than two different workers are used for encryption 
* For automatic execution or creation of a completely automated protocol: 
    * No way of remotely encrypting data (no remote_share() for example). *Relevance:* Even if one assumes that every worker encrypts his data on his own at the beginning, for executing PATE in a parallel manner one would need to copy (after calling remote_get() to unencrypt the model) and re-encrypt (here remote_share() would be good) the model to then be able to train different teacher models in parallel. An alternative to a remote_share() function could be to alter the current `Plan` to support more than just tensor arguments. (Workers for example)
    * No stable way of creating a training plan. (Some experimental, limited work) <br> ***TODO: NEED TO FURTHER CHECK OUT THE WORK***
* ResNet can't be used because relu() with inplace param is not supported 
* MaxPooling isn't supported either (problem with secureNN implementation?)
* CUDA isn't possible with SMPC-encrypted model and data? 
    * In Google Colab it was tested to add data and model directly to cuda device -> produced error. (Float, Long Problem, ***TODO: DIG DEEPER***)
* CrossEntropyLoss can't be used! **NO:** F.nll_loss(output, target)
* Deepcopy or .load_state_dict() both fail for early stopping -> ***TODO: Better understand how it should work in theory with pointers***
* Training of Students uses VERY MUCH RAM. (Google Colab Session stopped after first epoch of second student because apparently RAM of 12.21 GB wasn't sufficient)
* Pate-Analysis also takes very much time on big datasets (timing data coming, for now 30 min still running)

## What wasn't done (yet)
* Parallelization: ***LINK TO EXPERIMENTAL NOTEBOOK IMPLEMENTING PARALLEL WORKERS***

# OLD CODE 
## Possibly interesting parts to reuse 

In [None]:
## Nr.1 - DO NOT EXECUTE ##
# Note: 
# - Federated Dataset could be used 
# - data attribute with VirtualWorkes can't be used to not use efficient data loading of DataLoader
#   everything would be stored in one place....think further about that (relook at forum)
# - Implement GPU Usage 
# - DOC. better

# TODO: Potentially extend the function to support creation of actual workers with additional argument 
#       Possibly create a worker-class with different worker options with and without grid-usage 
def CREATE_V_WORKERS(w_number, w_name):
    """
        Create multiple, named virtual workers
    """
    worker = []
    for i in range(w_number):
        worker.append(sy.VirtualWorker(hook, id=w_name+str(i)))
    if len(worker) == 1: 
        worker = worker[0]
    return worker

# TODO: ENCRYPTED=False (case in reality) NOT POSSIBLE, because neither with the PyGrid nor with PySyft a 
#       remote-share() has been developed beyond prototyping. For now: 
#       Train-Config torchscript-based jit-serialization or serde-based FL-Plans have been prototyped so far 
#       -> Look into that. (+ supported hosting if necessary)
# TODO: Think about a better way to pass in all the particpating parties (incl. crypto_Provider)
def PATE_DATA_SETUP(teachers, student, model_owner, dataset, batch_size, crypto_provider,
                    ratio=0.4, encrypted=True, hosting=False):
    """
        MAINLY FOR TESTING PURPOSES (having a centrally stored dataset)
        Takes in teachers and a student as workers, a dataset, a batch_size and a ratio
        The ratio specifies the percentage of the total data that gets stored on the teachers 
        Returns a a student and a teacher dataset consisting of pointers on the either already encrypted data 
        or the equally split data between the teachers. Encryption is done using all participating parties. 
        In a further step (if encrypted=False) a grid could remotely encrypt and/or support the teachers/students
        with hosting nodes to enhance computational power. 
    """
    # No differentiation between training and testing is made, 
    # because in reality the data providers won't have the data seperated for ML 
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=batch_size)
    numb_workers = len(teachers)+1
    increment = len(dataset) / numb_workers

    # Directly store (input, target) pairs as an array to enumerate over them -> use as dataLoader over PTRs
    # Create dict of teacher pointer arrays, helpful for training afterwards 
    dist_dataset_teachers = {}
    # Initiate to be able to use .append() in main loop afterwards 
    for worker in teachers: 
        dist_dataset_teachers[worker.id] = []
    dist_dataset_student = []
    
    # TODO: Dynamic Hosting Provider: Create only if needed (Think of return!)
    hosting_provider = {}
    if hosting: 
        pass
        #hosting_provider["hosting"] = CREATE_V_WORKERS().....
    else: 
        pass
        #hosting_provider["hosting"]  = dataproviders.....
      
    # Create array of all participating workers, [] because all elements need to lists to concatenate 
    workers = teachers + [student] + [model_owner]
    # Neutral party should provide necessary numbs for encryption
    crypto_provider = CREATE_V_WORKERS(1, "crypto_provider")
    
    for i, (data, target) in enumerate(dataloader):
        # Teachers: The first 40% (normally less than on public) of the data
        # TODO: Iterate through teachers to distribute data equally (Possibly more efficient way?)
        current_worker = teachers[i%len(teachers)]
        if i < int(ratio * (len(dataset)/batch_size)):
            if encrypted: 
                ### PySyft ###
                # TODO: Add descritpion when using PyGrid.
                data = data.tag("cifar", "input").fix_precision().share(*workers, crypto_provider=crypto_provider)
                target = target.tag("cifar", "input").fix_precision().share(*workers, crypto_provider=crypto_provider)
            else: 
                ### PySyft ###
                data = data.tag("cifar", "input").send(current_worker)
                target = target.tag("cifar", "target").send(current_worker)
            # Store the pointers to the send data.
            dist_dataset_teachers[current_worker.id].append((data, target))
        # Student: The larger part is unlabeled 
        else:    
            # Doesn't need to be encrypted 
            ### PySyft ###
            data = data.tag("cifar", "input").send(student)
            # Store the pointers to the send data.
            dist_dataset_student.append(data)
        
    return dist_dataset_teachers, dist_dataset_student 

# Normalize data and convert to torch.FloatTensor
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])
    
# Get Cifar10 Dataset from torchvision.datasets
cifar10_train_data = datasets.CIFAR10('data', train=True,
                              download=True, transform=transform)

# We distribute testing data also on workers, as in a real scenario. 
cifar10_test_data = datasets.CIFAR10('data', train=False,
                              download=True, transform=transform)

# Args
batch_size = 20 
ratio = 0.4
sample_input = torch.randn((batch_size, 3, 32, 32))*255

# Model Provider
model_provider = CREATE_V_WORKERS(1, "model_provider")

# Private Data Provider
teachers = CREATE_V_WORKERS(3, "photographer")

# Public Data Provider
student = CREATE_V_WORKERS(1, "public_cloud")

# Crypto Provider 
crypto_provider = CREATE_V_WORKERS(1, "crypto_provider")

# All workers
all_workers = [model_provider] + teachers + [student]

# Distribute Data 
PTR_dtrain_teachers, PTR_dtrain_student = PATE_DATA_SETUP(teachers, student, model_provider, 
                                                          cifar10_train_data, batch_size, crypto_provider)

# We assume that the we got a pointer to the total dataset on the worker and then parted the pointers into 
# the training and testing dicts 
# Note: There are now labeled test samples on the 
PTR_dtest_teachers, PTR_dtest_student = PATE_DATA_SETUP(teachers, student, model_provider, 
                                                        cifar10_test_data, batch_size, crypto_provider)

# CHANGE: This is just for sim purposes: This test-data wouldn't exist normally but is still created for 
# performance evalulation purposes of the final model. (compare it to normal training)
# TODO: Possibly better for evaluation to use complete test data for testing (only store it on teachers)
# Merge part of the test data that was distributed on student together with rest, because 
# unlabeled datapoints can't be used for testing 
#PTR_dtrain_student += temp1

In [None]:
## Nr.2 - DO NOT EXECUTE ##

### AGAIN: NOT POSSIBLE to start like that because no remote encrypting is possible
class Cifar10_model(sy.Plan): 
    def __init__(self):
        super(Cifar10_model , self).__init__()
        # ONLY WORKS WITH TWO PARTIES: FILE PR!!!!!
        # syft/frameworks/torch/mpc/securenn.py in relu_deriv(a_sh), line 386 "too many values to unpack (expected 2)
        resnet = models.resnet18(pretrained=False)
        # Change network classifier to have 10 output classes (standard has 1000)
        # See Transfer Learning Example from PyTorch 
        resnet.fc = nn.Linear(512, 10)
        self.resnet = resnet
        
    def forward(self, x): 
        return self.resnet(x)

# Model Provider
model_provider = CREATE_V_WORKERS(1, "model_provider")
# Create model as plan
cifar10_model = Cifar10_model()
# Build plan with arbitrary input: (batch_size, channels, img_w, img_h) * intensitie value (not necessary)
cifar10_model.build(sample_input)
# Check if successful
print("Build successful: ", cifar10_model.is_built)
# Send model to model provider
PTR_model = cifar10_model.tag("model", "cifar10").send(model_provider)
# Check if successful (ptr on model plan stored on worker)
print(PTR_model)


# TODO: CHECK if always true: Models aren't directly tagable or sendable -> First convert to plan

In [None]:
### TESTING ###

# Test Forward Pass: Input of first batch of first teacher (batch are stored as (inp, target) tupels)
sample_input_2 = PTR_dtrain_teachers[teachers[0].id][0][0]
sample_target_2 = PTR_dtrain_teachers[teachers[0].id][0][1]
sample_model = Cifar10Model()
sample_model = sample_model.fix_precision().share(*hosting_groups[0], crypto_provider=crypto_provider, 
                                                  requires_grad=True)
# WHY CAN'T I ACCESS THE MODEL POINTER?
# Check if on same hosting_group
#assert(sample_input.child.child.child.keys() == sample_model.child.child.child.keys())
print(sample_input_2.shape)
PTR_out = sample_model(sample_input_2)
print("Forward Pass: ", PTR_out.shape)

In [None]:
### TESTING ###

#Sample Forward Pass works - Now test criterion, optmizer and Backward Pass

def criterion(pred, target):
    # loss = F.nll_loss(output, target)  <-- not possible here
    batch_size = pred.shape[0]
    loss = ((pred - target)**2).sum().refresh()/batch_size
    return loss

#criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(sample_model.parameters(), lr=0.01)
optimizer = optimizer.fix_precision() 

loss = criterion(PTR_out, sample_target_2)
# Backward Pass
loss.backward()
# Optimization Step
optimizer.step()

In [None]:
### TESTING ###

# Stats 
#loss = loss.get().float_precision()
# Get predicted class per sample in batch 
print(sample_target_2.shape)
PTR_target_class = torch.argmax(sample_target_2, dim=1).get().float_precision()
PTR_pred_class = torch.argmax(PTR_out, dim=1).get().float_precision()
# DOCH: SONST GEHT AUTOGRAD DINGS AUCH NICHT WEG!!!!!
# Giant numbers because still fixed_precision numbers (could call float_precision but doesn't matter for comparison)
print(PTR_pred_class)
print(PTR_target_class)
# Check how much are correct 
print(PTR_out.shape)
print(sample_target_2.shape)
temp2 = torch.eq(PTR_target_class, PTR_pred_class)
#temp2 = temp2.get()
print(temp2)
# Get the 
numb_correct = sum(temp2).item()
batch_size = PTR_out.shape[0]
print(batch_size)
bach_acc = numb_correct/batch_size

print("Loss: ", loss.get().item())
print("Batch_Accuracy: ", bach_acc) 

In [None]:
## DO NOT EXECUTE ##

### OLD ###

# DOING FL WITHOUT BEING A PARTICIPANT AND WITHOUT HELP OF PYGRID VERY COMPLICATED BECAUSE REMOTE ENCRYPTION
# DOESN'T WORK (.share()) PLANS ONLY WORK WITH TENSORS...
# DOING .SHARE MANUALLY WOULD REQUIRE KNOWING WHAT WORKERS EXIST, etc. -> THAT'S THE MAIN FUNCT. OF PYGRID!

### Create DataLoaders consisting of pointers to encrypted data and model on hosting providers ###
# TODO: Create a scenario where data and model are SMPC encrypted without hosting provider and 
#       hosting provider are only used if necessary for computation (Also Read SDPZ-Protocol) 
#       Also possibly use FederatedTrainingLoader from PySyft

# SMPC-encrypt datasets and model using the hosting provider workers 

# TODO: Specifiy precision of converted tensor in args 
# This plan should make it possible to remotely encrypt data via SMPC-enryption (Additive Secret Sharing)

@sy.func2plan()
def remote_share(data, hosting_provider1, hosting_provider2, crypto_provider):
    """
        Create plan which enables to remotely execute the share() method. 
        Is needed here to be able to remotely encrypt data via SMPC
    """
    if data.type() == 'torch.FloatTensor': 
        data = data.fix_precision()
    ptr = data.share(hosting_provider1, hosting_provider2, crypto_provider=crypto_provider)
    return ptr

# TODO: IS REMOTE SHARE POSSIBLE? DO we have to have a specific plan for every worker? 
# -> Think about defining overloading some of the functions or moving around? 
def send_to_all(data, workers):
    """
        Send copys of some data to multiple workers 
    """
    ptrs = {}
    for worker in workers: 
        # Use deepcopy because we want the plan on each worker and is data is nested we want to copy everything 
        # deepcopy also copies nested arrays e.g. (important for model sending for example)
        ptrs[worker.id] = deepcopy(data).send(worker)
    return ptrs

hosting_provider = {}

extra_hosting = True
if extra_hosting: 
    hosting_provider["hosting"] = CreateVWorker().....
else: 
    hosting_provider["hosting"]  = dataproviders.....

# TODO: quick explanation of crypto_provider
# Build plan
remote_share.build(sample_input, hosting_provider["hosting"][0],
                   hosting_provider["hosting"][1], hosting_provider["crypto"])
# Send to data_providers
PTRS_remote_share = send_to_all(remote_share)
PTRS_remote_share[0]

#remote_share(sample_input, hosting_provider["hosting"], hosting_provider["crypto"])
#ptr = sample_input.fix_precision().share(*hosting_provider["hosting"], crypto_provider=hosting_provider["crypto"], 
#                                        requires_grad=False)
#remote_share.build(sample_input, hosting_provider["hosting"], hosting_provider["crypto"])

In [None]:
### ADITIONAL TESTS ####

# Test: compute with tensor which are saved on different devices -> Not possible 
# Logically: Additive Secret Sharing Concept: Computation can only be made in an encrypted, decentralized manner
# if input tensors to calculation exist on all parties that are participating in calculation. 
# (A share of all participating variables has to be on all participating workers)
worker1 = sy.VirtualWorker(hook, id="worker1")
worker2 = sy.VirtualWorker(hook, id="worker2")
worker3 = sy.VirtualWorker(hook, id="worker3")
crypto_provider = sy.VirtualWorker(hook, id="c_provider")
test = torch.tensor([1., 2.])
model = nn.Sequential(nn.Linear(2, 1), nn.ReLU())
model(test)

In [None]:
w_group = [worker1, worker2]
w_group2 = [worker2, worker3]
w_group3 = [worker1, worker3]
PTR_test = test.fix_precision().tag("test1").share(*w_group, crypto_provider=crypto_provider)
print(PTR_test)

test2 = torch.tensor([2., 4.])
PTR_test2 = test2.fix_precision().share(*w_group2, crypto_provider=crypto_provider)
print(PTR_test2)

PTR_model = model.fix_precision().share(*w_group, crypto_provider=crypto_provider)
print(PTR_model)

In [None]:
PTR_test.child.child.child.keys()

In [None]:
# Two questions: 
# 1. Why can I only use tagging if I specifiy data through creation a worker (data argument)
# 2. Why can't I only pass in one variable to VirtualWorker at beginning? 

x11 = torch.tensor([-1, 2.]).tag('input_data')
x12 = torch.tensor([1, -2.]).tag('input_data2')
x13 = torch.tensor([1, -2.]).tag('input_data3')

device_1 = sy.VirtualWorker(hook, id="device_1", data=(x11, x12)) 
(x11, x12).send(device_1)

print(device_1.search("input_data3"))

In [None]:
# ISSUE!!!!!!
out1 = PTR_model(PTR_test)
out2 = PTR_model(PTR_test2)

In [None]:
# QUESTIONS: 
# Why does the data also need "requires_grad=True" also autograd_tensor?
# Fundamentally: How is it restricted who can call .get() on the pointers? (Passive -> Active Security)
# Why does DP-Def. (constraint) only describe an upper bound? If Pr[M(x) e S] + Pr[M(y) e S] = 1? (Prob. not)
# Cuda doesn't with secureNN: Float can't be converted to Long .... Check in more detail (tested in Google Colab)
# max pooling layer 
        # PR: !!! RuntimeError: "max_pool2d_with_indices_cpu" not implemented for 'Long'
        # Should work as part of secureNN protocol
        #self.pool = nn.MaxPool2d(2, 2)
# get_all() would be a cool functionality as often pointer array on pointer data (issue/PR?)

In [None]:
workers = hosting_groups[0]
workers

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net , self).__init__()
        # ONLY WORKS WITH TWO PARTIES: FILE PR!!!!!
        # syft/frameworks/torch/mpc/securenn.py in relu_deriv(a_sh), line 386 "too many values to unpack (expected 2)
        resnet = models.resnet18(pretrained=False)
        # Change network classifier to have 10 output classes (standard has 1000)
        # See Transfer Learning Example from PyTorch 
        resnet.fc = nn.Linear(512, 10)
        self.resnet = resnet
        
    def forward(self, x): 
        return self.resnet(x)

sample_input = torch.randn((20, 3, 32, 32))*255
sample_model = Net()
print(sample_input.shape)
print(sample_model(sample_input).shape)
ptr_si = sample_input.fix_precision().share(*workers, crypto_provider=crypto_provider, requires_grad=True)
ptr_sm = sample_model.fix_precision().share(*workers, crypto_provider=crypto_provider, requires_grad=True)
print(ptr_si)
print(ptr_sm)
print(ptr_sm(ptr_si))