# Fedbiomed Researcher base example

Use for developing (autoreloads changes made across packages)

In [18]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the node up
It is necessary to previously configure a node:
1. `./scripts/fedbiomed_run node add`
  * Select option 2 (default) to add MNIST to the node
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  
2. Check that your data has been added by executing `./scripts/fedbiomed_run node list`
3. Run the node using `./scripts/fedbiomed_run node run`. Wait until you get `Starting task manager`. it means you are online.

## Define an experiment model and parameters"

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [19]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlan, self).__init__(model_args)
        
        
        self.model = self.make_model()
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms"]
        
        self.add_dependency(deps)

    def make_model(self):
        model = nn.Sequential(nn.Conv2d(1, 32, 3, 1),
                                  nn.ReLU(),
                                  nn.Conv2d(32, 64, 3, 1),
                                  nn.ReLU(),
                                  nn.MaxPool2d(2),
                                  nn.Dropout(0.25),
                                  nn.Flatten(),
                                  nn.Linear(9216, 128),
                                  nn.ReLU(),
                                  nn.Dropout(0.5),
                                  nn.Linear(128, 10),
                                  nn.LogSoftmax(dim=1))
        return model
        
        
    def forward(self, x):

        return self.model(x)

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the node side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [20]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 3, 
    'dry_run': False,  
    'batch_maxnum': 100, # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
    'DP_args': {'type' : 'local', 'sigma': 1., 'clip': 1.},
}

## Dimensioning the training parameters for ldp

In [21]:
from fedbiomed.researcher.requests import Requests
import numpy as np

req = Requests()
xx = req.list()
min_dataset_size = np.min([xx[i][0]['shape'][0] for i in xx])
q = training_args['batch_size']/min_dataset_size

sigma = 1.
delta = 1e-6
max_epsilon = 1
max_N = int(1e5)

2022-04-01 14:47:45,744 fedbiomed INFO - Listing available datasets in all nodes... 
04/01/2022 14:47:45:INFO:Listing available datasets in all nodes... 
2022-04-01 14:47:45,749 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Message received: {'researcher_id': 'researcher_8e4448ff-612c-4c3a-bb07-338c5e251a9b', 'command': 'list'}
04/01/2022 14:47:45:INFO:log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Message received: {'researcher_id': 'researcher_8e4448ff-612c-4c3a-bb07-338c5e251a9b', 'command': 'list'}


In [30]:
from fedbiomed.researcher.privacy.rdp_accountant import get_iterations

N, eps_list = get_iterations(delta, sigma, q, max_epsilon, max_N)

max_epochs = int(N*training_args['batch_size']/min_dataset_size)

In [35]:
assert training_args['batch_size']<max_epochs, 'Number of epochs not compatible with privacy budget'

## Dimensioning the training parameters for cdp

In [42]:
q = 1 ## All clients selected
sigma = 1.
delta = 1e-6
max_epsilon = 20
max_N = int(50)

N, eps_list = get_iterations(delta, sigma, q, max_epsilon, max_N)

In [43]:
print(N, eps_list)

10 [0, 5.2215396311544175, 7.766237903487095, 9.848401082073611, 11.68862679916635, 13.37637916335795, 14.950586874415979, 16.438090423375677, 17.86144010371158, 19.22988164567188]


## Declare and run the experiment

- search nodes serving data for these `tags`, optionally filter on a list of node ID with `nodes`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `round_limit` rounds, applying the `node_selection_strategy` between the rounds

In [16]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

2022-04-01 14:42:38,503 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
04/01/2022 14:42:38:INFO:Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2022-04-01 14:42:38,514 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Message received: {'researcher_id': 'researcher_8e4448ff-612c-4c3a-bb07-338c5e251a9b', 'tags': ['#MNIST', '#dataset'], 'command': 'search'}
04/01/2022 14:42:38:INFO:log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Message received: {'researcher_id': 'researcher_8e4448ff-612c-4c3a-bb07-338c5e251a9b', 'tags': ['#MNIST', '#dataset'], 'command': 'search'}
2022-04-01 14:42:48,516 fedbiomed INFO - Node selected for training -> node_65e33263-8be2-42a8-b7ba-41b0cd90557d
04/01/2022 14:42:48:INFO:Node selected for training -> node_65e33263-8be2-42a8-b7ba-41b0cd90557d
2022-04-01 14:42:48,529 fedbiomed DEBUG - Model file has been saved: /Users/mlorenzi/works/temp/fedbiomed/var/ex

Let's start the experiment.

By default, this function doesn't stop until all the `round_limit` rounds are done for all the nodes

In [17]:
exp.run()

2022-04-01 14:42:49,179 fedbiomed INFO - Sampled nodes in round 0 ['node_65e33263-8be2-42a8-b7ba-41b0cd90557d']
04/01/2022 14:42:49:INFO:Sampled nodes in round 0 ['node_65e33263-8be2-42a8-b7ba-41b0cd90557d']
2022-04-01 14:42:49,181 fedbiomed INFO - Send message to node node_65e33263-8be2-42a8-b7ba-41b0cd90557d - {'researcher_id': 'researcher_8e4448ff-612c-4c3a-bb07-338c5e251a9b', 'job_id': 'c18c371d-6c02-4982-9038-183fc3f0d4e1', 'training_args': {'test_ratio': 0.0, 'test_on_local_updates': False, 'test_on_global_updates': False, 'test_metric': None, 'test_metric_args': {}, 'batch_size': 48, 'lr': 0.001, 'epochs': 3, 'dry_run': False, 'batch_maxnum': 100, 'DP_args': {'type': 'local', 'sigma': 1.0, 'clip': 1.0}}, 'training': True, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/04/01/my_model_213386a4-dc56-44b6-873b-3c688a1e4081.py', 'params_url': 'http://localhost:8844/media/uploads/2022/04/01/aggregated_params_init_0551782d-1682-46b7-a656-ff

2022-04-01 14:42:49,431 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Using device cpu for training (cuda_available=False, gpu=False, gpu_only=False, use_gpu=False, gpu_num=None)
04/01/2022 14:42:49:INFO:log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Using device cpu for training (cuda_available=False, gpu=False, gpu_only=False, use_gpu=False, gpu_num=None)
2022-04-01 14:43:29,974 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Reached 100 batches for this epoch, ignore remaining data
04/01/2022 14:43:29:INFO:log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Reached 100 batches for this epoch, ignore remaining data
2022-04-01 14:44:08,220 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Reached 100 batches for this epoch, ignore remaining data
04/01/2022 14:44:08:INFO:log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Reached 100 batches for this epoch, ig

2022-04-01 14:44:45,246 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Reached 100 batches for this epoch, ignore remaining data
04/01/2022 14:44:45:INFO:log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - Reached 100 batches for this epoch, ignore remaining data
2022-04-01 14:44:45,247 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - running model.postprocess() method
04/01/2022 14:44:45:INFO:log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - running model.postprocess() method
2022-04-01 14:44:45,249 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - model.postprocess() method not provided
04/01/2022 14:44:45:INFO:log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - model.postprocess() method not provided
2022-04-01 14:44:45,656 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - upload (HTTP POST request) of file /Users/mlorenzi/works/temp

2022-04-01 14:44:54,853 fedbiomed INFO - log from: node_65e33263-8be2-42a8-b7ba-41b0cd90557d / DEBUG - [TASKS QUEUE] Item:{'researcher_id': 'researcher_8e4448ff-612c-4c3a-bb07-338c5e251a9b', 'job_id': 'c18c371d-6c02-4982-9038-183fc3f0d4e1', 'params_url': 'http://localhost:8844/media/uploads/2022/04/01/aggregated_params_96436496-26a9-4ecb-8d68-e50833ede6fe.pt', 'training_args': {'test_ratio': 0.0, 'test_on_local_updates': False, 'test_on_global_updates': False, 'test_metric': None, 'test_metric_args': {}, 'batch_size': 48, 'lr': 0.001, 'epochs': 3, 'dry_run': False, 'batch_maxnum': 100, 'DP_args': {'type': 'local', 'sigma': 1.0, 'clip': 1.0}}, 'training_data': {'node_65e33263-8be2-42a8-b7ba-41b0cd90557d': ['dataset_2fe7d813-044f-4648-b5dd-1f539110deb2']}, 'training': True, 'model_args': {}, 'model_url': 'http://localhost:8844/media/uploads/2022/04/01/my_model_213386a4-dc56-44b6-873b-3c688a1e4081.py', 'model_class': 'MyTrainingPlan', 'command': 'train'}
04/01/2022 14:44:54:INFO:log from:

2022-04-01 14:46:02,057 fedbiomed CRITICAL - Fed-BioMed researcher stopped due to keyboard interrupt
04/01/2022 14:46:02:CRITICAL:Fed-BioMed researcher stopped due to keyboard interrupt



--------------------
Fed-BioMed researcher stopped due to keyboard interrupt
--------------------


Local training results for each round and each node are available via `exp.training_replies()` (index 0 to (`rounds` - 1) ).

For example you can view the training results for the last round below.

Different timings (in seconds) are reported for each dataset of a node participating in a round :
- `rtime_training` real time (clock time) spent in the training function on the node
- `ptime_training` process time (user and system CPU) spent in the training function on the node
- `rtime_total` real time (clock time) spent in the researcher between sending the request and handling the response, at the `Job()` layer

In [None]:
print("\nList the training rounds : ", exp.training_replies().keys())

print("\nList the nodes for the last training round and their timings : ")
round_data = exp.training_replies()[rounds - 1].data()
for c in range(len(round_data)):
    print("\t- {id} :\
    \n\t\trtime_training={rtraining:.2f} seconds\
    \n\t\tptime_training={ptraining:.2f} seconds\
    \n\t\trtime_total={rtotal:.2f} seconds".format(id = round_data[c]['node_id'],
        rtraining = round_data[c]['timing']['rtime_training'],
        ptraining = round_data[c]['timing']['ptime_training'],
        rtotal = round_data[c]['timing']['rtime_total']))
print('\n')
    
exp.training_replies()[rounds - 1].dataframe()

Federated parameters for each round are available via `exp.aggregated_params()` (index 0 to (`rounds` - 1) ).

For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", exp.aggregated_params().keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params()[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params()[rounds - 1]['params'].keys())


Feel free to run other sample notebooks or try your own models :D