# Fedbiomed Researcher base example

Use for developing (autoreloads changes made across packages)

In [1]:
%load_ext autoreload
%autoreload 2

## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the node up
It is necessary to previously configure a node:
1. `./scripts/fedbiomed_run node add`
  * Select option 2 (default) to add MNIST to the node
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  
2. Check that your data has been added by executing `./scripts/fedbiomed_run node list`
3. Run the node using `./scripts/fedbiomed_run node run`. Wait until you get `Starting task manager`. it means you are online.

## Define an experiment model and parameters"

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [2]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlan, self).__init__(model_args)
        
        
        self.model = self.make_model()
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms"]
        
        self.add_dependency(deps)

    def make_model(self):
        model = nn.Sequential(nn.Conv2d(1, 32, 3, 1),
                                  nn.ReLU(),
                                  nn.Conv2d(32, 64, 3, 1),
                                  nn.ReLU(),
                                  nn.MaxPool2d(2),
                                  nn.Dropout(0.25),
                                  nn.Flatten(),
                                  nn.Linear(9216, 128),
                                  nn.ReLU(),
                                  nn.Dropout(0.5),
                                  nn.Linear(128, 10),
                                  nn.LogSoftmax(dim=1))
        return model
        
        
    def forward(self, x):

        return self.model(x)

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the node side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [13]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100, # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
    'DP_args': {'local' : 'xx', 'sigma': 'a', 'clip': 1.},
}

## Declare and run the experiment

- search nodes serving data for these `tags`, optionally filter on a list of node ID with `nodes`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `round_limit` rounds, applying the `node_selection_strategy` between the rounds

In [14]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

2022-04-01 11:20:23,316 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
04/01/2022 11:20:23:INFO:Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2022-04-01 11:20:23,322 fedbiomed INFO - log from: node_036f19cf-c3ba-4b4b-8705-480b2d845d73 / DEBUG - Message received: {'researcher_id': 'researcher_7784dad0-7f7b-4228-886f-ed413acea603', 'tags': ['#MNIST', '#dataset'], 'command': 'search'}
04/01/2022 11:20:23:INFO:log from: node_036f19cf-c3ba-4b4b-8705-480b2d845d73 / DEBUG - Message received: {'researcher_id': 'researcher_7784dad0-7f7b-4228-886f-ed413acea603', 'tags': ['#MNIST', '#dataset'], 'command': 'search'}
2022-04-01 11:20:33,326 fedbiomed INFO - Node selected for training -> node_036f19cf-c3ba-4b4b-8705-480b2d845d73
04/01/2022 11:20:33:INFO:Node selected for training -> node_036f19cf-c3ba-4b4b-8705-480b2d845d73
2022-04-01 11:20:33,341 fedbiomed DEBUG - Model file has been saved: /Users/mlorenzi/works/temp/fedbiomed/var/ex

Let's start the experiment.

By default, this function doesn't stop until all the `round_limit` rounds are done for all the nodes

In [15]:
exp.run()

2022-04-01 11:20:33,930 fedbiomed INFO - Sampled nodes in round 0 ['node_036f19cf-c3ba-4b4b-8705-480b2d845d73']
04/01/2022 11:20:33:INFO:Sampled nodes in round 0 ['node_036f19cf-c3ba-4b4b-8705-480b2d845d73']
2022-04-01 11:20:33,931 fedbiomed INFO - Send message to node node_036f19cf-c3ba-4b4b-8705-480b2d845d73 - {'researcher_id': 'researcher_7784dad0-7f7b-4228-886f-ed413acea603', 'job_id': 'a9d012d5-d973-4c0f-bb4e-7c0239ae9683', 'training_args': {'test_ratio': 0.0, 'test_on_local_updates': False, 'test_on_global_updates': False, 'test_metric': None, 'test_metric_args': {}, 'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100, 'DP_args': {'local': 'xx', 'sigma': 'a', 'clip': 1.0}}, 'training': True, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/04/01/my_model_23e03937-0d61-42f3-91b0-3a0b4f85f9be.py', 'params_url': 'http://localhost:8844/media/uploads/2022/04/01/aggregated_params_init_d175b51a-4a1f-4bc9-bf10-4dbd

2022-04-01 11:20:34,186 fedbiomed INFO - log from: node_036f19cf-c3ba-4b4b-8705-480b2d845d73 / ERROR - Cannot train model in round: 'type'
04/01/2022 11:20:34:INFO:log from: node_036f19cf-c3ba-4b4b-8705-480b2d845d73 / ERROR - Cannot train model in round: 'type'
2022-04-01 11:20:43,940 fedbiomed INFO - Downloading model params after training on node_036f19cf-c3ba-4b4b-8705-480b2d845d73 - from 
04/01/2022 11:20:43:INFO:Downloading model params after training on node_036f19cf-c3ba-4b4b-8705-480b2d845d73 - from 
2022-04-01 11:20:43,943 fedbiomed ERROR - FB604: repository error : bad URL when downloading file node_params_35db3bb4-aa77-47a3-8d16-f9c5393ec39c.pt(details :Invalid URL '': No scheme supplied. Perhaps you meant http://? )
04/01/2022 11:20:43:ERROR:FB604: repository error : bad URL when downloading file node_params_35db3bb4-aa77-47a3-8d16-f9c5393ec39c.pt(details :Invalid URL '': No scheme supplied. Perhaps you meant http://? )
2022-04-01 11:20:43,945 fedbiomed CRITICAL - Fed-BioMe


--------------------
Fed-BioMed researcher stopped due to exception:
FB604: repository error : bad URL when downloading file node_params_35db3bb4-aa77-47a3-8d16-f9c5393ec39c.pt(details :Invalid URL '': No scheme supplied. Perhaps you meant http://? )
--------------------


Local training results for each round and each node are available via `exp.training_replies()` (index 0 to (`rounds` - 1) ).

For example you can view the training results for the last round below.

Different timings (in seconds) are reported for each dataset of a node participating in a round :
- `rtime_training` real time (clock time) spent in the training function on the node
- `ptime_training` process time (user and system CPU) spent in the training function on the node
- `rtime_total` real time (clock time) spent in the researcher between sending the request and handling the response, at the `Job()` layer

In [None]:
print("\nList the training rounds : ", exp.training_replies().keys())

print("\nList the nodes for the last training round and their timings : ")
round_data = exp.training_replies()[rounds - 1].data()
for c in range(len(round_data)):
    print("\t- {id} :\
    \n\t\trtime_training={rtraining:.2f} seconds\
    \n\t\tptime_training={ptraining:.2f} seconds\
    \n\t\trtime_total={rtotal:.2f} seconds".format(id = round_data[c]['node_id'],
        rtraining = round_data[c]['timing']['rtime_training'],
        ptraining = round_data[c]['timing']['ptime_training'],
        rtotal = round_data[c]['timing']['rtime_total']))
print('\n')
    
exp.training_replies()[rounds - 1].dataframe()

Federated parameters for each round are available via `exp.aggregated_params()` (index 0 to (`rounds` - 1) ).

For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", exp.aggregated_params().keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params()[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params()[rounds - 1]['params'].keys())


Feel free to run other sample notebooks or try your own models :D