# Fedbiomed Researcher

Use for developing (autoreloads changes made across packages)

In [None]:
%load_ext autoreload
%autoreload 2

## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the node up
It is necessary to previously configure a node:
1. `./scripts/fedbiomed_run node add`
  * Select option 2 (default) to add MNIST to the node
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  
2. Check that your data has been added by executing `./scripts/fedbiomed_run node list`
3. Run the node using `./scripts/fedbiomed_run node run`. Wait until you get `Starting task manager`. it means you are online.

## Create an experiment to train a model on the data found

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [1]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlan, self).__init__(model_args)
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms"]
        
        self.add_dependency(deps)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        
        
        output = F.log_softmax(x, dim=1)
        return output

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the node side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [2]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 2, 
    'dry_run': False,  
    'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}

Define an experiment
- search nodes serving data for these `tags`, optionally filter on a list of node ID with `nodes`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `round_limit` rounds, applying the `node_selection_strategy` between the rounds

In [3]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 3

exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None,
                 tensorboard=True
                )

2022-03-16 13:32:40,331 fedbiomed INFO - Component environment:
2022-03-16 13:32:40,332 fedbiomed INFO - type = ComponentType.RESEARCHER
2022-03-16 13:32:40,547 fedbiomed INFO - Messaging researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x7f1671bfc670>
2022-03-16 13:32:40,604 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2022-03-16 13:32:40,606 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Message received: {'researcher_id': 'researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2', 'tags': ['#MNIST', '#dataset'], 'command': 'search'}
2022-03-16 13:32:50,643 fedbiomed INFO - Node selected for training -> node_19ef0050-617d-4624-bbce-207469edf883
2022-03-16 13:32:50,685 fedbiomed DEBUG - Model file has been saved: /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0093/my_model_1c431521-ae44-4da

Start tensorboard to see loss value after every iteration during training. It is normal to see empty screen. After you run the experiment you will be able to see the changes on the dashboard. Notebook will refresh results in every 30 seconds. You can also click refresh button to see current training steps. 

In [4]:
from fedbiomed.researcher.environ import environ
tensorboard_dir = environ['TENSORBOARD_RESULTS_DIR']

In [5]:
%load_ext tensorboard

In [6]:
tensorboard --logdir "$tensorboard_dir"

In [7]:
exp.run()

2022-03-16 13:32:52,508 fedbiomed INFO - Sampled nodes in round 0 ['node_19ef0050-617d-4624-bbce-207469edf883']
2022-03-16 13:32:52,512 fedbiomed INFO - Send message to node node_19ef0050-617d-4624-bbce-207469edf883 - {'researcher_id': 'researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2', 'job_id': '443ea241-f06a-4e65-9307-ae10a8859bb3', 'training_args': {'batch_size': 48, 'lr': 0.001, 'epochs': 2, 'dry_run': False, 'batch_maxnum': 100}, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/03/16/my_model_1c431521-ae44-4dad-91f3-9420652540ec.py', 'params_url': 'http://localhost:8844/media/uploads/2022/03/16/aggregated_params_init_298fd41b-f601-4f51-a401-e0751ea4d139.pt', 'model_class': 'MyTrainingPlan', 'training_data': {'node_19ef0050-617d-4624-bbce-207469edf883': ['dataset_ba55374f-ddc3-4f5d-8bb6-deac79c459ee']}}
2022-03-16 13:32:52,515 fedbiomed DEBUG - researcher_96a37edc-2ba8-47d7-aa8e-33679104e4b2
2022-03-16 13:32:52,521 fedbiomed INFO - log fr

2022-03-16 13:33:10,778 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / INFO - results uploaded successfully 
2022-03-16 13:33:17,552 fedbiomed INFO - Downloading model params after training on node_19ef0050-617d-4624-bbce-207469edf883 - from http://localhost:8844/media/uploads/2022/03/16/node_params_65f5510a-1a85-440e-9843-5f5f8cfa1f9e.pt
2022-03-16 13:33:17,600 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_e5bcdf2f-c9c6-41d4-a382-cb054dc20c70.pt successful, with status code 200
2022-03-16 13:33:17,618 fedbiomed INFO - Nodes that successfully reply in round 0 ['node_19ef0050-617d-4624-bbce-207469edf883']
2022-03-16 13:33:17,829 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0093/aggregated_params_edb9d8a0-6330-497f-a0bd-19141b3c73d7.pt successful, with status code 201
2022-03-16 13:33:17,832 fedbiomed INFO - Saved aggregated params for round 0 in /home/scansiz/De

2022-03-16 13:33:35,117 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Reached 100 batches for this epoch, ignore remaining data
2022-03-16 13:33:35,119 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - running model.postprocess() method
2022-03-16 13:33:35,120 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - model.postprocess() method not provided
2022-03-16 13:33:35,300 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/tmp/node_params_7fffc02b-91e5-4d22-ac6b-728879f7b483.pt successful, with status code 201
2022-03-16 13:33:35,303 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / INFO - results uploaded successfully 
2022-03-16 13:33:42,863 fedbiomed INFO - Downloading model params after training on node_19ef0050-617d-4624-bbce-207469edf883 - from http://loca

2022-03-16 13:33:59,288 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - Reached 100 batches for this epoch, ignore remaining data
2022-03-16 13:33:59,289 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - running model.postprocess() method
2022-03-16 13:33:59,290 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - model.postprocess() method not provided
2022-03-16 13:33:59,478 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/tmp/node_params_8e742703-3025-4091-b62a-8bdf28828762.pt successful, with status code 201
2022-03-16 13:33:59,479 fedbiomed INFO - log from: node_19ef0050-617d-4624-bbce-207469edf883 / INFO - results uploaded successfully 
2022-03-16 13:34:08,145 fedbiomed INFO - Downloading model params after training on node_19ef0050-617d-4624-bbce-207469edf883 - from http://loca

3

To display current values please click refresh button on the TensorBoard screen

Local training results for each round and each node are available via `exp.training_replies()` (index 0 to (`rounds` - 1) ).

For example you can view the training results for the last round below.

Different timings (in seconds) are reported for each dataset of a node participating in a round :
- `rtime_training` real time (clock time) spent in the training function on the node
- `ptime_training` process time (user and system CPU) spent in the training function on the node
- `rtime_total` real time (clock time) spent in the researcher between sending the request and handling the response, at the `Job()` layer

In [None]:
print("\nList the training rounds : ", exp.training_replies().keys())

print("\nList the nodes for the last training round and their timings : ")
round_data = exp.training_replies()[rounds - 1].data()
for c in range(len(round_data)):
    print("\t- {id} :\
    \n\t\trtime_training={rtraining:.2f} seconds\
    \n\t\tptime_training={ptraining:.2f} seconds\
    \n\t\trtime_total={rtotal:.2f} seconds".format(id = round_data[c]['node_id'],
        rtraining = round_data[c]['timing']['rtime_training'],
        ptraining = round_data[c]['timing']['ptime_training'],
        rtotal = round_data[c]['timing']['rtime_total']))
print('\n')
    
exp.training_replies()[rounds - 1].dataframe()

Federated parameters for each round are available via `exp.aggregated_params()` (index 0 to (`rounds` - 1) ).

For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", exp.aggregated_params().keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params()[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params()[rounds - 1]['params'].keys())


## Optional : searching the data

In [None]:
from fedbiomed.researcher.requests import Requests

r = Requests()
data = r.search(tags)

import pandas as pd
for node_id in data.keys():
    print('\n','Data for ', node_id, '\n\n', pd.DataFrame(data[node_id]))

Feel free to try your own models :D