# Fedbiomed Researcher

Use for developing (autoreloads changes made across packages)

In [1]:
%load_ext autoreload
%autoreload 2

## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the node up
It is necessary to previously configure a node:
1. `./scripts/fedbiomed_run node add`
  * Select option 2 (default) to add MNIST to the node
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  
2. Check that your data has been added by executing `./scripts/fedbiomed_run node list`
3. Run the node using `./scripts/fedbiomed_run node run`. Wait until you get `Starting task manager`. it means you are online.

## Create an experiment to train a model on the data found

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [2]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlan, self).__init__(model_args)
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms"
               ]
        
        self.add_dependency(deps)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        
        
        output = F.log_softmax(x, dim=1)
        return output

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the node side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [3]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}

Define an experiment
- search nodes serving data for these `tags`, optionally filter on a list of node ID with `nodes`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `round_limit` rounds, applying the `node_selection_strategy` between the rounds

In [4]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 1

exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None,
                 tensorboard=True
                )

exp.set_test_ratio(0.1)
exp.set_test_on_local_updates(True)
exp.set_test_on_global_updates(True)

2022-03-30 21:58:26,255 fedbiomed INFO - Component environment:
2022-03-30 21:58:26,255 fedbiomed INFO - type = ComponentType.RESEARCHER
2022-03-30 21:58:26,410 fedbiomed INFO - Messaging researcher_ad3c024c-fb12-4ca1-9204-0f6b9220bed8 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x7f1568a0f2e0>
2022-03-30 21:58:26,452 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2022-03-30 21:58:36,463 fedbiomed INFO - Node selected for training -> node_ad006bab-e62d-4745-948c-604a37b7f170
2022-03-30 21:58:36,495 fedbiomed DEBUG - Model file has been saved: /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0025/my_model_6489c771-b461-42ff-bbb1-0b3d1d3978dc.py
2022-03-30 21:58:36,515 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0025/my_model_6489c771-b461-42ff-bbb1-0b3d1d3978dc.py successful, 

True

Start tensorboard to see loss value after every iteration during training. It is normal to see empty screen. After you run the experiment you will be able to see the changes on the dashboard. Notebook will refresh results in every 30 seconds. You can also click refresh button to see current training steps. 

In [5]:
from fedbiomed.researcher.environ import environ
tensorboard_dir = environ['TENSORBOARD_RESULTS_DIR']

In [6]:
%load_ext tensorboard

In [7]:
tensorboard --logdir "$tensorboard_dir"

In [10]:
exp.run(rounds=2, increase=True)

2022-03-30 22:00:09,572 fedbiomed DEBUG - Auto increasing total rounds for experiment from 2 to 4
2022-03-30 22:00:09,572 fedbiomed INFO - Sampled nodes in round 2 ['node_ad006bab-e62d-4745-948c-604a37b7f170']
2022-03-30 22:00:09,573 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_ad006bab-e62d-4745-948c-604a37b7f170 
					[1m Reqeust: [0m: Perform training with the arguments: {'researcher_id': 'researcher_ad3c024c-fb12-4ca1-9204-0f6b9220bed8', 'job_id': '2a70f2b7-8ee9-4236-a9c3-9dbed7031615', 'training_args': {'test_ratio': 0.1, 'test_on_local_updates': True, 'test_on_global_updates': True, 'test_metric': None, 'test_metric_args': {}, 'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100}, 'training': True, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/03/30/my_model_6489c771-b461-42ff-bbb1-0b3d1d3978dc.py', 'params_url': 'http://localhost:8844/media/uploads/2022/03/30/aggregated_params_dce63

3
[1]
1


2022-03-30 22:00:12,240 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 480/54000 (1%) 
 					 Loss: [1m0.269563[0m 
					 ---------


3


2022-03-30 22:00:12,705 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 960/54000 (2%) 
 					 Loss: [1m0.208132[0m 
					 ---------


3
[10]
20


2022-03-30 22:00:13,207 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 1440/54000 (3%) 
 					 Loss: [1m0.132512[0m 
					 ---------


3
[10, 20]
30


2022-03-30 22:00:13,765 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 1920/54000 (4%) 
 					 Loss: [1m0.167659[0m 
					 ---------


3
[10, 20, 30]
40


2022-03-30 22:00:14,238 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 2400/54000 (4%) 
 					 Loss: [1m0.249605[0m 
					 ---------


3
[10, 20, 30, 40]
50


2022-03-30 22:00:14,713 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 2880/54000 (5%) 
 					 Loss: [1m0.093042[0m 
					 ---------


3
[10, 20, 30, 40, 50]
60


2022-03-30 22:00:15,203 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 3360/54000 (6%) 
 					 Loss: [1m0.243867[0m 
					 ---------


3
[10, 20, 30, 40, 50, 60]
70


2022-03-30 22:00:15,784 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 3840/54000 (7%) 
 					 Loss: [1m0.024946[0m 
					 ---------


3
[10, 20, 30, 40, 50, 60, 70]
80


2022-03-30 22:00:16,268 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 4320/54000 (8%) 
 					 Loss: [1m0.065175[0m 
					 ---------


3
[10, 20, 30, 40, 50, 60, 70, 80]
90


2022-03-30 22:00:17,392 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_ad006bab-e62d-4745-948c-604a37b7f170
					[1m MESSAGE:[0m No `testing_step` method found in TrainingPlan: using default metric ACCURACY for model evaluation[0m
-----------------------------------------------------------------
2022-03-30 22:00:19,084 fedbiomed INFO - [1mTESTING ON LOCAL UPDATES[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Completed: 6000/6000 (100%) 
 					 ACCURACY: [1m0.967833[0m 
					 ---------
2022-03-30 22:00:19,237 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_ad006bab-e62d-4745-948c-604a37b7f170
					[1m MESSAGE:[0m results uploaded successfully [0m
-----------------------------------------------------------------


3


2022-03-30 22:00:24,591 fedbiomed INFO - Downloading model params after training on node_ad006bab-e62d-4745-948c-604a37b7f170 - from http://localhost:8844/media/uploads/2022/03/30/node_params_35e35746-8b78-4a41-bed0-581ed6c2d1cb.pt
2022-03-30 22:00:24,633 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_cfd24157-eae6-43be-84e3-eafc733bb8c5.pt successful, with status code 200
2022-03-30 22:00:24,643 fedbiomed INFO - Nodes that successfully reply in round 2 ['node_ad006bab-e62d-4745-948c-604a37b7f170']
2022-03-30 22:00:24,822 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0025/aggregated_params_866fd37b-1b29-4c80-8995-ceafff76c1a5.pt successful, with status code 201
2022-03-30 22:00:24,828 fedbiomed INFO - Saved aggregated params for round 2 in /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0025/aggregated_params_866fd37b-1b29-4c80-8995-ceafff76c1a5.pt
2022-03-3

4


2022-03-30 22:00:27,860 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 480/54000 (1%) 
 					 Loss: [1m0.124026[0m 
					 ---------


4


2022-03-30 22:00:28,409 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 960/54000 (2%) 
 					 Loss: [1m0.193208[0m 
					 ---------


4
[10]
20


2022-03-30 22:00:28,956 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 1440/54000 (3%) 
 					 Loss: [1m0.058132[0m 
					 ---------


4
[10, 20]
30


2022-03-30 22:00:29,503 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 1920/54000 (4%) 
 					 Loss: [1m0.158816[0m 
					 ---------


4
[10, 20, 30]
40


2022-03-30 22:00:30,134 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 2400/54000 (4%) 
 					 Loss: [1m0.141086[0m 
					 ---------


4
[10, 20, 30, 40]
50


2022-03-30 22:00:30,690 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 2880/54000 (5%) 
 					 Loss: [1m0.132982[0m 
					 ---------


4
[10, 20, 30, 40, 50]
60


2022-03-30 22:00:31,320 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 3360/54000 (6%) 
 					 Loss: [1m0.121289[0m 
					 ---------


4
[10, 20, 30, 40, 50, 60]
70


2022-03-30 22:00:31,909 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 3840/54000 (7%) 
 					 Loss: [1m0.128271[0m 
					 ---------


4
[10, 20, 30, 40, 50, 60, 70]
80


2022-03-30 22:00:32,488 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 4320/54000 (8%) 
 					 Loss: [1m0.076825[0m 
					 ---------


4
[10, 20, 30, 40, 50, 60, 70, 80]
90


2022-03-30 22:00:33,781 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_ad006bab-e62d-4745-948c-604a37b7f170
					[1m MESSAGE:[0m No `testing_step` method found in TrainingPlan: using default metric ACCURACY for model evaluation[0m
-----------------------------------------------------------------
2022-03-30 22:00:35,635 fedbiomed INFO - [1mTESTING ON LOCAL UPDATES[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Completed: 6000/6000 (100%) 
 					 ACCURACY: [1m0.971833[0m 
					 ---------
2022-03-30 22:00:35,788 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_ad006bab-e62d-4745-948c-604a37b7f170
					[1m MESSAGE:[0m results uploaded successfully [0m
-----------------------------------------------------------------


4


2022-03-30 22:00:44,861 fedbiomed INFO - Downloading model params after training on node_ad006bab-e62d-4745-948c-604a37b7f170 - from http://localhost:8844/media/uploads/2022/03/30/node_params_ba5436c4-33e7-47bb-b4f8-03bc9e7e33bb.pt
2022-03-30 22:00:44,915 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_9a4ebccc-84d9-4e13-92e1-2ffbb53052ad.pt successful, with status code 200
2022-03-30 22:00:44,925 fedbiomed INFO - Nodes that successfully reply in round 3 ['node_ad006bab-e62d-4745-948c-604a37b7f170']
2022-03-30 22:00:45,075 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0025/aggregated_params_a99965c2-8406-4205-baae-7d806b5f22e2.pt successful, with status code 201
2022-03-30 22:00:45,078 fedbiomed INFO - Saved aggregated params for round 3 in /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0025/aggregated_params_a99965c2-8406-4205-baae-7d806b5f22e2.pt
2022-03-3

5


2

In [None]:
exp._monitor._metric_store


To display current values please click refresh button on the TensorBoard screen

Local training results for each round and each node are available via `exp.training_replies()` (index 0 to (`rounds` - 1) ).

For example you can view the training results for the last round below.

Different timings (in seconds) are reported for each dataset of a node participating in a round :
- `rtime_training` real time (clock time) spent in the training function on the node
- `ptime_training` process time (user and system CPU) spent in the training function on the node
- `rtime_total` real time (clock time) spent in the researcher between sending the request and handling the response, at the `Job()` layer

In [None]:
print("\nList the training rounds : ", exp.training_replies().keys())

print("\nList the nodes for the last training round and their timings : ")
round_data = exp.training_replies()[rounds - 1].data()
for c in range(len(round_data)):
    print("\t- {id} :\
    \n\t\trtime_training={rtraining:.2f} seconds\
    \n\t\tptime_training={ptraining:.2f} seconds\
    \n\t\trtime_total={rtotal:.2f} seconds".format(id = round_data[c]['node_id'],
        rtraining = round_data[c]['timing']['rtime_training'],
        ptraining = round_data[c]['timing']['ptime_training'],
        rtotal = round_data[c]['timing']['rtime_total']))
print('\n')
    
exp.training_replies()[rounds - 1].dataframe()

Federated parameters for each round are available via `exp.aggregated_params()` (index 0 to (`rounds` - 1) ).

For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", exp.aggregated_params().keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params()[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params()[rounds - 1]['params'].keys())


## Optional : searching the data

In [None]:
from fedbiomed.researcher.requests import Requests

r = Requests()
data = r.search(tags)

import pandas as pd
for node_id in data.keys():
    print('\n','Data for ', node_id, '\n\n', pd.DataFrame(data[node_id]))

Feel free to try your own models :D