# Fedbiomed Researcher base example

Use for developing (autoreloads changes made across packages)

In [1]:
%load_ext autoreload
%autoreload 2

## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the node up
It is necessary to previously configure a node:
1. `./scripts/fedbiomed_run node add`
  * Select option 2 (default) to add MNIST to the node
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  
2. Check that your data has been added by executing `./scripts/fedbiomed_run node list`
3. Run the node using `./scripts/fedbiomed_run node run`. Wait until you get `Starting task manager`. it means you are online.

## Define an experiment model and parameters"

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [2]:
from fedbiomed.researcher.environ import environ
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=environ['TMP_DIR']+'/')
model_file = tmp_dir_model.name + '/class_export_mnist.py'

2022-01-07 11:41:33,230 fedbiomed INFO - Component environment:
2022-01-07 11:41:33,231 fedbiomed INFO - - type = ComponentType.RESEARCHER


Note : write **only** the code to export in the following cell

In [3]:
%%writefile "$model_file"

import torch
import torch.nn as nn
from fedbiomed.common.torchnn import TorchTrainingPlan
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self):
        super(MyTrainingPlan, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms",
               "from torch.utils.data import DataLoader"]
        self.add_dependency(deps)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        
        
        output = F.log_softmax(x, dim=1)
        return output

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        data_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs)
        return data_loader
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


Writing /home/scansiz/Desktop/Inria/development/fedbiomed/var/tmp/tmpyfqfd_u3/class_export_mnist.py


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the node side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [4]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}

## Declare and run the experiment

- search nodes serving data for these `tags`, optionally filter on a list of node ID with `nodes`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `rounds` rounds, applying the `node_selection_strategy` between the rounds

In [5]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 #nodes=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='MyTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)

2022-01-07 11:41:38,332 fedbiomed INFO - Messaging researcher_f3bdbaa1-ecdd-4103-9433-736433398173 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x7f025d282400>
2022-01-07 11:41:38,348 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2022-01-07 11:41:38,350 fedbiomed INFO - log from: node_d97aa17a-485a-4125-bc0d-fd4599cabc85 / DEBUG - Message received: {'researcher_id': 'researcher_f3bdbaa1-ecdd-4103-9433-736433398173', 'tags': ['#MNIST', '#dataset'], 'command': 'search'}
2022-01-07 11:41:48,359 fedbiomed INFO - Node selected for training -> node_d97aa17a-485a-4125-bc0d-fd4599cabc85
2022-01-07 11:41:48,480 fedbiomed DEBUG - torchnn saved model filename: /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0000/my_model_837f6a65-8980-4e3d-a98a-60fdc3bdf825.py


Let's start the experiment.

By default, this function doesn't stop until all the `rounds` are done for all the nodes

In [6]:
exp.run()

2022-01-07 11:41:53,791 fedbiomed INFO - Sampled nodes in round 0 ['node_d97aa17a-485a-4125-bc0d-fd4599cabc85']
2022-01-07 11:41:53,793 fedbiomed INFO - Send message to node node_d97aa17a-485a-4125-bc0d-fd4599cabc85 - {'researcher_id': 'researcher_f3bdbaa1-ecdd-4103-9433-736433398173', 'job_id': '2dd42d23-bd1e-4a4d-bdca-8dc448ab29cf', 'training_args': {'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100}, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/01/07/my_model_837f6a65-8980-4e3d-a98a-60fdc3bdf825.py', 'params_url': 'http://localhost:8844/media/uploads/2022/01/07/aggregated_params_init_8602615f-ff51-4d05-9e50-b894d84ced9b.pt', 'model_class': 'MyTrainingPlan', 'training_data': {'node_d97aa17a-485a-4125-bc0d-fd4599cabc85': ['dataset_55d371d4-09a2-45a8-84a3-a0dbf9ecb486']}}
2022-01-07 11:41:53,793 fedbiomed DEBUG - researcher_f3bdbaa1-ecdd-4103-9433-736433398173
2022-01-07 11:41:53,801 fedbiomed INFO - log fr

2022-01-07 11:42:14,107 fedbiomed INFO - log from: node_d97aa17a-485a-4125-bc0d-fd4599cabc85 / DEBUG - Reached 100 batches for this epoch, ignore remaining data
2022-01-07 11:42:14,274 fedbiomed INFO - log from: node_d97aa17a-485a-4125-bc0d-fd4599cabc85 / INFO - results uploaded successfully 
2022-01-07 11:42:24,098 fedbiomed INFO - Downloading model params after training on node_d97aa17a-485a-4125-bc0d-fd4599cabc85 - from http://localhost:8844/media/uploads/2022/01/07/node_params_08a063ca-7971-4f0c-9642-2866ae60dc23.pt
2022-01-07 11:42:24,154 fedbiomed INFO - Nodes that successfully reply in round 1 ['node_d97aa17a-485a-4125-bc0d-fd4599cabc85']
2022-01-07 11:42:24,338 fedbiomed INFO - Saved aggregated params for round 1 in /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0000/aggregated_params_c7c515b1-854a-42ab-b9f4-ff2523fdb624.pt


Local training results for each round and each node are available in `exp.training_replies` (index 0 to (`rounds` - 1) ).

For example you can view the training results for the last round below.

Different timings (in seconds) are reported for each dataset of a node participating in a round :
- `rtime_training` real time (clock time) spent in the training function on the node
- `ptime_training` process time (user and system CPU) spent in the training function on the node
- `rtime_total` real time (clock time) spent in the researcher between sending the request and handling the response, at the `Job()` layer

In [None]:
print("\nList the training rounds : ", exp.training_replies.keys())

print("\nList the nodes for the last training round and their timings : ")
round_data = exp.training_replies[rounds - 1].data
for c in range(len(round_data)):
    print("\t- {id} :\
    \n\t\trtime_training={rtraining:.2f} seconds\
    \n\t\tptime_training={ptraining:.2f} seconds\
    \n\t\trtime_total={rtotal:.2f} seconds".format(id = round_data[c]['node_id'],
        rtraining = round_data[c]['timing']['rtime_training'],
        ptraining = round_data[c]['timing']['ptime_training'],
        rtotal = round_data[c]['timing']['rtime_total']))
print('\n')
    
exp.training_replies[rounds - 1].dataframe

Federated parameters for each round are available in `exp.aggregated_params` (index 0 to (`rounds` - 1) ).

For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", exp.aggregated_params.keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params[rounds - 1]['params'].keys())


Feel free to run other sample notebooks or try your own models :D