# Performing Testing at Each Round of Training 

Use for developing (autoreloads changes made across packages)

In [1]:
%load_ext autoreload
%autoreload 2

## Start the network
Before running this notebook, start the network with `./scripts/fedbiomed_run network`

## Setting the node up
It is necessary to previously configure a node:
1. `./scripts/fedbiomed_run node add`
  * Select option 2 (default) to add MNIST to the node
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  
2. Check that your data has been added by executing `./scripts/fedbiomed_run node list`
3. Run the node using `./scripts/fedbiomed_run node run`. Wait until you get `Starting task manager`. it means you are online.

## Define an experiment model and parameters"

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [2]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlan, self).__init__(model_args)
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms"]
        
        self.add_dependency(deps)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        
        
        output = F.log_softmax(x, dim=1)
        return output

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the node side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [3]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100, # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
    'test_ratio': .3,
    'test_on_local_updates': True, 
    'test_on_global_updates': True
}

## Declare and run the experiment

- search nodes serving data for these `tags`, optionally filter on a list of node ID with `nodes`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `round_limit` rounds, applying the `node_selection_strategy` between the rounds

In [4]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None,
                 tensorboard=True)

2022-03-28 09:39:09,194 fedbiomed INFO - Component environment:
2022-03-28 09:39:09,196 fedbiomed INFO - type = ComponentType.RESEARCHER
2022-03-28 09:39:09,427 fedbiomed INFO - Messaging researcher_ad3c024c-fb12-4ca1-9204-0f6b9220bed8 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x7f622585bee0>
2022-03-28 09:39:09,438 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2022-03-28 09:39:19,478 fedbiomed INFO - Node selected for training -> node_ad006bab-e62d-4745-948c-604a37b7f170
2022-03-28 09:39:19,480 fedbiomed INFO - Node selected for training -> node_d646c7eb-b388-4712-981d-f63fbe392c5c
2022-03-28 09:39:19,485 fedbiomed INFO - Checking data quality of federated datasets...
2022-03-28 09:39:19,521 fedbiomed DEBUG - Model file has been saved: /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0009/my_model_dd3b0943-58ab-4d3d-8901-2009c35b1427.py
2022-03-28 09:39:19

In [5]:
from fedbiomed.researcher.environ import environ
tensorboard_dir = environ['TENSORBOARD_RESULTS_DIR']
%load_ext tensorboard

In [6]:
tensorboard --logdir "$tensorboard_dir"

Let's start the experiment.

By default, this function doesn't stop until all the `round_limit` rounds are done for all the nodes

In [7]:
exp.run()

2022-03-28 09:39:21,770 fedbiomed INFO - Sampled nodes in round 0 ['node_ad006bab-e62d-4745-948c-604a37b7f170', 'node_d646c7eb-b388-4712-981d-f63fbe392c5c']
2022-03-28 09:39:21,773 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_ad006bab-e62d-4745-948c-604a37b7f170 
					[1m Reqeust: [0m: Perform training with the arguments: {'researcher_id': 'researcher_ad3c024c-fb12-4ca1-9204-0f6b9220bed8', 'job_id': '24f106d2-87de-4e9f-b1c2-cd022fb71f37', 'training_args': {'test_ratio': 0.3, 'test_on_local_updates': True, 'test_on_global_updates': True, 'test_metric': None, 'test_metric_args': {}, 'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100}, 'training': True, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/03/28/my_model_dd3b0943-58ab-4d3d-8901-2009c35b1427.py', 'params_url': 'http://localhost:8844/media/uploads/2022/03/28/aggregated_params_init_a838adb0-a1cf-4e8b-a053-21d2109bb77f.pt', 'model_clas

2022-03-28 09:40:00,725 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_ad006bab-e62d-4745-948c-604a37b7f170
					[1m MESSAGE:[0m No `testing_step` method found in TrainingPlan: using default metric ACCURACY for model evaluation[0m
-----------------------------------------------------------------
2022-03-28 09:40:01,533 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_d646c7eb-b388-4712-981d-f63fbe392c5c
					[1m MESSAGE:[0m No `testing_step` method found in TrainingPlan: using default metric ACCURACY for model evaluation[0m
-----------------------------------------------------------------
2022-03-28 09:40:14,848 fedbiomed INFO - [1mTESTING ON LOCAL UPDATES[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Completed: 18000/18000 (100%) 
 					 ACCURACY: [1m0.944056[0m 
					 ---------
2022-03-28 09:40:15,023 fedbiomed INFO - [1mTESTING ON LOCAL UPDATES[0m 
					 NODE_ID: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					 Completed: 18000/18000 (10

2022-03-28 09:40:47,739 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					 Epoch: 1 | Completed: 1440/42000 (3%) 
 					 Loss: [1m0.368577[0m 
					 ---------
2022-03-28 09:40:48,143 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 1440/42000 (3%) 
 					 Loss: [1m0.229403[0m 
					 ---------
2022-03-28 09:40:48,832 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					 Epoch: 1 | Completed: 1920/42000 (5%) 
 					 Loss: [1m0.235843[0m 
					 ---------
2022-03-28 09:40:49,254 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 1920/42000 (5%) 
 					 Loss: [1m0.197829[0m 
					 ---------
2022-03-28 09:40:50,038 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					 Epoch: 1 | Completed: 2400/42000 (6%) 
 	

2022-03-28 09:41:49,597 fedbiomed INFO - [1mTESTING ON GLOBAL UPDATES[0m 
					 NODE_ID: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					 Completed: 18000/18000 (100%) 
 					 ACCURACY: [1m0.970444[0m 
					 ---------


2

Local training results for each round and each node are available via `exp.training_replies()` (index 0 to (`rounds` - 1) ).

For example you can view the training results for the last round below.

Different timings (in seconds) are reported for each dataset of a node participating in a round :
- `rtime_training` real time (clock time) spent in the training function on the node
- `ptime_training` process time (user and system CPU) spent in the training function on the node
- `rtime_total` real time (clock time) spent in the researcher between sending the request and handling the response, at the `Job()` layer

Federated parameters for each round are available via `exp.aggregated_params()` (index 0 to (`rounds` - 1) ).

For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", exp.aggregated_params().keys())

print("\nAccess the federated params for the last training round :")
print("\t- params_path: ", exp.aggregated_params()[rounds - 1]['params_path'])
print("\t- parameter data: ", exp.aggregated_params()[rounds - 1]['params'].keys())


Feel free to run other sample notebooks or try your own models :D

## Testing using your own testing metric

In [8]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlanCM(TorchTrainingPlan):
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlanCM, self).__init__(model_args)
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms"]
        
        self.add_dependency(deps)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        
        
        output = F.log_softmax(x, dim=1)
        return output

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss

    def testing_step(self, data, target):
        
        output = self.forward(data)
        loss1   = torch.nn.functional.nll_loss(output, target)
        output = self(data)
        loss2   = torch.nn.functional.nll_loss(output, target)
        return {"Loss_1": loss1, "Loss_2": loss2}

In [9]:
model_args = {}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100, # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
    'test_ratio': .3,
    'test_on_local_updates': True, 
    'test_on_global_updates': True
}

In [13]:
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 model_args=model_args,
                 model_class=MyTrainingPlanCM,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None, 
                 tensorboard=True)

2022-03-28 09:47:00,326 fedbiomed INFO - Searching dataset with data tags: ['#MNIST', '#dataset'] for all nodes
2022-03-28 09:47:10,365 fedbiomed INFO - Node selected for training -> node_d646c7eb-b388-4712-981d-f63fbe392c5c
2022-03-28 09:47:10,367 fedbiomed INFO - Node selected for training -> node_ad006bab-e62d-4745-948c-604a37b7f170
2022-03-28 09:47:10,400 fedbiomed INFO - Checking data quality of federated datasets...
2022-03-28 09:47:10,453 fedbiomed DEBUG - Model file has been saved: /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0011/my_model_1f3f9359-ff09-45b8-b063-d66a13aaa694.py
2022-03-28 09:47:10,497 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/experiments/Experiment_0011/my_model_1f3f9359-ff09-45b8-b063-d66a13aaa694.py successful, with status code 201
2022-03-28 09:47:10,709 fedbiomed DEBUG - upload (HTTP POST request) of file /home/scansiz/Desktop/Inria/development/fedbiomed/var/ex

In [12]:
tensorboard --logdir "$tensorboard_dir"

Reusing TensorBoard on port 6006 (pid 8470), started 0:07:19 ago. (Use '!kill 8470' to kill it.)

In [14]:
exp.run()

2022-03-28 09:50:22,199 fedbiomed INFO - Sampled nodes in round 0 ['node_d646c7eb-b388-4712-981d-f63fbe392c5c', 'node_ad006bab-e62d-4745-948c-604a37b7f170']
2022-03-28 09:50:22,200 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					[1m Reqeust: [0m: Perform training with the arguments: {'researcher_id': 'researcher_ad3c024c-fb12-4ca1-9204-0f6b9220bed8', 'job_id': '4a3c1174-40e2-4d5a-b3b9-7efeb7462169', 'training_args': {'test_ratio': 0.3, 'test_on_local_updates': True, 'test_on_global_updates': True, 'test_metric': None, 'test_metric_args': {}, 'batch_size': 48, 'lr': 0.001, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100}, 'training': True, 'model_args': {}, 'command': 'train', 'model_url': 'http://localhost:8844/media/uploads/2022/03/28/my_model_1f3f9359-ff09-45b8-b063-d66a13aaa694.py', 'params_url': 'http://localhost:8844/media/uploads/2022/03/28/aggregated_params_init_54783496-0e2f-4c4d-81f9-70d84676df14.pt', 'model_clas

2022-03-28 09:51:47,717 fedbiomed INFO - [1mINFO[0m
					[1m NODE[0m node_d646c7eb-b388-4712-981d-f63fbe392c5c
					[1m MESSAGE:[0m results uploaded successfully [0m
-----------------------------------------------------------------
2022-03-28 09:51:52,733 fedbiomed INFO - Downloading model params after training on node_ad006bab-e62d-4745-948c-604a37b7f170 - from http://localhost:8844/media/uploads/2022/03/28/node_params_c6f4ff17-712b-4cdb-81cb-36db9a1f30a7.pt
2022-03-28 09:51:52,802 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_b08f0666-1be0-4de5-986d-b7117839129a.pt successful, with status code 200
2022-03-28 09:51:52,882 fedbiomed INFO - Downloading model params after training on node_d646c7eb-b388-4712-981d-f63fbe392c5c - from http://localhost:8844/media/uploads/2022/03/28/node_params_dd662436-8ad8-45fc-bbf1-98014e9d0772.pt
2022-03-28 09:51:52,919 fedbiomed DEBUG - upload (HTTP GET request) of file node_params_1e6744bc-7e5c-4699-8839-d30bc3a9d3ff.pt successf

2022-03-28 09:52:29,944 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					 Epoch: 1 | Completed: 3360/42000 (8%) 
 					 Loss: [1m0.126965[0m 
					 ---------
2022-03-28 09:52:30,766 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					 Epoch: 1 | Completed: 3840/42000 (9%) 
 					 Loss: [1m0.164559[0m 
					 ---------
2022-03-28 09:52:30,825 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 3840/42000 (9%) 
 					 Loss: [1m0.229700[0m 
					 ---------
2022-03-28 09:52:31,606 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_d646c7eb-b388-4712-981d-f63fbe392c5c 
					 Epoch: 1 | Completed: 4320/42000 (10%) 
 					 Loss: [1m0.242482[0m 
					 ---------
2022-03-28 09:52:31,690 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: node_ad006bab-e62d-4745-948c-604a37b7f170 
					 Epoch: 1 | Completed: 4320/42000 (10%) 


2

2022-03-28 10:27:51,947 fedbiomed INFO - [1mCRITICAL[0m
					[1m NODE[0m node_ad006bab-e62d-4745-948c-604a37b7f170
					[1m MESSAGE:[0m Node stopped in signal_handler, probably by user decision (Ctrl C)[0m
-----------------------------------------------------------------
