# Fed-BioMed Researcher - Saving and Loading breakpoints

## Setting the node up
It is necessary to previously configure a node:
1. `./scripts/fedbiomed_run node dataset add`
  * Select option 2 (default) to add MNIST to the node
  * Confirm default tags by hitting "y" and ENTER
  * Pick the folder where MNIST is downloaded (this is due torch issue https://github.com/pytorch/vision/issues/3549)
  * Data must have been added (if you get a warning saying that data must be unique is because it's been already added)
  
2. Check that your data has been added by executing `./scripts/fedbiomed_run node dataset list`
3. Run the node using `./scripts/fedbiomed_run node start`. Wait until you get `Starting task manager`. it means you are online.

## Create an experiment to train a model on the data found

Declare a torch training plan MyTrainingPlan class to send for training on the node

In [1]:
import torch
import torch.nn as nn
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager
from torchvision import datasets, transforms

# Here we define the training plan. 
class MyTrainingPlan(TorchTrainingPlan):
    
    # Defines and return model 
    def init_model(self, model_args):
        return self.Net(model_args = model_args)
    
    # Defines and return optimizer
    def init_optimizer(self, optimizer_args):
        return torch.optim.Adam(self.model().parameters(), lr = optimizer_args["lr"])
    
    # Declares and return dependencies
    def init_dependencies(self):
        deps = ["from torchvision import datasets, transforms"]
        return deps
    
    class Net(nn.Module):
        def __init__(self, model_args):
            super().__init__()
            self.conv1 = nn.Conv2d(1, 32, 3, 1)
            self.conv2 = nn.Conv2d(32, 64, 3, 1)
            self.dropout1 = nn.Dropout(0.25)
            self.dropout2 = nn.Dropout(0.5)
            self.fc1 = nn.Linear(9216, 128)
            self.fc2 = nn.Linear(128, 10)

        def forward(self, x):
            x = self.conv1(x)
            x = F.relu(x)
            x = self.conv2(x)
            x = F.relu(x)
            x = F.max_pool2d(x, 2)
            x = self.dropout1(x)
            x = torch.flatten(x, 1)
            x = self.fc1(x)
            x = F.relu(x)
            x = self.dropout2(x)
            x = self.fc2(x)


            output = F.log_softmax(x, dim=1)
            return output

    def training_data(self):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = { 'shuffle': True}
        return DataManager(dataset=dataset1, **train_kwargs)
    
    def training_step(self, data, target):
        output = self.model().forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss


This group of arguments correspond respectively:
* `model_args`: a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class on the node side.
* `training_args`: a dictionary containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the routine on the node side.

**NOTE:** typos and/or lack of positional (required) arguments will raise error. 🤓

In [2]:
model_args = {}

training_args = {
    'loader_args': { 'batch_size': 48, }, 
    'optimizer_args': {
        "lr" : 1e-3
    },
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}

Define an experiment with saved breakpoints
- search nodes serving data for these `tags`, optionally filter on a list of node ID with `nodes`
- run a round of local training on nodes with model defined in `model_path` + federation with `aggregator`
- run for `round_limit` rounds, applying the `node_selection_strategy` between the rounds
- specify `save_breakpoints` for saving breakpoint at the end of each round.

Let's call ${FEDBIOMED_DIR} the base directory where you cloned Fed-BioMed.
Breakpoints will be saved under `Experiment_xxxx` folder at `${FEDBIOMED_DIR}/var/experiments/Experiment_xxxx/breakpoints_yyyy` (by default).

In [4]:
from fedbiomed.researcher.federated_workflows import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 model_args=model_args,
                 training_plan_class=MyTrainingPlan,
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None,
                 save_breakpoints=True)

2024-04-02 10:53:39,775 fedbiomed INFO - Starting researcher service...

2024-04-02 10:53:39,828 fedbiomed INFO - Waiting 3s for nodes to connect...

2024-04-02 10:53:41,656 fedbiomed DEBUG - Node: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac polling for the tasks

2024-04-02 10:53:41,726 fedbiomed DEBUG - Node: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b polling for the tasks

2024-04-02 10:53:42,832 fedbiomed INFO - Updating training data. This action will update FederatedDataset, and the nodes that will participate to the experiment.

2024-04-02 10:53:42,841 fedbiomed DEBUG - Node: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac polling for the tasks

2024-04-02 10:53:42,844 fedbiomed DEBUG - Node: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b polling for the tasks

2024-04-02 10:53:42,849 fedbiomed INFO - Node selected for training -> NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac

2024-04-02 10:53:42,850 fedbiomed INFO - Node selected for training -> NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b

2024-04-02 10:53:42,852 fedbiomed DEBUG - Model file has been saved: /home/ybouilla/Documents/github/fedbiomed/var/experiments/Experiment_0000/model_74506e8c-c5b1-4903-97e7-21d8c6f3ab18.py

Secure RNG turned off. This is perfectly fine for experimentation as it allows for much faster training performance, but remember to turn it on and retrain one last time before production with ``secure_mode`` turned on.


You can interrupt the `exp.run()` after one round, and then reload the breakpoint and continue the training.

In [5]:
exp.run()

2024-04-02 10:53:53,065 fedbiomed INFO - Sampled nodes in round 0 ['NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac', 'NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b']

2024-04-02 10:53:53,070 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					[1m Request: [0m: TRAIN
 -----------------------------------------------------------------

2024-04-02 10:53:53,071 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					[1m Request: [0m: TRAIN
 -----------------------------------------------------------------

2024-04-02 10:53:53,151 fedbiomed DEBUG - Node: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac polling for the tasks

2024-04-02 10:53:53,155 fedbiomed DEBUG - Node: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b polling for the tasks

2024-04-02 10:53:55,835 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 1/100 (1%) | Samples: 48/4800
 					 Loss: [1m2.313489[0m 
					 ---------

2024-04-02 10:53:56,048 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 1/100 (1%) | Samples: 48/4800
 					 Loss: [1m2.277154[0m 
					 ---------

2024-04-02 10:54:04,017 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 10/100 (10%) | Samples: 480/4800
 					 Loss: [1m1.411112[0m 
					 ---------

2024-04-02 10:54:05,438 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 10/100 (10%) | Samples: 480/4800
 					 Loss: [1m1.647678[0m 
					 ---------

2024-04-02 10:54:12,140 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 20/100 (20%) | Samples: 960/4800
 					 Loss: [1m0.812420[0m 
					 ---------

2024-04-02 10:54:14,721 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 20/100 (20%) | Samples: 960/4800
 					 Loss: [1m0.609296[0m 
					 ---------

2024-04-02 10:54:21,087 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 30/100 (30%) | Samples: 1440/4800
 					 Loss: [1m0.438182[0m 
					 ---------

2024-04-02 10:54:24,003 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 30/100 (30%) | Samples: 1440/4800
 					 Loss: [1m0.577196[0m 
					 ---------

2024-04-02 10:54:30,970 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 40/100 (40%) | Samples: 1920/4800
 					 Loss: [1m0.571755[0m 
					 ---------

2024-04-02 10:54:33,524 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 40/100 (40%) | Samples: 1920/4800
 					 Loss: [1m0.595199[0m 
					 ---------

2024-04-02 10:54:39,410 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 50/100 (50%) | Samples: 2400/4800
 					 Loss: [1m0.490290[0m 
					 ---------

2024-04-02 10:54:42,761 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 50/100 (50%) | Samples: 2400/4800
 					 Loss: [1m0.732443[0m 
					 ---------

2024-04-02 10:54:47,887 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 60/100 (60%) | Samples: 2880/4800
 					 Loss: [1m0.476133[0m 
					 ---------

2024-04-02 10:54:49,340 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 60/100 (60%) | Samples: 2880/4800
 					 Loss: [1m0.451733[0m 
					 ---------

2024-04-02 10:54:54,714 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 70/100 (70%) | Samples: 3360/4800
 					 Loss: [1m0.451611[0m 
					 ---------

2024-04-02 10:54:58,613 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 70/100 (70%) | Samples: 3360/4800
 					 Loss: [1m0.241874[0m 
					 ---------

2024-04-02 10:55:02,743 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 80/100 (80%) | Samples: 3840/4800
 					 Loss: [1m0.449304[0m 
					 ---------

2024-04-02 10:55:08,445 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 80/100 (80%) | Samples: 3840/4800
 					 Loss: [1m0.687822[0m 
					 ---------

2024-04-02 10:55:11,233 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 90/100 (90%) | Samples: 4320/4800
 					 Loss: [1m0.313533[0m 
					 ---------

2024-04-02 10:55:18,247 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 90/100 (90%) | Samples: 4320/4800
 					 Loss: [1m0.460418[0m 
					 ---------

2024-04-02 10:55:18,645 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 1 Epoch: 1 | Iteration: 100/100 (100%) | Samples: 4800/4800
 					 Loss: [1m0.514728[0m 
					 ---------

2024-04-02 10:55:18,944 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 1 Epoch: 1 | Iteration: 100/100 (100%) | Samples: 4800/4800
 					 Loss: [1m0.202897[0m 
					 ---------

2024-04-02 10:55:18,985 fedbiomed INFO - Nodes that successfully reply in round 0 ['NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac', 'NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b']

2024-04-02 10:55:19,040 fedbiomed DEBUG - Model file has been saved: /home/ybouilla/Documents/github/fedbiomed/var/experiments/Experiment_0000/model_96d442b8-e0d6-49da-a451-5cc9594ae3b1.py

2024-04-02 10:55:19,078 fedbiomed INFO - breakpoint number 0 saved at /home/ybouilla/Documents/github/fedbiomed/var/experiments/Experiment_0000/breakpoint_0000

2024-04-02 10:55:19,084 fedbiomed INFO - Sampled nodes in round 1 ['NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac', 'NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b']

2024-04-02 10:55:19,091 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					[1m Request: [0m: TRAIN
 -----------------------------------------------------------------

2024-04-02 10:55:19,093 fedbiomed INFO - [1mSending request[0m 
					[1m To[0m: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					[1m Request: [0m: TRAIN
 -----------------------------------------------------------------

2024-04-02 10:55:19,137 fedbiomed DEBUG - Node: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac polling for the tasks

2024-04-02 10:55:19,139 fedbiomed DEBUG - Node: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b polling for the tasks

2024-04-02 10:55:20,164 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 1/100 (1%) | Samples: 48/4800
 					 Loss: [1m0.337326[0m 
					 ---------

2024-04-02 10:55:20,520 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 1/100 (1%) | Samples: 48/4800
 					 Loss: [1m0.452600[0m 
					 ---------

2024-04-02 10:55:28,317 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 10/100 (10%) | Samples: 480/4800
 					 Loss: [1m0.260661[0m 
					 ---------

2024-04-02 10:55:29,379 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 10/100 (10%) | Samples: 480/4800
 					 Loss: [1m0.213722[0m 
					 ---------

2024-04-02 10:55:33,319 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 20/100 (20%) | Samples: 960/4800
 					 Loss: [1m0.416366[0m 
					 ---------

2024-04-02 10:55:36,059 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 20/100 (20%) | Samples: 960/4800
 					 Loss: [1m0.234291[0m 
					 ---------

2024-04-02 10:55:43,333 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 30/100 (30%) | Samples: 1440/4800
 					 Loss: [1m0.584430[0m 
					 ---------

2024-04-02 10:55:46,038 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 30/100 (30%) | Samples: 1440/4800
 					 Loss: [1m0.255125[0m 
					 ---------

2024-04-02 10:55:53,290 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 40/100 (40%) | Samples: 1920/4800
 					 Loss: [1m0.289334[0m 
					 ---------

2024-04-02 10:55:53,801 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 40/100 (40%) | Samples: 1920/4800
 					 Loss: [1m0.194493[0m 
					 ---------

2024-04-02 10:56:02,122 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 50/100 (50%) | Samples: 2400/4800
 					 Loss: [1m0.105945[0m 
					 ---------

2024-04-02 10:56:02,330 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 50/100 (50%) | Samples: 2400/4800
 					 Loss: [1m0.170798[0m 
					 ---------

2024-04-02 10:56:07,903 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 60/100 (60%) | Samples: 2880/4800
 					 Loss: [1m0.235368[0m 
					 ---------

2024-04-02 10:56:11,915 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 60/100 (60%) | Samples: 2880/4800
 					 Loss: [1m0.221285[0m 
					 ---------

2024-04-02 10:56:15,715 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 70/100 (70%) | Samples: 3360/4800
 					 Loss: [1m0.248518[0m 
					 ---------

2024-04-02 10:56:22,244 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 70/100 (70%) | Samples: 3360/4800
 					 Loss: [1m0.288931[0m 
					 ---------

2024-04-02 10:56:24,731 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 80/100 (80%) | Samples: 3840/4800
 					 Loss: [1m0.138288[0m 
					 ---------

2024-04-02 10:56:32,201 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 80/100 (80%) | Samples: 3840/4800
 					 Loss: [1m0.240958[0m 
					 ---------

2024-04-02 10:56:34,333 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 90/100 (90%) | Samples: 4320/4800
 					 Loss: [1m0.185055[0m 
					 ---------

2024-04-02 10:56:41,348 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 90/100 (90%) | Samples: 4320/4800
 					 Loss: [1m0.200031[0m 
					 ---------

2024-04-02 10:56:42,587 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac 
					 Round 2 Epoch: 1 | Iteration: 100/100 (100%) | Samples: 4800/4800
 					 Loss: [1m0.065473[0m 
					 ---------

2024-04-02 10:56:42,882 fedbiomed INFO - [1mTRAINING[0m 
					 NODE_ID: NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b 
					 Round 2 Epoch: 1 | Iteration: 100/100 (100%) | Samples: 4800/4800
 					 Loss: [1m0.241057[0m 
					 ---------

2024-04-02 10:56:42,929 fedbiomed INFO - Nodes that successfully reply in round 1 ['NODE_90a7e795-9022-4bf0-b6c6-f1a2394b6dac', 'NODE_a16525b2-5b6a-42cf-9cef-ff746b2a6d4b']

2024-04-02 10:56:42,989 fedbiomed DEBUG - Model file has been saved: /home/ybouilla/Documents/github/fedbiomed/var/experiments/Experiment_0000/model_b3a4b01c-3fe5-4caa-b767-0b62e21e6918.py

2024-04-02 10:56:43,016 fedbiomed INFO - breakpoint number 1 saved at /home/ybouilla/Documents/github/fedbiomed/var/experiments/Experiment_0000/breakpoint_0001

2

Save trained model to file

In [None]:
exp.training_plan().export_model('./trained_model')

## Delete experiment

Here we simulate the removing of the ongoing experiment
fret not! we have saved breakpoint, so we can retrieve parameters
of the experiment using `load_breakpoint` method

In [None]:
del exp

## Resume an experiment

While experiment is running, you can shut it down (after the first round) and resume the experiment from the next cell. Or wait for the experiment completion.


**To load the latest breakpoint of the latest experiment**

Run :
`Experiment.load_breakpoint()`. It reloads latest breakpoint, and will bypass `search` method

and then use `.run` method as you would do with an existing experiment.

**To load a specific breakpoint** specify breakpoint folder.

- absolute path: use `Experiment.load_breakpoint("${FEDBIOMED_DIR}/var/experiments/Experiment_xxxx/breakpoint_yyyy)`. Replace `xxxx` and `yyyy` by the real values.
- relative path from a notebook: a notebook is running from the `${FEDBIOMED_DIR}/notebooks` directory
so use `Experiment.load_breakpoint("../var/experiments/Experiment_xxxx/breakpoint_yyyy)`. Replace `xxxx` and `yyyy` by the real values.
- relative path from a script: if launching the script from the
  ${FEDBIOMED_DIR} directory (eg: `python ./notebooks/general-breakpoint-save-resume.py`) then use a path relative to the current directory eg: `Experiment.load_breakpoint("./var/experiments/Experiment_xxxx/breakpoint_yyyy)`

In [None]:
fedbiomed.researcher.federated_workflows import Experiment

loaded_exp = Experiment.load_breakpoint()

In [None]:
print(f'Experimentation folder: {loaded_exp.experimentation_folder()}')
print(f'Loaded experiment path: {loaded_exp.experimentation_path()}')

Continue training for the experiment loaded from breakpoint. If you ran all the rounds and load the last breakpoint, there won't be any more round to run.

In [None]:
loaded_exp.run(rounds=3, increase=True)

Save trained model to file

In [None]:
loaded_exp.training_plan().export_model('./trained_model')

In [None]:
exp=loaded_exp
print("______________ loaded training replies_________________")
print("\nList the training rounds : ", exp.training_replies().keys())

print("\nList the nodes for the last training round and their timings : ")
round_data = exp.training_replies()[rounds - 1]
for r in round_data.values():
    print("\t- {id} :\
    \n\t\trtime_training={rtraining:.2f} seconds\
    \n\t\tptime_training={ptraining:.2f} seconds\
    \n\t\trtime_total={rtotal:.2f} seconds".format(id = r['node_id'],
        rtraining = r['timing']['rtime_training'],
        ptraining = r['timing']['ptime_training'],
        rtotal = r['timing']['rtime_total']))
print('\n')

Federated parameters for each round are available via `exp.aggregated_params()` (index 0 to (`rounds` - 1) ).
For example you can view the federated parameters for the last round of the experiment :

In [None]:
print("\nList the training rounds : ", loaded_exp.aggregated_params().keys())

print("\nAccess the federated params for training rounds : ")
for round in loaded_exp.aggregated_params().keys():
  print("round {r}".format(r=round))
  print("\t- parameter data: ", loaded_exp.aggregated_params()[round]['params'].keys())
