# Federated Experiment Tracking

We like to demo federated experiment tracking via different tools in the following examples



Example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) to train an image classifier using federated averaging ([FedAvg]([FedAvg](https://arxiv.org/abs/1602.05629))) and [PyTorch](https://pytorch.org/) as the deep learning training framework. 

This example also highlights the streaming capability from the clients to the server with MLFLow 

> **_NOTE:_** This example uses the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and will load its data within the trainer code.


## 0. Prerequisits


Assume you have completed the followings: 
* create and activate the venv
* installed NVFLARE, if not, ollow the [Installation](https://nvflare.readthedocs.io/en/main/getting_started.html#installation) instructions.

* git clone the code NVFLARE repository 


## 1. Tensorboard

We have two jobs under hello-pt-tb
* hello-tb
* hello-tb-mix

hello-tb -- demonstrates the TBWriter and TBAnalyticsReceiver 
hello-tb-mix -- demonstrate the same code in previous job (hello-tb), can be displayed in MLFlow and WandB in addition to Tensorboard, by adding MLFLow Receiver. 

Since these two jobs have the same custom code, the only difference is the configurations, we move the custom shared directory to allow both jobs using the same code. By doing this, you need to specify the PYTHONPATH to include this custom directory. 

**_note_** if you don't have ```tree``` installed, you can change the beflow command to ``` ls -al hello-pt-tb/*```

In [1]:
%%bash 
tree hello-pt-tb

hello-pt-tb
├── custom
│   ├── pt_constants.py
│   ├── pt_learner.py
│   ├── simple_network.py
│   └── test_custom.py
├── hello-tb
│   ├── app
│   │   └── config
│   │       ├── config_fed_client.json
│   │       └── config_fed_server.json
│   └── meta.json
├── hello-tb-mix
│   ├── app
│   │   └── config
│   │       ├── config_fed_client.json
│   │       └── config_fed_server.json
│   └── meta.json
└── README.md

7 directories, 11 files


### Experiment Tracking with Tensorboard

Go to terminal and install additional requirements

```
pip3 install torch torchvision tensorboard
```

#### Configurations

**Client Config**
```
 "components": [
    {
      "id": "pt_learner",
      "path": "pt_learner.PTLearner",
      "args": {
        "lr": 0.01,
        "epochs": 5,
        "analytic_sender_id": "log_writer"
      }
    },
    {
      "id": "log_writer",
      "path": "nvflare.app_opt.tracking.tb.tb_writer.TBWriter",
      "args": {"event_type": "analytix_log_stats"}
    },
    {
      "id": "event_to_fed",
      "name": "ConvertToFedEvent",
      "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
    }
  ]
  ```
  Note PTLearner requires "analytic_sender_id", this os the LogWriter Component id, for backward compatibility reason, we did not change the argument name, so existing code doesn't need to change the configurable. 
  
  **Server Config**
  
  ```
   "components": [
   
    ... <omit other components> ...
    
    {
      "id": "tb_analytics_receiver",
      "name": "TBAnalyticsReceiver",
      "args": {"events": ["fed.analytix_log_stats"]}
    }
  ],
  
  ```
  
 On the server side, we registered TBAnalyticsReceiver
 

#### Run the experiment

first change directory to examples/experiment_tracking

``` 
cd example/experiment_tracking

```
then, use nvflare simulator to run the hello-examples

In [None]:
%%bash
PYTHONPATH=$PYTHONPATH:$(pwd)/hello-pt-tb/custom
nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 hello-pt-tb/hello-tb

####  View Results

**Tensorboard View**

from terminal: 
```
tensorboard --logdir=/tmp/nvflare/simulate_job/tb_events
TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.11.0 at http://localhost:6007/ (Press CTRL+C to quit)

```
then open broser with  http://localhost:6007/ URL


If the server is running on a remote machine, use port forwarding to view the TensorBoard dashboard in a browser.
For example:
```
ssh -L {local_machine_port}:127.0.0.1:6006 user@server_ip)
```

> **_NOTE:_** For a more in-depth guide about the TensorBoard streaming feature, see [Quickstart (PyTorch with TensorBoard)](https://nvflare.readthedocs.io/en/main/examples/hello_pt_tb.html).



In [None]:
%%bash

tensorboard --logdir=/tmp/nvflare/simulate_job/tb_events

## Tensorboard with different receivers

Let's assume we would like to use MLFLow or Weights & Biases except we don't want to re-rewrite the code in order to capture the metrics. With NVFLARE, this is really easy, All you need to do is add two the needed receiver. You can add both MLFlow and Weights & Biases receivers if you like. 

In the following example hello-tb-mix, we just do that. 

There is no change in the custom code and client configuration. However, we did replace the TBAnalyticReceiver with MLFLow Receiver and WandB Receiver on the server configuration 

**Server Config**

Details of the configurations will be explained in MLFlow example. Here we only to demonstrate you can use them without code changes


**_MLflow Recevier_**

```

   {
      "id": "mlflow_receiver",
      "path": "nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLFlowReceiver",
      "args": {
        "kwargs": {
          "experiment_name": "hello-pt-experiments"
        },
        "artifact_location": "artifacts"
      }
    },
```    




In [None]:
%%bash 

pip3 install mlflow

In [None]:
%%bash 
# keep track of original python to avoid mixup in different examples
ORIGIN_PATH=$PYTHONPATH
echo $ORIGIN_PATH

In [None]:
%%bash
PYTHONPATH=$ORIGIN_PATH:$( pwd )/hello-pt-tb/custom  
echo $PYTHONPATH
nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 hello-pt-tb/hello-tb-mix

The result for MLFlow experiment run can be found in 

```
/tmp/nvflare/mlruns
```

How to view the result, we will leave to the MLFLow examples to discuss


## 2. MLFlow

We have two jobs under hello-pt-mlflow
* hello-mlflow
* hello-mlflow-mix

hello-mlflow -- demonstrates the MLFlowWriter and MLFLowReceiver 
hello-mlflow-mix -- demonstrate the same code in hello-mlflow, can be displayed in Tensorboard in addition to MLFlow, by adding Tensorboard

Since these two jobs have the same custom code, the only difference is the configurations, we move the custom shared directory to allow both jobs using the same code. By doing this, you need to specify the PYTHONPATH to include this custom directory. 

**_note_** if you don't have ```tree``` installed, you can change the beflow command to ``` ls -al hello-pt-mlflow/*```




In [2]:
%%bash 
tree hello-pt-mlflow

hello-pt-mlflow
├── custom
│   ├── pt_constants.py
│   ├── pt_learner.py
│   ├── simple_network.py
│   └── test_custom.py
├── hello-mlflow
│   ├── app
│   │   └── config
│   │       ├── config_fed_client.json
│   │       └── config_fed_server.json
│   └── meta.json
└── hello-mlflow-mix
    ├── app
    │   └── config
    │       ├── config_fed_client.json
    │       └── config_fed_server.json
    └── meta.json

7 directories, 10 files


### Experiment Tracking with MLFlow

First Install requirements, skip this step if you have already installed the requirement.
```
pip3 install mlflow
```


In [None]:
%%bash 

pip3 install mlflow

### Configuration

**Client Config**

```
   "components": [
    {
      "id": "pt_learner",
      "path": "pt_learner.PTLearner",
      "args": {
        "lr": 0.01,
        "epochs": 5,
        "analytic_sender_id": "mlflow_writer"
      }
    },
    {
      "id": "mlflow_writer",
      "path": "nvflare.app_opt.tracking.mlflow.mlflow_writer.MLFlowWriter",
      "args": {"event_type": "analytix_log_stats"}
    },
    {
      "id": "event_to_fed",
      "name": "ConvertToFedEvent",
      "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
    }
  ]
```

to use MLflow API syntax, we need to register with MLFlowSender


**Server Config**

  in addition to the other normal configuration for training, we need to add the following component to handle
  the streamed events. 
  
  If the MLfLow tracking server is used, we need to specify the tracking URL, 
  If the MLflow tracking server is not user, we don't need to specify tracking URL in the argument. 
  
  **with tracking server**
  
``` 
  {
      "id": "mlflow_receiver",
      "path": "nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLFlowReceiver",
      "args": {
        "kwargs": {
          "experiment_name": "hello-pt-experiment",
          "run_name": "hello-pt-with-mlflow",
          "experiment_tags": {
            "mlflow.note.content": "experiment description"
          },
          "run_tags": {
            "mlflow.note.content": "Run description"
          }
        },
        "artifact_location": "artifacts",
        "tracking_uri": "http://localhost:5001"
      }
    } 
```    
   **without tracking server**
```   
   {
      "id": "mlflow_receiver",
      "path": "nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLFlowReceiver",
      "args": {
        "kwargs": {
          "experiment_name": "hello-pt-experiment",
          "run_name": "hello-pt-with-mlflow",
          "experiment_tags": {
            "mlflow.note.content": "experiment description"
          },
          "run_tags": {
            "mlflow.note.content": "Run description"
          }
        },
        "artifact_location": "artifacts",
      }
    } 
```


You can set the experiment name in the configuration. You can setup the relative artifacts location ( no need to precreate it). The artifact location is only used for log text. it is not used for parameters and metrics


**experiment logging code changes**

Different from Tensorboard SummaryWriter API syntax, where add_scalar() or add_scalars() were used, there we the MLFLow Writer follows the MLFlow API with log_params(), log_metrics() 



#### MLFlow Tracking Server
 
MLFLow Tracking Server can be setup and deployed separately. For example, in Azure ML Workspace, the MLFlow tracking server is already setup, all one needs is to find out the tracking URL
 
In this example, we will setup a simple tracking server with SQLite database: 

```
mlflow server --backend-store-uri=sqlite:///mlruns.db  --host localhost --port 5000

```
the user then can go to http://localhost:5000 to monitoring the experiments during job run
 
 

#### Run the experiment

Use nvflare simulator to run the hello-mlflow. Recall that we have move the custom code in the a shared location, we need to set PYTHONPATH so the simulator can find it. 

First cd to `examples/experiment_tracking` directory


In [None]:
%%bash

PYTHONPATH=$ORIGIN_PATH:$( pwd )/hello-pt-mlflow/custom  
echo $PYTHONPATH

nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 hello-pt-mlflow/hello-mlflow

#### Runing experiments without tracking server and tracking URL

if we don't specify tracking URL and no tracking server. 

we can simply run the experiments as before. Meanwhile, we can track the progress by the following command
(notice our workspace is point to /tmp/nvflare) 


**mlflow ui --backend-store-uri=/tmp/nvflare/mlrun**


run above from terminal ( it doesn't work running from Notebook)
```
 mlflow ui --backend-store-uri=/tmp/nvflare/mlruns
 
[2023-01-05 15:30:38 -0800] [71735] [INFO] Starting gunicorn 20.1.0
[2023-01-05 15:30:38 -0800] [71735] [INFO] Listening at: http://127.0.0.1:5000 (71735)
[2023-01-05 15:30:38 -0800] [71735] [INFO] Using worker: sync
[2023-01-05 15:30:38 -0800] [71737] [INFO] Booting worker with pid: 71737
[2023-01-05 15:30:38 -0800] [71738] [INFO] Booting worker with pid: 71738
[2023-01-05 15:30:38 -0800] [71739] [INFO] Booting worker with pid: 71739
[2023-01-05 15:30:38 -0800] [71740] [INFO] Booting worker with pid: 71740

```

then user should open http://127.0.0.1:5000 via browser check the results


If the server is running on a remote machine, use port forwarding to view the TensorBoard dashboard in a browser.
For example:
```
ssh -L {local_machine_port}:127.0.0.1:5000 user@server_ip)
```


#### View Result

In this example, we specify the Tracking URL, you can review the result in browser http://localhost:5000/


### MLFlow with different receivers

Similar to hello-mlflow job, let's assume we would like to use MLFLow except we don't want to re-rewrite the code in order to capture the metrics. We will need to do is to add the needed receiver. There is no change in the custom code and client configuration. However, we add Tensorboard Receiver. In this exampole, we removed the tracking URI for the MLFLow Receiver. 

**Server Config**

Details of the configurations will be explained in Tensorboard example. Here we only to demonstrate you can use them without code changes

Tensorboard Recevier

```
   {
      "id": "tb_analytics_receiver",
      "name": "TBAnalyticsReceiver",
      "args": {"events": ["fed.analytix_log_stats"]}
    },
    
```
```
  {
      "id": "mlflow_receiver_without_tracking_uri",
      "path": "nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLFlowReceiver",
      "args": {
        "kwargs": {
          "experiment_name": "hello-pt-experiment",
          "run_name": "hello-pt-with-mlflow",
          "experiment_tags": {
            "mlflow.note.content": "## **Hello PyTorch experiment with MLFLOW**",
            "version": "v1",
            "priority": "P1"
          },
          "run_tags": {
            "mlflow.note.content": "run description"
          }
        },
        "artifact_location": "artifacts"
      }
    },
```

Before we run the job, we will need to install the additional requirements ( skip if you have done it)

```
pip3 install torch torchvision tensorboard 

```


In [None]:
%%bash 

pip3 install torch torchvision tensorboard


In [None]:
%%bash

PYTHONPATH=$ORIGIN_PATH:$( pwd )/hello-pt-mlflow/custom  
echo $PYTHONPATH

nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 hello-pt-mlflow/hello-mlflow-mix

**View Results**

Since there is no tracking URI, From terminal: 

```
mlflow ui --backend-store-uri=/tmp/nvflare/mlruns
```

Then, look at the URL in browser http://localhost:5000/ 


## Weights and Biases

In hello-wandb -- we like to show two parts: 

1) how to customize the Reciever and LogWriter, so that you can create your own integration for another ML experiment tracking tool.  The  hello-wandb/wandb contains both WandBReceiver and LogWriter. You can read the README to learn about the design consideration for the work. 

2) hell-wandb/custom directory shows the same hello-pt example, using WandBWriter to log the metrics. 
 

**_Note_** Weights and Biases (W&B or WandB) requires registratio on their website. Its free for individual personal and open source use. But not not free for commercial or team use. 

**Assuming you have the account at W&B and already logins. You can follow the examples**

First install the wandb 

```
pip3 install wandb
```


In [None]:
%%bash 
pip3 install wandb


In [3]:
%%bash 
tree hello-pt-wandb

hello-pt-wandb
├── app
│   ├── config
│   │   ├── config_fed_client.json
│   │   └── config_fed_server.json
│   └── custom
│       ├── pt_constants.py
│       ├── pt_learner.py
│       ├── simple_network.py
│       └── test_custom.py
├── meta.json
└── wandb
    ├── __init__.py
    ├── README.md
    ├── wandb_receiver.py
    └── wandb_writer.py

4 directories, 11 files


### Experiments with W&B

This example, we should WandB Writer and Receiver to track experiments. We can also use mixed receivers such as Tensorboard Receiver and MLFLow Reciever. But the process is the same and there is no need to repeat.

**Client Config**

```
 "components": [
    {
      "id": "pt_learner",
      "path": "pt_learner.PTLearner",
      "args": {
        "lr": 0.01,
        "epochs": 5,
        "analytic_sender_id": "log_writer"
      }
    },
    {
      "id": "log_writer",
      "path": "wandb_writer.WandBWriter",
      "args": {"event_type": "analytix_log_stats"}
    },
    {
      "id": "event_to_fed",
      "name": "ConvertToFedEvent",
      "args": {"events_to_convert": ["analytix_log_stats"], "fed_event_prefix": "fed."}
    }
  ]
```

Here we use W&B log() API syntax, using WandBWriter. 



**Server Config**
 
WandB Receiver

```
    {
      "id": "wandb_receiver",
      "path": "wandb_receiver.WandBReceiver",
      "args": {
          "mode": "offline",

          "config": {
            "architecture": "CNN",
            "dataset_id": "CIFAR10",
            "momentum": 0.9,
            "optimizer": "SGD",
            "learning rate": 0.01,
            "epochs": 5
          },

          "kwargs" :  {
            "project": "hello-pt-experiment",
            "notes": "descripton",
            "tags": ["baseline", "paper1"],
            "group": "hello-pt",
            "job_type": "train-validate"
          }
      }
    }
```    

In Weights and Biases, you can specify the init() project anr run related information specified in **kwargs**" argument. the key of the argument must much the WandB init() API. Here you can add project, notes, tags etc. 

You can also set the configuration parameters related to the experiments (whatever you think its important)


The **"mode"** can be either: **"online"**, **"offline"** or **"disabled"** ( test only). 

If the mode is set to be online, the metrics will directly send to Weights & Biases hosted website, assuming you have the open the account and logged in. 

If the mode is set to be offline, the metrics will be written to the file and sync to the W&B Website. 

Now before run the job-mix example, we need to install additional requirements for mlflow and WandB, then run the job. ( skip this step if you already have them installed) 


Change the directory to ```example/experiment_tracking``` directory. Now run the example via simulator

Make sure the PYTHONPATH doesn contains other the same classes from previous example


In [None]:
%%bash
PYTHONPATH=$ORIGIN_PATH:$(pwd)/hello-pt-wandb/wand

In [None]:
%%bash

nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 hello-pt-wandb


When finish success in the **offline** mode, you will seen something like this
```
wandb: Waiting for W&B process to finish... (success).
wandb: Waiting for W&B process to finish... (success).
wandb: 
wandb: Run history:
wandb:          train_loss █▆▆▆▄▄▃▅▄▄▄▆▅▃▄▄▄▃▄▃▄▄▄▄▃▃▂▄▂▂▃▄▂▄▁▁▃▂▂▃
wandb: validation_accuracy ▁
wandb: 
wandb: Run summary:
wandb:          train_loss 0.5315
wandb: validation_accuracy 0.4926
wandb: 
wandb: 
wandb: Run history:
wandb:          train_loss █▇▆▅▄▅▅▄▅▅▃▄▅▅▂▅▃▃▃▃▄▃▃▄▃▃▁▃▄▃▄▃▄▂▂▁▁▃▂▂
wandb: validation_accuracy ▁
wandb: 
wandb: Run summary:
wandb:          train_loss 0.90587
wandb: validation_accuracy 0.4888
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /tmp/nvflare/wandb/offline-run-20230108_214713-3m98v39x
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /tmp/nvflare/wandb/offline-run-20230108_214713-tfyfivqt
wandb: Find logs at: ./wandb/offline-run-20230108_214713-tfyfivqt/logs
wandb: Find logs at: ./wandb/offline-run-20230108_214713-3m98v39x/logs
```

You can follow the command to sync result to W&B with 

```
wandb sync <path>
```
then you can login to the W&B website to view the result


#### Trouble shoorting

Since the WandB run for each client is implemented as multiprocess, with multiprocess, the common issues as noted in [WandB documentation](https://docs.wandb.ai/guides/track/advanced/distributed-training)
* *_Hanging at the beginning of training_* 
* *_Hanging at the end of training_*

One way to check is simply check nvflare simular process is running or not. 

In [None]:
%%bash 
ps -eaf | grep nvflare 

kill the remain jobs. In some cases, we found it helpful by simply remove the WandB out directory to clean up from previous run. For example

```
rm /tmp/nvflare/wandb
```