   # Hello PyTorch with MLflow

In this example, we like to demonstrate that the example code used in hello-pt-tb with PyTorch Tensorboard tracking can be simply switch to using MLflow as experimental tracking without changing the code



Example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.html) to train an image classifier using federated averaging ([FedAvg]([FedAvg](https://arxiv.org/abs/1602.05629))) and [PyTorch](https://pytorch.org/) as the deep learning training framework. This example also highlights the streaming capability from the clients to the server with Tensorboard SummaryWriter sender syntax, but with a MLFlow receiver

> **_NOTE:_** This example uses the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset and will load its data within the trainer code.


### 1. Install NVIDIA FLARE

Follow the [Installation](https://nvflare.readthedocs.io/en/main/getting_started.html#installation) instructions.
Install additional requirements:


In [None]:
%%bash
pip3 install torch torchvision tensorboard mlflow

### 2. Change Configuration

in fed_server_config.json

add the following to the components
```
{
      "id": "mlflow_receiver",
      "path": "nvflare.app_opt.tracking.mlflow.mlflow_receiver.MLFlowReceiver",
      "args": {
        "kwargs": {"experiment_name": "hello-pt-experiments"},
        "artifact_location": "artifacts"
      }
}
```
This indicate that we are register the MLFLow Receiver in additional to the Tensorboard Receiver 




### 3. Run the experiment

Use nvflare simulator to run the hello-examples:

```
nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 examples/hello-pt-tb-mlflow/hello-pt-tb-mlflow-job1
```

In [None]:
%%bash
cd $NVFLARE_HOME

nvflare simulator -w /tmp/nvflare/ -n 2 -t 2 examples/hello-pt-tb-mlflow/hello-pt-tb-mlflow-job1

2022-12-30 21:55:07,020 - SimulatorRunner - INFO - Create the Simulator Server.
2022-12-30 21:55:07,070 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 39803
2022-12-30 21:55:07,072 - SimulatorServer - INFO - starting insecure server at localhost:41845
2022-12-30 21:55:07,074 - SimulatorRunner - INFO - Deploy the Apps.
2022-12-30 21:55:07,076 - SimulatorRunner - INFO - Create the simulate clients.
2022-12-30 21:55:07,093 - ClientManager - INFO - Client: New client site-1@127.0.0.1 joined. Sent token: 17a06432-2fe7-43ad-886a-9b97f887f35e.  Total clients: 1
2022-12-30 21:55:07,093 - FederatedClient - INFO - Successfully registered client:site-1 for project simulator_server. Token:17a06432-2fe7-43ad-886a-9b97f887f35e SSID:
2022-12-30 21:55:07,110 - ClientManager - INFO - Client: New client site-2@127.0.0.1 joined. Sent token: 70848535-124b-4205-b035-7f231724f37e.  Total clients: 2
2022-12-30 21:55:07,111 - FederatedClient - INFO - Successfully registered cli

E1230 21:55:10.118269652  159790 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers


2022-12-30 21:55:10,121 - SimulatorClientRunner - INFO - Simulate Run client: site-2


E1230 21:55:10.122527629  159791 fork_posix.cc:76]           Other threads are currently calling into gRPC, skipping fork() handlers


Files already downloaded and verified
2022-12-30 21:55:13,512 - ClientRunner - INFO - [identity=site-1, run=simulate_job]: client runner started
2022-12-30 21:55:13,512 - ClientTaskWorker - INFO - Initialize ClientRunner for client: site-1
2022-12-30 21:55:13,528 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-1, peer_run=simulate_job, task_name=train, task_id=f6b01c85-7ab8-4c78-a5c3-77e75b8deb0e]: assigned task to client site-1: name=train, id=f6b01c85-7ab8-4c78-a5c3-77e75b8deb0e
2022-12-30 21:55:13,528 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather, peer=site-1, peer_run=simulate_job, task_name=train, task_id=f6b01c85-7ab8-4c78-a5c3-77e75b8deb0e]: sent task assignment to client
2022-12-30 21:55:13,529 - SimulatorServer - INFO - GetTask: Return task: train to client: site-1 (17a06432-2fe7-43ad-886a-9b97f887f35e) 
Files already downloaded and verified
2022-12-30 21:55:13,529 - ClientRunn

2022-12-30 21:55:26,312 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:26,312 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:26,831 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:26,831 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:27,351 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:27,351 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:27,873 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:27,873 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:28,391 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:28,391 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:28,909 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:28,909 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55

2022-12-30 21:55:51,150 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:51,151 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:51,993 - PTLearner - INFO - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=f6b01c85-7ab8-4c78-a5c3-77e75b8deb0e]: Epoch: 4/5, Iteration: 0, Loss: 1.4308373133341471e-05
2022-12-30 21:55:52,015 - PTLearner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=f221fc1d-6c70-4e32-aa00-a1fac2c4ab83]: Epoch: 4/5, Iteration: 0, Loss: 1.1736134688059489e-05
2022-12-30 21:55:52,672 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:52,673 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:53,197 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55:53,198 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:55

2022-12-30 21:56:01,448 - SimulatorServer - INFO - getting AuxCommunicate request
2022-12-30 21:56:03,018 - ClientTaskWorker - INFO - Finished one task run for client: site-2
2022-12-30 21:56:03,104 - ClientTaskWorker - INFO - Finished one task run for client: site-1
2022-12-30 21:56:05,051 - ClientTaskWorker - INFO - Finished one task run for client: site-2
2022-12-30 21:56:05,128 - ClientTaskWorker - INFO - Finished one task run for client: site-1
2022-12-30 21:56:07,071 - ClientTaskWorker - INFO - Finished one task run for client: site-2
2022-12-30 21:56:07,160 - ClientTaskWorker - INFO - Finished one task run for client: site-1
2022-12-30 21:56:08,314 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather]: task train exit with status TaskCompletionStatus.TIMEOUT
2022-12-30 21:56:08,315 - ScatterAndGather - INFO - [identity=simulator_server, run=simulate_job, wf=scatter_and_gather]: Start aggregation.
2022-12-30 21:56:08,315 - DXOAggregator 

2022-12-30 21:56:09,192 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=cross_site_validate, peer=site-1, peer_run=simulate_job, task_name=submit_model, task_id=84c36581-08e2-46d6-9a14-00dba446a027]: assigned task to client site-1: name=submit_model, id=84c36581-08e2-46d6-9a14-00dba446a027
2022-12-30 21:56:09,192 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=cross_site_validate, peer=site-1, peer_run=simulate_job, task_name=submit_model, task_id=84c36581-08e2-46d6-9a14-00dba446a027]: sent task assignment to client
2022-12-30 21:56:09,192 - SimulatorServer - INFO - GetTask: Return task: submit_model to client: site-1 (17a06432-2fe7-43ad-886a-9b97f887f35e) 
2022-12-30 21:56:09,192 - Communicator - INFO - Received from simulator_server server  (586 Bytes). getTask time: 0.029262304306030273 seconds
2022-12-30 21:56:09,193 - FederatedClient - INFO - pull_task completed. Task name:submit_model Status:True 
2022-12-30 21:56:09,193 - ClientR

2022-12-30 21:56:12,328 - ClientRunner - INFO - [identity=site-2, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=validate, task_id=779dba6e-bba8-4671-8ae7-44e6bda42286]: finished processing task
2022-12-30 21:56:12,329 - FederatedClient - INFO - Starting to push execute result.
2022-12-30 21:56:12,329 - Communicator - INFO - Send submitUpdate to simulator_server server
2022-12-30 21:56:12,347 - SimulatorServer - INFO - received update from simulator_server_site-2_0 (959 Bytes, 1672466172 seconds)
2022-12-30 21:56:12,348 - ServerRunner - INFO - [identity=simulator_server, run=simulate_job, wf=cross_site_validate, peer=site-2, peer_run=simulate_job]: got result from client site-2 for task: name=validate, id=779dba6e-bba8-4671-8ae7-44e6bda42286
2022-12-30 21:56:12,348 - CrossSiteModelEval - INFO - [identity=simulator_server, run=simulate_job, wf=cross_site_validate, peer=site-2, peer_run=simulate_job, peer_rc=OK, task_name=validate, task_id=779dba6e-bba8-4671-8a

### 4. Tensorboard Tracking

On the client side, we are still using the TensorBoard SummaryWriter as the `AnalyticsSender`. 

Instead of writing to TB files, it actually generates NVFLARE events of type `analytix_log_stats`.
The `ConvertToFedEvent` widget will turn the event `analytix_log_stats` into a fed event `fed.analytix_log_stats`,
which will be delivered to the server side.

On the server side, the `TBAnalyticsReceiver` is configured to process `fed.analytix_log_stats` events,
which writes received TB data into appropriate TB files on the server.

To view training metrics that are being streamed to the server, run:

```
tensorboard --logdir=/tmp/nvflare/simulate_job/tb_events
```

In [None]:
%%bash
tensorboard --logdir=/tmp/nvflare/simulate_job/tb_events

TensorFlow installation not found - running with reduced feature set.

NOTE: Using experimental fast data loading logic. To disable, pass
    "--load_fast=false" and report issues on GitHub. More details:
    https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all
TensorBoard 2.11.0 at http://localhost:6006/ (Press CTRL+C to quit)


### 5. MLFlow tracking

On the server side, we also configured `MLFlowReceiver` to process `fed.analytix_log_stats` events,
which writes received and write to the MLFlow backendstore

To view training metrics that are being streamed to the server, run:

```
mlflow ui --backend-store-uri=/tmp/nvflare/mlruns
```

Then 

Look at the URL in browser http://localhost:5000/

In [4]:
%ls -al /tmp/nvflare


total 200
drwxrwxr-x   6 chester chester   4096 Dec 30 21:53 [0m[01;34m.[0m/
drwxrwxrwt 147 root    root     32768 Dec 30 21:54 [30;42m..[0m/
-rw-rw-r--   1 chester chester 139543 Dec 30 21:54 audit.log
drwxrwxr-x   2 chester chester   4096 Dec 30 11:34 [01;34mlocal[0m/
drwxrwxr-x   5 chester chester   4096 Dec 30 16:53 [01;34mmlruns[0m/
drwxrwxr-x   8 chester chester   4096 Dec 30 21:54 [01;34msimulate_job[0m/
drwxrwxr-x   2 chester chester   4096 Dec 30 11:34 [01;34mstartup[0m/
