# FLARE Monitoring
FLARE Monitoring provides a initial solution for tracking system metrics of your federated learning jobs.
Different from Machine learning experiment tracking, where it focused on the training metrics, the monitoring here focused on the FL system: i.e. job and system lifecycle metrics. 

This guide will walk you through the steps to set up and use the monitoring system effectively.


## Start up the monitoring system

In this example, we simulate the real setup on the local host. To keep the example simple, we will only set up 1 and 2. You can easily follow the steps to work out step 3.

In steps 1 and 2, we only need one monitoring system. Assuming you already have Docker and Docker Compose installed, you can use the provided [`docker-compose.yml`](../setup/docker-compose.yml) file to set up StatsD Exporter, Prometheus, and Grafana.

>Note: As of July 2023, docker-compose v1 is no longer supported. Make sure you download and use v2, verify docker compose version
 ```
    docker compose version
 ```
 

### Steps:

From a terminal, not from Notebook cell

1. Navigate to the setup directory:
    ```bash
    cd setup
    ```

2. Start the services using Docker Compose:
    ```bash
    docker compose up -d
    ```
    You should see something similar to the following:

    ```
    Creating network "setup_monitoring" with driver "bridge"
    Creating statsd-exporter ... done
    Creating prometheus      ... done
    Creating grafana         ... done
    ```

3. To stop the services, run:
    ```bash
    docker compose down
    ```

**Note:** The StatsD Exporter port is 9125 (not 8125).



In [None]:
! docker compose version

In [None]:
%cd setup

!docker compose up -d




In [None]:
! docker compose down




## Prepare FLARE Metrics Monitoring Configuration
 
### Prepare Configuration for Setup 1: All Sites Share the Same Monitoring System

![setup-1](./figures/setup-1.png)

As described in the [system monitorinig introduction](./system_monitorinig.ipynb), we will make different component configurations depending on the setups.

In this setup, all sites (server and clients) will share the same monitoring system with the same host and port.

#### Job Metrics Monitoring Configuration

Instead of manually configuring the metrics monitoring, we can directly use the Job API. You can refer to the [jobs/setup-1/code/](./jobs/setup-1/code/fl_job.py).

This is done by adding additional components on top of the existing code:

```python
    
    server_tags = {"site": "server", "env": "dev"}
    metrics_reporter = StatsDReporter(site="server", host="localhost", port=9125)
    metrics_collector = JobMetricsCollector(tags=server_tags, streaming_to_server=False)

    job.to_server(metrics_collector, id="server_job_metrics_collector")
    job.to_server(metrics_reporter, id="statsd_reporter")

    # Add clients
    for i in range(n_clients):
        
        <skip code >

        # add client side monitoring components
        tags = {"site": client_site, "env": "dev"}

        metrics_collector = JobMetricsCollector(tags=tags)

        job.to(metrics_collector, target=client_site, id=f"{client_site}_job_metrics_collector")
        job.to(metrics_reporter, target=client_site, id="statsd_reporter")

```

#### System Metrics Monitoring Configuration

We need to manually edit the configuration files for System Metrics collections.

For example, we need to add server to include 

* system metrics collector 
* statsd reporter

In the default POC setup, these components are added to 
"/tmp/nvflare/poc/example_project/prod_00/server/local/resources.json"

For Client sides, we need to add 

* system metrics collector 
* statsd reporter

"/tmp/nvflare/poc/example_project/prod_00/<site-n>/local/resources.json"

for the default POC setup.


Instead of manually, go through each file, we wrote a small python program to do this: 

```bash
cd setup-1
./prepare_local_config.sh  false
```
This will generate the needed system configuration for each site in this setup. 


## Start up FLARE FL system with POC

Now we are ready to start the FLARE FL system.

1. Prepare POC:

    ```bash
    nvflare poc prepare
    ```

    This will prepare 1 server and 2 clients ("site-1", "site-2") and one admin console client (admin@nvidia.com). You can examine the output directory: ```/tmp/nvflare/poc/example_project/prod_00```.

2. Start POC:
    ```bash
    nvflare poc start -ex admin@nvidia.com
    ```
    This will exclude the admin console service.

3. Run Job:
    See the run job section.

4. Stop POC:
    After you complete the job run, you can stop the POC by:

    ```bash
    nvflare poc stop
    ```

## Run Job via CLI

To run the job from the command line, use the following command:

```bash
# Generate job config folder
python3 fl_job.py -j /tmp/nvflare/jobs/job_config

# Submit the NVFlare job
nvflare job submit -j /tmp/nvflare/jobs/job_config/fedavg
```


## Monitoring View

Once you setup the system, you can view from the followingt website
for statsd-exporter, you can look at 

### Statsd-exporter metrics view

<!-- markdown-link-check-disable -->
metrics page: "http://localhost:9102/metrics" 

for the metrics published to statsd-export, which can be scraped by prometheus.
Here is a screen shot

![screen shot](./figures/statsd_export_metrics_view.png)


### Prometheus metrics view
The same metrics is scraped by Prometheus can be found in this URL

<!-- markdown-link-check-disable -->
metrics page: "http://localhost:9090/metrics"


### Grafana Dashboard views

We can visualize them better via Grafana. 

<!-- markdown-link-check-disable -->
Visualization: http://localhost:3000

Here are two metrics dashboards examples

![Client heartbeat (before & after) time taken](./figures/grafana_plot_metrics_heatbeat_time_taken.png)

![task processed accumated count](./figures/grafana_plot_metrics_view_task_count.png)



## Complete steps

Now, lets go to terminal and following all the steps to do the excersize

* install dependencies 

 ```
    pip install -r jobs/requirements.txt
    
 ```

* start monitoring systems (statsD, prometheus and grafana)
    
    ```
    cd setup 
    
    docker compose up -d

    cd ..
    ```


* prepare poc

```
    nvflare poc prepare -n  5

```

* prepare local site configurations

```
    # the argument stream_to_server = false
    
    jobs/prepare_local_config.sh false
```
   
* start poc 

```
    nvflare poc start -ex admin@nvidia.com 

```

* prepare data

```
   python jobs/data/download.py

```

* submit job


```bash
    cd jobs/setup-1/code

    ./submit_job.sh
```

* Monitoring System performance

<!-- markdown-link-check-disable -->
statsd metrics page: "http://localhost:9102/metrics" 

<!-- markdown-link-check-disable -->
prometheus metrics page: "http://localhost:9090/metrics"

<!-- markdown-link-check-disable -->
grafana visualization: http://localhost:3000


* Stop POC

```
    nvflare poc stop

    nvflare poc clean
```
   
* Stop Monitoring Systems

```
    docker compose down
```



## Setup 2: Client Metrics streamed to Server

In this setup, only the server site is connected to the monitoring system. This allows the server to monitor metrics on all client sites.

![setup-2](./figures/setup-2.png)

### Prepare Configuration for Setup 2: Client Metrics Streamed to Server

Similar to setup 1, we need to consider both job and system level configurations


#### Job Metrics Monitoring Configuration

We will configure the job to stream client metrics to the server. You can refer to the [jobs/setup-2/coode/fl_job.py](jobs/setup-2/code/fl_job.py).

Here is the configuration:

```python
 job_name = "fedavg"


# add server side monitoring components

server_tags = {"site": "server", "env": "dev"}

metrics_reporter = StatsDReporter(site="server", host="localhost", port=9125)
metrics_collector = JobMetricsCollector(tags=server_tags, streaming_to_server=False)
remote_metrics_receiver = RemoteMetricsReceiver(events=[METRICS_EVENT_TYPE])

job.to_server(metrics_collector, id="server_job_metrics_collector")
job.to_server(metrics_reporter, id="statsd_reporter")
job.to_server(remote_metrics_receiver, id="remote_metrics_receiver")

fed_event_converter = ConvertToFedEvent(events_to_convert=[METRICS_EVENT_TYPE])


# clients
   ....<skip code> ...

   client_site = f"site-{i + 1}"
   job.to(executor, client_site)

   # add client side monitoring components
   tags = {"site": client_site, "env": "dev"}

   metrics_collector = JobMetricsCollector(tags=tags)

   job.to(metrics_collector, target=client_site, id=f"{client_site}_job_metrics_collector")
   job.to(fed_event_converter, target= client_site, id=f"event_converter")
```

#### System Metrics Monitoring Configuration

We need to manually edit the configuration files for System Metrics collections.

We can use the same code in step is pretty the same except the followings

* prepare local configs

```bash
   # stream_to_server = true
   
   jobs/prepare_local_config.sh true
   
```

* submit job


```bash
    cd jobs/setup-2/code

    ./submit_job.sh
```


### Complete with rest of the steps 

  * start monitoring system
  * start the POC
  * submit job
  * review the metrics and visualization
  * stop the POC 
  * stop monitoring system
