# Monitoring the NVIDIA FLARE System

## Introduction

Monitoring is a critical aspect of any distributed system, and federated learning is no exception. As federated learning systems operate across multiple sites and organizations, having visibility into system performance, job progress, and resource utilization becomes essential for troubleshooting, optimization, and ensuring reliable operation.

NVIDIA FLARE provides a comprehensive monitoring solution that focuses on system metrics rather than just machine learning metrics. While machine learning experiment tracking focuses on training metrics (like loss and accuracy), FLARE's monitoring system focuses on the federated learning system itself—tracking job lifecycle events, communication patterns, and system performance.

In this section, we'll explore how to set up and use NVIDIA FLARE's monitoring capabilities to gain insights into your federated learning deployments.

### Learning Objectives
By the end of this section, you will be able to:
- Understand the importance of system monitoring in federated learning
- Identify the different types of metrics collected by NVIDIA FLARE
- Configure various monitoring setups based on your deployment needs
- Implement monitoring components in your federated learning system
- Visualize and analyze system performance using industry-standard tools

## Monitoring Architecture

NVIDIA FLARE's monitoring system is built on industry-standard tools and follows a modular architecture:

1. **Metrics Collection**: FLARE components collect metrics from various system events
2. **Metrics Publishing**: These metrics are published to a metrics service (StatsD)
3. **Metrics Storage**: A time-series database (Prometheus) stores the metrics
4. **Visualization**: A dashboard tool (Grafana) provides visual representations of the metrics

This architecture allows for flexible deployment options and integration with existing monitoring infrastructure.

## Monitoring Setup Options

Depending on your deployment requirements and organizational constraints, NVIDIA FLARE supports three main monitoring setup options:

### 1. Shared Monitoring System

In this setup, all sites (server and clients) publish metrics to a central monitoring system. This provides a consolidated view of the entire federated learning system.

![setup-1](./figures/setup-1.png)

**Key characteristics:**
- Single monitoring infrastructure for all sites
- Consolidated view of all metrics
- Requires network connectivity from all sites to the monitoring system
- Simplest to manage and analyze

**Implementation steps:**
1. Install StatsD Exporter, Prometheus, and Grafana on a central monitoring server
2. Configure all FLARE sites to send metrics to the central StatsD Exporter
3. Configure Prometheus to scrape metrics from StatsD Exporter
4. Set up Grafana dashboards to visualize the metrics

> **Note**: Don't confuse StatsD Exporter and StatsD Reporter. StatsD Exporter is a service that receives metrics and exposes them for Prometheus to scrape, while StatsD Reporter is the FLARE component that sends metrics to the StatsD Exporter.

### 2. Client-to-Server Metrics Forwarding

In this setup, clients forward their metrics to the server, which then publishes all metrics to a monitoring system. This is useful when clients cannot directly access the monitoring system.

![setup-2](figures/setup-2.png)

**Key characteristics:**
- Clients don't need direct access to the monitoring system
- Server acts as a metrics aggregator
- Consolidated view of all metrics
- Requires additional configuration for metrics forwarding

**Implementation steps:**
1. Install StatsD Exporter, Prometheus, and Grafana on a server accessible to the FL server
2. Configure clients to collect metrics and forward them to the server
3. Configure the server to receive client metrics and publish them to StatsD Exporter
4. Configure Prometheus to scrape metrics from StatsD Exporter
5. Set up Grafana dashboards to visualize the metrics

### 3. Individual Monitoring Systems

In this setup, each site (server and clients) has its own monitoring system. This is useful when organizations want to keep their monitoring data within their own infrastructure.

![setup-3](figures/setup-3.png)

**Key characteristics:**
- Each site maintains its own monitoring infrastructure
- No consolidated view of all metrics
- Maximum data privacy and security
- More complex to manage and analyze

**Implementation steps:**
1. Install StatsD Exporter, Prometheus, and Grafana at each site
2. Configure each FLARE site to send metrics to its local StatsD Exporter
3. Configure Prometheus at each site to scrape metrics from its local StatsD Exporter
4. Set up Grafana dashboards at each site to visualize the local metrics

## NVIDIA FLARE Monitoring Metrics

NVIDIA FLARE collects a wide range of metrics that provide insights into different aspects of the federated learning system. 

### Key Metrics

Here's a selection of important metrics collected by NVIDIA FLARE:

#### System Lifecycle Metrics
- **System Start/End**: Track when the system starts and stops
- **Client Connection**: Monitor client connections and disconnections
- **Resource Management**: Track resource allocation and availability

#### Job Lifecycle Metrics
- **Job Deployment**: Track when jobs are deployed to the system
- **Job Execution**: Monitor job starts, completions, aborts, and cancellations
- **Run Management**: Track the execution of individual runs within jobs

#### Federated Learning Metrics
- **Task Execution**: Monitor task assignment, execution, and completion
- **Data Transfer**: Track data movement between clients and server
- **Aggregation**: Monitor model aggregation operations
- **Training**: Track local training operations on clients
- **Round Management**: Monitor the progress of federated learning rounds


### Complete Metrics List

The following table shows the complete list of metrics collected by NVIDIA FLARE:

| Event | Metric Count | Metric Time Taken |
|-------|--------------|-------------------|
| SYSTEM_START | _system_start_count | |
| SYSTEM_END | _system_end_count | _system_time_taken |
| ABOUT_TO_START_RUN | _about_to_start_run_count | |
| START_RUN | _start_run_count | |
| ABOUT_TO_END_RUN | _about_to_end_run_count | |
| END_RUN | _end_run_count | _run_time_taken |
| CHECK_END_RUN_READINESS | _check_end_run_readiness_count | |
| SWAP_IN | _swap_in_count | |
| SWAP_OUT | _swap_out_count | |
| START_WORKFLOW | _start_workflow_count | |
| END_WORKFLOW | _end_workflow_count | _workflow_time_taken |
| ABORT_TASK | _abort_task_count | |
| FATAL_SYSTEM_ERROR | _fatal_system_error_count | |
| JOB_DEPLOYED | _job_deployed_count | |
| JOB_STARTED | _job_started_count | |
| JOB_COMPLETED | _job_completed_count | _job_time_taken |
| JOB_ABORTED | _job_aborted_count | |
| JOB_CANCELLED | _job_cancelled_count | |
| CLIENT_DISCONNECTED | _client_disconnected_count | |
| CLIENT_RECONNECTED | _client_reconnected_count | |
| BEFORE_PULL_TASK | _before_pull_task_count |  |
| AFTER_PULL_TASK | _after_pull_task_count | _pull_task_time_taken |
| BEFORE_PROCESS_TASK_REQUEST | _before_process_task_request_count | |
| AFTER_PROCESS_TASK_REQUEST | _after_process_task_request_count | _process_task_request_time_taken |
| BEFORE_PROCESS_SUBMISSION | _before_process_submission_count |  |
| AFTER_PROCESS_SUBMISSION | _after_process_submission_count | _process_submission_time_taken |
| BEFORE_TASK_DATA_FILTER | _before_task_data_filter_count |  |
| AFTER_TASK_DATA_FILTER | _after_task_data_filter_count | _data_filter_time_taken |
| BEFORE_TASK_RESULT_FILTER | _before_task_result_filter_count |  |
| AFTER_TASK_RESULT_FILTER | _after_task_result_filter_count | _result_filter_time_taken |
| BEFORE_TASK_EXECUTION | _before_task_execution_count |  |
| AFTER_TASK_EXECUTION | _after_task_execution_count | _task_execution_time_taken |
| BEFORE_SEND_TASK_RESULT | _before_send_task_result_count |  |
| AFTER_SEND_TASK_RESULT | _after_send_task_result_count | _send_task_result_time_taken |
| BEFORE_PROCESS_RESULT_OF_UNKNOWN_TASK | _before_process_result_of_unknown_task_count |  |
| AFTER_PROCESS_RESULT_OF_UNKNOWN_TASK | _after_process_result_of_unknown_task_count | _process_result_of_unknown_task_time_taken |
| PRE_RUN_RESULT_AVAILABLE | _pre_run_result_available_count | |
| BEFORE_CHECK_CLIENT_RESOURCES | _before_check_client_resources_count |  |
| AFTER_CHECK_CLIENT_RESOURCES | _after_check_client_resources./ways_to_interact_with_fl_system.ipynb_count | _check_client_resources_time_taken |
| SUBMIT_JOB | _submit_job_count | |
| DEPLOY_JOB_TO_SERVER | _deploy_job_to_server_count | |
| DEPLOY_JOB_TO_CLIENT | _deploy_job_to_client_count | |
| BEFORE_CHECK_RESOURCE_MANAGER | _before_check_resource_manager_count | |
| BEFORE_SEND_ADMIN_COMMAND | _before_send_admin_command_count | |
| BEFORE_CLIENT_REGISTER | _before_client_register_count | |
| AFTER_CLIENT_REGISTER | _after_client_register_count | client_register_time_taken |
| CLIENT_REGISTER_RECEIVED | _client_register_received_count | |
| CLIENT_REGISTER_PROCESSED | _client_register_processed_count | |
| CLIENT_QUIT | _client_quit_count | |
| SYSTEM_BOOTSTRAP | _system_bootstrap_count | |
| BEFORE_AGGREGATION | _before_aggregation_count | |
| END_AGGREGATION | _end_aggregation_count | _aggregation_time_taken|
| RECEIVE_BEST_MODEL | _receive_best_model_count | |
| BEFORE_TRAIN | _before_train_count | |
| AFTER_TRAIN | _after_train_count |_train_time_taken |
| TRAIN_DONE | _train_done_count | |
| TRAINING_STARTED | _training_count | |
| TRAINING_FINISHED | _training_count | _training_time_taken|
| ROUND_STARTED | _round_started_count | |
| ROUND_DONE | _round_done_count |  _round_time_taken |

## NVIDIA FLARE Monitoring Components

NVIDIA FLARE provides several components for collecting and publishing metrics. Understanding these components is essential for configuring your monitoring setup.

### Key Monitoring Components

1. **StatsDReporter**: Publishes collected metrics to a StatsD Exporter service
2. **JobMetricsCollector**: Collects job-level metrics and publishes them to the databus
3. **SysMetricsCollector**: Collects system-level metrics from the parent processes
4. **RemoteMetricsReceiver**: Receives metrics streamed from clients and publishes them

### Component Configuration

Let's look at how to configure these components for different monitoring setups.

> **Note**: NVIDIA FLARE uses a simple JSON format for component configuration:
```json
{ 
   "id": "<component_identifier>",
   "path": "<component_class_path>",
   "args": {
       "<arg_name>": "<arg_value>"
   }
}
```

#### Setup 1: Shared Monitoring System

In this setup, all sites post metrics to a common StatsD Exporter service. Each site needs both JobMetricsCollector and SysMetricsCollector components.

**Client Configuration (`fed_config_client.json`):**

```json
{
    "id": "job_metrics_collector",
    "path": "nvflare.metrics.job_metrics_collector.JobMetricsCollector",
    "args": {
        "tags": {
            "site": "site_1",
            "env": "dev"
        }
    }
},
{
    "id": "statsd_reporter",
    "path": "nvflare.fuel_opt.statsd.statsd_reporter.StatsDReporter",
    "args": {
        "host": "<statsd_exporter_host>",
        "port": <statsd_exporter_port>
    }
}
```

**Server Configuration (`fed_config_server.json`):**

```json
{
    "id": "job_metrics_collector",
    "path": "nvflare.metrics.job_metrics_collector.JobMetricsCollector",
    "args": {
        "tags": {
            "site": "server",
            "env": "dev"
        }
    }
},
{
    "id": "statsd_reporter",
    "path": "nvflare.fuel_opt.statsd.statsd_reporter.StatsDReporter",
    "args": {
        "host": "<statsd_exporter_host>",
        "port": <statsd_exporter_port>
    }
}
```

**System Metrics Configuration (`resources.json`):**

For system-level metrics, you need to configure the SysMetricsCollector in the local resources configuration file for each site. This is done by creating or modifying the `resources.json` file in the site's local directory.

```json
{
    "id": "sys_metrics_collector",
    "path": "nvflare.metrics.sys_metrics_collector.SysMetricsCollector",
    "args": {
        "tags": {
            "site": "<site_name>",
            "env": "dev"
        }
    }
},
{
    "id": "statsd_reporter",
    "path": "nvflare.fuel_opt.statsd.statsd_reporter.StatsDReporter",
    "args": {
        "host": "<statsd_exporter_host>",
        "port": <statsd_exporter_port>
    }
}
```

#### Setup 2: Client-to-Server Metrics Forwarding

In this setup, clients forward their metrics to the server, which then publishes all metrics to a monitoring system.

**Client Configuration (`fed_config_client.json`):**

```json
{
    "id": "job_metrics_collector",
    "path": "nvflare.metrics.job_metrics_collector.JobMetricsCollector",
    "args": {
        "tags": {
            "site": "site_1",
            "env": "dev"
        }, 
        "streaming_to_server": true
    }
},
{
    "id": "event_convertor",
    "path": "nvflare.app_common.widgets.convert_to_fed_event.ConvertToFedEvent",
    "args": {
      "events_to_convert": ["metrics_event"]
    }
}
```

**Server Configuration (`fed_config_server.json`):**

```json
{
    "id": "job_metrics_collector",
    "path": "nvflare.metrics.job_metrics_collector.JobMetricsCollector",
    "args": {
        "tags": {
            "site": "server",
            "env": "dev"
        }
    }
},
{
    "id": "statsd_reporter",
    "path": "nvflare.fuel_opt.statsd.statsd_reporter.StatsDReporter",
    "args": {
        "host": "<statsd_exporter_host>",
        "port": <statsd_exporter_port>
    }
},
{
    "id": "remote_metrics_receiver",
    "path": "nvflare.metrics.remote_metrics_reciever.RemoteMetricsReceiver",
     "args": {
         "events": ["fed.metrics_event"]
     }
} 
```



**System Metrics Configuration:**

This is similar to Setup 1, but clients don't need the StatsDReporter in their resources.json, while the server does.

#### Setup 3: Individual Monitoring Systems for Each Site

In this setup, each site (server and clients) has its own dedicated monitoring infrastructure.

The configuration is similar to Setup 1, but with each site pointing to its own StatsD Exporter instance:

- Each site uses the same components (JobMetricsCollector, StatsDReporter, etc.)
- The `host` and `port` parameters in the StatsDReporter configuration are unique for each site
- This allows for independent monitoring and potentially better isolation between sites

This approach is useful when sites are in different network environments or when you need separate monitoring dashboards for each participant.

## Practical Example: Monitoring a Federated Learning Job

To see how monitoring works in practice, see the [job example notebook](job_example.ipynb).


## Summary

In this section, we've explored NVIDIA FLARE's monitoring capabilities, including:

- Different monitoring setup options based on deployment needs
- The wide range of metrics collected by NVIDIA FLARE
- How to configure monitoring components for different setups

Effective monitoring is essential for maintaining reliable federated learning systems, optimizing performance, and troubleshooting issues. By implementing the monitoring approaches described in this section, you can gain valuable insights into your federated learning deployments.