# NVIDIA FLARE System Architecture

## Introduction
In the previous section, we explored the core concepts of NVIDIA FLARE. Now, we'll dive deeper into the system architecture that brings these concepts to life. Understanding this architecture will help you appreciate how NVIDIA FLARE enables secure, scalable, and flexible federated computing.

### Learning Objectives
By the end of this section, you will be able to:
- Describe the layered architecture of NVIDIA FLARE
- Explain the FLARE Communication Interface (FCI) and its capabilities
- Understand the federated job processing architecture
- Identify the different types of APIs available in NVIDIA FLARE
- Navigate the configuration system and job templates

## NVIDIA FLARE: Key Characteristics

Before examining the technical architecture, let's understand what makes NVIDIA FLARE unique as a federated computing platform:

* **Open Source**: Released under Apache License 2.0 to foster research and development in federated learning
  
* **Production-Ready**: Designed with enterprise requirements in mind, including security, scalability, and reliability
   
* **Hardware Flexibility**: Capable of running on CPU, GPU, and Multi-GPU environments

* **Global Collaboration**: Enables cross-country, distributed, multi-party collaborative learning

* **Enterprise Scalability**: Offers high availability and multi-task execution capabilities

* **Universal Applicability**: Framework, model, domain, and task agnostic design

* **Extensible Architecture**: Layered, pluggable, customizable federated compute architecture

## The Layered Architecture

NVIDIA FLARE is built using a layered architecture, where each layer provides specific functionality and builds upon the layers below it. This design enables modularity, extensibility, and separation of concerns.

<img src="./flare_overview.png" alt="FLARE Architecture" width="700" height="400">

Let's explore each layer from bottom to top:

### 1. Network Communication Layer

At the foundation of NVIDIA FLARE is the network communication layer, which handles all data exchange between participants in the federated system.

**Key components:**
- **Communication drivers**: gRPC, HTTP + WebSocket, TCP, and plugin drivers
- **CellNet**: Logical end-to-end (cell to cell) network
- **Message handling**: Reliable streaming message capabilities

### 2. Federated Computing Layer

Built on top of the communication layer, the federated computing layer manages the execution of federated workflows.

**Key capabilities:**
- **Job management**: Resource-based scheduling, monitoring, concurrent lifecycle management
- **High availability**: Ensuring system reliability even when components fail
- **Component management**: Handling plugin components
- **Configuration**: Managing system and job configurations
- **Event handling**: Processing both local and federated events

### 3. Privacy & Security Layer

This critical layer ensures that federated learning workflows maintain data privacy and system security.

**Key capabilities:**
- **Authentication**: Verifying the identity of participants using certificates
- **Authorization**: Controlling access to system resources based on roles
- **Secure communication**: Encrypting data in transit between participants
- **Privacy-preserving techniques**: Differential privacy, homomorphic encryption, secure aggregation
- **Audit logging**: Recording system activities for compliance and security analysis
- **Data governance**: Enforcing policies on what data can be shared

### 4. Federated Workflow Layer

This layer implements specific patterns for federated execution.

**Some of the supported workflows:**
- **Scatter and Gather (SAG)**: The classic federated learning pattern
- **Cyclic**: Sequential training across clients
- **Cross-site Evaluation**: Evaluating models across different sites
- **Swarm Learning**: Peer-to-peer federated learning
- **Federated Analytics**: Analyzing distributed data without sharing raw data

### 5. Programming APIs Layer

This layer provides interfaces for developers to interact with the NVIDIA FLARE system at different levels of abstraction.

**Key APIs:**
- **Low-level APIs**: Controller and Executor APIs for custom workflow development
- **High-level APIs**: ModelController and Client APIs for simplified ML workflows
- **Job APIs**: For job configuration and management
- **Admin APIs**: For system administration and monitoring
- **Integration APIs**: For connecting with external ML frameworks (PyTorch, TensorFlow, etc.)

### 6. Federated Learning Algorithms Layer

At this layer, specific federated learning algorithms are implemented using the capabilities provided by the lower layers.

**Examples include:**
- FedAvg (Federated Averaging)
- FedOpt (Federated Optimization)
- FedProx (Federated Proximal)
- Scaffold
- And many more specialized algorithms

### 7. Tools Layer

The top layer provides tools for both development and production environments.

#### Development Tools
- **Simulator**: For testing federated learning workflows in a simulated environment
- **POC Mode**: For creating proof-of-concept deployments on a single machine
- **Debugging utilities**: For troubleshooting federated learning applications
- **Visualization tools**: For analyzing results and system performance
- **Job templates**: Pre-configured job templates for common scenarios

#### Production Tools
- **Provisioning system**: For creating secure startup kits for participants
- **Admin console**: For managing the federated learning system
- **Monitoring dashboard**: For tracking system health and job progress
- **Deployment utilities**: For managing system deployment across organizations
- **Security management**: For handling certificates, keys, and access control

## FLARE Communication Interface (FCI)

The FLARE Communication Interface (FCI) is a critical component that deserves special attention. It provides the foundation for all communication in the federated system.

### What is FCI?

FCI is a logical network framework that supports asynchronous, two-way communication through multiple transport mechanisms. It's designed to be flexible, efficient, and secure.

### Key Capabilities of FCI

* **Pluggable Architecture**: Supports different messaging patterns (request-response, broadcast, pub/sub) and transport mechanisms (TCP, Pipe, HTTP/WS, gRPC) through drivers

* **Efficient Streaming**: Large binary data can be streamed in small chunks to minimize memory usage

* **Full-duplex Communication**: Both sides can send messages to each other without polling (if the transport supports it)

* **Multiplexing**: Multiple conversations can occur over the same connection simultaneously using stream IDs

* **Asynchronous Messaging**: Can send/receive messages asynchronously (fire-and-forget, message listening)

* **Client-Initiated Connections**: All TCP-based connections can be initiated from clients, eliminating the need for clients to expose ports

* **Inter-Process Communication**: Works with communications through pipes or sockets between processes

* **Built-in Heartbeats**: Maintains connection health through heartbeat mechanisms

### FCI Architecture

FCI itself has a layered architecture:

<img src="./fci.png" alt="FLARE Communication Interface" width="300" height="400">

* **API Layer**: Exposes interfaces like Communicator and Cellnet to application developers

* **Streamable Framed Message (SFM)**: The core of FCI that provides abstraction over different communication protocols and manages endpoints and connections

* **Transport Drivers**: Responsible for sending frames to other endpoints, treating frames as opaque bytes

**Why is this important?** The pluggable nature of FCI means you can switch transport drivers without affecting the application layers. This provides flexibility in deployment and allows for adaptation to different network environments and security requirements.

## Federated Job Processing Architecture

NVIDIA FLARE uses a job-based architecture to manage federated workflows. This design enables concurrent execution of multiple federated tasks and provides isolation between different workflows.

<img src="./system_architecture.png" alt="FLARE System Architecture" width="700" height="400">

### Key Components

* **Parent Control Processes**: Each site (server and clients) has a parent control process that manages job execution

* **Job Processes**: Individual jobs run in separate processes, providing isolation and resource management

* **Job Scheduler**: Allocates resources and manages job execution based on priorities and resource availability

* **Job Monitor**: Tracks job status and performance metrics

### Benefits of the Job Architecture

* **Concurrency**: Multiple jobs can run simultaneously
* **Isolation**: Jobs are isolated from each other, preventing interference
* **Resource Management**: Resources can be allocated based on job requirements
* **Fault Tolerance**: Failures in one job don't affect others
* **Lifecycle Management**: Jobs can be started, paused, resumed, and terminated independently

## Event-Based System

NVIDIA FLARE uses an event-driven architecture, where components communicate through events rather than direct method calls. This design enables loose coupling and extensibility.

### How It Works

* All NVIDIA FLARE components (derived from FLComponent) can handle and fire events via the runtime engine
* Components can register to listen for specific events
* When an event occurs, all registered listeners are notified

### Benefits

* **Extensibility**: New functionality can be added by creating components that listen for existing events
* **Loose Coupling**: Components don't need direct knowledge of each other
* **Customization**: Users can write custom FLComponents as plugins to listen for events and implement specialized logic at any layer

## Federated Learning Framework

Building on the core architecture, NVIDIA FLARE provides implementations of various federated learning algorithms and workflows.

### Available Algorithms

* **FedAvg**: The classic federated averaging algorithm
* **FedOpt**: Federated optimization with server-side optimization
* **FedProx**: Federated learning with proximal terms for stability
* **Scaffold**: Stochastic Controlled Averaging for variance reduction
* **Cyclic Learning**: Sequential training across clients
* **Swarm Learning**: Peer-to-peer federated learning
* **Split Learning**: Splitting model layers between client and server

### Resources

You can find examples and tutorials for these algorithms on the [NVIDIA FLARE website](https://nvidia.github.io/NVFlare/) and in the [tutorial catalog](https://nvidia.github.io/NVFlare/catalog/).

## Enterprise Security and Privacy

NVIDIA FLARE includes features to support enterprise security requirements and privacy-enhancing technologies (PETs). These topics are covered in detail in [Part-3 Security and Privacy](../../../part-3_security_and_privacy/part-3_introduction.ipynb).

## Simulation Capabilities

NVIDIA FLARE provides tools for simulating federated learning workflows, which is essential for development and testing.

### Simulation Tools

* **Python API**: Programmatic control of simulations
* **Command-Line Interface (CLI)**: Running simulations from the command line

You've already seen the Job API and simulator CLI in [Chapter-1](../../../part-1_federated_learning_introduction/Chapter-1_running_federated_learning_applications/01.0_introduction/introduction.ipynb). In [Section 3.2](../03.2_deployment_simulation/simulate_real_world_deployment.ipynb), we'll explore how to simulate deployment on a local machine.

## Setup and Deployment

Setting up a federated computing system involves multiple steps and considerations. NVIDIA FLARE provides tools to simplify this process, which we'll discuss in [Chapter 4](../../chapter-4_setup_federated_system/04.0_introduction/introduction.ipynb).

## NVIDIA FLARE APIs

NVIDIA FLARE provides multiple APIs at different levels of abstraction, allowing developers to choose the right level for their needs.

### Python APIs

#### 1. Controller and Executor API
* Low-level APIs that provide full control over federated computing
* Enable custom federated workflows and algorithms
* Offer maximum flexibility but require more detailed implementation

#### 2. ModelController and Client API
* Higher-level APIs based on the FLModel data structure
* Simplify common machine learning and deep learning workflows
* The FLModel structure captures model parameters, metrics, and metadata:

```python
class FLModel:
    def __init__(
        self,
        params_type: Union[None, str, ParamsType] = None,
        params: Any = None,
        optimizer_params: Any = None,
        metrics: Optional[Dict] = None,
        start_round: Optional[int] = 0,
        current_round: Optional[int] = None,
        total_rounds: Optional[int] = None,
        meta: Optional[Dict] = None,
    ):
        ...
```

* On the server side, ModelController consumes and produces FLModel objects
* On the client side, the Client API receives and sends model updates via FLModel

#### 3. Job API
* Helps generate job configurations programmatically
* Allows construction of needed components and generation of job configurations
* Supports simulation through the `job.simulate_run()` method

#### 4. Simulator API
* Enables direct invocation of simulations through the `simulator_run()` method

#### 5. FLARE API
* Python equivalent of FLARE Console commands
* Allows programmatic interaction with the FL system
* Supports operations like connecting to servers, checking status, monitoring jobs, and submitting jobs

### Command Line Interface (CLI)

NVIDIA FLARE provides several command-line tools under the `nvflare` command:

* `nvflare --version`: Display version information
* `nvflare poc`: Create and manage proof-of-concept deployments
* `nvflare preflight_check`: Verify FL system setup and diagnose issues
* `nvflare provision`: Generate secure startup kits for participants
* `nvflare simulator`: Run simulations
* `nvflare dashboard`: Start the NVFLARE dashboard for distributing provisioned startup kits
* `nvflare authz_preview`: View different user roles and permissions
* `nvflare job`: Create job configurations, list templates, and submit jobs
* `nvflare config`: Configure default directories for startup, POC workspace, and job templates

## Configuration System

NVIDIA FLARE uses a flexible configuration system that supports multiple formats and provides templates for common scenarios.

### Configuration Formats

NVIDIA FLARE supports several configuration formats:
* JSON
* PYHOCON (Python HOCON)
* YAML

For details, see the [Configuration Files documentation](https://nvflare.readthedocs.io/en/main/user_guide/configurations.html).

### Job Templates

Job templates provide predefined configurations that can be customized for specific needs. They simplify the process of creating job configurations by providing starting points for common scenarios. You can leverage existing [job templates](https://github.com/NVIDIA/NVFlare/tree/main/job_templates) which are a set of predefined configurations and use the [job CLI](https://github.com/NVIDIA/NVFlare/blob/main/examples/tutorials/job_cli.ipynb) to customize to your needs. 

#### Template Structure

A typical job template includes:

```
├── config_fed_client.conf  # Client configuration
├── config_fed_server.conf  # Server configuration
├── info.conf               # Template information
├── info.md                 # Template documentation
└── meta.conf               # Template metadata
```

Let's examine a sample client configuration from the `sag_pt` template:

In [None]:
! cat  ../../../../../../job_templates/sag_pt/config_fed_client.conf

### Using Job Templates

You can use the NVIDIA FLARE job CLI to view and modify templates when creating job configurations. This provides a user-friendly way to customize templates for your specific needs.

For more information, see the [job CLI tutorial](../../../../job_cli.ipynb).

## Summary

In this section, we've explored the architecture of NVIDIA FLARE, including:

* The layered design that provides modularity and extensibility
* The FLARE Communication Interface (FCI) that enables secure, efficient communication
* The job-based processing architecture that supports concurrent execution
* The event-driven system that enables loose coupling and customization
* The various APIs available at different levels of abstraction
* The configuration system and job templates that simplify deployment

This architectural understanding will help you leverage NVIDIA FLARE effectively for your federated computing needs.

In the next section, we'll explore how to [simulate real-world deployment](../03.2_deployment_simulation/simulate_real_world_deployment.ipynb) of NVIDIA FLARE on a local machine.