
## Advanced FL Algorithms
We provide several examples to help you quickly get started with NVFlare.
All examples in this folder are based on using [TensorFlow](https://tensorflow.org/) as the model training framework.

This example demonstrates TensorFlow-based federated learning algorithms,
[FedAvg](https://arxiv.org/abs/1602.05629), [FedOpt](https://arxiv.org/abs/2003.00295), and [SCAFFOLD](https://arxiv.org/abs/1910.06378) on the CIFAR-10 dataset.

In this example, the latest Client APIs were used to implement
client-side training logics (details in file
[`cifar10_tf_fl_alpha_split.py`](src/cifar10_tf_fl_alpha_split.py)),
and the new
[`FedJob`](../../../nvflare/job_config/api.py)
APIs were used to programmatically set up an
NVFlare job to be exported or ran by simulator (details in file
[`tf_fl_script_runner_cifar10.py`](tf_fl_script_runner_cifar10.py)),
alleviating the need of writing job config files, simplifying
development process.


## 1. Install requirements
Install required packages

In [None]:
!pip install --upgrade pip
!pip install -r ./requirements.txt

> **_NOTE:_**  We recommend either using a containerized deployment or virtual environment,
> please refer to [getting started](https://nvflare.readthedocs.io/en/latest/getting_started.html).

## 2. Run experiments

The next examples uses simulator to run all experiments. The script
[`tf_fl_script_runner_cifar10.py`](tf_fl_script_runner_cifar10.py)
is the main script to be used to launch different experiments with
different arguments (see sections below for details). 

### 2.1 Impact of Data Heterogeneity
First, we can run several experiment to simulate different hetereogenous datasplits to see how it impacts the FedAvg algorithm.

The CIFAR10 dataset will be downloaded when running any experiment for
the first time. TensorBoard summary logs will be generated during
any experiment, and you can use TensorBoard to visualize the
training and validation process as the experiment runs. Data split
files, summary logs and results will be saved in a workspace
directory, which defaults to `/tmp` and can be configured by setting
`--workspace` argument of the `tf_fl_script_runner_cifar10.py`
script.

We apply Dirichlet sampling (as implemented in [FedMA](https://github.com/IBM/FedMA)) to
CIFAR10 data labels to simulate data heterogeneity among client sites, controlled by an
`alpha` value between 0 (exclusive) and 1. A high alpha value indicates less data
heterogeneity, i.e., an alpha value equal to 1.0 would result in homogeneous data 
distribution among different splits.

> Note, we use the following environment variables in the training, to prevent
> TensorFlow from allocating full GPU memory all at once so we can run the clients in parallel on the same GPU:
> `export TF_FORCE_GPU_ALLOW_GROWTH=true && export TF_GPU_ALLOCATOR=cuda_malloc_asyncp`
>
> You should be able to run the 8 clients in parallel on a GPU with 16GB memory.


In [None]:
# You can change GPU index if multiple GPUs are available
GPU_INDX=0

# Run FedAvg with different alpha values
for alpha in [1.0, 0.1]:
    !python ./tf_fl_script_runner_cifar10.py \
       --algo fedavg \
       --n_clients 8 \
       --num_rounds 50 \
       --batch_size 64 \
       --epochs 4 \
       --alpha $alpha \
       --gpu $GPU_INDX \
       --workspace /tmp # workspace root directory

You can visualize the results by running `tensorboard --logdir /tmp/nvflare/jobs` in a different terminal.

You can notice the impact of data heterogeneity by varying the
`alpha` value, where lower values cause higher heterogeneity. As can
be observed in the table below, performance of the FedAvg decreases
as data heterogeneity becomes higher.

![Impact of client data
heterogeneity](./figs/cifar10_tf_alphas.png)

### 2.2 How Advanced FL Algorithms Can Help

Now, let's see how advanced FL algorithms can significantly improve performance in the presence of data heterogeneity -— a common real-world challenge where clients' local data distributions differ substantially. Techniques like [FedOpt](https://arxiv.org/abs/2003.00295) extend FedAvg by applying server-side optimization methods such as Adam or momentum SGD to better adapt to diverse client updates. [SCAFFOLD](https://arxiv.org/abs/1910.06378) tackles the issue of client drift by using control variates to correct local updates, helping align them more closely with global objectives. These methods can be integrated into the client training script and server-side controllers  to improve convergence and generalization in non-IID settings.

#### 2.2.1 FedOpt: Server-Side Optimization
*FedOpt* is a family of algorithms that extends FedAvg by replacing the simple averaging step at the server with a more sophisticated server-side optimizer, such as Adam, Adagrad, or momentum SGD. While clients still perform multiple steps of local training, the global model update at the server is treated as an optimization problem in its own right.

**Key idea:** Instead of just averaging the client updates, the server accumulates them and applies an optimizer to adjust the global model. This helps the server adapt more effectively to inconsistent or biased updates from clients, which are common in non-IID settings.

**Benefits:**

- Improved convergence speed.
- Greater stability across diverse client data distributions.
- Flexibility to tune optimizer hyperparameters for different tasks.



#### 2.2.2 SCAFFOLD: Correcting Client Drift
*SCAFFOLD* addresses a different issue known as client drift, which occurs when clients’ local updates deviate significantly from the direction that would optimize the global objective. This is especially problematic in heterogeneous environments where local optima vary widely.

**Key idea:** SCAFFOLD introduces control variates—auxiliary variables maintained at both the client and server—to estimate and correct the drift. During training, each client uses its control variate to adjust its local gradient updates. After training, the server updates its global control variate based on the clients’ contributions.

**Benefits:**

- Reduces variance across client updates.
- Encourages local updates to stay aligned with the global optimization direction.
- Leads to faster and more stable convergence in non-IID settings.

#### 2.2.3 Compare the algorithms
Here, we use the `alpha=0.1` setting to compare FedAvg, with FedOpt, and Scaffold.

In [None]:
# You can change GPU index if multiple GPUs are available
GPU_INDX=0

for algo in ["fedopt", "scaffold"]:
    !python ./tf_fl_script_runner_cifar10.py \
       --algo $algo \
       --n_clients 8 \
       --num_rounds 50 \
       --batch_size 64 \
       --epochs 4 \
       --alpha 0.1 \
       --gpu $GPU_INDX \
       --workspace /tmp # workspace root directory

Again, you can visualize the results by running `tensorboard --logdir /tmp/nvflare/jobs` in a different terminal.

We compared the performance of different FL algorithms, with alpha value fixed to 0.1, i.e., a high client data heterogeneity. We can observe from the figure below that, FedOpt and SCAFFOLD achieve better performance, with better convergence rates compared to FedAvg with the same alpha setting. SCAFFOLD achieves that by adding a correction term when updating the client models, while FedOpt utilizes SGD with momentum to update the global model on the server. Both achieve better performance with the same number of training steps as FedAvg.

![Comparison of FL Algorithms](./figs/cifar10_tf_algos.png)