# FedAvg algorithm
<a id = "title"></a>

In this example, we will demonstrate the FedAvg algorithm using the CIFAR10 dataset.

Both Job life-cycle and training workflow are controlled on the server side; we will just use the existing available SAG controller available in NVFLARE.

For client-side training code, we will leverage the new DL to FL Client API.

First, let's look at the FedAvg Algorithm and SAG Workflow.


## Scatter and Gather (SAG)

FLARE's Scatter and Gather workflow is similar to the Message Passing Interface (MPI)'s MPI Broadcast + MPI Gather. [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) is a standardized and portable message-passing standard designed to function on parallel computing architectures. MPI consists of some [collective communication routines](https://mpitutorial.com/tutorials/mpi-broadcast-and-collective-communication/), such as MPI Broadcast, MPI Scatter, and MPI Gather.

<img src="mpi_scatter.png" alt="scatter" width=25% height=20% /><img src="mpi_gather.png" alt="gather" width=25% height=20% />



## FedAvg with SAG
We use [SAG workflow](https://nvflare.readthedocs.io/en/main/programming_guide/controllers/scatter_and_gather_workflow.html) to implement the FedAvg algorithm. You can see one round of training in such workflow.

<img src="fed_avg_one_round.png" alt="FedAvg" width=35% height=30% />

<a id = "sag"></a>
<img src="fed_avg.png" alt="FedAvg" width=50% height=45% /> <img src="sag.png" alt="Scatter and Gather" width=40% height=40% />

The aggregation of FedAvg is done on the server side, its weighted on the number of training steps on each client
 
## Convert training code to federated learning training code
<a id = "code"></a>
We will use the original [Training a Classifer](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) example
from pytorch as the code base.

The original code with some clean-up (remove comments etc.) can be found in [here](../code/dl/train.py)


With the NVFLARE DL to FL Client APIs, we need to transform the existing pytorch classifer training code into Federated Classifer training code with few lines of code changes. The already converted code can be found in **[here](../code/fl/train.py)**

For detailed discussion how to convert training code into federated learning training code using Client API, you can also checked out the examples [here](https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/ml-to-fl/README.md)

The key changes are the following steps: 

```
    #  import nvflare client API
    import nvflare.client as flare

    #  initializes NVFlare client API
    flare.init()

    # gets FLModel from NVFlare
    input_model = flare.receive()

    # loads model from NVFlare
    net.load_state_dict(input_model.params)

    # evaluate on received model
    accuracy = evaluate(input_model.params)
    
    # construct trained FL model
    output_model = flare.FLModel(
        params=net.cpu().state_dict(),
        metrics={"accuracy": accuracy},
        meta={"NUM_STEPS_CURRENT_ROUND": steps},
    )
    
    # send model back to NVFlare
    flare.send(output_model)
```

If you are using pytorch-lightning, the changes are much smaller, 1-line import , 1-line change applies to trainer, 1-line global model evaluation. see [cifar10_lightning_examples](https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/ml-to-fl/pt/cifar10_lightning_fl.py) 
# Prepare Data
<a id = "data"></a>

Let's get the data first. Follow the instruction of cifar10, we can download the data with following scripts. 


In [None]:
CIFAR10_ROOT = "/tmp/nvflare/data/cifar10"

! python ../data/download.py

## Job Folder and Configurations
<a id = "job"></a>

NVFlare needs a job folder to run.
We can use NVFLARE Job API to set up the configurations for the server and clients.

Let's first copy the required files over:

In [None]:
! cp ../code/fl/train.py train.py
! cp ../code/fl/net.py net.py

The following code is constructing an FedAvgJob with ScriptRunner with 2 clients, 5 rounds and batch size is 6. Feel free to edit any of these parameters.

In [None]:
from net import Net

from nvflare.app_opt.pt.job_config.fed_avg import FedAvgJob
from nvflare.job_config.script_runner import ScriptRunner

if __name__ == "__main__":
    n_clients = 2
    num_rounds = 5
    train_script = "train.py"

    job = FedAvgJob(
        name="cifar10_fedavg",
        n_clients=n_clients,
        num_rounds=num_rounds,
        initial_model=Net()
    )

    # Add clients
    for i in range(n_clients):
        runner = ScriptRunner(
            script=train_script, script_args="--batch_size 6"
        )
        job.to(runner, f"site-{i+1}")

    job.export_job("/tmp/nvflare/jobs")
    job.simulator_run("/tmp/nvflare/jobs/workdir", gpu="0")


## Run Job

The previous cell exports the job config and executes the job in NVFlare simulator.

If you want to run in production system, you will need to submit this exported job folder to nvflare system.

We can check the content of a job folder using tree command or ls -all


In [None]:
!tree /tmp/nvflare/jobs/cifar10_fedavg


The next 5 examples will be using the same FedAvg workflow, but will demonstrate different execution APIs and feature.
In the next example [sag_deploy_map](../sag_deploy_map/sag_deploy_map.ipynb), we will learn about the deploy_map configuration for deployment of apps to different sites.