# FedAvg Algorithm with SAG (Scatter & Gather) workflow

In this example, we will demonstrate the SAG workflow with FedAvg using CIFAR10 dataset. 

Both Job Lifecycle and training workflow are controlled on the **server side**, we will just use the existing available SAG controller availalbe in NVFLARE. 

For client side training code, we will leverage new DL to FL **Client API**

First, Let's look at the FedAvg Algorithm and SAG Workflow. 

## FedAvg with SAG
<img src="fed_avg.png" alt="FedAvg" width=50% height=45% /> <img src="sag.png" alt="Scatter and Gather" width=40% height=40% />

The Fed Avg aggregation is done on the server side, its weighted on the number of training steps on each client
 
## Convert training code to federated learning training code

We will use the original [Training a Classifer](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) example
in pytorch as the code base. The cleanup code (remove comments etc.) can be found in [here](../code/dl/train.py)


With the NVFLARE DL to FL Client APIs, we need to transform the existing pytorch classifer training code into Federated Classifer training code with few lines of code changes. The already converted code can be found in **[here](../code/fl/train.py)**

For detailed discussion how to convert training code into federated learning training code using Client API, you can also checked out the examples [here](https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/ml-to-fl/README.md) and code 

The key changes are the following steps: 

```
    #  import nvflare client API
    import nvflare.client as flare

    #  initializes NVFlare client API
    flare.init()

    # gets FLModel from NVFlare
    input_model = flare.receive()

    # loads model from NVFlare
    net.load_state_dict(input_model.params)

    # evaluate on received model
    accuracy = evaluate(input_model.params)
    
    # construct trained FL model
    output_model = flare.FLModel(
        params=net.cpu().state_dict(),
        metrics={"accuracy": accuracy},
        meta={"NUM_STEPS_CURRENT_ROUND": steps},
    )
    
    # send model back to NVFlare
    flare.send(output_model)
```

If you are using pytorch-lightning, the changes are much smaller, 1-line import , 1-line change applies to trainer, 1-line global model evaluation. see [cifar10_lightning_examples](https://github.com/NVIDIA/NVFlare/blob/main/examples/hello-world/ml-to-fl/codes/cifar10_lightning.py) 
# Prepare Data

Let's get the data first. Follow the instruction of cifar10, we can download the data with following scripts. 


In [None]:
CIFAR10_ROOT = "/tmp/nvflare/data/cifar10"

! python ../data/download.py

## Job Folder and Configurations

Now we need to setup the configurations for server and clients and constructure Job folder NVFLARE needed to run. We can do this using NVFLARE job CLI. You can study the [Job CLI tutorials](https://github.com/NVIDIA/NVFlare/blob/main/examples/tutorials/job_cli.ipynb) later with all the details. But for now, you can just use the following commands

* Find out the available job templates

We need to set the job templates directory, so the job cli commands can find the job templates. If have already set NVFLARE_HOME=```<NVFLARE git clone directory> ```then, you can skipt the folllowing step. 


In [None]:
! nvflare config -jt ../../../../../job_templates

In [None]:
! nvflare job list_templates

* Create job folder and initial configs

The template **'sag_pt'** seems to fit our needs: SAG with pytorch, using client API. Lets create a job folder with this template initially without specifying the code location, just see what's needs to be changed

In [None]:
! nvflare job create -j /tmp/nvflare/jobs/cifar10_sag_pt -w sag_pt

Lets also looks at the server and client configurations

In [None]:
!cat /tmp/nvflare/jobs/cifar10_sag_pt/app/config/config_fed_server.conf

In [None]:
!cat /tmp/nvflare/jobs/cifar10_sag_pt/app/config/config_fed_client.conf

In [None]:
!cat /tmp/nvflare/jobs/cifar10_sag_pt/app/config/config_exchange.conf

* Create job folder with all the configs

Let's change the num_rounds = 5, script = train.py, min_clients = 2 for meta.conf.  We also like to change the arguments for train.py 
dataset_path=CIFAR10_ROOT, batch_size=6, num_workers = 2. Here dataset_path is actually not changed, but we just want to show you could change. 

In [None]:
! nvflare job create -j /tmp/nvflare/jobs/cifar10_sag_pt -w sag_pt -f meta.conf min_clients=2 -force -f config_fed_server.conf num_rounds=5 -s ../code/fl/train.py -sd ../code/fl -a batch_size=6 dataset_path={CIFAR10_ROOT} num_workers=2 

OK, we are ready to run the job, let's look at the job folder, use "ls -al" if you don't have "tree" installed. 

In [None]:
! tree /tmp/nvflare/jobs/cifar10_sag_pt  

## Run Job
We can use simulator to run the job directly. 


In [None]:
! nvflare simulator /tmp/nvflare/jobs/cifar10_sag_pt  -w /tmp/nvflare/jobs/cifar10_sag_pt_workspace -t 2 -n 2 

The job should be running in the simulator mode. We are done with the training. Now we have done this. Let's move on to the next examaple and see how can we monitoring the losses in MLFlow or Tensorboard. 