# NVFLARE JOB CLI

In this notebook, we will go through the different commands of the Job CLI to show the syntax and usage of each.


## Install NVIDIA FLARE

For this notebook, we will need a running NVFLARE project that we can connect to.
Follow the [installation](https://nvflare.readthedocs.io/en/main/getting_started.html#installation) 
instructions to set up an environment that has NVIDIA FLARE installed if you do not have one already.

If you use the job CLI to submit job, you will need a running NVFLARE system with client and server. You can either run a local system via nvflare poc commands, or 
use a running production system. 

To see how to setup a local system, please refer to the [setup_poc tutorial](setup_poc.ipynb).


## Step-by-step walk-through: from creating a job to running a job

Taking the converted CIFAR10 with pytorch training code for a 2-client federated learning [program](https://github.com/NVIDIA/NVFlare/tree/main/examples/hello-world/step-by-step/cifar10/code), we can use the standard Scatter and Gatter (SAG) workflow pattern to demonstrate the features of the Job CLI. 

Now, we would like to see what are the available pre-configured job templates the user can use and modify. 


### Check out the available nvflare job templates
 

#### List Job Templates and job templates directory

The NVFLARE 2.4.0 release introduces job templates for the different types of job configurations.

To list the available templates, you can use the ```nvflare job list_templates``` command:

```
! nvflare job list_templates
```

If you installed nvflare 2.4.x via `pip install nvflare`. The above command should show you available job templates (built-in default job templates). But if you cloned the github report of repository, and did not use the ```pip install nvflare```, the above command will expect you to provide the job_templates directory. When the job templates directory is not specified, the Job CLI will try to find the job_templates location with the following sequences of logic:

* See if the NVFLARE_HOME environment variable is set. If NVFLARE_HOME is not empty, the Job CLI will look for the job_templates at:
 
 ```${NVFLARE_HOME}/job_templates```
 
* If the NVFLARE_HOME env. variable is not set, the Job CLI will look for the `job_template` path of the config in the nvflare hidden directory 

```
cat ~/.nvflare/config.conf 

startup_kit {
  path = "/tmp/nvflare/poc1/example_project/prod_00"
}
poc_workspace {
  path = "/tmp/nvflare/poc1"
}
job_template {
  path = "../../job_templates"
}

```
once the `-d <job_template_dir>` option is used, the `job_template` value in `~/.nvflare/config.conf` will be updated so you don't need to specify -d again. 

If you want to change the `job_template` path, you can directly edit this config file or use the `nvflare config` command:

```
nvflare config -jt ../../job_templates. 

```
If the ~/.nvflare/config.conf is not defined yet, the command will look at the following location from installed NVFLARE package 
```
 job_templates_dir = os.path.join(nvflare.job.__file__, "templates")
```
 
If the nvflare is installed, this directory exists, then it should find the built-in job templates. 

> Note: this directory may not exist in the follow case: 
> * If you have done ```pip install nvflare```, but also installed the NVFLARE source code from github repo. the sys.path might point to your local NVFLARE repository when load nvflare.job module. In such a case, the above directory will not exist. As the job_templates is not located at nvflare/job/templates in the github repository. 


If Job templates directory still not found, the command will raise exception for missing Job Template directory. 


By now, you should understand that the ```nvflare job list_templates``` allows you to list built-in default job templates from the release, as well as provides your own job_templates to reflect the recent changes. 

For now, let's specify the job templates directory location


In [None]:
! nvflare job list_templates -d "../../job_templates"

Where the option `-d "<job_templates_dir>"` or `--job_template_dir "<job_templates_dir>"` is the location of the job_templates.  By doing so, we have also save our job_templates into the hidden configuration,so we don't do it again next time. Let's look at the config file. 


In [None]:
! cat  ~/.nvflare/config.conf

You can also manually preset the job_templates directory if you don't want to reply on the default one. 

In [None]:
! nvflare config -jt ../../job_templates 


In [None]:
! cat  ~/.nvflare/config.conf

Now we can list the templates again without -d option

In [None]:
! nvflare job list_templates

With a job template that fits your needs, you can use the job template name to create a new job folder.


### Create a job folder

Since the code for our example is written in pytorch and we would like to try the FedAvg algorithm using the Scatter & Gather (SAG) workflow, the job template **"sag_pt"** is what we are looking for. We will use this template to create our job folder. 

Create a job folder that contains the base job configuration from the template, which can then be modified as desired. First, create a job folder with the intent for it to be modified, without specifying any code.


#### First try


In [None]:
! nvflare job create -j /tmp/nvflare/my_job -w sag_pt -force



The above command creates a job folder at ```/tmp/nvflare/my_job``` with job template ```sag_pt```. 
You can see that a few configuration files are created. Some of the configurations are open for you to overwrite.

If you have the ```tree``` command installed ( ```sudo apt install tree``` on linux), you can use the ```tree``` command, otherwise, you can use "ls -al" to look at the job_folder structure:

In [None]:
! tree /tmp/nvflare/my_job

In [None]:
! cat /tmp/nvflare/my_job/meta.conf

Notice the app_name is "my_job". In `config_fed_client.conf` we can specify the data exchange path, the exchange format, and the way to transfer the model. Let's look at the server side configuration. 


In [None]:
! cat /tmp/nvflare/my_job/app/config/config_fed_server.conf

In [None]:
! ! cat /tmp/nvflare/my_job/app/config/config_fed_client.conf

> Note that both client and server configurations are nicely commented with explainations. 
> If you create the job with customizations such as using -f or configurations, the configuration files will be overwritten. As result, the comments in the configuration will be lost in the final files. 

### Show variables

Now, you can see the job folder is auto-created with pre-defined configurations. To make sure this template works for your code and the variables can be updated. Let's check the variables again with the following command

In [None]:
! nvflare job show_variables -j /tmp/nvflare/my_job

You can see there are many variables you might want to change:

* Change num_rounds to 1 to test out a fast run first.
* Use custom cifar10 code which was already written based on Flare 2.4.0 Client API.


**Note**

the job template name: such as ```sag_pt```, you can also use directory path for the job template. You can try yourself.

```
! nvflare job create -j /tmp/nvflare/my_job -w ../../job_templates/sag_pt -force
```



Let's do a second try, 

#### The second try

In [None]:
! nvflare job create -j /tmp/nvflare/my_job -force -w sag_pt  \
-f config_fed_server.conf num_rounds=1 \
-f config_fed_client.conf app_script=train.py \
-sd ../hello-world/step-by-step/cifar10/code/fl

The above command creates a job folder at ```/tmp/nvflare/my_job``` with job template ```sag_pt``` again (`-force` to replace the existing job folder). 
Now, `num_rounds` is set to 1 and `{app_script}` is "train.py": the python script will invoke ```python custom/{app_script}```, so the provided `train.py` will be called.
Now, take a look the code structure again: 

In [None]:
! tree /tmp/nvflare/my_job


Notice that the code we had written is copied to the job directory. 

In config_fed_server.conf, we have ```PTFileModelPersistor```, a file-based persistor for pytorch. It requires the `net.Net` class for model initialization and also for saving the final model.
The "net.py" file matches the configuration.  If your model file name and class name does not match `net.Net`, you will need to update your configuration to match. 

We will leave the rest of values as default and try to run the job. 

### Download the data

Download the data first to avoid repeated downloading. You can use the download script:


In [None]:
! python ../../examples/hello-world/step-by-step/cifar10/data/download.py

### Run the Job in simulator 

You can first run the job with `nvflare simulator` to see if there are any issues:




In [None]:
! nvflare simulator /tmp/nvflare/my_job -w /tmp/my_job

If this does not work for you, you may need to make additional changes based on the error messages.

Assuming `nvflare simulator` works, you can try running locally with POC mode. For more realistic training, you can first recreate the job configuration with a larger number of rounds (num_rounds=100):



In [None]:
! nvflare job create -j /tmp/nvflare/my_job -force -w sag_pt \
-f config_fed_server.conf num_rounds=100 \
-f config_fed_client.conf app_script=train.py \
-sd ../hello-world/step-by-step/cifar10/code/fl


### Set up and start POC mode

From a terminal, run:

```
   nvflare poc prepare -n 2
   nvflare poc start -ex admin@nvidia.com
```
This will prepare a workspace for POC with n = 2 clients. The second command starts the POC clients and server except for the FLARE Admin Console (user name = 'admin@nvidia.com'). Since we are going to the Job CLI for submit job, we don't need the admin console for now. Once the system has started, we are ready to move to the next step: submit job.


### Submit Job from CLI

You can use the following command to directly submit job from the command line. 

Even through in `config_fed_server.conf`, num_rounds = 100, to start with a smaller number of rounds, you can set `num_rounds` in the `nvflare job submit` command without changing the value in the config. 

Also, to change the `train_timeout` to 300 seconds instead of 0 (which means no timeout), this arg is also in `config_fed_server.conf`, so you can include it with `num_rounds` after `-f config_fed_server.conf`.

Finally, instead of relying on the default `dataset_path`, you can specify the `dataset_path` in the `nvflare job submit` command.

In [None]:
! nvflare job submit -j /tmp/nvflare/my_job \
-f config_fed_server.conf num_rounds=1 train_timeout=300 \
-f config_fed_client.conf app_config="--dataset_path /tmp/nvflare/data/cifar10" \
-debug

You can go to the terminal to monitor the output log. 

> the CLI argument
> ```
>   app_config="--dataset_path /tmp/nvflare/data/cifar10"
> ```
> will be translated into 

> ```
>    python custom/train.py --dataset_path "/tmp/nvflare/data/cifar10"
> ```
> in our case, `train.py` takes `--dataset_path` as an argument. 




### Submit Job from CLI in production

Before you try to submit to production, the Job CLI will need to know the location of the admin console startup kit directory. 
In POC mode, this is set for the user automatically. In prodcuction, the user will need to set the path to the startup kit for the Job CLI. 

The startup kit path is stored in the `config.conf` file in the nvflare hidden directory at the user's home directory. First you can take a look at this file: 


In [None]:
! cat ~/.nvflare/config.conf
 

You can directly edit the path in the file:
```
    startup_kit {
        path = /tmp/nvflare/poc/example_project/prod_00
    }
```
Alternatively, you can use the following command:

In [None]:
! nvflare config --startup_kit_dir /tmp/nvflare/poc/example_project/prod_00

or

In [None]:
! nvflare config -d /tmp/nvflare/poc/example_project/prod_00

Once the startup kit directory path is set, you can do the job submit:


In [None]:
! nvflare job submit -j /tmp/nvflare/my_job \
-f config_fed_server.conf num_rounds=1 \
-f config_fed_client.conf app_config="--dataset_path /tmp/nvflare/data/cifar10"

## Troubleshooting with the `-debug` flag

Since the ```nvflare job submit``` command does not overwrite the job folder configuration during submission, it has to use a temp job folder. 
If you want to check the final configs submited to the server or simply want to see the stack trace of the exception, you can use the `-debug` flag. 

With the `-debug` flag, the ``` nvflare job submit ``` command will not delete the temp job folder once it has finished job submission, and it will also print the exception stack trace in case of failure. 



In [None]:
! nvflare job submit -j /tmp/nvflare/my_job \
-f config_fed_server.conf num_rounds=1 train_timeout=300 \
-f config_fed_client.conf app_config="--dataset_path /tmp/nvflare/data/cifar10" \
-debug

You should see a statement like the following after the message that the job was submitted (the actual random folder name will vary): 

```
in debug mode, job configurations can be examined in temp job directory '/tmp/tmpdnusoyzj'
```

You can check the job folder with `tree` or `ls -al` 
> note:  the temp folder name can be different on your machine

In [None]:
! tree '/tmp/tmpdnusoyzj'

In [None]:
!cat '/tmp/tmpdnusoyzj/app/config/config_fed_client.conf'

In [None]:
!cat '/tmp/tmpdnusoyzj/app/config/config_fed_server.conf'

You can see if the configs for server and clients are indeed the values specified.

## Troubleshooting - Client API timeout

If the client API has not received training in 60 seconds, the job will be considered failed with a message like the following:
```
PTFilePipeLauncherExecutor - ERROR - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=db7940f1-d7b4-44e5-b509-dfed4adeb2ec]: received _PEER_GONE_ while waiting for result for train
```

If you need to, you can increase the value for the timeout: 

```
heartbeat_timeout = 120
``` 

## Cleanup

Make sure you shut down the POC system when you are done:

In [None]:
! nvflare poc stop

## Advanced Section

With above sections, you should have understood how to create job with the job template, modify the configuration as needed (either via CLI or manually) and submit job. 
Now, what if you would like to have 

* Different configurations on different clients
  You could have different datasets on different sites, therefore, the epoches, batch size, learning rate, etc. can be different. 

* Deploy different code pieces to different sites 
  You don't need to deploy all the code to all places, only certain code is needed at certain locations. 
  
* add new arguments not in the job templates, modify specific config key using path

* remove configuration

* modify custom configurations




In this section, we will discuss how to do this. So far, we assumed all sites (server and client sites) had the same code and configuration, we deploy all the code + configs to all sites with the following meta.conf

```
name = "my_job"
resource_spec {}
deploy_map {
  app = [
    "@ALL"
  ]
}
min_clients = 2
mandatory_clients = []

```

Notice the **deploy_map** 
```
deploy_map {
  app = [
    "@ALL"
  ]
}

```
### Set up job with different site-specific configurations 

We are saying that there is "app" is deployed to "ALL" sites. Let's look at a different example



In [None]:
! nvflare job create \
-j /tmp/nvflare/my_job -w sag_pt_deploy_map 

Here we have three different apps : "app_server", "app_1" and "app_2". 
We would like to change the following: 

* change number of training rounds to 2
* change default app_script from "cifar10.py" to "train.py" for both app_1 and app_2
* change the app_1 batch_size to 4, app_2 batch_size to 6

In [None]:
! nvflare job create \
-j /tmp/nvflare/my_job -w sag_pt_deploy_map \
-f app_server/config_fed_server.conf num_rounds=2 \
-f app_1/config_fed_client.conf app_script=train.py app_config="--batch_size 4" \
-f app_2/config_fed_client.conf app_script=train.py app_config="--batch_size 6" \
-sd ../hello-world/step-by-step/cifar10/code/fl \
-force

Now let's look at the job folder structure. 

In [None]:
!tree /tmp/nvflare/my_job


The job folder consists of three sub-folders, each representing one application: app_server, app_1, app_2. Now look at the meta.conf's deploy_map

In [None]:
!cat /tmp/nvflare/my_job/meta.conf

Notice, app_server is deployed to "server", "app_1" and "app_2" respectively.  The app_1 and app_2 only need client configurations and app_server only need server configuration. Since the server is not doing the training job. we could **remove** ther train.py from the app_server app. and look at again

In [None]:
!rm /tmp/nvflare/my_job/app_server/custom/train.py

In [None]:
!tree /tmp/nvflare/my_job

Look at the job configuration variables 

In [None]:
! nvflare job show_variables -j /tmp/nvflare/my_job

This shows the same information we previously seen. Except it shows each app's configuration. Lets explain a bit mroe about the commnand syntax

```
 nvflare job create \
-j /tmp/nvflare/my_job -w sag_pt_deploy_map \
-f app_server/config_fed_server.conf num_rounds=2 \
-f app_1/config_fed_client.conf app_script=train.py app_config="--batch_size 4" \
-f app_2/config_fed_client.conf app_script=train.py app_config="--batch_size 6" \
-sd ../hello-world/step-by-step/cifar10/code/fl \
-force

```

to specify app specific configuration, you use

```-f app_server/config_fed_server.conf num_rounds=2 ```

instead 

```
-f config_fed_server.conf num_rounds=2 

```

Here it tells the command that that only change the config for "app_server" app, without "app_server/" the command is considered to use the default "app" configuration. 

if the "app_name" is not previously defined in the job templates, the command will show error.  For example



In [None]:
! nvflare job create \
-j /tmp/nvflare/my_job -w sag_pt_deploy_map \
-f fl_server/config_fed_server.conf num_rounds=2 \
-force

Once you have the the job folder. You should be able to run the job as before


### Add arguments not originally specified in Job Template

In some cases, we need add additional arguments not defined in the job templates, and we would like to the add to a specific args of certain component. This requires we specify the path to the component. 

We use the following notations to indicate the path
* for single component, we can use dot notation. such as ```model.args.number_classes=2```
* for component list, we use index notation. such as ```components[1].model.args.number_classes=2```

In the 2nd case, ```components[1]``` indicates the 2nd component of the component list. The first component will be ```component[0]```

Let's look at how do we use this to add or modify the job template. 


In [None]:
! nvflare job create -j /tmp/nvflare/my_job -w sag_pt -force

In [None]:
! cat /tmp/nvflare/my_job/app/config/config_fed_server.conf

Here, we would like to modify the peristor component configuration from 
```
  components = [
    {
      # This is the persistence component used in above workflow.
      # PTFileModelPersistor is a Pytorch persistor which save/read the model to/from file.

      id = "persistor"
      path = "nvflare.app_opt.pt.file_model_persistor.PTFileModelPersistor"

      # the persitor class take model class as argument
      # This imply that the model is initialized from the server-side.
      # The initialized model will be broadcast to all the clients to start the training.
      args.model.path = "{model_class_path}"
    },
```
to: 

```
  components = [
    {
      id = "persistor"
      path = "nvflare.app_opt.pt.file_model_persistor.PTFileModelPersistor"
      args {
          model { 
              path = "{model_class_path}"
              args {
                in_channels = 165
                hidden_channels = 256
                num_classes = 2
                num_layers = 3
            }
          }
      }
    },
```
Notice, that new models.args are new keys and values. This is Model Persistor is the 1st component. 

In [None]:
! nvflare job create -j /tmp/nvflare/my_job -w sag_pt -force \
-f config_fed_server.conf \
components[0].args.model.args.in_channels=165 \
components[0].args.model.args.hidden_channels=256 \
components[0].args.model.args.num_classes=2 \
components[0].args.model.args.num_layers=3

Now, look at the modified configuration again

In [None]:
! cat /tmp/nvflare/my_job/app/config/config_fed_server.conf

### Remove configuration

In some cases, we have changed the local training code class contructor, the arguments from the job template's argments need to be removed. We will show you how to do that. 

Let's take a look an example: 

In [None]:
! nvflare job create -j /tmp/nvflare/jobs/my_job -w stats_df -force

In [None]:
! cat /tmp/nvflare/jobs/my_job/app/config/config_fed_client.conf 


Notice that the 1st component: 
```
 "components": [
    {
      "id": "df_stats_generator",
      "path": "df_statistics.DFStatistics",
      "args": {
        "data_path": "data.csv"
      }
    },
    
    ...
   ]
```
The df_stats_generator uses the class at "df_statistics.DFStatistics", withj input arguments as _data_path = "data.csv". What if I decided to write my local class called "df_stats.MyStats" where it only takes "data_root_dir". 

Now we need to do 
1) change the path of "df_stats_generator" to the new class path
2) remove data_path configuration
3) add data_root_dir argument. 

To remove a configuration key, we can use the **<key->** syntax, i.e add "-" at the end of key. The key must be exists and must be expressed in full path. such as
    **"components[0].args.data_path-"**
    
both
    **"components[0].args.data_path-"** or 
    **"components[0].args.data_path-=value "** works, although value will be ignored. 
    
    
>>Limitation
    **the configuration key removal must be against leaf node key. we can't remove parent key such as "component[0].args"**
    
Let's try this out


In [None]:
! nvflare job create -j /tmp/nvflare/jobs/my_job -w stats_df \
-f config_fed_client.conf  \
components[0].path="df_stats.MyDFStats" \
components[0].args.data_path- \
components[0].args.data_root_dir="/tmp/dataset" -force


Now let's look the configuration file again, notice the arguments have changed

In [None]:
! cat /tmp/nvflare/jobs/my_job/app/config/config_fed_client.conf 


### Modify custom configuration

Up to so far, we have been discussion how to create and modify NVFLARE specific configurations such as meta.conf, config_fed_client.conf and config_fed_server.conf.  How about custom configuration that needed by the training code. The custom configuration files are located in the custom directory of the app. 

such as 
   app1/custom/my_config.yaml
   app2/custom/my_config.yaml
   app_server/custom/my_config.yaml
   
in such cases, the config file name is arbitary, the file format could be any one of the JSON, PYHOCON or OmegaConf. We can still modify these files. Before jump into the specifics. Let us see what format the CLI offers to input files. We can specify the config file in one of the following ways: 

```

 -f config_fed_client.conf
 
```

This implies that the input config file, is default to "app/config/config_client.conf". Or 


```

 -f app1/config_fed_client.conf
 
```
This implies that the input config file, is default to "app1/config/config_client.conf". Or directly spell out the full path
 
 
```

 -f app1/config/config_fed_client.conf
 
```
 
Similarly, we can use 
 

```

 -f app1/custom/my_config.yaml
 
```

Let's use an example to demonstrate this with job template "sag_nemo"


In [None]:
! nvflare job create -j /tmp/nvflare/my_job -w sag_nemo -force  -sd ../../integration/nemo/examples/peft/code 

In [None]:
! tree /tmp/nvflare/my_job


Notice we use the example script directory "code". CLI create job copied the custom folder to the job folder. And there are many custom configuration yaml files. Notice "weight_decay" is one of parameters io megatron_gpt_peft_tuning_config.ymal. Let's change it from 0.01 to 0.02 on app_server. 

In [None]:
! nvflare job create -j /tmp/nvflare/my_job -w sag_nemo -force \
  -sd ../../integration/nemo/examples/peft/code \
  -f app_server/custom/megatron_gpt_peft_tuning_config.yaml weight_decay=0.02

Noticed that the weight_decay for the app_server megatron_gpt_peft_tuning_config.yaml value is updated to 0.02. We can also look at the file saved. 

In [None]:
!cat /tmp/nvflare/my_job/app_server/custom//megatron_gpt_peft_tuning_config.yaml | grep weight_decay

This works! Alternatively, you can spell out the variable path, this can be used to update the exact variable in case duplicated variables ( such as two weight_decay under different paths). Let's do it again 

In [None]:
! nvflare job create -j /tmp/nvflare/my_job -w sag_nemo -force \
  -sd ../../integration/nemo/examples/peft/code \
  -f app_server/custom/megatron_gpt_peft_tuning_config.yaml model.optim.weight_decay=0.02

In [None]:
!cat /tmp/nvflare/my_job/app_server/custom//megatron_gpt_peft_tuning_config.yaml | grep weight_decay