# NVFLARE JOB CLI

In this notebook, we will go through the different commands of the Job CLI to show the syntax and usage of each.


## Install NVIDIA FLARE

For this notebook, we will need a running NVFLARE project that we can connect to.
Follow the [Installation](https://nvflare.readthedocs.io/en/main/getting_started.html#installation) 
instructions to set up an environment that has NVIDIA FLARE installed if you do not have one already.

If you use job CLI to submit job, you will need a running NVFLARE system with client and server. You can either run a local system via nvflare poc commands, or 
use the running production system. 

To see how to setup a local system, please refer to setup_poc tutorial. 


## Step-by-step Walk-Through: from creating a job to running a job. 

Assuming we like to run with CIFAR10 data, we have converted the CIFAR10 with pytorch training code to a 2-client federated learning [program](../hello-world/step-by-step/cifar10/code]. To demonstrate the Job CLI, we would like to use the standard Scatter and Gatter (SAG) workflow pattern. 

Now, we would like to see what are the available pre-configured job templates the user can use and modify. 


### Checkout the availalble nvflare job templates
 

#### List Job Templates

NVFLARE 2.4.0 release introduces an job templates where the different type of job configurations are created as templates.  

To list the available templates, you can use the ```nvflare job list_templates``` command


In [1]:
! nvflare job list_templates -d "../../job_templates"


The following job templates are available: 

------------------------------------------------------------------------------------------------------------------------
  name            Description                                                  Controller Type      Client Category     
------------------------------------------------------------------------------------------------------------------------
  sag_cross_np    scatter & gather and cross-site validation using numpy       server               client executor     
  sag_lightning   scatter & gather workflow using lightning                    server               client_api          
  sag_pt          scatter & gather workflow using pytorch                      server               client_api          
  stats_df        FedStats: tabular data with pandas                           server               stats executor      
-----------------------------------------------------------------------------------------------------------

Where the -d "<job_templates_dir>" or --job_template_dir "<job_templates_dir>" is the location of the job_templates. 

We can also simplify the command by simply 
```
! nvflare job list_templates
```

without specify the  -d "<job_templates_dir>" location. When the job templates directory is not specified, the Job CLI will try to find the location in the following logics: 

* see if the NVFALRE_HOME env variable is set, if yes, we assume that you have clone the github repo and the job_templates is located at 
 
 ```${NVFLARE_HOME}/job_templates```
 
* If NVFLARE_HOME env. variable is not set, the CLI Job will look at the nvflare hidden config directory 

```
cat ~/.nvflare/config.conf 

startup_kit {
  path = "/tmp/nvflare/poc1/example_project/prod_00"
}
poc_workspace {
  path = "/tmp/nvflare/poc1"
}
job_template {
  path = "../../job_templates"
}

```
and will use the job_templates path sepecified in the location.  

once you used the -d <job_template_dir> once, the ~/.nvflare/config.conf will be updated. you don't need to specify -d again. 

You can also directly edit this file or simply use

```
nvflare config -jt ../../job_templates. 

```

In [15]:
! nvflare config -jt ../../job_templates 

Now we can list again with job_templates directory argument

In [2]:
! nvflare job list_templates


The following job templates are available: 

------------------------------------------------------------------------------------------------------------------------
  name            Description                                                  Controller Type      Client Category     
------------------------------------------------------------------------------------------------------------------------
  sag_cross_np    scatter & gather and cross-site validation using numpy       server               client executor     
  sag_lightning   scatter & gather workflow using lightning                    server               client_api          
  sag_pt          scatter & gather workflow using pytorch                      server               client_api          
  stats_df        FedStats: tabular data with pandas                           server               stats executor      
-----------------------------------------------------------------------------------------------------------

once we found the job template that fits our needs, we can use the job template name to create a new job folder


### Create a job folder

Since our code is written in pytorch and we would like to try FedAvg algorithm using Scatter & Gather (SAG) workflow, the job template **"sag_pt"** is what we are looking for. We will use this template to create our job folder. 

We create a job folder that contains the base job configuration from the template, and can then modify it as desired.  First, we create job folder and intend to be modified, without specify our code.


#### First Try


In [3]:
! nvflare job create -j /tmp/nvflare/my_job -w sag_pt -force



The following are the variables you can change in the template

---------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                       
  job folder: /tmp/nvflare/my_job                                                                                                        
                                                                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------
  file_name                      var_name                       value                               component                          
---------------------------------------------------------------------------------------------------------------------

The above command create a job folder at ```/tmp/nvflare/my_job``` with job template ```sag_pt```. 
We can see few configuration files are created. And some of the configurations are open to overwrite. We can take a look at the job_folder structure first. 

If you have ```tree``` command installed ( ```python -m pip install``` on linux), you can use the ```tree``` command, otherwise, you can just "ls -al" 

In [4]:
! tree /tmp/nvflare/my_job

[01;34m/tmp/nvflare/my_job[0m
├── [01;34mapp[0m
│   ├── [01;34mconfig[0m
│   │   ├── config_exchange.conf
│   │   ├── config_fed_client.conf
│   │   └── config_fed_server.conf
│   └── [01;34mcustom[0m
└── meta.conf

3 directories, 4 files


In [5]:
! cat /tmp/nvflare/my_job/meta.conf

name = "my_job"
resource_spec {}
deploy_map {
  app = [
    "@ALL"
  ]
}
min_clients = 1
mandatory_clients = []


In [5]:
! cat /tmp/nvflare/my_job/app/config/config_exchange.conf

{
  exchange_path = "./"
  exchange_format =  "pytorch"
  transfer_type =  "DIFF"
}

Notice the app_name is "my_job" and config_exchange is used for Client API where it specify the data exchange path, exchange format is for pytorch and the model diff will be transferred. Let's look at the server side configuration. 


In [6]:
! cat /tmp/nvflare/my_job/app/config/config_fed_server.conf

{
  # version of the configuration
  format_version = 2

  # task data filter: if filters are provided, the filter will filter the data flow out of server to client.
  task_data_filters =[]

  # task result filter: if filters are provided, the filter will filter the result flow out of server to client.
  task_result_filters = []

 # workflows: Array of workflows the control the Federated Learning workflow lifecycle.
 # One can specify mutliple workflows. The NVFLARE will run them in the order specified.
  workflows = [
      {
        # 1st workflow"
        id = "scatter_and_gather"

        # name = ScatterAndGather, path is the class path of the scatterAndGatter controller.
        path = "nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather"
        args {
            # argument of the ScatterAndGather class.
            # min number of clients required for ScatterAndGather controller to move to the next round
            # during the workflow cycle. The controller will

In [7]:
! ! cat /tmp/nvflare/my_job/app/config/config_fed_client.conf

{
  # version of the configuration
  format_version = 2

  # This is application scripts which will be invoked.
  # Client can replaced this script with user's own traning script.
  app_script = "cifar10.py"

  # Additional arguments needed by the training code. For example, in lightning, these can be --trainer.batch_size=xxx.
  app_config = ""

  # Client Computing Executors.
  executors = [
    {
      # tasks the executors are defined to handle
      tasks = ["train"]

      # This particular executor
      executor {

        # Eexecutor name : PTFilePipeLauncherExecutor
        # This is an executor for pytorch. The underline data exchange is using FilePipe.
        path = "nvflare.app_opt.pt.file_pipe_launcher_executor.PTFilePipeLauncherExecutor"

        args {

          # This executor take an component named "launcher"
          launcher_id = "launcher"

          heartbeat_timeout = 60
        }
      }
    }
  ],

  # this defined an arru of task data filters. If provided, 

> Note both client and server configurations are nicely commented with explainations. 
> But if you create the job with customizations such as -f or configurations, the configurations files would be overwritten. As result, the comments in the configuration will be lost in the final files. Just a note. 

### Show variables

Now, we can see the job folder is auto-created for me with pre-defined configurations. We need to make sure this template works for our code and the variables can be updated. Let's check the variables again with the following command

In [15]:
! nvflare job show_variables -j /tmp/nvflare/my_job


The following are the variables you can change in the template

---------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                       
  job folder: /tmp/nvflare/my_job                                                                                                        
                                                                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------
  file_name                      var_name                       value                               component                          
---------------------------------------------------------------------------------------------------------------------

You can see there are many variables we might awant to change.  

* I want to change num_rounds to 1 to test out in simular first
* I also to use my own cifar10 code which already written based on Flare 2.4.0 Client API.

Let's do the second try, 

#### The second Try



In [7]:
! nvflare job create -j /tmp/nvflare/my_job -force -w sag_pt -f config_fed_server.conf num_rounds=1 -s ../hello-world/step-by-step/cifar10/code/fl/train.py -sd ../hello-world/step-by-step/cifar10/code/fl


The following are the variables you can change in the template

---------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                       
  job folder: /tmp/nvflare/my_job                                                                                                        
                                                                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------
  file_name                      var_name                       value                               component                          
---------------------------------------------------------------------------------------------------------------------

Now, num_rounds is set to 1 and app_script is "train.py", the pyton script will invoked ```python custom/{app_script}```, this essentially do ```python custom/train.py```. Now, let's 
take a look the code structure again. 

In [8]:
! tree /tmp/nvflare/my_job


[01;34m/tmp/nvflare/my_job[0m
├── [01;34mapp[0m
│   ├── [01;34mconfig[0m
│   │   ├── config_exchange.conf
│   │   ├── config_fed_client.conf
│   │   └── config_fed_server.conf
│   └── [01;34mcustom[0m
│       ├── net.py
│       └── train.py
└── meta.conf

3 directories, 6 files


Notice that the code we had written is copied to the job directory. 

Notice on config_fed_server.conf, we have ```PTFileModelPersistor``` is file-based persistor for pytorch. It requires net.Net class which used for model initialization and also for save the final model
this file "net.py" is matching such configuration.  If your model file name and class Name is not matching to net.Net, we will need to update configuration to match this. 

We will rest of values as default. And try to run the job. 


### Download the data

Let's download the data first to avoid repeated download. We can simply use the existing script to download in


In [12]:
! python ../../examples/hello-world/step-by-step/cifar10/data/download.py

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to /tmp/nvflare/data/cifar10/cifar-10-python.tar.gz
100.0%
Extracting /tmp/nvflare/data/cifar10/cifar-10-python.tar.gz to /tmp/nvflare/data/cifar10


### Run the Job in simulator 

We can first run the job in simulator and see if we have any issues. 




In [14]:
! nvflare simulator /tmp/nvflare/my_job -w /tmp/my_job

2023-08-25 21:55:58,902 - SimulatorRunner - INFO - Create the Simulator Server.
2023-08-25 21:55:58,903 - CoreCell - INFO - server: creating listener on tcp://0:50369
2023-08-25 21:55:58,913 - CoreCell - INFO - server: created backbone external listener for tcp://0:50369
2023-08-25 21:55:58,913 - ConnectorManager - INFO - 33155: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-08-25 21:55:58,913 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:58750] is starting
2023-08-25 21:55:59,414 - CoreCell - INFO - server: created backbone internal listener for tcp://localhost:58750
2023-08-25 21:55:59,414 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 PASSIVE tcp://0:50369] is starting
2023-08-25 21:55:59,463 - nvflare.fuel.hci.server.hci - INFO - Starting Admin Server localhost on Port 35609
2023-08-25 21:55:59,463 - SimulatorRunner - INFO - Deploy the Apps.
2023-08-25 21:55:59,465 - SimulatorRunner - INFO - Create

* If this doesn't work for you, let figure out what its complaining and change additional changes we need to make.  
* If this works for you, we can move to next step. 

For now, assuming simulator works for you. And you would like to try out in real, but running locally with POC mode. to do this, let's first recreate the job configuration to set a larger number of rounds= 100



In [24]:
! nvflare job create -j /tmp/nvflare/my_job -force -w sag_pt -f config_fed_server.conf num_rounds=100 -s ../hello-world/step-by-step/cifar10/code/fl/train.py -sd ../hello-world/step-by-step/cifar10/code/fl


The following are the variables you can change in the template

---------------------------------------------------------------------------------------------------------------------------------------
                                                                                                                                       
  job folder: /tmp/nvflare/my_job                                                                                                        
                                                                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------
  file_name                      var_name                       value                               component                          
---------------------------------------------------------------------------------------------------------------------


### Setup and Start POC mode

from a terminal, run

```
   nvflare poc prepare -n 2
   nvflare poc start -ex admin@nvflare.com
```
here we will prepare a workspace for POC with n = 2 clients. Then start the POC clients and server except for the FLARE Admin Console (user name = 'admin@nvidia.com'). Since we are going to CLI to submit job, so we don't need admin console. Once the system started, we are ready to move to the next step: submit job


### Submit Job from CLI

We can use the following command to directly submit job from command line. 

even through the number_round = 100, I want to start with small number of round, but not changing the number_rounds = 100 value. 

Also, instead of relying on the default dataset_path, we are going to specify the dataset_path from the command line. 

lastly, I want to change the ```train_timeout``` to 300 seconds instead of 0, which means never timneout. 

In [37]:
! nvflare job submit -j /tmp/nvflare/my_job -f config_fed_server.conf num_rounds=1 train_timeout=300 -a dataset_path="/tmp/nvflare/data/cifar10"

['dataset_path=/tmp/nvflare/data/cifar10']
trying to connect to the server
job: 'de707e28-4790-41f9-aba8-45ae405b3e01 was submitted


You go to terminal to monitor the output log. 

> Note: 
> -a or --app_config specify the arguments to the training scripts. 

> the CLI argument
> ```
>   -a dataset_path="/tmp/nvflare/data/cifar10"
> ```
> will be translate into 

> ```
>    python custom/train.py --dataset_path "/tmp/nvflare/data/cifar10"
> ```
> in our case, the train.py takes --dataset_path as argument. 




### Submit Job from CLI in Production

Before you try to submit to production, the job CLI will need to know the location of the admin console startup kit directory. 
In the POC mode, we set this for user automatically, in prodcuction, user will need to tell the job CLI. 

First you can take a look at this file: 


In [1]:
! cat ~/.nvflare/config.conf
 


    startup_kit {
        path = /tmp/nvflare/poc/example_project/prod_00
    }

    poc_workspace {
        path = /tmp/nvflare/poc
    }
    

You can directly edit the path
```
    startup_kit {
        path = /tmp/nvflare/poc/example_project/prod_00
    }
```
or use the following command 

In [2]:
! nvflare config --startup_kit_dir  /tmp/nvflare/poc/example_project/prod_00

or shorter form

In [3]:
! nvflare config -d  /tmp/nvflare/poc/example_project/prod_00

Once the startup kit directory path is set, we can do the job submit


In [None]:
! nvflare job submit -j /tmp/nvflare/my_job -f config_fed_server.conf num_rounds=1 -a dataset_path="/tmp/nvflare/data/cifar10"

## Trouble Shooting -debug flag

The ```nvflare job submit``` command, since it should not overwrite the job folder configuration during submission, it has to use a temp job folder. 
If you want to check the final configs submited to the server or simply want to see the stack trace of the exception, you can use -debug flag. 

with -debug flag, the ``` nvflare job submit ``` command will not delete the temp job folder once its finished job submission, it will also prinit the exception stack trace in case of failure. 



In [38]:
! nvflare job submit -j /tmp/nvflare/my_job -f config_fed_server.conf num_rounds=1 train_timeout=300 -a dataset_path="/tmp/nvflare/data/cifar10" -debug

['dataset_path=/tmp/nvflare/data/cifar10']
trying to connect to the server
job: '92f84b0a-246a-4aae-904a-78bd903e14b4 was submitted
in debug mode, job configurations can be examined in temp job directory '/tmp/tmpdnusoyzj'


See the statement: 

```
in debug mode, job configurations can be examined in temp job directory '/tmp/tmp7nteenxr'
```

we can check the job folder with ```tree`` or ```ls -al ```

In [39]:
! tree '/tmp/tmpdnusoyzj'

[01;34m/tmp/tmpdnusoyzj[0m
├── [01;34mapp[0m
│   ├── [01;34mconfig[0m
│   │   ├── config_exchange.conf
│   │   ├── config_fed_client.conf
│   │   └── config_fed_server.conf
│   └── [01;34mcustom[0m
│       ├── net.py
│       └── train.py
└── meta.conf

3 directories, 6 files


In [40]:
!cat '/tmp/tmpdnusoyzj/app/config/config_fed_client.conf'

format_version = 2
app_script = "train.py"
app_config = "--dataset_path /tmp/nvflare/data/cifar10"
executors = [
  {
    tasks = [
      "train"
    ]
    executor {
      path = "nvflare.app_opt.pt.file_pipe_launcher_executor.PTFilePipeLauncherExecutor"
      args {
        launcher_id = "launcher"
        heartbeat_timeout = 60
      }
    }
  }
]
task_data_filters = []
task_result_filters = []
components = [
  {
    id = "launcher"
    path = "nvflare.app_common.launchers.subprocess_launcher.SubprocessLauncher"
    args {
      script = "python custom/{app_script}  {app_config} "
    }
  }
]


In [41]:
!cat '/tmp/tmpdnusoyzj/app/config/config_fed_server.conf'

format_version = 2
task_data_filters = []
task_result_filters = []
workflows = [
  {
    id = "scatter_and_gather"
    path = "nvflare.app_common.workflows.scatter_and_gather.ScatterAndGather"
    args {
      min_clients = 2
      num_rounds = 1
      start_round = 0
      wait_time_after_min_received = 0
      aggregator_id = "aggregator"
      persistor_id = "persistor"
      shareable_generator_id = "shareable_generator"
      train_task_name = "train"
      train_timeout = 300
    }
  }
]
components = [
  {
    id = "persistor"
    path = "nvflare.app_opt.pt.file_model_persistor.PTFileModelPersistor"
    args {
      model {
        path = "net.Net"
      }
    }
  }
  {
    id = "shareable_generator"
    path = "nvflare.app_common.shareablegenerators.full_model_shareable_generator.FullModelShareableGenerator"
    args {}
  }
  {
    id = "aggregator"
    path = "nvflare.app_common.aggregators.intime_accumulate_model_aggregator.InTimeAccumulateWeightedAggregator"
    args {
      

you can see the configs in server and clients are indeed the values we specified. 

## Trouble Shooting - Client API timeout

If the client API has not received training in 60 seconds, the job will consider failed. 
```
PTFilePipeLauncherExecutor - ERROR - [identity=site-1, run=simulate_job, peer=simulator_server, peer_run=simulate_job, task_name=train, task_id=db7940f1-d7b4-44e5-b509-dfed4adeb2ec]: received _PEER_GONE_ while waiting for result for train
```

you can change the timeout to a higher level: 

```
heartbeat_timeout = 120
``` 

## Cleanup

Make sure you shutdown the poc system once you are done