# Jobflow-remote introduction

Jobflow-remote is a free, open-source library serving as a manager for the execution of [jobflow](https://materialsproject.github.io/jobflow/) workflows. Most of the information about how to set up and use jobflow-remote can be found in the [official documentation](https://matgenix.github.io/jobflow-remote/index.html).

## Setting up jobflow-remote

While the first set to use jobflow-remote would be to [set up the configuration file for one project](https://matgenix.github.io/jobflow-remote/user/install.html), in this tutorial all the configurations have already been performed. Nonetheless it may be helpfull to explore the content of the configuration yaml file that has been generated. The default location is in the `~/.jfremote` folder.
A few important sections that you might want to check are:
* `name`: the name of the project (note that while it is advisable that the configuration file has the same name as the project, this is not a constraint)
* `workers`: the workers available for this project. In this case there are several workers available:
  * `cecam`: submits jobs to the Helvetios cluster in the SLURM queue
  * `cecam_fw`: runs jobs on the front end of the Helvetios cluster
  * `local_slurm`: submits jobs to the `slurm` container in its SLURM queue
  * `local`: runs jobs directly in the `jupyter` container.
* `queue`: the connection details to the MongoDB database containing the details about the Jobs execution
* `jobstore`: the connection details to jobflow's `JobStore`, containing the Job's outputs
* `exec_config`: common configurations required to execute Jobs on the workers.

In [None]:
!cat /home/jovyan/.jfremote/PROJECTNAME.yaml # Replace PROJECTNAME with your chosen project name

## The `jf` CLI

Most of the interactions with jobflow-remote can be performed using the `jf` command line interface. It is usually executed from the shell, but can also be executed inside this notebook, prepending the command with an exlamation mark `!`.

A first step would be to verify that the connections have been properly set up using the `jf project check` command. This will attempt to connect to the workers and the databases and perform few actions. Passing the checks does not guarantee that everything will work fine, but if the checks do not pass connection details need to be revised. 
The output should look like:
```
✓ Worker cecam
✓ Worker cecam_fe
✓ Worker local_slurm
✓ Worker local_shell
✓ Jobstore
✓ Queue store
```
(Note that if you do not have access to the helvetion cluster you will get an error for the `cecam` and `cecam_fe`).

In [None]:
!jf project check

Once the connections are properly set, the queue database needs to be prepared with the `jf admin reset` command. This will add a few required documents to the DB and **remove all the Jobs and Flows** present in the database (here we us the `-f` option to avoid being asked for a confirmation)

In [None]:
!jf admin reset -f

It is now possible to check that no jobs or flow are present in the DB

In [None]:
!jf job list

In [None]:
!jf flow list

## The first Flow

To start using jobflow-remote a Flow needs to be created and added to the DB. Several example Jobs can be found in the `jobflow-remote.testing` module (see the [source code](https://github.com/Matgenix/jobflow-remote/blob/develop/src/jobflow_remote/testing/__init__.py)) and can be used to compose a simple Flow.

<div class="alert alert-block alert-warning"><b>Warning:</b> To be executed by jobflow-remote the Job source code should be <b>available in the worker as well</b>. Thus, a simple Job with its source in the notebook will not be present in the worker and jobflow-remote is not able of running it.</div>


In [None]:
from jobflow import Flow
from jobflow_remote.testing import add

j1 = add(1, 2)
j2 = add(j1.output, 4)

flow1 = Flow([j1, j2])

At this point the Flow has been create but it has not been inserted in the jobflow-remote database. To do so it is necessary to use the `submit_flow` function and choose a worker that will execute the Jobs. For this first example we will use the local worker `local_shell`. The function returns a list of the jobs unique ids that are used to identify them in the database (note that these are strings, even if usually representing integers)

In [None]:
from jobflow_remote import submit_flow

job_ids = submit_flow(flow1, worker="local_shell")
job_ids

It is now possible to check the presence of the Jobs and Flow in the database using again the `jf` command line:

In [None]:
!jf job list

In [None]:
!jf flow list

## The Runner

Once the Jobs are in the database, they will be executed by jobflow-remote. However, to do so, the Runner has to be activated. 
In jobflow-remote the Runner refers to one or more processes that handle the whole execution of the jobflow workflows, including the interaction with the worker and the writing of the outputs in the `JobStore`.

![Runner schema](https://matgenix.github.io/jobflow-remote/_images/daemon_schema.svg)

The Runner performs different tasks, mainly divided in
1. checking out jobs from database to start a Flow execution
2. updating the states of the Jobs in the queue database
3. interacting with the worker hosts to upload/download files and check the job status
4. inserting the output data in the output JobStore

The runner can be activate with the `jf runner start` command. Its status can be checked with `jf runner status` and `jf runner info` and it can be stopped with `jf runner stop` or `jf runner shutdown`. The Runner **remains active until explicitly stopped**. 

In [None]:
!jf runner start

In [None]:
!jf runner status

In [None]:
!jf runner info

If everything is running properly you can check the status of the Jobs in the queue. They should reach the `COMPLETED` state in few seconds.

In [None]:
!jf job list

## Extracting results

Once the Jobs are completed it is possible to extract their outputs. Jobflow-remote does not change this specific aspect of jobflow, so the results can still be extracted from the database using the `JobStore` object or direct database queries. The only difference would be to use jobflow-remote to get the properly configured instance of `JobStore` through the helper function `get_jobstore`. Remeber to **connect** the JobStore before using it.

In [None]:
from jobflow_remote import get_jobstore

jobstore = get_jobstore()
jobstore.connect()

Queries will be the same as for standard jobflow and results can be obtained with generic queries, or referring to the job `uuid`. In this case it may also be convenient to use jobflow-remote's `db_id`, which is stored in the outputs `metadata`:

In [None]:
print("Output from uuid: ", jobstore.get_output(uuid=j2.uuid))
print("Output document from uuid: ", jobstore.query_one({"metadata.db_id": "2"}))

## Submitting to a queue based worker 

Since the previous Flow was executed on the local worker, that is based on the shell execution, no further information needed to be provided. However, to perform real simulations Jobs need to be submitted to queueing systems in the HPC centers (e.g. Slurm, PBS, ...). To do so it is necessary to specify which resources will be used. Let's create a new Flow and submit it to the `local_slurm` worker (or to the `cecam` one). Since the scheduler in this worker is Slurm, some of the standard Slurm input options can be used

<div class="alert alert-block alert-info">
The handling of the submission to the scheduler is delegated to <a href="https://matgenix.github.io/qtoolkit/">qtoolkit</a>. The available keywords can be checked in the <a href="https://github.com/Matgenix/qtoolkit/blob/bcb445b903f3cb78295aa7641944e0bade9a3fb8/src/qtoolkit/io/slurm.py#L150">Slurm template</a> 
</div>

In [None]:
j3 = add(1, 2)
j4 = add(j3.output, 4)

flow2 = Flow([j3, j4])

submit_flow(flow2, worker="local_slurm", resources={"nodes": 1 , "ntasks": 1, "time": "00:10:00"})

The execution mayb be slightly slower then in the previous case, since now files need to be transferred to the other machine and the Job is running through a queue.

In [None]:
!jf job list