# Psij Getting Started Tutorial


## Background
PSI/J (Portable Submission Interface for Jobs), is a Python abstraction layer which allows your HPC application to run virtually anywhere. <br> <br>

<img src="./images/psij_overview.png" width="350"/> 

PSI/J automatically translates abstract job specifications into concrete scripts and commands to send to the scheduler. PSI/J has a number of advantages:
1. **Runs entirely in user space**: no need to wait for infrequent deployment cycles, it's easy to leverage built-in or community-provided plugins
2. **An asynchronous modern API for job management**: a clean Python API for requesting and managing jobs.
3. **Supports the common batch schedulers**: we test PSI/J across multiple DOE supercomputer centers. It’s easy to test PSI/J on your systems and share the results with the community.
4. **Built by the HPC community, for the HPC community**: PSI/J is based on a number of libraries used by state-of-the-art HPC workflow applications.
5. **PSI/J is an open source project**: we are establishing a community to develop, test, and deploy PSI/J across many HPC facilities.

### Installation

In [None]:
%%capture capt

%pip install psij-python

# or install from GitHub
#   pip install git+https://github.com/ExaWorks/psij-python.git

## Overview 
In PSI/J’s terminology, a [Job](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job) represents an executable plus a bunch of attributes. Static job attributes such as resource requirements are defined by the [JobSpec](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_spec.JobSpec) at creation. Dynamic job attributes such as the [JobState](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_state.JobState) are modified by [JobExecutors](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_executor.JobExecutor) as the [Job](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job) progresses through its lifecycle.

A [JobExecutor](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_executor.JobExecutor) represents a specific *Resource Manager*, e.g. Slurm, on which the Job is being executed. Available *Resource Managers*: 
- Local
- Cobalt
- Flux
- LSF
- PBS
- Slurm

Generally, when jobs are submitted, they will be queued for a variable period of time, depending on how busy the target machine is. Once the Job is started, its executable is launched and runs to completion.

In PSI/J, a job is submitted by [JobExecutor.submit(Job)](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_executor.JobExecutor) which permanently binds the Job to that executor and submits it to the underlying resource manager.

### Basic Usage
Creating a *single* job with PSI/J:
1. Create a [JobExecutor](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_executor.JobExecutor) instance.
2. Create a [JobSpec](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_spec.JobSpec) object and populate it with information about your job
3. Create a [Job](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job) with that [JobSpec](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_spec.JobSpec)
4. Submit the [Job](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job) instance to the [JobExecutor](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_executor.JobExecutor)

### Viewing Job Status
To view a jobs' status, you can use `print(str(Job.status))`

In [2]:
import time
from psij import Job, JobExecutor, JobSpec

executor_type = "local"
executable = "/bin/date"
ex = JobExecutor.get_instance(executor_type)
job = Job(JobSpec(executable=executable))

print("Job Status Stages:")
print("\tPrior to submitting: ", str(job.status))
ex.submit(job)
print("\tAfter submitting: ", str(job.status))
time.sleep(1)
print("\tOnce completed: ", str(job.status))

Job Status Stages:
	Prior to submitting:  JobStatus[NEW, time=1674256341.2602851]
	After submitting:  JobStatus[ACTIVE, time=1674256341.28253]
	Once completed:  JobStatus[COMPLETED, time=1674256341.438302, exit_code=0]


### Multiple Jobs
This section will show you how to submit *multiple* jobs

In [5]:
from psij import Job, JobExecutor, JobSpec
from psij import JobState

executable = "/bin/date"
ex = JobExecutor.get_instance(executor_type)

jobs = []

print("Submitting jobs: ")
for _ in range(10):
    job = Job(JobSpec(executable=executable))
    print(str(job.status))
    jobs.append(job)
    ex.submit(job)

time.sleep(1)
print("\nCompleted Jobs: ")
_ = [print(job.status) for job in jobs]

Submitting jobs: 
JobStatus[NEW, time=1674256370.7787745]
JobStatus[NEW, time=1674256370.7983398]
JobStatus[NEW, time=1674256370.8159137]
JobStatus[NEW, time=1674256370.833979]
JobStatus[NEW, time=1674256370.8511078]
JobStatus[NEW, time=1674256370.8664868]
JobStatus[NEW, time=1674256370.8822076]
JobStatus[NEW, time=1674256370.8973079]
JobStatus[NEW, time=1674256370.9105623]
JobStatus[NEW, time=1674256370.9238665]

Completed Jobs: 
JobStatus[COMPLETED, time=1674256370.8887913, exit_code=0]
JobStatus[COMPLETED, time=1674256370.8889246, exit_code=0]
JobStatus[COMPLETED, time=1674256370.8890853, exit_code=0]
JobStatus[COMPLETED, time=1674256370.8891776, exit_code=0]
JobStatus[COMPLETED, time=1674256370.894441, exit_code=0]
JobStatus[COMPLETED, time=1674256371.095883, exit_code=0]
JobStatus[COMPLETED, time=1674256371.0960574, exit_code=0]
JobStatus[COMPLETED, time=1674256371.0961127, exit_code=0]
JobStatus[COMPLETED, time=1674256371.0961728, exit_code=0]
JobStatus[COMPLETED, time=1674256371

Every instance of `JobExecutor` can handle an abitrar number of submitted jobs. For reference, we have tested it with up to 64,000 jobs.

### Waiting for Completion
To wait for a job to complete once it has been submitted, it suffices to call the [wait](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job.wait) method with no arguments:

In [None]:
from psij import Job, JobSpec

job = Job(JobSpec(executable="/bin/date"))
ex.submit(job)
job.wait()

The [wait](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job.wait) call will return once the job has reached a terminal state, which almost always means that it finished or was cancelled.

To distinguish jobs that complete successfully from ones that fail or are cancelled, fetch the status of the job after calling [wait](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job.wait):

In [None]:
job.wait()
print(job.status.state)

## Configuring a Job
The previous examples specify `executable='/bin/date'` tells PSI/J that we want to run the `/bin/date` command. The superclass is `JobSpec`, which allows you to configure the following parameters used in the instance of `Job`:
- arguments for the job executable
- environment the job is runnning in
- destination for standard output and error streams
- resource requirements for the job's execution
- accounting details to be used

Relevant Docs: [JobSpec](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_spec.JobSpec)

### Job Arguments
The executable’s command line arguments to be used for a job are specified as a list of strings in the arguments attribute of the `JobSpec` class. For example, our previous `/bin/date` job could be changed to request UTC time formatting:


In [None]:
executor_type = 'local'
ex = JobExecutor.get_instance(executor_type)
job = Job(JobSpec(executable='/bin/date', arguments=['-utc', '--debug']))

print(str(job.status))
ex.submit(job)
job.wait()
print(str(job.status))

Note: `JobSpec` can also be added incrementally

In [None]:
from psij import JobSpec

spec = JobSpec()
spec.executable = '/bin/date'
spec.arguments = ['-u']
job = Job(spec)

print(str(job.status))
ex.submit(job)
job.wait()
print(str(job.status))

### Job Environment
The job environment sets the environment variables for a job before it is launched. This is the equivalent of exporting `FOO=bar` on the command line before running a command. These environment variables are specified as a dictionary of string key/value pairs:

*Note: Environment variables specified this way will overwrite settings from your shell initialization files (e.g., ~/.bashrc), including from any modules loaded in the default shell environment.*

In [None]:
from psij import JobSpec

spec = JobSpec()
spec.executable = '/bin/date'
spec.environment = {'TZ': 'America/Los_Angeles'}
job = Job(spec)

print(str(job.status))
ex.submit(job)
job.wait()
print(str(job.status))

### Job Stdio
Standard output and standard error streams of the job can be individually redirected to files by setting the stdout_path and stderr_path attributes:

The job’s standard input stream can also be redirected to read from a file, by setting the `spec.stdin_path` attribute.

In [None]:
from psij import JobSpec

spec = JobSpec()
spec.executable = '/bin/date'
spec.stdout_path = '/tmp/date.out'
spec.stderr_path = '/tmp/date.err'

job = Job(spec)

print(str(job.status))
ex.submit(job)
job.wait()
print(str(job.status))

with open(spec.stdout_path, 'r') as fd:
    print("Stdout:", fd.read(), end='')

with open(spec.stderr_path, 'r') as fd:
    print("Stderr:", fd.read(), end='')

### Job Resources
A job submitted to a cluster is allocated a specific set of resources to run on. The amount and type of resources are defined by a resource specification `ResourceSpec` which becomes a part of the job specification. The resource specification supports the following attributes:

* `node_count`: allocate that number of compute nodes to the job. All cpu-cores and gpu-cores on the allocated node can be exclusively used by the submitted job.

* `processes_per_node`: on the allocated nodes, execute that given number of processes.

* `process_count`: the total number of processes (MPI ranks) to be started

* `cpu_cores_per_process`: the number of cpu cores allocated to each launched process. PSI/J uses the system definition of a cpu core which may refer to a physical cpu core or to a virtual cpu core, also known as a hardware thread.

* `gpu_cores_per_process`: the number of gpu cores allocated to each launched process. The system definition of an gpu core is used, but usually refers to a full physical GPU.

* `exclusive_node_use`: When this boolean flag is set to True, then PSI/J will ensure that no other jobs, neither of the same user nor of other users of the same system, will run on any of the compute nodes on which processes for this job are launched.

A resource specification does not need to define all available attributes. In fact, an empty resource spec is valid as it refers to a single process being launched on a single cpu core.

The user should also take care not to define contradictory statements. For example, the following specification cannot be enacted by PSI/J as the specified node count contradicts the value of `process_count / processes_per_node`:

In [None]:
from psij import JobSpec, ResourceSpecV1

spec = JobSpec()
spec.executable = '/bin/stress'
spec.resource_spec = ResourceSpecV1(node_count=2, processes_per_node=2,
        process_count=4)
job = Job(spec)

print(str(job.status))
ex.submit(job)
job.wait()
print(str(job.status))

### Processes versus ranks
All processes of the job will share a single MPI communicator (*MPI_COMM_WORLD*), independent of their placement, and the term *rank* (which usually refers to an MPI rank) is thus equivalent. However, jobs started with a single process instance may, depending on the executor implementation, not get an MPI communicator. How Jobs are launched can be specified by the launcher attribute of the `JobSpec`, as documented below.

### Launching Methods
To specify how the processes in your job should be started once resources have been allocated for it, pass the name of a launcher (e.g. `"mpirun"`, `"srun"`, etc.) like so: `JobSpec(..., launcher='srun')`.

### Scheduling Information
To specify resource-manager-specific information, like queues/partitions, runtime, and so on, create a [JobAttributes](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_attributes.JobAttributes) and set it with `JobSpec(..., attributes=my_job_attributes)`:

In [None]:
from psij import Job, JobExecutor, JobSpec, JobAttributes, ResourceSpecV1

executor_type = 'local'
ex = JobExecutor.get_instance(executor_type)

job = Job(
    JobSpec(
        executable="/bin/date",
        resources=ResourceSpecV1(node_count=1),
        attributes=JobAttributes(
            queue_name="<QUEUE_NAME>", project_name="<ALLOCATION>"
        ),
    )
)

print(str(job.status))
ex.submit(job)
job.wait()
print(str(job.status))


Note: The <QUEUE_NAME> and <ALLOCATION> fields will depend on the system you are running on.

## Managing Job State
In all the above examples, we have submitted jobs without checking on what happened to them. Once the job has finished executing (which, for */bin/date*, should be almost as soon as the job starts) the resource manager will mark the job as complete, triggering PSI/J to do the same via the [JobStatus](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_status.JobStatus) attribute of the job. `Job` state progressions follow this state model:

<img src="./images/psij_job_states.png" width="350"/>


### Cancelling a Job
If supported by the underlying job scheduler, PSI/J jobs can be canceled by invoking the [cancel](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job.cancel) method.

In [None]:
job = Job(JobSpec(executable="/bin/date"))
ex.submit(job)
job.cancel()
print(str(job.status))

### Status Callbacks
Waiting for jobs to complete with [wait](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job.wait) is fine if you don’t mind blocking while you wait for a single job to complete. However, if you want to wait on multiple jobs without blocking, or you want to get updates when jobs start running, you can attach a callback to a [JobExecutor](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job_executor.JobExecutor) which will fire whenever any job submitted to that executor changes status.

To wait on multiple jobs at once:

In [None]:
import time
from psij import Job, JobExecutor, JobSpec

count = 3

def callback(job, status):
    global count

    if status.final:
        print(f"Job {job} completed with status {status}\n")
        count -= 1

ex = JobExecutor.get_instance(executor_type)
ex.set_job_status_callback(callback)

for _ in range(count):
    job = Job(JobSpec(executable="/bin/date"))
    ex.submit(job)

while count > 0:
    time.sleep(0.01)

Status callbacks can also be set on individual jobs by calling the `JobExecutor`'s method, [set_job_status_callback()](https://exaworks.org/psij-python/#docs/.generated/psij.html/#psij.job.Job.set_job_status_callback).