Skip to content

Quantum-Accelerators/covalent-hpc-plugin

Repository files navigation

 

Covalent HPC Plugin

Covalent is a Pythonic workflow tool used to execute tasks on advanced computing hardware. This executor plugin uses PSI/J to allow Covalent to seamlessly interface with a variety of common high-performance computing job schedulers and pilot systems (e.g. Slurm, PBS, LSF, Flux, Cobalt, RADICAL-Pilot). For workflows to be deployable, users must have SSH access to the login node, access to the job scheduler, and write access to the remote filesystem.

Installation

Server Environment

To use this plugin with Covalent, simply install it using pip in whatever Python environment you use to run the Covalent server (your local machine by default):

pip install covalent-hpc-plugin

Run the following in Python to have Covalent automatically register the plugin:

import covalent

HPC Environment

Additionally, on the remote machine(s) where you plan to execute Covalent workflows with this plugin, ensure that the remote Python environment has Covalent and PSI/J installed:

pip install covalent psij-python

Note that the Python major and minor version numbers on both the local and remote machines must match to ensure reliable (un)pickling of the various objects.

Usage

Default Configuration Parameters

By default, when you install the covalent-hpc-plugin and run import covalent for the first time, your Covalent configuration file (found at ~/.config/covalent/covalent.conf by default) will automatically be updated to include the following sections. These are not all of the available parameters but are simply the default values.

[executors.hpc]
address = ""
username = ""
ssh_key_file = "~/.ssh/id_rsa"
instance = "slurm"
launcher = "single"
inherit_environment = true
pre_launch_cmds = []
post_launch_cmds = []
shebang = "#!/bin/bash"
remote_python_exe = "python"
remote_workdir = "~/covalent-workdir"
create_unique_workdir = false
cache_dir = "~/.cache/covalent"
poll_freq = 60

[executors.hpc.environment]

[executors.hpc.resource_spec_kwargs]
node_count = 1
processes_per_node = 1
gpu_cores_per_process = 0

[executors.hpc.job_attributes_kwargs]
duration = 10

You can modify various parameters in the Covalent config file as-needed to better suit your needs, such as the address of the remote machine, the username to use when logging in, the ssh_key_file to use for authentication, the type of job scheduler (instance), and much more. Note that PSI/J is a common interface to many common job schedulers, so you only need to toggle the instance to switch between job schedulers.

A full description of the various input parameters are described in the docstrings of the HPCExecutor class, reproduced below:

class HPCExecutor(AsyncBaseExecutor):
"""
HPC executor plugin class, built around PSI/J.
This plugin requires that Covalent and PSI/J exist in the remote machine's Python environment.
Args:
address: Remote address or hostname of the login node (e.g. "coolmachine.university.edu").
username: Username used to authenticate over SSH (i.e. what you use to login to `address`).
The default is None (i.e. no username is required).
ssh_key_file: Private RSA key used to authenticate over SSH.
The default is "~/.ssh/id_rsa". If no key is required, set this as None.
cert_file: Certificate file used to authenticate over SSH, if required (usually has extension .pub).
The default is None. If no certificate is required, leave this as None.
instance: The PSI/J `JobExecutor` instance (i.e. job scheduler) to use for job submission.
Must be one of: "cobalt", "flux", "lsf", "pbspro", "rp", "slurm". Defaults to "slurm".
launcher: The PSI/J `JobSpec` launcher to use for the job.
Must be one of: "aprun", "jsrun", "mpirun", "multiple", "single", "srun". Defaults to "single".
resource_spec_kwargs: The PSI/J keyword arguments for `ResourceSpecV1`, which describes the resources to
reserve on the scheduling system. Defaults to None, which is equivalent to the PSI/J defaults.
job_attributes_kwargs: The PSI/J keyword arguments for `JobAttributes`, which describes information about how the
job is queued and run. Defaults to None, which is equivalent to the PSI/J defaults.
inherit_environment: Whether the job should inherit the parent environment. Defaults to True.
environment: Environment variables to set for the job. Defaults to None, which is equivalent to {}.
pre_launch_cmds: List of shell-compatible commands to run before launching the job. Defaults to None.
post_launch_cmds: List of shell-compatible commands to run after launching the job. Defaults to None.
shebang: Shebang to use for pre-launch and post-launch commands. Defaults to "#!/bin/bash".
remote_python_exe: Python executable to use for job submission. Defaults to "python".
remote_conda_env: Conda environment to activate on the remote machine. Defaults to None.
remote_workdir: Working directory on the remote cluster. Defaults to "~/covalent-workdir".
create_unique_workdir: Whether to create a unique working (sub)directory for each task.
Defaults to False.
cache_dir: Local cache directory used by this executor for temporary files.
Defaults to the dispatcher's cache directory.
poll_freq: Frequency with which to poll a submitted job. Defaults to 60. Note that settings this value
to be significantly smaller is not advised, as it will cause too frequent SSHs into the remote machine.
cleanup: Whether to clean up the temporary job submission files when done. Set this to False for debugging.
Note that temporary files will be made both in the `remote_workdir` and in `~/.psij`. The latter will
not be cleaned up by the plugin.
log_stdout: Path to file to log stdout to. Defaults to "" (i.e. no logging).
log_stderr: Path to file to log stderr to. Defaults to "" (i.e. no logging).
time_limit: time limit for the task (in seconds). Defaults to -1 (i.e. no time limit). Note that this is
not the same as the job scheduler's time limit, which is set in `job_attributes_kwargs`.
retries: Number of times to retry execution upon failure. Defaults to 0 (i.e. no retries).
"""

Defining Resource Specifications and Job Attributes

Two of the most important sets of parameters are resource_spec_kwargs and job_attributes_kwargs, which specify the resources required for the job (e.g. number of nodes, number of processes per node, etc.) and the job attributes (e.g. duration, queue name, etc.), respectively.

  1. resource_spec_kwargs is a dictionary of keyword arguments passed to PSI/J's ResourceSpecV1 class
  2. job_attributes_kwargs is a dictionary of keyword arguments passed to PSI/J's JobAttributes class.

The allowed types are listed here.

Using the Plugin in a Workflow: Approach 1

With the configuration file appropriately set up, one can run a workflow on the HPC machine as follows:

import covalent as ct

@ct.electron(executor="HPCExecutor")
def add(a, b):
    return a + b

@ct.lattice
def workflow(a, b):
    return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Using the Plugin in a Workflow: Approach 2

If you wish to modify the various parameters within your Python script rather than solely relying on the the Covalent configuration file, it is possible to do that as well by instantiating a custom instance of the HPCExecutor class. An example with some commonly used parameters is shown below. By default, any parameters not specified in the HPCExecutor will be inherited from the configuration file.

import covalent as ct

executor = ct.executor.HPCExecutor(
    address="coolmachine.university.edu",
    username="UserName",
    ssh_key_file="~/.ssh/id_rsa",
    instance="slurm",
    remote_conda_env="myenv",
    environment={"HELLO": "WORLD"},
    resource_spec_kwargs={
        "node_count": 2,
        "processes_per_node": 24
    },
    job_attributes_kwargs={
        "duration": 30, # minutes
        "queue_name": "debug",
        "project_name": "AccountName",
    },
    launcher="single",
    remote_workdir="~/covalent-workdir",
)

@ct.electron(executor=executor)
def add(a, b):
    return a + b

@ct.lattice
def workflow(a, b):
    return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Working Example: Perlmutter

The following is a minimal working example to submit a Covalent job on NERSC's Perlmutter machine. It assumes that you have used the sshproxy utility to generate a certificate file in order to circumvent the need for multi-factor authentication for each login.

import covalent as ct

executor = ct.executor.HPCExecutor(
    address="perlmutter-p1.nersc.gov",
    username="UserName",
    ssh_key_file="~/.ssh/nersc",
    cert_file="~/.ssh/nersc-cert.pub",
    remote_conda_env="myenv",
    job_attributes_kwargs={
        "project_name": "ProjectName",
        "custom_attributes": {"slurm.constraint": "cpu", "slurm.qos": "debug"},
    },
)

@ct.electron(executor=executor)
def add(a, b):
    return a + b

@ct.lattice
def workflow(a, b):
    return add(a, b)


dispatch_id = ct.dispatch(workflow)(1, 2)
result = ct.get_result(dispatch_id)

Troubleshooting

The most common cause of issues is related to the job scheduler details (i.e. the resource_spec_kwargs and the job_attributes_kwargs). If your job fails on the remote machine, set cleanup=False and then check the files left behind in the working directory as well as the ~/.psij directory for a history and various log files associated with your attempted job submissions.

Release Notes

Release notes are available in the Changelog.

Credit

This plugin was developed by Andrew S. Rosen, building off of prior work by the Agnostiq team on the covalent-slurm-plugin.

If you use this plugin, be sure to cite Covalent as follows:

W. J. Cunningham, S. K. Radha, F. Hasan, J. Kanem, S. W. Neagle, and S. Sanand. Covalent. Zenodo, 2022. https://doi.org/10.5281/zenodo.5903364

License

Covalent is licensed under the Apache 2.0 License. Covalent may be distributed under other licenses upon request. See the LICENSE file or contact the support team for more details.