Skip to content

Utilities for configuring dask-jobqueue with appropriate settings for NCAR clusters

License

Notifications You must be signed in to change notification settings

NCAR/ncar-jobqueue

Repository files navigation

CI GitHub Workflow Status Code Coverage Status pre-commit.ci status
Docs Documentation Status
Package Conda PyPI
License License

ncar-jobqueue

ncar-jobqueue provides utilities for configuring dask-jobqueue with appropriate default settings for NCAR's clusters.

The following compute servers are supported:

  • Casper (casper.hpc.ucar.edu)
  • Derecho (derecho.hpc.ucar.edu)
  • Hobart (hobart.cgd.ucar.edu)
  • Izumi (izumi.unified.ucar.edu)

CISL discourages the use of Derecho for Dask. Please use Casper instead unless you are sure you can properly utilize a significant portion of the CPU cores on a Derecho node (e.g., via dask-mpi).

Installation

NCAR-jobqueue can be installed from PyPI with pip:

python -m pip install ncar-jobqueue

NCAR-jobqueue is also available from conda-forge for conda installations:

conda install -c conda-forge ncar-jobqueue

Configuration

ncar-jobqueue provides a custom configuration file with appropriate default settings for different clusters. This configuration file resides in ~/.config/dask/ncar-jobqueue.yaml:

ncar-jobqueue.yaml
casper:
  pbs:
    #project: XXXXXXXX
    name: dask-worker-casper
    cores: 1 # Total number of cores per job
    memory: '4GiB' # Total amount of memory per job
    processes: 1 # Number of Python processes per job
    interface: ext # Network interface to use (high-speed ethernet)
    walltime: '01:00:00'
    resource-spec: select=1:ncpus=1:mem=4GB
    queue: casper
    log-directory: '/glade/derecho/scratch/${USER}/dask/casper/logs'
    local-directory: '/glade/derecho/scratch/${USER}/dask/casper/local-dir'
    job-extra: ['-r n']
    env-extra: []
    death-timeout: 60

derecho:
  pbs:
    #project: XXXXXXXX
    name: dask-worker-derecho
    cores: 1 # Total number of cores per job
    memory: '4GiB' # Total amount of memory per job
    processes: 1 # Number of Python processes per job
    interface: hsn0 # Network interface to use (Slingshot)
    queue: develop
    walltime: '01:00:00'
    resource-spec: select=1:ncpus=128:mem=235GB
    log-directory: '/glade/derecho/scratch/${USER}/dask/derecho/logs'
    local-directory: '/glade/derecho/scratch/${USER}/dask/derecho/local-dir'
    job-extra: ['-l job_priority=economy', '-r -n']
    env-extra: []
    death-timeout: 60

hobart:
  pbs:
    name: dask-worker-hobart
    cores: 10 # Total number of cores per job
    memory: '96GB' # Total amount of memory per job
    processes: 10 # Number of Python processes per job
    # interface: null              # ib0 doesn't seem to be working on Hobart
    queue: medium
    walltime: '08:00:00'
    resource-spec: nodes=1:ppn=48
    log-directory: '/scratch/cluster/${USER}/dask/hobart/logs'
    local-directory: '/scratch/cluster/${USER}/dask/hobart/local-dir'
    job-extra: ['-r n']
    env-extra: []
    death-timeout: 60

izumi:
  pbs:
    name: dask-worker-izumi
    cores: 10 # Total number of cores per job
    memory: '96GB' # Total amount of memory per job
    processes: 10 # Number of Python processes per job
    # interface: null              # ib0 doesn't seem to be working on Hobart
    queue: medium
    walltime: '08:00:00'
    resource-spec: nodes=1:ppn=48
    log-directory: '/scratch/cluster/${USER}/dask/izumi/logs'
    local-directory: '/scratch/cluster/${USER}/dask/izumi/local-dir'
    job-extra: ['-r n']
    env-extra: []
    death-timeout: 60

Note:

  • To configure a default project account that is used by dask-jobqueue when submitting batch jobs, uncomment the project key/line in ~/.config/dask/ncar-jobqueue.yaml and set it to an appropriate value.

Usage

Note:

⚠️ Online documentation for dask-jobqueue is available here. ⚠️

Casper

>>> from ncar_jobqueue import NCARCluster
>>> from dask.distributed import Client
>>> cluster = NCARCluster(project='XXXXXXXX')
>>> cluster
PBSCluster(0f23b4bf, 'tcp://xx.xxx.x.x:xxxx', workers=0, threads=0, memory=0 B)
>>> cluster.scale(jobs=2)
>>> client = Client(cluster)

Derecho

>>> from ncar_jobqueue import NCARCluster
>>> from dask.distributed import Client
>>> cluster = NCARCluster(project='XXXXXXXX')
>>> cluster
PBSCluster(0f23b4bf, 'tcp://xx.xxx.x.x:xxxx', workers=0, threads=0, memory=0 B)
>>> cluster.scale(jobs=2)
>>> client = Client(cluster)

Hobart

>>> from ncar_jobqueue import NCARCluster
>>> from dask.distributed import Client
>>> cluster = NCARCluster()
>>> cluster
PBSCluster(0f23b4bf, 'tcp://xx.xxx.x.x:xxxx', workers=0, threads=0, memory=0 B)
>>> cluster.scale(jobs=2)
>>> client = Client(cluster)

Izumi

>>> from ncar_jobqueue import NCARCluster
>>> from dask.distributed import Client
>>> cluster = NCARCluster()
>>> cluster
PBSCluster(0f23b4bf, 'tcp://xx.xxx.x.x:xxxx', workers=0, threads=0, memory=0 B)
>>> cluster.scale(jobs=2)
>>> client = Client(cluster)

Non-NCAR machines

On non-NCAR machines, ncar-jobqueue will warn the user, and it will use distributed.LocalCluster:

>>> from ncar_jobqueue import NCARCluster
.../ncar_jobqueue/cluster.py:17: UserWarning: Unable to determine which NCAR cluster you are running on... Returning a `distributed.LocalCluster` class.
warn(message)
>>> from dask.distributed import Client
>>> cluster = NCARCluster()
>>> cluster
LocalCluster(3a7dd0f6, 'tcp://127.0.0.1:64184', workers=4, threads=8, memory=17.18 GB)