Skip to content

Commit

Permalink
Enable more complete configurability of jsrun launcher (#384)
Browse files Browse the repository at this point in the history
* initial patch for more complete configuration of jsrun launcher

* First pass at documenting usage of the lsf/jsrun launcher

* Add corresponding maestro steps for each jsrun variant

* Add one example of a memory hungry application

* Build table mapping jsrun to maestro step keys

* Add binding controls, rename keys from snake case, document defaults

* Fix up straggling snake case keys

* Improve debugging info in schema error messages

* Fix rs_per_node, make gpu binding optional since it's new in lsf 10.1

* Update binding flag in examples, add note about gpu binding availability

* Add initial lsfscriptadapter tests

* Initial pass at general batch block documentation

* Remove old commentary

* Remove unneeded nodes/procs math in jsrun launcher substitution

* Remove unneeded loggin output

* Remove more debugging log outputs

* Update lsf examples to match json schema for resource specification keys

* Cleanup the cpus per rs machinery, schema

* Add openmp and mpi lulesh study to exercise lsf resource specification keys

* Document the sample lsf lulesh specification
  • Loading branch information
jwhite242 committed Apr 30, 2022
1 parent a6840ee commit 026ce65
Show file tree
Hide file tree
Showing 9 changed files with 594 additions and 21 deletions.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,7 @@ And, running the study is still as simple as:
quick_start
hello_world
lulesh_breakdown
schedulers
parameters
maestro_core

Expand Down
87 changes: 83 additions & 4 deletions docs/source/lulesh_breakdown.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,83 @@
LULESH Specification Breakdown
===============================

Stub
LULESH Specification Breakdown
===============================

Stub

Vanilla Unix/Linux Specification
++++++++++++++++++++++++++++++++

Stub

Parallel Specifications
+++++++++++++++++++++++

The next three variants of the Lulesh example make changes to enable running the steps
on HPC clusters using a variety of schedulers, as well as invoking the various parallel
options in Lulesh itself (MPI, OpenMP, ...).

Slurm Scheduled
---------------

Stub

Flux Scheduled
--------------

Stub

LSF Scheduled Parallel Specification
------------------------------------

This example is configured for running on LLNL's IBM/nVidia HPC machines which use the
LSF job scheduler to manage compute resources. Due to variations in system setups this
may not work on all LSF installations and may require some tweaking to account for the
differences. As with the other parallel specifications, adjust the machine name, bank,
and queue's to suit the system you run this spec on.


First change, the batch block to use the LSF scheduler to get appropriate batch submission
and parallel command injection working:

.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
:language: yaml
:lines: 18-22
:emphasize-lines: 2

We'll briefly skip ahead to the parameters block as there's multiple different parameters
used to specify all three modes of running. ``TASKS`` specifies the mpi tasks, while
``CPUS_PER_TASK`` captures the openmp threads inside each task. These can be nested
but here we are only showing the pure-mpi and pure-openmp modes of Lulesh:

.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
:language: yaml
:lines: 80-99
:emphasize-lines: 10-12, 14-16


There are a few differences in the first step to deal with the llnl system where we are
using the lmod setup to select the mpi and compiler installs to use. In
addition, we see the resource keys which have a few differences owing to the
way LSF describes resources using resource sets (rs). In this step it's not very important
yet as the compilation doesn't need to run in parallel. Note the depends is empty as
this is the first step that needs to run and has no dependencies.

.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
:language: yaml
:lines: 25-48
:emphasize-lines: 5-11, 18-24


The run step gets a little more interesting now that we are mixing serial, mpi parallel, and
openmp parallel modes in a single step. The biggest difference here is that some of the
resource parameters get used in the step to set the environment variables needed to control
the OpenMP thread counts using the ``$(CPUS_PER_TASK)`` Maestro parameter.

.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
:language: yaml
:lines: 51-77
:emphasize-lines: 16-18, 21-27

And finally, we have the complete specification here

.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
:language: yaml
231 changes: 231 additions & 0 deletions docs/source/schedulers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
Scheduling Studies (a.k.a. the Batch Block)
===========================================

The batch block is an optional component of the workflow specification that enables
job submission and management on remote clusters. The block contains a handful of
keys for specifying system level information that's applicable to all scheduled
steps:

.. list-table:: Batch block keys

* - Key
- Description
* - type
- Select scheduler adapter to use. Currently supported are ``slurm``, ``lsf``, and
``flux``, ``local``. Default is ``local``, non-scheduled job steps.
* - host
- Name of cluster being run on
* - queue
- Machine partition to schedule jobs to. '--partition' for slurm systems, '-q' for
lsf systems, ...
* - bank
- Account which runs the job; this is used for computing job priority on the cluster.
'--account' on slurm, '-G' on lsf, ...
* - reservation
- Name of any prereserved machine partition to submit jobs to
* - qos
- Optional quality of service options (slurm only)
* - flux_uri
- Flux specific reservation functionality

The information in this block is used to populate the step specific batch scripts with the appropriate
header comment blocks (e.g. '#SBATCH --partition' for slurm). Additional keys such as step specific
resource requirements (number of nodes, cpus/tasks, gpus, ...) get added here when processing
individual steps; see subsequent sections for scheduler specific details. Note that job steps will
run locally unless at least the ``nodes`` or ``procs`` key in the step is populated.

LOCAL
*****

Stub

SLURM
*****

Stub

FLUX
****

Stub

LSF: a Tale of Two Launchers
****************************

The LSF scheduler has multiple options for the parallel launcher commands:

* `lsrun <https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=jobs-run-interactive-tasks>`_
* `jsrun <https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=SSWRJV_10.1.0/jsm/jsrun.html>`_

Maestro currently supports only the jsrun version, which differs from slurm
via a more flexible specification of resources available for each task. In
addition to the `procs`, `cores per task`, and `gpu` keys, there are also
`tasks_per_rs` and `rs_per_node`. `jsrun` describes things in terms of resource
sets, with several keywords controlling these resource sets and mapping them to
the actual machine/node allocations:

.. list-table:: Mapping of LSF args to Maestro step keys

* - LSF (jsrun)
- Maestro
- Description
- Default setting
* - -n, --nrs
- procs
- number of resource sets
- 1
* - -a, --tasks_per_rs
- tasks per rs
- Number of MPI tasks (ranks) in a resource
- 1
* - -c, --cpu_per_rs
- cores per task
- Number of physical CPU cores in a resource set
- 1
* - -g, --gpu_per_rs
- gpus
- Number of GPU's per resource set
- 0
* - -b, --bind
- bind
- Controls binding of tasks in a resource set
- 'rs'
* - -B, --bind_gpus
- bind gpus
- Controls binding of tasks to GPU's in a resource set
- 'none'
* - -r, --rs_per_host
- rs per node
- Number of resource sets per node
- 1


.. note::

``bind_gpus`` is new in lsf 10.1 and may not be available on all systems

Now for a few examples of how to map these to Maestro's resource specifications.
Note the `node` key is not directly used for any of these, but is still used for
the reservation itself. The rest of the keys serve to control the per task resources
and then the per node packing of resource sets. Consider a few examples:

* 1 resource set per gpu on a cluster with 4 gpus per node with an application requesting
8 gpus. This will consume 2 full nodes of the cluster with 1 MPI rank associated with
each gpu and having 1 cpu each.

.. code-block:: bash
jsrun -nrs 8 -a 1 -c 1 -g 1 -r 4 --bind rs my_awesome_gpu_application
And the corresponding maestro step that generates it

.. code-block:: yaml
study:
- name: run-my-app
description: launch the best gpu application.
run:
cmd: |
$(LAUNCHER) my_awesome_gpu_application
procs: 8
nodes: 2
gpus: 1
rs per node: 4
tasks per rs: 1
cores per task: 1
Note that `procs` here maps more to the tasks/resource set concept in lsf/jsrun, and
nodes is a multiplier on `rs_per_node` which yields the `nrs` jsrun key

* 1 resource set per cpu, with no gpus, and using all 44 cpus on the node

.. code-block:: bash
jsrun -nrs 44 -a 1 -c 1 -g 0 -r 44 --bind rs my_awesome_mpi_cpu_application
.. code-block:: yaml
study:
- name: run-my-app
description: launch a pure mpi-cpu application.
run:
cmd: |
$(LAUNCHER) my_awesome_mpi_cpu_application
procs: 44
nodes: 1
gpus: 0
rs per node: 44
tasks per rs: 1
cores per task: 1
Again, note that `procs` is a multiple of `rs_per_node`.
* Several multithreaded mpi ranks per node, with no gpus

.. code-block:: bash
jsrun -nrs 4 -a 1 -c 11 -g 0 -r 4 --bind rs my_awesome_omp_mpi_cpu_application
.. code-block:: yaml
study:
- name: run-my-app
description: launch an application using mpi and omp
run:
cmd: |
$(LAUNCHER) my_awesome_omp_mpi_cpu_application
procs: 4
nodes: 1
gpus: 0
rs per node: 4
tasks per rs: 1
cores per task: 11
* Several multithreaded mpi ranks per node with one gpu per rank, spanning multiple
nodes having 4 gpu's each

.. code-block:: bash
jsrun -nrs 8 -a 1 -c 11 -g 1 -r 4 --bind rs my_awesome_all_the_threads_application
.. code-block:: yaml
study:
- name: run-my-app
description: Use all the threads!
run:
cmd: |
$(LAUNCHER) my_awesome_all_the_threads_application
procs: 8
nodes: 2
gpus: 1
rs per node: 4
tasks per rs: 1
cores per task: 11
* An mpi application that needs lots of memory per rank

.. code-block:: bash
jsrun -nrs 2 -a 1 -c 1 -g 0 -r 1 --bind rs my_memory_hungry_application
.. code-block:: yaml
study:
- name: run-my-app
description: Use all the memory for single task per node
run:
cmd: |
$(LAUNCHER) my_memory_hungry_application
procs: 2
nodes: 2
gpus: 0
rs per node: 1
tasks per rs: 1
cores per task: 1
1 change: 1 addition & 0 deletions maestrowf/datastructures/core/executiongraph.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ def generate_script(self, adapter, tmp_dir=""):
LOGGER.info("Generating script for %s into %s", self.name, scr_dir)
self.to_be_scheduled, self.script, self.restart_script = \
adapter.write_script(scr_dir, self.step)
LOGGER.debug("STEP: %s", self.step)
LOGGER.info("Script: %s\nRestart: %s\nScheduled?: %s",
self.script, self.restart_script, self.to_be_scheduled)

Expand Down

0 comments on commit 026ce65

Please sign in to comment.