Enable more complete configurability of jsrun launcher (#384)

* initial patch for more complete configuration of jsrun launcher * First pass at documenting usage of the lsf/jsrun launcher * Add corresponding maestro steps for each jsrun variant * Add one example of a memory hungry application * Build table mapping jsrun to maestro step keys * Add binding controls, rename keys from snake case, document defaults * Fix up straggling snake case keys * Improve debugging info in schema error messages * Fix rs_per_node, make gpu binding optional since it's new in lsf 10.1 * Update binding flag in examples, add note about gpu binding availability * Add initial lsfscriptadapter tests * Initial pass at general batch block documentation * Remove old commentary * Remove unneeded nodes/procs math in jsrun launcher substitution * Remove unneeded loggin output * Remove more debugging log outputs * Update lsf examples to match json schema for resource specification keys * Cleanup the cpus per rs machinery, schema * Add openmp and mpi lulesh study to exercise lsf resource specification keys * Document the sample lsf lulesh specification
LLNL · Apr 30, 2022 · 026ce65 · 026ce65
1 parent a6840ee
commit 026ce65
Show file tree

Hide file tree

Showing 9 changed files with 594 additions and 21 deletions.
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -107,6 +107,7 @@ And, running the study is still as simple as:
    quick_start
    hello_world
    lulesh_breakdown
+   schedulers
    parameters
    maestro_core
 

diff --git a/docs/source/lulesh_breakdown.rst b/docs/source/lulesh_breakdown.rst
@@ -1,4 +1,83 @@
-LULESH Specification Breakdown
-===============================
-
-Stub
+LULESH Specification Breakdown
+===============================
+
+Stub
+
+Vanilla Unix/Linux Specification
+++++++++++++++++++++++++++++++++
+
+Stub
+
+Parallel Specifications
++++++++++++++++++++++++
+
+The next three variants of the Lulesh example make changes to enable running the steps
+on HPC clusters using a variety of schedulers, as well as invoking the various parallel
+options in Lulesh itself (MPI, OpenMP, ...).
+
+Slurm Scheduled
+---------------
+
+Stub
+
+Flux Scheduled
+--------------
+
+Stub
+
+LSF Scheduled Parallel Specification
+------------------------------------
+
+This example is configured for running on LLNL's IBM/nVidia HPC machines which use the
+LSF job scheduler to manage compute resources.  Due to variations in system setups this
+may not work on all LSF installations and may require some tweaking to account for the
+differences.  As with the other parallel specifications, adjust the machine name, bank,
+and queue's to suit the system you run this spec on.
+
+
+First change, the batch block to use the LSF scheduler to get appropriate batch submission
+and parallel command injection working:
+
+.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
+   :language: yaml
+   :lines: 18-22
+   :emphasize-lines: 2
+
+We'll briefly skip ahead to the parameters block as there's multiple different parameters
+used to specify all three modes of running.  ``TASKS`` specifies the mpi tasks, while
+``CPUS_PER_TASK`` captures the openmp threads inside each task.  These can be nested
+but here we are only showing the pure-mpi and pure-openmp modes of Lulesh:
+
+.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
+   :language: yaml
+   :lines: 80-99
+   :emphasize-lines: 10-12, 14-16
+
+
+There are a few differences in the first step to deal with the llnl system where we are
+using the lmod setup to select the mpi and compiler installs to use.  In
+addition, we see the resource keys which have a few differences owing to the
+way LSF describes resources using resource sets (rs).  In this step it's not very important
+yet as the compilation doesn't need to run in parallel.  Note the depends is empty as
+this is the first step that needs to run and has no dependencies.
+
+.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
+   :language: yaml
+   :lines: 25-48
+   :emphasize-lines: 5-11, 18-24           
+
+
+The run step gets a little more interesting now that we are mixing serial, mpi parallel, and
+openmp parallel modes in a single step.  The biggest difference here is that some of the
+resource parameters get used in the step to set the environment variables needed to control
+the OpenMP thread counts using the ``$(CPUS_PER_TASK)`` Maestro parameter.
+
+.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
+   :language: yaml
+   :lines: 51-77
+   :emphasize-lines: 16-18, 21-27
+
+And finally, we have the complete specification here
+
+.. literalinclude:: ../../samples/lulesh/lulesh_sample1_unix_lsf.yaml
+   :language: yaml
diff --git a/docs/source/schedulers.rst b/docs/source/schedulers.rst
@@ -0,0 +1,231 @@
+Scheduling Studies (a.k.a. the Batch Block)
+===========================================
+
+The batch block is an optional component of the workflow specification that enables
+job submission and management on remote clusters.  The block contains a handful of
+keys for specifying system level information that's applicable to all scheduled
+steps:
+
+.. list-table:: Batch block keys
+
+   * - Key
+     - Description
+   * - type
+     - Select scheduler adapter to use. Currently supported are ``slurm``, ``lsf``, and
+       ``flux``, ``local``.  Default is ``local``, non-scheduled job steps.
+   * - host
+     - Name of cluster being run on
+   * - queue
+     - Machine partition to schedule jobs to.  '--partition' for slurm systems, '-q' for
+       lsf systems, ...
+   * - bank
+     - Account which runs the job; this is used for computing job priority on the cluster.
+       '--account' on slurm, '-G' on lsf, ...
+   * - reservation
+     - Name of any prereserved machine partition to submit jobs to
+   * - qos
+     - Optional quality of service options (slurm only)
+   * - flux_uri
+     - Flux specific reservation functionality
+
+The information in this block is used to populate the step specific batch scripts with the appropriate
+header comment blocks (e.g. '#SBATCH --partition' for slurm).  Additional keys such as step specific
+resource requirements (number of nodes, cpus/tasks, gpus, ...) get added here when processing
+individual steps; see subsequent sections for scheduler specific details.  Note that job steps will
+run locally unless at least the ``nodes`` or ``procs`` key in the step is populated.
+
+LOCAL
+*****
+
+Stub
+
+SLURM
+*****
+
+Stub
+
+FLUX
+****
+
+Stub
+
+LSF: a Tale of Two Launchers
+****************************
+
+The LSF scheduler has multiple options for the parallel launcher commands:
+
+* `lsrun <https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=jobs-run-interactive-tasks>`_
+* `jsrun <https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=SSWRJV_10.1.0/jsm/jsrun.html>`_
+
+Maestro currently supports only the jsrun version, which differs from slurm
+via a more flexible specification of resources available for each task.  In
+addition to the `procs`, `cores per task`, and `gpu` keys, there are also
+`tasks_per_rs` and `rs_per_node`.  `jsrun` describes things in terms of resource
+sets, with several keywords controlling these resource sets and mapping them to
+the actual machine/node allocations:
+
+.. list-table:: Mapping of LSF args to Maestro step keys
+
+   * - LSF (jsrun)
+     - Maestro
+     - Description
+     - Default setting
+   * - -n, --nrs
+     - procs
+     - number of resource sets
+     - 1
+   * - -a, --tasks_per_rs
+     - tasks per rs
+     - Number of MPI tasks (ranks) in a resource
+     - 1
+   * - -c, --cpu_per_rs
+     - cores per task
+     - Number of physical CPU cores in a resource set
+     - 1
+   * - -g, --gpu_per_rs
+     - gpus
+     - Number of GPU's per resource set
+     - 0
+   * - -b, --bind
+     - bind
+     - Controls binding of tasks in a resource set
+     - 'rs'
+   * - -B, --bind_gpus
+     - bind gpus
+     - Controls binding of tasks to GPU's in a resource set
+     - 'none'
+   * - -r, --rs_per_host
+     - rs per node
+     - Number of resource sets per node
+     - 1
+
+
+.. note::
+
+   ``bind_gpus`` is new in lsf 10.1 and may not be available on all systems
+
+Now for a few examples of how to map these to Maestro's resource specifications.
+Note the `node` key is not directly used for any of these, but is still used for
+the reservation itself.  The rest of the keys serve to control the per task resources
+and then the per node packing of resource sets.  Consider a few examples:
+
+* 1 resource set per gpu on a cluster with 4 gpus per node with an application requesting
+  8 gpus.  This will consume 2 full nodes of the cluster with 1 MPI rank associated with
+  each gpu and having 1 cpu each.
+
+  .. code-block:: bash
+
+     jsrun -nrs 8 -a 1 -c 1 -g 1 -r 4 --bind rs my_awesome_gpu_application
+
+  And the corresponding maestro step that generates it
+
+  .. code-block:: yaml
+
+     study:
+         - name: run-my-app
+           description: launch the best gpu application.
+           run:
+             cmd: |
+                 $(LAUNCHER) my_awesome_gpu_application
+
+             procs: 8
+             nodes: 2
+             gpus:  1
+             rs per node: 4
+             tasks per rs: 1
+             cores per task: 1
+  
+  Note that `procs` here maps more to the tasks/resource set concept in lsf/jsrun, and
+  nodes is a multiplier on `rs_per_node` which yields the `nrs` jsrun key
+
+* 1 resource set per cpu, with no gpus, and using all 44 cpus on the node
+
+  .. code-block:: bash
+
+     jsrun -nrs 44 -a 1 -c 1 -g 0 -r 44 --bind rs my_awesome_mpi_cpu_application
+
+  .. code-block:: yaml
+
+     study:
+         - name: run-my-app
+           description: launch a pure mpi-cpu application.
+           run:
+             cmd: |
+                 $(LAUNCHER) my_awesome_mpi_cpu_application
+
+             procs: 44
+             nodes: 1
+             gpus:  0
+             rs per node: 44
+             tasks per rs: 1
+             cores per task: 1
+
+     Again, note that `procs` is a multiple of `rs_per_node`.
+  
+* Several multithreaded mpi ranks per node, with no gpus
+
+  .. code-block:: bash
+
+     jsrun -nrs 4 -a 1 -c 11 -g 0 -r 4 --bind rs my_awesome_omp_mpi_cpu_application
+
+  .. code-block:: yaml
+
+     study:
+         - name: run-my-app
+           description: launch an application using mpi and omp
+           run:
+             cmd: |
+                 $(LAUNCHER) my_awesome_omp_mpi_cpu_application
+
+             procs: 4
+             nodes: 1
+             gpus:  0
+             rs per node: 4
+             tasks per rs: 1
+             cores per task: 11
+
+* Several multithreaded mpi ranks per node with one gpu per rank, spanning multiple
+  nodes having 4 gpu's each
+
+  .. code-block:: bash
+
+     jsrun -nrs 8 -a 1 -c 11 -g 1 -r 4 --bind rs my_awesome_all_the_threads_application
+
+  .. code-block:: yaml
+
+     study:
+         - name: run-my-app
+           description: Use all the threads!
+           run:
+             cmd: |
+                 $(LAUNCHER) my_awesome_all_the_threads_application
+
+             procs: 8
+             nodes: 2
+             gpus:  1
+             rs per node: 4
+             tasks per rs: 1
+             cores per task: 11
+
+
+* An mpi application that needs lots of memory per rank
+
+  .. code-block:: bash
+
+     jsrun -nrs 2 -a 1 -c 1 -g 0 -r 1 --bind rs my_memory_hungry_application
+
+  .. code-block:: yaml
+
+     study:
+         - name: run-my-app
+           description: Use all the memory for single task per node
+           run:
+             cmd: |
+                 $(LAUNCHER) my_memory_hungry_application
+
+             procs: 2
+             nodes: 2
+             gpus:  0
+             rs per node: 1
+             tasks per rs: 1
+             cores per task: 1
diff --git a/maestrowf/datastructures/core/executiongraph.py b/maestrowf/datastructures/core/executiongraph.py
@@ -106,6 +106,7 @@ def generate_script(self, adapter, tmp_dir=""):
         LOGGER.info("Generating script for %s into %s", self.name, scr_dir)
         self.to_be_scheduled, self.script, self.restart_script = \
             adapter.write_script(scr_dir, self.step)
+        LOGGER.debug("STEP: %s", self.step)
         LOGGER.info("Script: %s\nRestart: %s\nScheduled?: %s",
                     self.script, self.restart_script, self.to_be_scheduled)