Skip to content

Commit

Permalink
Merge pull request #314 from Libensemble/develop
Browse files Browse the repository at this point in the history
Updating Stefan's branch with latest Develop
  • Loading branch information
jmlarson1 committed Dec 3, 2019
2 parents 0c36945 + da24a66 commit 29defcc
Show file tree
Hide file tree
Showing 20 changed files with 226 additions and 133 deletions.
3 changes: 2 additions & 1 deletion docs/FAQ.rst
Expand Up @@ -145,7 +145,8 @@ There are several ways to address this nuisance, but all involve trial and error
An easy (but insecure) solution is temporarily disabling the Firewall through
System Preferences -> Security & Privacy -> Firewall -> Turn Off Firewall. Alternatively,
adding a Firewall "Allow incoming connections" rule can be attempted for the offending
Job Controller executable. We've had limited success running ``sudo codesign --force --deep --sign - /path/to/application.app``
Job Controller executable. We've had limited success running
``sudo codesign --force --deep --sign - /path/to/application.app``
on our Job Controller executables, then confirming the next alerts for the executable
and ``mpiexec.hydra``.

Expand Down
10 changes: 9 additions & 1 deletion docs/data_structures/calc_status.rst
Expand Up @@ -3,7 +3,15 @@
calc_status
===========

The ``calc_status`` is an integer attribute with named (enumerated) values and a corresponding description that can be used in :ref:`sim_f<api_sim_f>` or :ref:`gen_f<api_gen_f>` functions to capture the status of a calcuation. This is returned to the manager and printed to the ``libE_stats.txt`` file. Only the status values ``FINISHED_PERSISTENT_SIM_TAG`` and ``FINISHED_PERSISTENT_GEN_TAG`` are currently used by the manager, but others can still provide a useful summary in libE_stats.txt. The user determines the status of the calculation, as it could include multiple application runs. It can be added as a third return variable in sim_f or gen_f functions.
The ``calc_status`` is an integer attribute with named (enumerated) values and
a corresponding description that can be used in :ref:`sim_f<api_sim_f>` or
:ref:`gen_f<api_gen_f>` functions to capture the status of a calcuation. This
is returned to the manager and printed to the ``libE_stats.txt`` file. Only the
status values ``FINISHED_PERSISTENT_SIM_TAG`` and
``FINISHED_PERSISTENT_GEN_TAG`` are currently used by the manager, but others
can still provide a useful summary in libE_stats.txt. The user determines the
status of the calculation, as it could include multiple application runs. It
can be added as a third return variable in sim_f or gen_f functions.
The calc_status codes are in the ``libensemble.message_numbers`` module.

Example of ``calc_status`` used along with :ref:`job controller<jobcontroller_index>` in sim_f:
Expand Down
7 changes: 5 additions & 2 deletions docs/data_structures/work_dict.rst
Expand Up @@ -20,8 +20,11 @@ Dictionary with integer keys ``i`` and dictionary values to be given to worker `
'persistent' [bool]: True if worker 'i' will enter persistent mode

.. seealso::
For allocation functions giving Work dictionaries using persistent workers, see `start_only_persistent.py`_ or `start_persistent_local_opt_gens.py`_.
For a use case where the allocation and generator functions combine to do simulation evaluations with different resources (blocking some workers), see `test_6-hump_camel_with_different_nodes_uniform_sample.py`_.
For allocation functions giving Work dictionaries using persistent workers,
see `start_only_persistent.py`_ or `start_persistent_local_opt_gens.py`_.
For a use case where the allocation and generator functions combine to do
simulation evaluations with different resources (blocking some workers), see
`test_6-hump_camel_with_different_nodes_uniform_sample.py`_.

.. _start_only_persistent.py: https://github.com/Libensemble/libensemble/blob/develop/libensemble/alloc_funcs/start_only_persistent.py
.. _start_persistent_local_opt_gens.py: https://github.com/Libensemble/libensemble/blob/develop/libensemble/alloc_funcs/start_persistent_local_opt_gens.py
Expand Down
7 changes: 4 additions & 3 deletions docs/data_structures/worker_array.rst
Expand Up @@ -12,9 +12,10 @@ worker array
'blocked' [int]:
Is the worker's resources blocked by another calculation

Since workers can be an a variety of states, the worker array ``W`` is contains
information about each workers state. This can allow an allocation functions to
determine what work should be performed.
The worker array ``W`` contains information about each worker's state. This is
useful information for allocation functions determining what work should be
performed next.

We take the following convention:

========================================= ======= ============ =======
Expand Down
Expand Up @@ -13,9 +13,11 @@ This assumes you have already:

Details on how to create forks can be found at: https://help.github.com/articles/fork-a-repo

You now have a configuration like shown in answer at: https://stackoverflow.com/questions/6286571/are-git-forks-actually-git-clones
You now have a configuration like shown in answer at:
https://stackoverflow.com/questions/6286571/are-git-forks-actually-git-clones

Upstream, in this case, is the official Spack repository on GitHub. Origin is your fork on GitHub and Local Machine is your local clone (from your fork).
Upstream, in this case, is the official Spack repository on GitHub. Origin is
your fork on GitHub and Local Machine is your local clone (from your fork).

Make sure ``SPACK_ROOT`` is set and spack binary is in your path::

Expand All @@ -37,8 +39,8 @@ To set upstream repo::
Now to update (the main develop branch)
---------------------------------------

You will now update your local machine from the upstream repo (if in doubt - make a copy of local repo
in your filestystem before doing the following).
You will now update your local machine from the upstream repo (if in doubt -
make a copy of local repo in your filestystem before doing the following).

Check upstream remote is present::

Expand Down
12 changes: 8 additions & 4 deletions docs/dev_guide/release_management/release_process.rst
Expand Up @@ -13,9 +13,11 @@ Before release

- A release branch should be taken off develop (or develop pulls controlled).

- Release notes for this version are added to the documentation with release date, including a list of supported (tested) platforms.
- Release notes for this version are added to the documentation with release
date, including a list of supported (tested) platforms.

- Version number is updated wherever it appears (in ``setup.py``, ``libensemble/__init__.py``, ``README.rst`` and twice in ``docs/conf.py``)
- Version number is updated wherever it appears:
(in ``setup.py``, ``libensemble/__init__.py``, ``README.rst`` and twice in ``docs/conf.py``)

- Check year is correct in ``README.rst`` under *Citing libEnsemble* and in ``docs/conf.py``.

Expand All @@ -31,7 +33,8 @@ Before release

- Documentation must build and display correctly wherever hosted (currently readthedocs.com).

- Pull request from either develop or release branch to master requesting reviewer/s (including at least one other administrator).
- Pull request from either develop or release branch to master requesting
reviewer/s (including at least one other administrator).

- Reviewer will check tests have passed and approve merge.

Expand All @@ -55,6 +58,7 @@ An administrator will take the following steps.
After release
-------------

- Ensure all relevant GitHub issues are closed and moved to the *Done* column on the kanban project board (inc. the release checklist).
- Ensure all relevant GitHub issues are closed and moved to the *Done* column
on the kanban project board (inc. the release checklist).

- Email libEnsemble mailing list
31 changes: 15 additions & 16 deletions docs/history_output.rst
@@ -1,16 +1,15 @@
The History Array
~~~~~~~~~~~~~~~~~
libEnsemble uses a NumPy structured array :ref:`H<datastruct-history-array>` to
store output from ``gen_f`` and corresponding ``sim_f`` output. Similarly,
``gen_f`` and ``sim_f`` are expected to return output in NumPy structured
arrays. The names of the fields to be given as input to ``gen_f`` and ``sim_f``
must be an output from ``gen_f`` or ``sim_f``. In addition to the fields output
from ``sim_f`` and ``gen_f``, the final history returned from libEnsemble will
include the following fields:
store corresponding output from each ``gen_f`` and ``sim_f``. Similarly,
``gen_f`` and ``sim_f`` are expected to return output as NumPy structured
arrays. The names of the input fields for ``gen_f`` and ``sim_f``
must be output from ``gen_f`` or ``sim_f``. In addition to the user-function output fields,
the final history from libEnsemble will includes the following:

* ``sim_id`` [int]: Each unit of work output from ``gen_f`` must have an
associated ``sim_id``. The generator can assign this, but users must be
careful to ensure points are added in order. For example, ``if alloc_f``
careful to ensure points are added in order. For example, if ``alloc_f``
allows for two ``gen_f`` instances to be running simultaneously, ``alloc_f``
should ensure that both don’t generate points with the same ``sim_id``.

Expand All @@ -20,7 +19,7 @@ include the following fields:
* ``given_time`` [float]: At what time (since the epoch) was this ``gen_f``
output given to a worker?

* ``sim_worker`` [int]: libEnsemble worker that it was given to be evaluated.
* ``sim_worker`` [int]: libEnsemble worker that output was given to for evaluation

* ``gen_worker`` [int]: libEnsemble worker that generated this ``sim_id``

Expand All @@ -44,15 +43,15 @@ where ``sim_count`` is the number of points evaluated.

Other libEnsemble files produced by default are:

* ``libE_stats.txt``: This contains a one-line summary of all user
calculations. Each calculation summary is sent by workers to the manager and
printed as the run progresses.
* ``libE_stats.txt``: This contains one-line summaries for each user
calculation. Each summary is sent by workers to the manager and
logged as the run progresses.

* ``ensemble.log``: This is the logging output from libEnsemble. The default
logging is at INFO level. To gain additional diagnostics logging level can be
set to DEBUG. If this file is not removed, multiple runs will append output.
Messages at or above level MANAGER_WARNING are also copied to stderr to alert
the user promptly. For more info, see :doc:`Logging<logging>`.
* ``ensemble.log``: This contains logging output from libEnsemble. The default
logging level is INFO. To gain additional diagnostics, the logging level can be
set to DEBUG. If this file is not removed, multiple runs will append output.
Messages at or above MANAGER_WARNING are also copied to stderr to alert
the user promptly. For more info, see :doc:`Logging<logging>`.

Output Analysis
^^^^^^^^^^^^^^^
Expand Down
4 changes: 3 additions & 1 deletion docs/job_controller/jc_index.rst
Expand Up @@ -3,7 +3,9 @@
Job Controller
==============

The job controller can be used within the simulator (and potentially generator) functions to provide a simple, portable interface for running and managing user jobs.
The job controller can be used within the simulator (and potentially generator)
functions to provide a simple, portable interface for running and managing user
jobs.

.. toctree::
:maxdepth: 2
Expand Down
14 changes: 9 additions & 5 deletions docs/job_controller/job_controller.rst
Expand Up @@ -18,9 +18,10 @@ See the controller APIs for optional arguments.
Job Class
---------

Jobs are created and returned though the job_controller launch function. Jobs can be polled and
killed with the respective poll and kill functions. Job information can be queried through the job attributes
below and the query functions. Note that the job attributes are only updated when they are
Jobs are created and returned though the job_controller launch function. Jobs
can be polled and killed with the respective poll and kill functions. Job
information can be queried through the job attributes below and the query
functions. Note that the job attributes are only updated when they are
polled/killed (or through other job or job controller functions).

.. autoclass:: Job
Expand All @@ -32,9 +33,12 @@ polled/killed (or through other job or job controller functions).
Job Attributes
--------------

Following is a list of job status and configuration attributes that can be retrieved from a job.
Following is a list of job status and configuration attributes that can be
retrieved from a job.

:NOTE: These should not be set directly. Jobs are launched by the job controller and job information can be queired through the job attributes below and the query functions.
:NOTE: These should not be set directly. Jobs are launched by the job
controller and job information can be queired through the job attributes
below and the query functions.

Job Status attributes include:

Expand Down
8 changes: 6 additions & 2 deletions docs/job_controller/mpi_controller.rst
Expand Up @@ -22,7 +22,9 @@ See the controller API below for optional arguments.
Class specific attributes
-------------------------

These attributes can be set directly to alter behaviour of the MPI job controller. However, they should be used with caution, as they may not be implemented in other job controllers.
These attributes can be set directly to alter behaviour of the MPI job
controller. However, they should be used with caution, as they may not be
implemented in other job controllers.

:max_launch_attempts: (int) Maximum number of launch attempts for a given job. *Default: 5*.
:fail_time: (int) *Only if wait_on_run is set.* Maximum run-time to failure in seconds that results in re-launch. *Default: 2*.
Expand All @@ -33,4 +35,6 @@ Example. To increase resilience against launch failures::
jobctrl.max_launch_attempts = 10
jobctrl.fail_time = 5

Note that an the re-try delay on launches starts at 5 seconds and increments by 5 seconds for each retry. So the 4th re-try will wait for 20 seconds before re-launching.
Note that an the re-try delay on launches starts at 5 seconds and increments by
5 seconds for each retry. So the 4th re-try will wait for 20 seconds before
re-launching.
65 changes: 33 additions & 32 deletions docs/job_controller/overview.rst
@@ -1,38 +1,37 @@
Job Controller Overview
=======================

Many users' will wish to launch an application to the system from a :ref:`sim_f<api_sim_f>`
(or :ref:`gen_f<api_gen_f>`), running on a worker.

An MPI job, for example, could be initialized with a subprocess call to ``mpirun``, or
an alternative launcher such as ``aprun`` or ``jsrun``. The sim_f may then monitor this job,
check output, and possibly kill the job. The word ``job`` is used here to represent
a launch of an application to the system, where the system could be a supercomputer,
cluster, or any other provision of compute resources.

In order to remove the burden of system interaction from the user, and enable sim_f
scripts that are portable between systems, a job_controller interface is provided by
libEnsemble. The job_controller provides the key functions: ``launch()``, ``poll()`` and
``kill()``. libEnsemble auto-detects a number of system criteria, such as the MPI launcher,
along with correct mechanisms for polling and killing jobs, on supported systems. It also
contains built in resilience, such as re-launching jobs that fail due to system factors.
User scripts that employ the job_controller interface will be portable between supported
systems. Job attributes can be queried to determine status after each poll. Functions are
also provided to access and interrogate files in the job's working directory.

The Job Controller module can be used to submit
and manage jobs using a portable interface. Various back-end mechanisms may be
used to implement this interface on the system, including a proxy launcher and
job management system, such as Balsam. Currently, these job_controllers launch
at the application level within an existing resource pool. However, submissions
to a batch schedular may be supported in the future.

At the top-level calling script, a job_controller is created and the executable
gen or sim applications are registered to it (these are applications that will
be runnable jobs). If an alternative job_controller, such as Balsam, is to be
used, then these can be created as in the example. Once in the user-side worker
code (sim/gen func), an MPI based job_controller can be retrieved without any
need to specify the type.
Users who wish to launch jobs to a system from a :ref:`sim_f<api_sim_f>`(or :ref:`gen_f<api_gen_f>`),
running on a worker have several options.

Typically, an MPI job could be initialized with a subprocess call to
``mpirun`` or an alternative launcher such as ``aprun`` or ``jsrun``. The ``sim_f``
may then monitor this job, check output, and possibly kill the job. We use "job"
to represent an application launch to the system, which may be a supercomputer,
cluster, or other provision of compute resources.

A **job_controller** interface is provided by libEnsemble to remove the burden of
system interaction from the user and ease writing portable user scripts that
launch applications. The job_controller provides the key functions: ``launch()``,
``poll()`` and ``kill()``. Job attributes can be queried to determine status after
each poll. To implement these functions, libEnsemble auto-detects system criteria
such as the MPI launcher and mechanisms to poll and kill jobs on supported systems.
libEnsemble's job_controller is resilient, and can re-launch jobs that fail due
to system factors.

Functions are also provided to access and interrogate files in the job's working directory.

Various back-end mechanisms may be used by the job_controller to best interact
with each system, including proxy launchers or job management systems like
Balsam_. Currently, these job_controllers launch at the application level within
an existing resource pool. However, submissions to a batch scheduler may be
supported in the future.

In a calling script, a job_controller object is created and the executable
generator or simulation applications are registered to it for launch. If an
alternative job_controller like Balsam will be used, then the applications can be
registered like in the example below. Once in the user-side worker code (sim/gen func),
an MPI based job_controller can be retrieved without any need to specify the type.

**Example usage (code runnable with or without a Balsam backend):**

Expand Down Expand Up @@ -92,3 +91,5 @@ For a more realistic example see:
- libensemble/tests/scaling_tests/forces/

which launches the forces.x application as an MPI job.

.. _Balsam: https://balsam.readthedocs.io/en/latest/

0 comments on commit 29defcc

Please sign in to comment.