Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
110 commits
Select commit Hold shift + click to select a range
9e70cb5
initial commit for attempting new balsam executor. very WIP
jlnav Dec 3, 2021
63128e5
update poll, wait, and kill for new balsam states. remove test file. …
jlnav Dec 7, 2021
8103bae
intermediate work with forces+balsam, move AppDef class
jlnav Dec 14, 2021
e7efe4b
Merge branch 'develop' into feature/balsam7
jlnav Jan 31, 2022
7ea61bd
initial commit for balsam-forces, small adjustments in preparation fo…
jlnav Jan 31, 2022
bb9c88f
update balsam simf, set Application to accept optional PyObj (balsam …
jlnav Jan 31, 2022
47839fe
rename parameter, fix spacing in command template
jlnav Feb 1, 2022
4f40ae8
refactor, add BalsamExecutor.submit_allocation() to reserve resources…
jlnav Feb 7, 2022
bd1d3fd
flake8
jlnav Feb 7, 2022
38cfdac
refactoring, use local site, return 0 runtime if EventLog query is empty
jlnav Feb 7, 2022
604106e
sadly, mpiresources unused in new executor?
jlnav Feb 7, 2022
b25c338
Whitespace
jmlarson1 Feb 8, 2022
3e8d329
Spelling
jmlarson1 Feb 8, 2022
f4a8753
Edits to readme
jmlarson1 Feb 8, 2022
ae3755d
Black on forces balsam runscript
jmlarson1 Feb 8, 2022
1774712
Black on forces_simf
jmlarson1 Feb 8, 2022
e850890
Merge branch 'develop' into feature/balsam7
jlnav Feb 8, 2022
3e12581
initial round of updating READMEs
jlnav Feb 8, 2022
46cbed8
trying out theta again, debug attempts
jlnav Feb 9, 2022
955e9fc
experiment with only manager process defining app, also not re-defini…
jlnav Feb 9, 2022
b73a8d5
fix is_manager check
jlnav Feb 10, 2022
be945dd
try turning back on refresh_from_db
jlnav Feb 10, 2022
2458c03
testing something on rtd
jlnav Feb 11, 2022
6512086
add necessary css
jlnav Feb 11, 2022
3abf041
whitespace
jlnav Feb 11, 2022
06e8257
adding page template for javascript, credit to stackoverflow genius
jlnav Feb 11, 2022
5cbb7b5
removing toggle block for now, now that we know it works
jlnav Feb 11, 2022
bab4b90
comments/readme adjusts, cleanup old forces
jlnav Feb 17, 2022
e5017e6
Merge branch 'develop' into feature/balsam7
jlnav Feb 23, 2022
dff703f
initial attempt on revoke_allocation, globus data transfers
jlnav Feb 23, 2022
130e229
replace url-safe with token-hex
jlnav Feb 24, 2022
00b32b5
fix executor set_complete and revoke_allocation, add logic to transfe…
jlnav Feb 28, 2022
e287ee0
black
jlnav Feb 28, 2022
7b92854
initial globus docs in readme, additional improvements
jlnav Mar 1, 2022
385399b
adding POSTPROCESSED as a success balsam state, small fix to cleanup,…
jlnav Mar 1, 2022
fd579f9
black, refactoring, new documentation
jlnav Mar 1, 2022
6d4eb47
slight rename of new Balsam Executor, adds autodocing, fixes class me…
jlnav Mar 1, 2022
97b74b1
flake8
jlnav Mar 1, 2022
48ad8c3
Merge branch 'develop' into feature/balsam7
jlnav Mar 1, 2022
86404ff
fix css and conf.py?
jlnav Mar 1, 2022
576d009
fix balsam import condition
jlnav Mar 2, 2022
441c4a8
probably don't import both, especially during CI?
jlnav Mar 2, 2022
40085c4
yet again rearrange pkg_resources logic
jlnav Mar 2, 2022
c8ec059
catch DistributionNotFound
jlnav Mar 2, 2022
7fa44df
Simple black
jmlarson1 Mar 2, 2022
635d9e7
Simple black
jmlarson1 Mar 2, 2022
90ab921
Spell
jmlarson1 Mar 2, 2022
2196985
large refactorings of docs, refactor balsam-submission routine to def…
jlnav Mar 2, 2022
e3404b7
flake8
jlnav Mar 2, 2022
63a84ff
Merge branch 'feature/balsam7' of https://github.com/Libensemble/libe…
jlnav Mar 2, 2022
a8a1d34
fix load_by_site
jlnav Mar 2, 2022
85a4147
still need to specify MPI ranks for libensemble job, bump sim_max
jlnav Mar 2, 2022
6cbac2c
pass on OSError on attempt to sync Balsam app (probably no access to …
jlnav Mar 2, 2022
b227774
fix globus destination directory
jlnav Mar 3, 2022
d6e5511
attempt to cancel BatchJob once all stat files returned
jlnav Mar 3, 2022
e206ef8
attempt to improve forces.stat eval logic, considering machine/if tra…
jlnav Mar 3, 2022
fba4c6c
fix np.read of statfile for each run dest
jlnav Mar 3, 2022
7de8549
fix syntax
jlnav Mar 3, 2022
afc2cf1
rename submission script, add section on submitting libe as balsam ap…
jlnav Mar 7, 2022
1a066b0
old balsam executor is now legacy, new balsam executor is now just th…
jlnav Mar 7, 2022
7bf0123
deprecating old_balsam_tests and standalone executor tests?
jlnav Mar 7, 2022
230a78d
additional documentation and missed renames
jlnav Mar 7, 2022
841b3ad
fix docstring
jlnav Mar 7, 2022
da1dc36
additional docs
jlnav Mar 7, 2022
9eca5ca
additional options for jobs and batchjobs, including tags, partitions…
jlnav Mar 7, 2022
b9d0271
fix legacy balsam test
jlnav Mar 7, 2022
0b70819
some docs clarifications, monospace classes and functions throughout
jlnav Mar 8, 2022
b58c1cd
detect via hostname if submission script running on theta
jlnav Mar 8, 2022
3c3e48b
Black for scaling
jmlarson1 Mar 8, 2022
d5dafcd
Black on deprecated tests
jmlarson1 Mar 8, 2022
aef1cca
tiny docs changes
jlnav Mar 10, 2022
2cbe67a
update README for new balsam
jlnav Mar 14, 2022
cc7f7fc
Merge branch 'develop' into feature/balsam7
jlnav Mar 15, 2022
fe9f396
Merge branch 'develop' into feature/balsam7
jmlarson1 Mar 16, 2022
5823f04
rearranges scripts so Balsam Apps are defined in another script
jlnav Mar 22, 2022
552592f
start to rearrange README in balsam_forces
jlnav Mar 22, 2022
09240bf
some comments reorginzation
jlnav Mar 23, 2022
141ebcd
Docs/cleanup deprecated tests (#756)
jmlarson1 Mar 24, 2022
51c75bc
black
jmlarson1 Mar 24, 2022
e1682c7
apparently fixes retrieved app's unresolved site_id
jlnav Mar 24, 2022
9ca0aa0
Merge branch 'feature/balsam7' of https://github.com/Libensemble/libe…
jlnav Mar 24, 2022
fba6ba2
some reformatting to emphasize how libensemble app reserved for remot…
jlnav Mar 28, 2022
ea3581f
black
jlnav Mar 28, 2022
2be049b
next iteration of updating README
jlnav Mar 29, 2022
a7d0e8a
makes transfers False by default, completes reorganization of readme
jlnav Mar 30, 2022
e1760e8
Merge branch 'develop' into feature/balsam7
jlnav Apr 4, 2022
c7e3e9c
Merge branch 'develop' into feature/balsam7
jlnav Apr 5, 2022
3526da2
Merge branch 'develop' into feature/balsam7
jlnav Apr 7, 2022
2646831
Merge branch 'develop' into feature/balsam7
jmlarson1 Apr 11, 2022
7d074e6
Merge branch 'develop' into feature/balsam7
jlnav Apr 18, 2022
46edbf0
fix executor init logic to resolve Balsam2 having new name on pypi
jlnav Apr 18, 2022
5993a9c
primarily a rename to clarify, and some clarifying comments
jlnav Apr 18, 2022
56ac5bf
Merge branch 'develop' into feature/balsam7
jlnav Apr 18, 2022
b1a4881
refactor platforms_index for new Balsam diagram, comparing Balsam ver…
jlnav Apr 18, 2022
5f64174
refactor forces tests and many common components into separate direct…
jlnav Apr 19, 2022
44f6467
move miniforces, refactor balsam forces test to be simplified based o…
jlnav Apr 19, 2022
a96238d
flake8
jlnav Apr 19, 2022
b0f408d
flake8 actually
jlnav Apr 19, 2022
0f4fdb0
fix some paths, remove unneeded transfer, fix workdir naming
jlnav Apr 20, 2022
c1dbcbf
moving imports to top of file
jlnav Apr 20, 2022
e450e60
fix workdir concatanation again, improve executor __init__.py logic t…
jlnav Apr 20, 2022
8b1f00f
rearrange README so globus instructions are higher, remove references…
jlnav Apr 20, 2022
ecfb76e
flake8
jlnav Apr 20, 2022
e1d5a60
Merge pull request #775 from Libensemble/refactor/reorganize_forces
jlnav Apr 21, 2022
067e0e3
Merge branch 'develop' into feature/balsam7
jlnav Apr 21, 2022
cd54dc8
Merge branch 'develop' into feature/balsam7
jlnav Apr 21, 2022
6a6d654
fix workdir concatenation, add additional catchable error in __init__…
jlnav Apr 21, 2022
7999377
small adjusts for undoing of nparticles in forces.stat filename
jlnav Apr 21, 2022
b4fa0bc
fix statfile transfer path
jlnav Apr 21, 2022
4ccb911
try new loop that sleeps at least once, then checks if transferred st…
jlnav Apr 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,10 +82,9 @@ Optional dependencies:

* Balsam_

If running on the the compute nodes of three-tier systems
like OLCF's Summit_ or ALCF's Theta_, libEnsemble's workers may use the Balsam service
to schedule and launch MPI applications. Otherwise, libEnsemble can be run with
multiprocessing on the intermediate launch nodes.
As of v0.8.0+dev, libEnsemble features an updated `Balsam Executor`_
for workers to schedule and launch applications to *anywhere* with a running
Balsam site, including to remote machines.

* pyyaml_

Expand Down Expand Up @@ -296,7 +295,8 @@ See a complete list of `example user scripts`_.
.. _across: https://libensemble.readthedocs.io/en/develop/platforms/platforms_index.html#funcx-remote-user-functions
.. _APOSMM: https://link.springer.com/article/10.1007/s12532-017-0131-4
.. _AWA: https://link.springer.com/article/10.1007/s12532-017-0131-4
.. _Balsam: https://www.alcf.anl.gov/support-center/theta/balsam
.. _Balsam: https://balsam.readthedocs.io/en/latest/
.. _Balsam Executor: https://libensemble.readthedocs.io/en/develop/executor/balsam_2_executor.html
.. _Community Examples repository: https://github.com/Libensemble/libe-community-examples
.. _Conda: https://docs.conda.io/en/latest/
.. _conda-forge: https://conda-forge.org/
Expand Down
14 changes: 14 additions & 0 deletions docs/executor/balsam_2_executor.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Balsam Executor - Remote apps
=============================

.. automodule:: balsam_executor
:no-undoc-members:

.. autoclass:: BalsamExecutor
:show-inheritance:
:members: __init__, register_app, submit_allocation, revoke_allocation, submit

.. autoclass:: BalsamTask
:show-inheritance:
:member-order: bysource
:members: poll, wait, kill
17 changes: 0 additions & 17 deletions docs/executor/balsam_executor.rst

This file was deleted.

14 changes: 8 additions & 6 deletions docs/executor/ex_index.rst
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
.. _executor_index:

Executor
========
Executors
=========

libEnsemble's Executor can be used within the simulator (and, potentially, the generator)
functions to provide a simple, portable interface for running and managing user
applications.
libEnsemble's Executors can be used within user functions to provide a simple,
portable interface for running and managing user applications.

.. toctree::
:maxdepth: 2
:titlesonly:
:caption: libEnsemble Executor:
:caption: libEnsemble Executors:

overview
executor
mpi_executor
legacy_balsam_executor
balsam_2_executor
7 changes: 4 additions & 3 deletions docs/executor/executor.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Executor Modules
================
Base Executor - Local apps
==========================

.. automodule:: executor
:no-undoc-members:
Expand All @@ -13,7 +13,8 @@ See the Executor APIs for optional arguments.
:caption: Alternative Executors:

mpi_executor
balsam_executor
legacy_balsam_executor
balsam_2_executor

Executor Class
---------------
Expand Down
17 changes: 17 additions & 0 deletions docs/executor/legacy_balsam_executor.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Legacy Balsam MPI Executor
==========================

.. automodule:: legacy_balsam_executor
:no-undoc-members:

.. autoclass:: LegacyBalsamMPIExecutor
:show-inheritance:
:inherited-members:
:member-order: bysource
:members: __init__, submit, poll, manager_poll, kill, set_kill_mode

.. autoclass:: LegacyBalsamTask
:show-inheritance:
:member-order: bysource
:members: workdir_exists, file_exists_in_workdir, read_file_in_workdir, stdout_exists, read_stdout
:inherited-members:
4 changes: 2 additions & 2 deletions docs/executor/mpi_executor.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
MPI Executor
============
MPI Executor - MPI apps
=======================

.. automodule:: mpi_executor
:no-undoc-members:
Expand Down
10 changes: 5 additions & 5 deletions docs/executor/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ to an application instance instead of a callable. They feature the ``cancel()``,
from the standard.

The main ``Executor`` class is an abstract class, inherited by the ``MPIExecutor``
for direct running of MPI applications, and the ``BalsamMPIExecutor``
for submitting MPI run requests from a worker running on a compute node to a
Balsam service running on a launch node. This second approach is suitable for
for direct running of MPI applications, and the ``BalsamExecutor``
for submitting MPI run requests from a worker running on a compute node to the
Balsam service. This second approach is suitable for
systems that don't allow submitting MPI applications from compute nodes.

Typically, users choose and parameterize their ``Executor`` objects in their
Expand All @@ -46,8 +46,8 @@ In calling script::
USE_BALSAM = False

if USE_BALSAM:
from libensemble.executors.balsam_executor import BalsamMPIExecutor
exctr = BalsamMPIExecutor()
from libensemble.executors.balsam_executor import LegacyBalsamMPIExecutor
exctr = LegacyBalsamMPIExecutor()
else:
from libensemble.executors.mpi_executor import MPIExecutor
exctr = MPIExecutor()
Expand Down
3 changes: 2 additions & 1 deletion docs/introduction_latex.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ We now present further information on running and testing libEnsemble.
.. _across: https://libensemble.readthedocs.io/en/develop/platforms/platforms_index.html#funcx-remote-user-functions
.. _APOSMM: https://link.springer.com/article/10.1007/s12532-017-0131-4
.. _AWA: https://link.springer.com/article/10.1007/s12532-017-0131-4
.. _Balsam: https://www.alcf.anl.gov/support-center/theta/balsam
.. _Balsam: https://balsam.readthedocs.io/en/latest/
.. _Balsam Executor: https://libensemble.readthedocs.io/en/develop/executor/balsam_2_executor.html
.. _Community Examples repository: https://github.com/Libensemble/libe-community-examples
.. _Conda: https://docs.conda.io/en/latest/
.. _conda-forge: https://conda-forge.org/
Expand Down
2 changes: 1 addition & 1 deletion docs/overview_usecases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ its capabilities.

* **Executor**: The executor can be used within user functions to provide a
simple, portable interface for running and managing user tasks (applications).
There are multiple executors including the ``MPIExecutor`` and ``BalsamMPIExecutor``.
There are multiple executors including the ``MPIExecutor`` and ``LegacyBalsamMPIExecutor``.
The base ``Executor`` class allows local sub-processing of serial tasks.

* **Submit**: Enqueue or indicate that one or more jobs or tasks needs to be
Expand Down
40 changes: 30 additions & 10 deletions docs/platforms/platforms_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,26 +84,46 @@ Systems with Launch/MOM nodes
Some large systems have a 3-tier node setup. That is, they have a separate set of launch nodes
(known as MOM nodes on Cray Systems). User batch jobs or interactive sessions run on a launch node.
Most such systems supply a special MPI runner which has some application-level scheduling
capability (eg. aprun, jsrun). MPI applications can only be submitted from these nodes. Examples
capability (eg. ``aprun``, ``jsrun``). MPI applications can only be submitted from these nodes. Examples
of these systems include: Summit, Sierra and Theta.

There are two ways of running libEnsemble on these kind of systems. The first, and simplest,
is to run libEnsemble on the launch nodes. This is often sufficient if the worker's sim or
gen scripts are not doing too much work (other than launching applications). This approach
is to run libEnsemble on the launch nodes. This is often sufficient if the worker's simulation
or generation functions are not doing much work (other than launching applications). This approach
is inherently centralized. The entire node allocation is available for the worker-launched
tasks.

To run libEnsemble on the compute nodes of these systems requires an alternative Executor,
such as :doc:`Balsam<../executor/balsam_executor>`, which runs on the
launch nodes and launches tasks submitted by workers. Running libEnsemble on the compute
nodes is potentially more scalable and will better manage ``sim_f`` and ``gen_f`` functions
that contain considerable computational work or I/O.
However, running libEnsemble on the compute nodes is potentially more scalable and
will better manage simulation and generation functions that contain considerable
computational work or I/O. Therefore the second option is to use proxy task-execution
services like Balsam_.

.. image:: ../images/centralized_new_detailed_balsam.png
Balsam - Externally managed applications
----------------------------------------

Running libEnsemble on the compute nodes while still submitting additional applications
requires alternative Executors that connect to external services like Balsam_. Balsam
can take tasks submitted by workers and execute them on the remaining compute nodes,
or if using Balsam 2, *to entirely different systems*.

.. figure:: ../images/centralized_new_detailed_balsam.png
:alt: central_balsam
:scale: 30
:align: center

Single-System: libEnsemble + LegacyBalsamMPIExecutor

.. figure:: ../images/balsam2.png
:alt: balsam2
:scale: 40
:align: center

(New) Multi-System: libEnsemble + BalsamExecutor

As of v0.8.0+dev, libEnsemble supports both "legacy" Balsam via the
:doc:`LegacyBalsamMPIExecutor<../executor/legacy_balsam_executor>`
and Balsam 2 via the :doc:`BalsamExecutor<../executor/balsam_2_executor>`.

Submission scripts for running on launch/MOM nodes and for using Balsam, can be be found in
the :doc:`examples<example_scripts>`.

Expand Down Expand Up @@ -178,7 +198,7 @@ key. For example::
'sim_f': sim_f,
'in': ['x'],
'out': [('f', float)],
'funcx_endpoint': 3af6dc24-3f27-4c49-8d11-e301ade15353,
'funcx_endpoint': '3af6dc24-3f27-4c49-8d11-e301ade15353',
}

See the ``libensemble/tests/scaling_tests/funcx_forces`` directory for a complete
Expand Down
2 changes: 1 addition & 1 deletion docs/running_libE.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Limitations of MPI mode

If you are launching MPI applications from workers, then MPI is being nested. This is not
supported with Open MPI. This can be overcome by using a proxy launcher
(see :doc:`Balsam<executor/balsam_executor>`). This nesting does work, however,
(see :doc:`Balsam<executor/balsam_2_executor>`). This nesting does work, however,
with MPICH and its derivative MPI implementations.

It is also unsuitable to use this mode when running on the **launch** nodes of three-tier
Expand Down
4 changes: 2 additions & 2 deletions docs/tutorials/executor_forces_tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ and evaluated in a variety of helpful ways. For now, we're satisfied with waitin
for the task to complete via ``task.wait()``.

We can assume that afterward, any results are now available to parse. Our application
produces a ``forces[particles].stat`` file that contains either energy
produces a ``forces.stat`` file that contains either energy
computations for every time-step or a "kill" message if particles were lost, which
indicates a failed simulation.

Expand All @@ -254,7 +254,7 @@ to ``WORKER_DONE``. Otherwise, send back ``NAN`` and a ``TASK_FAILED`` status:
:linenos:

# Stat file to check for bad runs
statfile = "forces{}.stat".format(particles)
statfile = "forces.stat"

# Try loading final energy reading, set the sim's status
try:
Expand Down
2 changes: 1 addition & 1 deletion examples/calling_scripts/run_libe_forces.py
2 changes: 1 addition & 1 deletion examples/calling_scripts/run_libe_forces_from_yaml.py
2 changes: 1 addition & 1 deletion examples/tutorials/forces_with_executor/build_forces.sh
1 change: 0 additions & 1 deletion examples/tutorials/forces_with_executor/cleanup.sh

This file was deleted.

2 changes: 1 addition & 1 deletion examples/tutorials/forces_with_executor/forces.c
19 changes: 14 additions & 5 deletions libensemble/executors/__init__.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
from libensemble.executors.executor import Executor
from libensemble.executors.mpi_executor import MPIExecutor

import os
import sys
if 'BALSAM_DB_PATH' in os.environ and int(sys.version[2]) >= 6:
from libensemble.executors.balsam_executor import BalsamMPIExecutor
import pkg_resources

__all__ = ['BalsamMPIExecutor', 'Executor', 'MPIExecutor']
try:
if pkg_resources.get_distribution("balsam"): # Balsam 0.7.0 onward (Balsam 2)
from libensemble.executors.balsam_executor import BalsamExecutor

except (ModuleNotFoundError, ImportError, pkg_resources.DistributionNotFound):
try:
if pkg_resources.get_distribution("balsam-flow"): # Balsam up through 0.5.0
from libensemble.executors.legacy_balsam_executor import LegacyBalsamMPIExecutor
except (ModuleNotFoundError, ImportError, pkg_resources.DistributionNotFound):
pass


__all__ = ["LegacyBalsamMPIExecutor", "Executor", "MPIExecutor", "BalsamExecutor"]
Loading