Skip to content

Commit

Permalink
Merge pull request #98 from Libensemble/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
shuds13 committed Nov 9, 2018
2 parents 9b89810 + e530f0c commit 89d517b
Show file tree
Hide file tree
Showing 64 changed files with 2,180 additions and 1,920 deletions.
8 changes: 4 additions & 4 deletions .travis.yml
@@ -1,14 +1,14 @@
language: python
sudo: required
dist: xenial
python:
- 2.7
- 3.4
- 3.5
- 3.6
#- 3.7
- 3.7

os: linux
dist: trusty
sudo: false

env:
global:
Expand Down Expand Up @@ -70,7 +70,7 @@ install:

# Run test
script:
- libensemble/tests/run-tests.sh
- libensemble/tests/run-tests.sh -z

# Coverage
after_success:
Expand Down
4 changes: 3 additions & 1 deletion README.rst
Expand Up @@ -79,7 +79,9 @@ regularly on:

* `Travis CI <https://travis-ci.org/Libensemble/libensemble>`_

The test suite requires the pytest, pytest-cov and pytest-timeout packages to be installed and can be run from the libensemble/tests directory of the source distribution by running::
The test suite requires the mock, pytest, pytest-cov and pytest-timeout
packages to be installed and can be run from the libensemble/tests directory of
the source distribution by running::

./run-tests.sh

Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Expand Up @@ -66,6 +66,7 @@ def __getattr__(cls, name):
##breathe_projects_source = {"libEnsemble" : ( "../code/src/", ["libE.py", "test.cpp"] )}
#breathe_projects_source = {"libEnsemble" : ( "../code/src/", ["test.cpp","test2.cpp"] )}

autodoc_mock_imports = ["balsam"]

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
Expand Down
3 changes: 0 additions & 3 deletions docs/dev_guide/worker_module.rst
Expand Up @@ -3,6 +3,3 @@ Worker Module
.. automodule:: libE_worker
:members: worker_main, Worker

.. autoclass:: Worker
:member-order: bysource
:members: init_workers, __init__, run, clean
25 changes: 25 additions & 0 deletions docs/job_controller/balsam_controller.rst
@@ -0,0 +1,25 @@
Balsam Job Controller
=====================

To create a Balsam job controller, the calling script should contain::

jobctr = BalsamJobController()

The Balsam job controller inherits from the MPI job controller. See the
:doc:`MPIJobController<mpi_controller>` for shared API. Any differences are
shown below.

.. automodule:: balsam_controller
:no-undoc-members:

.. autoclass:: BalsamJobController
:show-inheritance:
.. :inherited-members:
.. :member-order: bysource
.. :members: __init__, launch, poll, manager_poll, kill, set_kill_mode
.. autoclass:: BalsamJob
:show-inheritance:
:member-order: bysource
.. :members: workdir_exists, file_exists_in_workdir, read_file_in_workdir, stdout_exists, read_stdout
.. :inherited-members:
4 changes: 2 additions & 2 deletions docs/job_controller/jc_index.rst
Expand Up @@ -4,9 +4,9 @@ Job Controller
The job controller can be used with simulation functions to provide a simple, portable interface for running and managing user jobs.

.. toctree::
:maxdepth: 1
:maxdepth: 2
:titlesonly:
:caption: libEnsemble Job Controller:

overview
register
job_controller
49 changes: 17 additions & 32 deletions docs/job_controller/job_controller.rst
Expand Up @@ -3,46 +3,31 @@ Job Controller Module

.. automodule:: controller
:no-undoc-members:

See :doc:`example<overview>` for usage.

JobController Class
-------------------

The JobController should be constructed after registering applications to a Registry::

jobctl = JobController(registry = registry)

or if using Balsam::

jobctr = BalsamJobController(registry = registry)

.. autoclass:: JobController
:member-order: bysource
:members: __init__, launch, poll, manager_poll, kill, set_kill_mode

.. autoclass:: BalsamJobController
:show-inheritance:
:member-order: bysource
.. :members: __init__, launch, poll, manager_poll, kill, set_kill_mode
See the controller APIs for optional arguments.

.. toctree::
:maxdepth: 1
:caption: Job Controllers:

mpi_controller
balsam_controller

Job Class
---------

Jobs are created and returned though the job_controller launch function. Jobs can be passed as arguments
to the job_controller poll and kill functions. Job information can be queired through the job attributes below and the query functions. Note that the job attributes are only updated when they are polled (or though other
job controller functions).
Jobs are created and returned though the job_controller launch function. Jobs can be polled and
killed with the respective poll and kill functions. Job information can be queried through the job attributes
below and the query functions. Note that the job attributes are only updated when they are
polled/killed (or through other job or job controller functions).

.. autoclass:: Job
:member-order: bysource
:members: workdir_exists, file_exists_in_workdir, read_file_in_workdir, stdout_exists, read_stdout, stderr_exists, read_stderr

.. autoclass:: BalsamJob
:show-inheritance:
:member-order: bysource
.. :members: workdir_exists, file_exists_in_workdir, read_file_in_workdir, stdout_exists, read_stdout
.. :inherited-members:
:members:
:exclude-members: calc_job_timing,check_poll
.. :member-order: bysource
.. :members: poll, kill, workdir_exists, file_exists_in_workdir, read_file_in_workdir, stdout_exists, read_stdout, stderr_exists, read_stderr
Job Attributes
Expand All @@ -65,7 +50,7 @@ Run configuration attributes - Some will be auto-generated:

:job.workdir: (string) Work directory for the job
:job.name: (string) Name of job - auto-generated
:job.app: (app obj) Use application/executable, registered using registry.register_calc
:job.app: (app obj) Use application/executable, registered using jobctl.register_calc
:job.app_args: (string) Application arguments as a string
:job.num_procs: (int) Total number of processors for job
:job.num_nodes: (int) Number of nodes for job
Expand Down
17 changes: 17 additions & 0 deletions docs/job_controller/mpi_controller.rst
@@ -0,0 +1,17 @@
MPI Job Controller
==================

To create an MPI job controller, the calling script should contain::

jobctl = MPIJobController()

See the controller API below for optional arguments.

.. automodule:: mpi_controller
:no-undoc-members:

.. autoclass:: MPIJobController
:show-inheritance:
:inherited-members:
.. :member-order: bysource
.. :members: __init__, register_calc, launch, manager_poll
26 changes: 12 additions & 14 deletions docs/job_controller/overview.rst
Expand Up @@ -3,52 +3,50 @@ Job Controller Overview

The Job Controller module can be used by the worker or user-side code to issue and manage jobs using a portable interface. Various back-end mechanisms may be used to implement this interface on the system, either specified by the user at the top-level, or auto-detected. The job_controller manages jobs using the launch, poll and kill functions. Job attributes can then be queried to determine status. Functions are also provided to access and interrogate files in the job's working directory.

At the top-level calling script, a registry and job_controller are created and the executable gen or sim applications are registered to these (these are applications that will be runnable parallel jobs). If an alternative job_controller, such as Balsam, is to be used, then these can be created as in the example. Once in the user-side worker code (sim/gen func), the job_controller can be retrieved without any need to specify the type.
At the top-level calling script, a job_controller is created and the executable gen or sim applications are registered to it (these are applications that will be runnable jobs). If an alternative job_controller, such as Balsam, is to be used, then these can be created as in the example. Once in the user-side worker code (sim/gen func), an MPI based job_controller can be retrieved without any need to specify the specific type.

**Example usage (code runnable with or without a Balsam backend):**

In calling function::

from libensemble.register import Register, BalsamRegister
from libensemble.controller import JobController, BalsamJobController
sim_app = '/path/to/my/exe'
USE_BALSAM = False
if USE_BALSAM:
registry = BalsamRegister()
jobctrl = BalsamJobController(registry = registry)
from libensemble.balsam_controller import BalsamJobController
jobctrl = BalsamJobController()
else:
registry = Register()
jobctrl = JobController(registry = registry)
from libensemble.mpi_controller import MPIJobController
jobctrl = MPIJobController()
registry.register_calc(full_path=sim_app, calc_type='sim')
jobctrl.register_calc(full_path=sim_app, calc_type='sim')
In user sim func::

from libensemble.controller import JobController
jobctl = MPIJobController.controller # This will work for inherited controllers also (eg. Balsam)
import time
jobctl = JobController.controller #Will return controller (whether Balsam or standard).
job = jobctl.launch(calc_type='sim', num_procs=8, app_args='input.txt', stdout='out.txt')
jobctl = MPIJobController.controller # Will return controller (whether Balsam or standard MPI).
job = jobctl.launch(calc_type='sim', num_procs=8, app_args='input.txt', stdout='out.txt', stderr='err.txt')
while time.time() - start < timeout_sec:
time.sleep(delay)
# Has manager sent a finish signal
jobctl.manager_poll()
if jobctl.manager_signal == 'finish':
jobctl.kill(job)
job.kill()
# Poll job to see if completed
jobctl.poll(job)
job.poll()
if job.finished:
print(job.state)
break
# Check output file for error and kill job
if job.stdout_exists():
if 'Error' in job.read_stdout():
jobctl.kill(job)
job.kill()
break

See the :doc:`job_controller<job_controller>` interface for API.
Expand Down
19 changes: 0 additions & 19 deletions docs/job_controller/register.rst

This file was deleted.

17 changes: 17 additions & 0 deletions docs/release_notes.rst
Expand Up @@ -2,6 +2,23 @@
Release Notes
=============

Release 0.4.0
-------------

:Date: November 7, 2018

* Separate job controller classes into different modules including a base class (API change)
* Add central_mode run option to distributed type (MPI) job_controllers (API addition) (#93)
* Make poll and kill job methods (API change)
* In job_controller, set_kill_mode is removed and replaced by a wait argument for a hard kill (API change)
* Removed register module - incorporated into job_controller (API change)
* APOSMM has improved asynchronicity when batch mode is false (with new example). (#96)
* Manager errors (instead of hangs) when alloc_f or gen_f don't return work when all workers are idle. (#95)

:Known issues:

* OpenMPI is not supported with direct MPI launches as nested MPI launches are not supported.


Release 0.3.0
-------------
Expand Down
1 change: 1 addition & 0 deletions examples/alloc_funcs/fast_alloc.py
1 change: 1 addition & 0 deletions examples/alloc_funcs/fast_alloc_to_aposmm.py
1 change: 1 addition & 0 deletions examples/calling_scripts/test_fast_alloc.py
6 changes: 5 additions & 1 deletion examples/job_submission_scripts/bebop_submit_slurm.sh
Expand Up @@ -15,7 +15,11 @@
export EXE=libE_calling_script.py
export NUM_WORKERS=4
export MANAGER_NODE=false #true = Manager has a dedicated node (use one extra node for SBATCH -N)
export I_MPI_FABRICS=shm:ofa

unset I_MPI_FABRICS
export I_MPI_FABRICS_LIST=tmi,tcp
export I_MPI_FALLBACK=1


#If using in calling script (After N mins manager kills workers and timing.dat created.)
export LIBE_WALLCLOCK=55
Expand Down
52 changes: 52 additions & 0 deletions examples/job_submission_scripts/bebop_submit_slurm_centralmode.sh
@@ -0,0 +1,52 @@
#!/bin/bash
#SBATCH -J libE_test_central
#SBATCH -N 5
#SBATCH -p knlall
##SBATCH -A <my_project>
#SBATCH -o tlib.%j.%N.out
#SBATCH -e tlib.%j.%N.error
#SBATCH -t 01:00:00

#Launch script for running in central mode.
#LibEnsemble will run on a dedicated node (or nodes).
#The remaining nodes in the allocation will be dedicated to the jobs launched by the workers.

#Requirements for running:
# Must use job_controller with auto-resources=True and central_mode=True.
# Note: Requires a schedular having an environment variable giving a global nodelist in a supported format (eg. SLURM/COBALT)
# Otherwise a worker_list file will be required.

#Currently requires even distribution - either multiple workers per node or nodes per worker


#User to edit these variables
export EXE=libE_calling_script.py
export NUM_WORKERS=4

export I_MPI_FABRICS=shm:tmi

#If using in calling script (After N mins manager kills workers and timing.dat created.)
export LIBE_WALLCLOCK=55

#---------------------------------------------------------------------------------------------
#Test
echo -e "Slurm job ID: $SLURM_JOBID"

#cd $PBS_O_WORKDIR
cd $SLURM_SUBMIT_DIR

# A little useful information for the log file...
echo -e "Master process running on: $HOSTNAME"
echo -e "Directory is: $PWD"

#This will work for the number of contexts that will fit on one node (eg. 320 on Bebop) - increase libE nodes for more.
cmd="srun --overcommit --ntasks=$(($NUM_WORKERS+1)) --nodes=1 python $EXE $LIBE_WALLCLOCK"

echo The command is: $cmd
echo End PBS script information.
echo All further output is from the process being run and not the pbs script.\n\n $cmd # Print the date again -- when finished

$cmd

# Print the date again -- when finished
echo Finished at: `date`
6 changes: 3 additions & 3 deletions examples/job_submission_scripts/theta_submit_balsam.sh
Expand Up @@ -79,15 +79,15 @@ SCRIPT_BASENAME=${EXE%.*}
balsam app --name $SCRIPT_BASENAME.app --exec $EXE --desc "Run $SCRIPT_BASENAME"

# Running libE on one node - one manager and upto 63 workers
balsam job --name job_$SCRIPT_BASENAME --workflow $WORKFLOW_NAME --application $SCRIPT_BASENAME.app --args $SCRIPT_ARGS --wall-time-minutes $LIBE_WALLCLOCK --num-nodes 1 --ranks-per-node $((NUM_WORKERS+1)) --url-out="local:/$THIS_DIR" --stage-out-files="*.out *.dat" --url-in="local:/$THIS_DIR/*" --yes
balsam job --name job_$SCRIPT_BASENAME --workflow $WORKFLOW_NAME --application $SCRIPT_BASENAME.app --args $SCRIPT_ARGS --wall-time-minutes $LIBE_WALLCLOCK --num-nodes 1 --ranks-per-node $((NUM_WORKERS+1)) --url-out="local:/$THIS_DIR" --stage-out-files="*.out *.txt *.log" --url-in="local:/$THIS_DIR/*" --yes

# Hyper-thread libE (note this will not affect HT status of user calcs - only libE itself)
# Running 255 workers and one manager on one libE node.
# balsam job --name job_$SCRIPT_BASENAME --workflow $WORKFLOW_NAME --application $SCRIPT_BASENAME.app --args $SCRIPT_ARGS --wall-time-minutes $LIBE_WALLCLOCK --num-nodes 1 --ranks-per-node 256 --threads-per-core 4 --url-out="local:/$THIS_DIR/*" --stage-out-files="*.out *.dat" --url-in="local:/$THIS_DIR" --yes
# balsam job --name job_$SCRIPT_BASENAME --workflow $WORKFLOW_NAME --application $SCRIPT_BASENAME.app --args $SCRIPT_ARGS --wall-time-minutes $LIBE_WALLCLOCK --num-nodes 1 --ranks-per-node 256 --threads-per-core 4 --url-out="local:/$THIS_DIR" --stage-out-files="*.out *.txt *.log" --url-in="local:/$THIS_DIR/*" --yes

# Multiple nodes for libE
# Running 127 workers and one manager - launch script on 129 nodes (if one node per worker)
# balsam job --name job_$SCRIPT_BASENAME --workflow $WORKFLOW_NAME --application $SCRIPT_BASENAME.app --args $SCRIPT_ARGS --wall-time-minutes $LIBE_WALLCLOCK --num-nodes 2 --ranks-per-node 64 --url-out="local:/$THIS_DIR/*" --stage-out-files="*.out *.dat" --url-in="local:/$THIS_DIR" --yes
# balsam job --name job_$SCRIPT_BASENAME --workflow $WORKFLOW_NAME --application $SCRIPT_BASENAME.app --args $SCRIPT_ARGS --wall-time-minutes $LIBE_WALLCLOCK --num-nodes 2 --ranks-per-node 64 --url-out="local:/$THIS_DIR" --stage-out-files="*.out *.txt *.log" --url-in="local:/$THIS_DIR/*" --yes

#Run job
balsam launcher --consume-all --job-mode=mpi --num-transition-threads=1
Expand Down
4 changes: 2 additions & 2 deletions libensemble/__init__.py
Expand Up @@ -4,6 +4,6 @@
Library for managing ensemble-like collections of computations.
"""

__version__ = "0.3.0"
__author__ = 'Jeffrey Larson and Stephen Hudson'
__version__ = "0.4.0"
__author__ = 'Jeffrey Larson, Stephen Hudson and David Bindel'
__credits__ = 'Argonne National Laboratory'

0 comments on commit 89d517b

Please sign in to comment.