Skip to content

Conversation

jlnav
Copy link
Member

@jlnav jlnav commented Nov 8, 2021

Addresses #713

Initial attempts to draft an optional funcX interface so workers can easily launch non-persistent user functions on remote resources:

https://funcx.org/

Some notes:

  • Security is taken care of by funcX. Users must authenticate with Globus when initializing an endpoint.
  • This user-prompted security makes CI testing currently-impossible?: Allow globus login from command line utility  globus/globus-compute#619
  • funcX endpoints have a limit of 20 function calls every 10 seconds unless batching is used.
  • Endpoints can only be instantiated on Linux. If using clusters, endpoints can (and should) be configured by the user to launch their functions to compute nodes. Many examples, including for Theta, are available on the funcX docs.
  • This (currently) isn't an alternative to MPI or local comms, or any Executor. This currently only allows user functions to run on a different resource than libEnsemble's processes.

To try this out locally:

  1. pip install funcx funcx-endpoint
  2. funcx-endpoint configure my-endpoint
  3. (You may be asked to authenticate with Globus. Do so using the funxc-endpoint generated URL)
  4. funcx-endpoint start my-endpoint. A message will confirm startup and print the endpoint's uuid.
  5. Set sim_specs['funcx_endpoint'] to this uuid.
  6. Run like normal: python test_funcx.py --comms local --nworkers 4

Documentation and/or other improvements coming soon?

TODO:

@coveralls
Copy link
Collaborator

coveralls commented Nov 8, 2021

Pull Request Test Coverage Report for Build 1491745447

  • 17 of 35 (48.57%) changed or added relevant lines in 3 files are covered.
  • 18 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+40.2%) to 95.163%

Changes Missing Coverage Covered Lines Changed/Added Lines %
libensemble/worker.py 12 30 40.0%
Files with Coverage Reduction New Missed Lines %
libensemble/comms/comms.py 1 95.25%
libensemble/resources/mpi_resources.py 1 88.44%
libensemble/executors/mpi_runner.py 16 83.72%
Totals Coverage Status
Change from base Build 1473635638: 40.2%
Covered Lines: 6316
Relevant Lines: 6575

💛 - Coveralls

@jlnav jlnav requested review from jmlarson1 and shuds13 November 9, 2021 21:40
@jmlarson1 jmlarson1 marked this pull request as ready for review November 11, 2021 17:22
Copy link
Member

@jmlarson1 jmlarson1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the changes in the alloc_fs and make it so the manager doesn't error when receiving a None persis_info

@jmlarson1
Copy link
Member

Thoughts on loss of coverage in openmpi runners?

@jlnav
Copy link
Member Author

jlnav commented Nov 15, 2021

Thoughts on loss of coverage in openmpi runners?

- [ ] still need to fix it by allowing the executor_hworld tests to run with local comms when using openmpi

This was never the problem in the first place since the simple routines were only running with local comms anyway! And all the runners are covered in the simple coverage jobs... https://coveralls.io/jobs/90108856/source_files/14509258687 . So still trying to figure out whats up

@jlnav jlnav marked this pull request as draft November 17, 2021 23:35
@jlnav
Copy link
Member Author

jlnav commented Nov 17, 2021

I've completely rewritten the funcX test to be a scaling test. Try this out on Theta:

pip install funcx-endpoint
funcx-endpoint configure forces

Configure ~/.funcx/forces/config.py to have the following:

from parsl.addresses import address_by_hostname
from parsl.launchers import AprunLauncher, SimpleLauncher
from parsl.providers import CobaltProvider

from funcx_endpoint.endpoint.utils.config import Config
from funcx_endpoint.executors import HighThroughputExecutor

# fmt: off

# PLEASE UPDATE user_opts BEFORE USE
user_opts = {
    'theta': {
        'worker_init': 'source ~/startup.sh',
        'scheduler_options': '',
        # Specify the account/allocation to which jobs should be charged
        'account': 'CSC250STMS07'
    }
}

config = Config(
    executors=[
        HighThroughputExecutor(
            max_workers_per_node=1,
            worker_debug=False,
            address=address_by_hostname(),
            provider=CobaltProvider(
                queue='debug-flat-quad',
                account=user_opts['theta']['account'],
                #launcher=AprunLauncher(),
                launcher=SimpleLauncher(),

                # string to prepend to #COBALT blocks in the submit
                # script to the scheduler eg: '#COBALT -t 50'
                scheduler_options=user_opts['theta']['scheduler_options'],

                # Command to be run before starting a worker, such as:
                # 'module load Anaconda; source activate funcx_env'.
                worker_init=user_opts['theta']['worker_init'],

                # Scale between 0-1 blocks with 2 nodes per block
                nodes_per_block=1,
                init_blocks=0,
                min_blocks=0,
                max_blocks=1,

                # Hold blocks for 30 minutes
                walltime='00:10:00'
            ),
        )
    ],
)

Then finally on Theta:

funcx-endpoint start forces

On your local machine (or any machine with internet access where you want to run libEnsemble), replace sim_specs['funcx_endpoint'] with the printed uuid. See funcx_forces.yaml. Adjust the other paths in that file as necessary.

Then on your machine where you'll run libEnsemble, run the calling script as usual. As long as the local exit_criteria have not been met, funcX will dynamically submit allocations via cobalt to the specified queue in the above config, then run the sim_f.

Using launcher=SimpleLauncher(), in the above config means the simulator function will be run on the MOM nodes as usual. Replace it with launcher=AprunLauncher(), to experiment with submitting the sim_f directly to the allocated compute nodes instead!

@jlnav jlnav changed the title [WIP] Experimental/funcx Experimental/funcx Nov 22, 2021
@jlnav jlnav marked this pull request as ready for review November 22, 2021 20:20
@jlnav jlnav requested a review from jmlarson1 November 22, 2021 20:20
@jlnav jlnav linked an issue Nov 22, 2021 that may be closed by this pull request

if 'persis_info' in D_recv and len(D_recv['persis_info']):
persis_info[w].update(D_recv['persis_info'])
if D_recv.get('persis_info') is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me - why not just if D_recv.get('persis_info') instead of 436/437

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably valid. The original work was to address how persis_info is handled by the manager if the user didn't specify one but the alloc tried packing it up anyway.

@shuds13
Copy link
Member

shuds13 commented Dec 3, 2021

Wondering if we can avoid some of the duplication of forces stuff. Also looks like there is a binary there (forces.x).

@jlnav
Copy link
Member Author

jlnav commented Dec 6, 2021

Wondering if we can avoid some of the duplication of forces stuff. Also looks like there is a binary there (forces.x).

addressed

@jlnav jlnav merged commit 0e3e2cb into develop Dec 9, 2021
@jlnav jlnav deleted the experimental/funcx branch December 9, 2021 18:34
@shuds13 shuds13 mentioned this pull request Mar 15, 2022
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Experiment with funcX interoperability

4 participants