Skip to content

Conversation

jlnav
Copy link
Member

@jlnav jlnav commented Feb 8, 2022

Addresses #631

Opening this to share. Still a work-in-progress.

This is a complete rewrite of the Balsam MPI Executor (currently a separate file) to incorporate the changes and new capabilities of Balsam2. The biggest new capability is workers can submit tasks from any machine, to any machine. The Balsam service, via Balsam sites, takes care of reserving resources and launching the tasks.

Regarding the interface, thankfully, the executor.submit() method is pretty close to what we're used to with our traditional MPIExecutor, so app-launching user-functions shouldn't change much. Within calling scripts, however, there's two major changes:

  1. Users must register Balsam ApplicationDefinition instances to the executor instead of just paths to executables. This is because Balsam apps are now inherently and inseparably tied to their class name, their site, and other class attributes. ApplicationDefinitions are literal-parsed, serialized, and sent to the Balsam service on class instantiation. An example in a calling script may resemble:
from balsam.api import ApplicationDefinition

class RemoteForces(ApplicationDefinition):
    site = 'thetaGPU'
    command_template = '/path/to/forces.x {{sim_particles}} {{sim_timesteps}} {{seed}} {{kill_rate}}'

exctr = NewBalsamMPIExecutor()
exctr.register_app(RemoteForces, app_name='forces')
  1. Users must submit batch resource requests corresponding to sites where apps should run. This can be done with the new submit_allocation() function in a calling script, unless each worker wants to submit to a separate allocation or machine for some reason. An example may resemble:
exctr.submit_allocation(site_id=999, num_nodes=256, wall_time_min=300,
                        queue='default', project='my-project-code')

I don't think libE itself necessarily needs to be launched with Balsam anymore, but I need to test this. Take a look at the scaling_tests/balsam_forces example.

TODO:

  • General refactoring and cleanup
  • Documentation
  • Address feedback
  • Test how well task.poll() is actually working
  • Update balsam_forces README
  • Executor.revoke_allocation()

Longer term TODO (I think these should go in another PR, probably for 0.9.1):

  • Possibly also register user functions with Balsam. Balsam ApplicationDefinitions can either have a command_template field to launch an executable like we're familiar with, or a run field to launch a Python function.
  • Update templater for new Balsam tests

"""Return time since balsam task entered RUNNING state"""

# If wait_on_start then can could calculate runtime same a base executor
# but otherwise that will return time from task submission. Get from Balsam.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove/modify this comment?

@jlnav
Copy link
Member Author

jlnav commented Feb 8, 2022

We do have a generic Executor that is inherited by the MPIExecutor, which the BalsamExecutor inherits from.

For mpi_resources, it's definitely not a problem if Balsam is being used to launch to other machines. But for single-machine runs with Balsam it may still prove useful. I think the only parameter it could help is max_tasks_per_node, but we'll need to discuss this more in a meeting.

Actually yeah, I need to test how this works when everything is being run on one machine like Theta.

Also, the README isn't unique to this forces version, so I'll incorporate the changes into the others

@@ -0,0 +1,61 @@
## Running test run_libe_forces_balsam.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to think about how we structure the variants of the forces example, and maybe have one place that describes the test, and its purpose, in a generic way.

This application will need to be compiled on the remote machine where the sim_f will run.
See below.

### Running with libEnsemble.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to say up front, that you currently need ALCF login credentials to use this service, regardless of what remote machine you are using.

Then we could say here is an example running libEnsemble on your local computer/laptop and having forces.x runs on the remote machine. Making clear which instructions are for laptop and which for remote machine. E.g. Any changes to calling script/sim function are on laptop. But forces.x must be compiled on remote machine.

Copy link
Member

@shuds13 shuds13 Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to tell the user to set up their own conda (or other virtual) environment. on the remote. I first tried the pip install of balsam without that and got a warning could not access /soft/datascience/conda/miniconda3/latest/lib/python3.7/site-packages/.

Also they will need balsam installed on their local machine.

You may be asked to login and authenticate with the Balsam service. Do so with
your ALCF credentials.

Configure the `RemoteForces` class in the `run_libe_forces_balsam.py` calling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On local machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RemoteForces class is now in submit_libe_forces_balsam.py - though the site name is given in the calling script. Maybe could be a bit clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also path to forces.x

your ALCF credentials.

Configure the `RemoteForces` class in the `run_libe_forces_balsam.py` calling
script to match the Balsam site name and the path to your `forces.x` executable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The site name I had to put in was not "my-site" as above which names the directory, but the name given by balsam site ls, which is also the default it gave me when going through balsam site init ./my-site.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is confusing. I guess you can call the directory "my-site" anything you want. I don't know why that does not come up as the default site name (Instead I get the name of the login node). I assume there is a one-to-one mapping between this directory and a site.

@jlnav jlnav merged commit 87274d6 into develop Apr 22, 2022
@jlnav jlnav deleted the feature/balsam7 branch April 22, 2022 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants