-
Notifications
You must be signed in to change notification settings - Fork 27
Feature/balsam7 #729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/balsam7 #729
Conversation
…r passing AppDef instances instead
… from a Balsam site
"""Return time since balsam task entered RUNNING state""" | ||
|
||
# If wait_on_start then can could calculate runtime same a base executor | ||
# but otherwise that will return time from task submission. Get from Balsam. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove/modify this comment?
We do have a generic For Actually yeah, I need to test how this works when everything is being run on one machine like Theta. Also, the README isn't unique to this forces version, so I'll incorporate the changes into the others |
libensemble/tests/scaling_tests/balsam_forces/run_libe_forces_balsam.py
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,61 @@ | |||
## Running test run_libe_forces_balsam.py | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to think about how we structure the variants of the forces example, and maybe have one place that describes the test, and its purpose, in a generic way.
This application will need to be compiled on the remote machine where the sim_f will run. | ||
See below. | ||
|
||
### Running with libEnsemble. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to say up front, that you currently need ALCF login credentials to use this service, regardless of what remote machine you are using.
Then we could say here is an example running libEnsemble on your local computer/laptop and having forces.x
runs on the remote machine. Making clear which instructions are for laptop and which for remote machine. E.g. Any changes to calling script/sim function are on laptop. But forces.x must be compiled on remote machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may want to tell the user to set up their own conda (or other virtual) environment. on the remote. I first tried the pip install of balsam without that and got a warning could not access /soft/datascience/conda/miniconda3/latest/lib/python3.7/site-packages/.
Also they will need balsam installed on their local machine.
You may be asked to login and authenticate with the Balsam service. Do so with | ||
your ALCF credentials. | ||
|
||
Configure the `RemoteForces` class in the `run_libe_forces_balsam.py` calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On local machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RemoteForces
class is now in submit_libe_forces_balsam.py
- though the site name is given in the calling script. Maybe could be a bit clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also path to forces.x
your ALCF credentials. | ||
|
||
Configure the `RemoteForces` class in the `run_libe_forces_balsam.py` calling | ||
script to match the Balsam site name and the path to your `forces.x` executable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The site name I had to put in was not "my-site" as above which names the directory, but the name given by balsam site ls
, which is also the default it gave me when going through balsam site init ./my-site
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this is confusing. I guess you can call the directory "my-site" anything you want. I don't know why that does not come up as the default site name (Instead I get the name of the login node). I assume there is a one-to-one mapping between this directory and a site.
…sions, links to each. Add new diagram in autodocs for new balsam executor
…n recent tutorial refactoring, remotes balsam_forces.yaml
…o only try to import legacy balsam executor if unsuccessful on importing new executor
… to transferring balsam_forces.yaml, assumes transferring statfiles throughout
Refactor/reorganize forces
….py, undo forces.c change adding nparticles to forces.stat filename
…atfile exists and isn't empty
Addresses #631
Opening this to share. Still a work-in-progress.
This is a complete rewrite of the Balsam MPI Executor (currently a separate file) to incorporate the changes and new capabilities of Balsam2. The biggest new capability is workers can submit tasks from any machine, to any machine. The Balsam service, via Balsam sites, takes care of reserving resources and launching the tasks.
Regarding the interface, thankfully, the
executor.submit()
method is pretty close to what we're used to with our traditionalMPIExecutor
, so app-launching user-functions shouldn't change much. Within calling scripts, however, there's two major changes:ApplicationDefinition
instances to the executor instead of just paths to executables. This is because Balsam apps are now inherently and inseparably tied to their class name, their site, and other class attributes.ApplicationDefinition
s are literal-parsed, serialized, and sent to the Balsam service on class instantiation. An example in a calling script may resemble:submit_allocation()
function in a calling script, unless each worker wants to submit to a separate allocation or machine for some reason. An example may resemble:I don't think libE itself necessarily needs to be launched with Balsam anymore, but I need to test this. Take a look at the
scaling_tests/balsam_forces
example.TODO:
Longer term TODO (I think these should go in another PR, probably for 0.9.1):
ApplicationDefinitions
can either have acommand_template
field to launch an executable like we're familiar with, or arun
field to launch a Python function.