Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python: improve integration with user job batch scripts #560

Open
adammoody opened this issue Oct 23, 2023 · 0 comments
Open

python: improve integration with user job batch scripts #560

adammoody opened this issue Oct 23, 2023 · 0 comments
Assignees
Labels
Projects

Comments

@adammoody
Copy link
Contributor

adammoody commented Oct 23, 2023

Our scr_run.py script currently launches the user job with the launcher process via subprocess.Popen. There are a few challenges with this:

  1. Currently, we buffer all stdout and stderr and only print those out at the end. Users will want us to at least print this more frequently as the job runs, since people want to monitor their output while the job is running.
  2. In some cases, users may also need to forward stdin?
  3. Running with profilers/debuggers may be complicated, since those need to wrap the launcher like
    totalview srun -a ...

It would be good to look into solutions for the above.

As a fallback, and perhaps as the recommended approach, we should also ensure that people can continue to use their existing job scripts and just add a few additional commands to integrate with SCR. At the least, I think we want to allow users to invoke:

  • scr_prerun - to prepare the allocation for SCR
  • scr_list_down_nodes - to rely on SCR to test for node health and return a list of down or heathly nodes. Leave it to the user to then incorporate that list into a relaunch command. Documentation here can help, e.g., pointing users to srun -x <downnodes> as a way to avoid certain nodes with srun.
  • scr_should_exit - to determine whether to stop the run. This will check that there are enough healthy nodes, enough time, and verify that an SCR halt condition has not been set.
  • scr_postrun - to check for and scavenge any cached datasets

For users with bash job scripts, we want these commands to return 0/1 exit codes. Output like the node list should be printed to stdout, and it should be formatted in a way to make it easy for the user to integrate, e.g., potentially format the down node list differently for srun vs jsrun.

For users with python job scripts, we get bonus points if they can import and use SCR modules. For the first pass, let's just stick with requiring the user's python job script to invoke these as commands like the bash job scripts do.

@adammoody adammoody self-assigned this Oct 25, 2023
@adammoody adammoody added this to To do in v3.1 Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
v3.1
To do
Development

No branches or pull requests

1 participant