Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand runtime control: global max_seconds and STOP file #3028

Merged
merged 19 commits into from
Mar 20, 2021

Conversation

ye-luo
Copy link
Contributor

@ye-luo ye-luo commented Mar 19, 2021

Proposed changes

This PR enhances runtime controls.

  1. <project id="project-prefix" max_seconds="500"> to control time limit in batched drivers. default 360000 (100h) as legacy drivers which however still respect max_seconds in the per driver input section.
  2. project-prefix.STOP can be used to stop a run sanely. With an ensemble run, all jobs ends. close Stop the code on the fly #498
  3. Fix this case. Once a driver section hits the time limit, all the subsequent driver sections won't run.

check frequency is per block.

explicitly tested ensemble job case. existing cpu_limit test results.

$ ctest -R cpu -j32 -E performance
Test project /home/yeluo/opt/qmcpack/build_clang_real_debug
    Start 893: cpu_limit-diamondC_1x1x1_pp-dmc-1-1
    Start 779: cpu_limit-bccH_1x1x1_ae-rmc-1-1
    Start 892: cpu_limit-diamondC_1x1x1_pp-vmc-4-4
    Start 891: cpu_limit-diamondC_1x1x1_pp-vmc-1-1
    Start   6: deterministic-unit_test_cpu
1/6 Test   #6: deterministic-unit_test_cpu ...........   Passed    0.38 sec
2/6 Test #891: cpu_limit-diamondC_1x1x1_pp-vmc-1-1 ...   Passed   49.85 sec
3/6 Test #892: cpu_limit-diamondC_1x1x1_pp-vmc-4-4 ...   Passed   50.08 sec
    Start 894: cpu_limit-diamondC_1x1x1_pp-dmc-4-4
4/6 Test #779: cpu_limit-bccH_1x1x1_ae-rmc-1-1 .......   Passed   50.33 sec
5/6 Test #893: cpu_limit-diamondC_1x1x1_pp-dmc-1-1 ...   Passed   90.87 sec
6/6 Test #894: cpu_limit-diamondC_1x1x1_pp-dmc-4-4 ...   Passed   47.34 sec

What type(s) of changes does this code introduce?

  • New feature

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

epyc-server

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes. This PR is up to date with current the current state of 'develop'
  • Yes. Code added or changed in the PR has been clang-formatted

@jtkrogel
Copy link
Contributor

If <project> is going to expand into our "control" section, please can we use parameters rather than attributes?

I would really like to avoid cases like this if at all possible in the future:

<sposet_builder type="bspline" href="../nscf/pwscf_output/pwscf.pwscf.h5" tilematrix="1 1 -1 1 -1 1 -1 1 1" twistnum="0" source="ion0" version="0.10" meshfactor="1.0" precision="float" truncate="no">

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 19, 2021

If <project> is going to expand into our "control" section, please can we use parameters rather than attributes?

I would really like to avoid cases like this if at all possible in the future:

<sposet_builder type="bspline" href="../nscf/pwscf_output/pwscf.pwscf.h5" tilematrix="1 1 -1 1 -1 1 -1 1 1" twistnum="0" source="ion0" version="0.10" meshfactor="1.0" precision="float" truncate="no">

Agreed. I was expecting this request. Documentation is still missing in this PR. Putting out this PR for feedback.

@jtkrogel
Copy link
Contributor

Overall these look like good changes to me.

@prckent
Copy link
Contributor

prckent commented Mar 19, 2021

Good advance here but I would like some changes:

I am against max_seconds="500" as an attribute the project section. It adds parsing complication as Jaron noted.
Ideally it should be at the same level as (say) the random number initialization for consistency. Maybe it has to be a project attribute, which then raises the question about whether the random initialization (seed) is really in the "correct" and consistent place.

Generally I am against the use of attributes when we could use parameters. This is much easier for users (including this one) to remember, read, and get right. Use of XML for input is unfortunate, but it is what we have currently.

We already have maxcpusecs that works at the driver level, and should recycle the same tag at a higher level.

It would also make sense to support the same parameter as a command line argument. Then when the user or robot writes a job submission file it can be written there without affecting the QMC input.

.STOP will be very useful!

I agree with the per block check, since part of our current block definition is that this is when we do I/O.

@prckent
Copy link
Contributor

prckent commented Mar 19, 2021

@ye-luo Interested to know your current use case, or perhaps it was requested.

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 19, 2021

I changed "max_seconds" from attribute to parameter but it remains under project.
"maxcpusecs" is an unfortunate choice. The name is better to be straightforward. I have an idea, since I have the tool now. I can mark "maxcpusecs" deprecated and add "max_seconds" in classic drivers.

Regarding command line arguments, in general I would like to move more command line arguments under project to improve reproducibility from a single input file. For example, we may consider moving verbosity.

I agree adding max_seconds in command line can be useful.

I want to see the timer so the execution must stop clean. If I'm looking at the timer distribution and see which are heavier, just need to stop the code on the fly without changing input.

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 19, 2021

@prckent let me know if you really want to stick to "maxcpusecs", otherwise, I will mark it deprecated.

@prckent
Copy link
Contributor

prckent commented Mar 19, 2021

max_seconds is good imo.

OK to mark the old tag as deprecated and abort - there can't be many people using it and it is an easy change to make. An edit might needed in NEXUS.

@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 19, 2021

"deprecated" will continue to run. Once 1 release is out, we change it to "deleted" then users get an abort.

@prckent prckent changed the title Expand runtime control Expand runtime control: global max_seconds and STOP file Mar 19, 2021
@ye-luo
Copy link
Contributor Author

ye-luo commented Mar 19, 2021

maxcpusecs has been superseded by max_seconds in classical drivers.
found no trace of maxcpusecs in nexus folder

@prckent prckent self-requested a review March 20, 2021 22:15
Copy link
Contributor

@prckent prckent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified by hand.

Added another mention in the documentation and increased the safety factor.

@ye-luo ye-luo merged commit d3382fb into QMCPACK:develop Mar 20, 2021
@ye-luo ye-luo deleted the expand-RunTimeManage branch March 20, 2021 23:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop the code on the fly
4 participants