New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PP-772: improve accuracy of wall time in mom #437

Merged
merged 1 commit into from Nov 17, 2017

Conversation

Projects
None yet
4 participants
@minghui-liu
Contributor

minghui-liu commented Nov 3, 2017

Issue-ID

Problem description

Mom hooks and other code in the mom main loop are counted into job's wall time, making wall time inaccurate.

Cause / Analysis

  1. Job's ji_stime is set too early.
  2. Using ji_stime to keep track of wall time is not an ideal method. It is hard to subtract time spent
    on hooks since hooks run on a child process and the child process has to communicate with its parent to adjust wall time.

Solution description

  1. Set job's ji_stime after execjob_begin hook.
  2. Reworked the way wall time is calculated by implementing an accumulation method.
  3. Moved variables and functions related to wall time into a separate file. All the mom source code files are TOO LONG and we need to break them up into manageable pieces.

No new PTL tests are added since these changes do not change any interface.

Checklist:

***For further information please visit the Developer Guide Home.***

@agrawalravi90

This comment has been minimized.

Show comment
Hide comment
@agrawalravi90

agrawalravi90 Nov 3, 2017

Contributor

A very minor comment about the title of this PR Minghui, this ticket seems to be about fixing the incorrect accumulation of walltime that happens inside the mom's main loop, but the title confused me into thinking that the ticket is about accumulating more to the walltime. So, maybe change the title to something like "Incorrect walltime accumulation in mom" ?
I have't looked at the changes, just wanted to comment on the title.

Contributor

agrawalravi90 commented Nov 3, 2017

A very minor comment about the title of this PR Minghui, this ticket seems to be about fixing the incorrect accumulation of walltime that happens inside the mom's main loop, but the title confused me into thinking that the ticket is about accumulating more to the walltime. So, maybe change the title to something like "Incorrect walltime accumulation in mom" ?
I have't looked at the changes, just wanted to comment on the title.

@minghui-liu minghui-liu changed the title from PP-772: implement walltime accumulation to PP-772: improve accuracy of wall time in mom Nov 3, 2017

@minghui-liu

This comment has been minimized.

Show comment
Hide comment
@minghui-liu

minghui-liu Nov 3, 2017

Contributor

@agrawalravi90 Thanks. I changed the title.

Contributor

minghui-liu commented Nov 3, 2017

@agrawalravi90 Thanks. I changed the title.

@minghui-liu minghui-liu changed the title from PP-772: improve accuracy of wall time in mom to [WIP] PP-772: improve accuracy of wall time in mom Nov 3, 2017

@minghui-liu minghui-liu changed the title from [WIP] PP-772: improve accuracy of wall time in mom to PP-772: improve accuracy of wall time in mom Nov 7, 2017

@bayucan

Since you've made changes to the walltime item in mom_set_use() under resmom/linux/mom_mach.c, resmom_win/win/mom_mach.c, need to do the same with the other architecture-related files, e.g. resmom/solaris7/mom_mach.c...

Show outdated Hide outdated src/resmom/mom_main.c Outdated
Show outdated Hide outdated src/resmom/mom_walltime.c Outdated
@@ -4576,19 +4556,16 @@ post_restart(job *pjob, int ev)
}
}
new_stime = time_now - (ckpt_time - pjob->ji_qs.ji_stime);

This comment has been minimized.

@bayucan

bayucan Nov 8, 2017

Contributor

This snippet of code has been taken out and it takes into account the checkpoint time (ckpt_time)...

@bayucan

bayucan Nov 8, 2017

Contributor

This snippet of code has been taken out and it takes into account the checkpoint time (ckpt_time)...

This comment has been minimized.

@minghui-liu

minghui-liu Nov 9, 2017

Contributor

ckpt_time is used to calculate the length of time that the job was held. The whole logic is not necessary anymore so I removed all of them.

@minghui-liu

minghui-liu Nov 9, 2017

Contributor

ckpt_time is used to calculate the length of time that the job was held. The whole logic is not necessary anymore so I removed all of them.

@@ -1559,14 +1559,7 @@ post_suspend(job *pjob, int err)
return;
if ((pjob->ji_flags & MOM_SISTER_ERR) == 0) {
/* save stop time for adjusting walltime */
pjob->ji_momstat = time_now;

This comment has been minimized.

@bayucan

bayucan Nov 8, 2017

Contributor

So it's looking like job's ji_momstat is no longer used. Perhaps this attribute can be removed from the job structure.

@bayucan

bayucan Nov 8, 2017

Contributor

So it's looking like job's ji_momstat is no longer used. Perhaps this attribute can be removed from the job structure.

Show outdated Hide outdated src/resmom/mom_main.c Outdated
@@ -10529,36 +10511,12 @@ active_idle(job *pjob, int which)
pjob->ji_qs.ji_svrflags |= JOB_SVFLG_Actsuspd;
send_wk_job_idle(pjob->ji_qs.ji_jobid, which);
if ((pjob->ji_qs.ji_svrflags &
(JOB_SVFLG_Suspend|JOB_SVFLG_Suspend)) == 0) {
/* save stop time for adjusting walltime */
pjob->ji_momstat = time_now;

This comment has been minimized.

@bayucan

bayucan Nov 8, 2017

Contributor

This pjob->ji_momstat is looking like not relevant anymore. Perhaps remove the field from the job structure?

@bayucan

bayucan Nov 8, 2017

Contributor

This pjob->ji_momstat is looking like not relevant anymore. Perhaps remove the field from the job structure?

This comment has been minimized.

@minghui-liu

minghui-liu Nov 9, 2017

Contributor

Ok. I wasn't sure if it's used for other purposes. I will check and remove it.

@minghui-liu

minghui-liu Nov 9, 2017

Contributor

Ok. I wasn't sure if it's used for other purposes. I will check and remove it.

This comment has been minimized.

@minghui-liu

minghui-liu Nov 9, 2017

Contributor

Removed ji_momstat

@minghui-liu

minghui-liu Nov 9, 2017

Contributor

Removed ji_momstat

Show outdated Hide outdated src/resmom/mom_walltime.c Outdated
@@ -4576,19 +4556,16 @@ post_restart(job *pjob, int ev)
}
}
new_stime = time_now - (ckpt_time - pjob->ji_qs.ji_stime);

This comment has been minimized.

@bayucan

bayucan Nov 8, 2017

Contributor

This snippet of code seemed to have disappeared. It appears that it's accounting for the checkpoint time (ckpt_time)...

@bayucan

bayucan Nov 8, 2017

Contributor

This snippet of code seemed to have disappeared. It appears that it's accounting for the checkpoint time (ckpt_time)...

@bhroam

This comment has been minimized.

Show comment
Hide comment
@bhroam

bhroam Nov 9, 2017

Contributor

Travis failed because the code failed to compile. An instance of 'id' was still in the code. It needs to be replaced with __func__

Contributor

bhroam commented Nov 9, 2017

Travis failed because the code failed to compile. An instance of 'id' was still in the code. It needs to be replaced with __func__

@bhroam

I didn't see anything big. Just a few nits here and there.

Show outdated Hide outdated src/include/job.h Outdated
@@ -999,6 +1001,7 @@ task_find (job *pjob,
#define JOB_SUBSTATE_RERUN2 62 /* job is rerun, delete files stage */
#define JOB_SUBSTATE_RERUN3 63 /* job is rerun, mom delete job */
#define JOB_SUBSTATE_EXPIRED 69 /* subjob (of an array) is gone */

This comment has been minimized.

@bhroam

bhroam Nov 9, 2017

Contributor

Is this blank line needed?

@bhroam

bhroam Nov 9, 2017

Contributor

Is this blank line needed?

This comment has been minimized.

@minghui-liu

minghui-liu Nov 9, 2017

Contributor

It's there to visually separate the BEGIN and FINISHED substate group.

@minghui-liu

minghui-liu Nov 9, 2017

Contributor

It's there to visually separate the BEGIN and FINISHED substate group.

Show outdated Hide outdated src/resmom/catch_child.c Outdated
Show outdated Hide outdated src/resmom/catch_child.c Outdated
Show outdated Hide outdated src/resmom/catch_child.c Outdated
Show outdated Hide outdated src/resmom/mom_main.c Outdated
Show outdated Hide outdated src/resmom/mom_main.c Outdated
Show outdated Hide outdated src/resmom/mom_walltime.c Outdated
@bhroam

bhroam approved these changes Nov 9, 2017

Looks good to me. I sign off.

@bayucan

The latest changes looked good so far, though, I still haven't seen the following addressed:

void
+update_walltime(job *pjob) {

  • if (pjob->ji_walltime_stamp == 0)
  •    return;
    
  • attribute *resources_used = &pjob->ji_wattr[(int)JOB_ATR_resc_used];
  • resource_def *walltime_def = find_resc_def(svr_resc_def, "walltime", svr_resc_size);
    [Al] check for NULL-ness of 'walltime_def'

And also look into the other architecture-dependent files under src/resmom like src/resmom/solaris7. Apply the same update to linux/mom_mach.c:mom_set_use() when 'walltime' is also polled/changed.

@bayucan

Everything looks good, I've reviewed all the commits. I sign off.

@bhroam

This comment has been minimized.

Show comment
Hide comment
@bhroam

bhroam Nov 13, 2017

Contributor

Please squash/rebase for the merge.

Contributor

bhroam commented Nov 13, 2017

Please squash/rebase for the merge.

@minghui-liu

This comment has been minimized.

Show comment
Hide comment
@minghui-liu

minghui-liu Nov 13, 2017

Contributor

@bhroam Done.

Contributor

minghui-liu commented Nov 13, 2017

@bhroam Done.

@bhroam

This comment has been minimized.

Show comment
Hide comment
@bhroam

bhroam Nov 13, 2017

Contributor

@minghui-liu please attach evidence of testing to the Jira ticket.

Contributor

bhroam commented Nov 13, 2017

@minghui-liu please attach evidence of testing to the Jira ticket.

@minghui-liu

This comment has been minimized.

Show comment
Hide comment
@minghui-liu

minghui-liu Nov 15, 2017

Contributor

@bhroam @bayucan Hi. I added a new PTL test. Would you please take another look. Thank you.

Contributor

minghui-liu commented Nov 15, 2017

@bhroam @bayucan Hi. I added a new PTL test. Would you please take another look. Thank you.

@bhroam

This comment has been minimized.

Show comment
Hide comment
@bhroam

bhroam Nov 15, 2017

Contributor

The test looks good, @minghui-liu

Contributor

bhroam commented Nov 15, 2017

The test looks good, @minghui-liu

@minghui-liu

This comment has been minimized.

Show comment
Hide comment
@minghui-liu

minghui-liu Nov 16, 2017

Contributor

@bhroam Thank you. I have squashed and rebased.

Contributor

minghui-liu commented Nov 16, 2017

@bhroam Thank you. I have squashed and rebased.

@bhroam

This comment has been minimized.

Show comment
Hide comment
@bhroam

bhroam Nov 17, 2017

Contributor

Something got merged in ahead of you. Could you rebase again?

Contributor

bhroam commented Nov 17, 2017

Something got merged in ahead of you. Could you rebase again?

@minghui-liu

This comment has been minimized.

Show comment
Hide comment
@minghui-liu

minghui-liu Nov 17, 2017

Contributor

@bhroam Rebased again.

Contributor

minghui-liu commented Nov 17, 2017

@bhroam Rebased again.

@bhroam bhroam merged commit a6d027b into PBSPro:master Nov 17, 2017

4 checks passed

codacy/pr Good work! A positive pull request.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
license/cla Contributor License Agreement is signed.
Details
@bhroam

This comment has been minimized.

Show comment
Hide comment
@bhroam

bhroam Nov 17, 2017

Contributor

I merged in the PR.

Contributor

bhroam commented Nov 17, 2017

I merged in the PR.

@minghui-liu minghui-liu deleted the minghui-liu:PP-772 branch Nov 17, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment