Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for stuck jobs in poll() #5451

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

FTang21
Copy link
Contributor

@FTang21 FTang21 commented Dec 9, 2023

Fixes #5352

Description of the Change
Add a way to check if jobs were stuck by adding an addition poll that occurs every hour, it checks if an active_tasks fraction_done doesn't change and current_cpu_time < 10s.

Alternate Designs

Release Notes

@AenBleidd
Copy link
Member

This might be an issue for really big tasks (e.g. ClimatePrediction.net)

@FTang21
Copy link
Contributor Author

FTang21 commented Dec 9, 2023

Would this be more of an issue of an hour is too short of time frame, or more with the implementation?

Copy link

codecov bot commented Dec 9, 2023

Codecov Report

Merging #5451 (eb07ea0) into master (c02d6e0) will not change coverage.
Report is 16 commits behind head on master.
The diff coverage is n/a.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #5451   +/-   ##
=========================================
  Coverage     10.84%   10.84%           
  Complexity     1068     1068           
=========================================
  Files           279      279           
  Lines         36156    36156           
  Branches       8355     8355           
=========================================
  Hits           3920     3920           
  Misses        31842    31842           
  Partials        394      394           

see 1 file with indirect coverage changes

@AenBleidd
Copy link
Member

@FTang21, ah, no, my bad: in the original proposal there was an additional verification method for long running jobs: CPU time, and I completely missed that this was the part of the implementation in this PR.

@davidpanderson, could you please review this and verify that this is a desired implementation of the original proposal?

@davidpanderson
Copy link
Contributor

There are various problems with this.
I updated the issue to clarify what needs to be done:
#5352

@FTang21
Copy link
Contributor Author

FTang21 commented Dec 10, 2023

@davidpanderson I updated the implementation based on the updated issue. Lmk if this is ok. Should I add the abort on its own after some time or this is fine for now? Would this also be preferable as it function?

@davidpanderson
Copy link
Contributor

For now, let's just show a message using
msg_printf(atp->project, MSG_USER_ALERT...)

... telling the user which job is stuck,
and that they should consider aborting it.

This will be useful for testing because we can see the stuck job
and decide if it's really stuck.

@FTang21
Copy link
Contributor Author

FTang21 commented Dec 10, 2023

Gotcha, I updated it to MSG_USER_ALERT

client/app_control.cpp Outdated Show resolved Hide resolved
@davidpanderson
Copy link
Contributor

Almost but not quite. Please review my pseudo-code.

@FTang21
Copy link
Contributor Author

FTang21 commented Dec 11, 2023

Ah I see what I missed, it should match the order provided.

(atp->current_cpu_time - atp->stuck_check_cpu_time) < 10) {
// if fraction done does not change and cpu time is <10, message the user
msg_printf(atp->result->project, MSG_USER_ALERT,
"[task] has not made progress in last hour, consider aborting task %s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the better working should be

Task has not made any progress within the last hour, consider aborting it: %s

@davidpanderson, could you please verify the spelling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about the direction this might take the BOINC client (eventually), or might lead the user.

Not every wrapper-using project successfully manages to implement a realistic measure of progress made. This direction of thought changes the progress report from 'nice to have' towards 'essential'. It needs to be explained carefully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A job is regarded as stuck if, in the last hour of running,
a) it accumulates less than 10 sec of CPU time
AND
b) its fraction done is unchanged.

b) by itself doesn't make it stuck.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

stuck jobs
4 participants