Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stuck jobs #5352

Open
davidpanderson opened this issue Sep 7, 2023 · 4 comments · May be fixed by #5451
Open

stuck jobs #5352

davidpanderson opened this issue Sep 7, 2023 · 4 comments · May be fixed by #5451

Comments

@davidpanderson
Copy link
Contributor

davidpanderson commented Sep 7, 2023

Reportedly, some VM jobs (and possibly others) get in a "stuck" state where they
don't make progress: no fraction done change, and little CPU usage.
These jobs will eventually be aborted when their elapsed time reaches the rsc_fpops_bound limit,
but this could take weeks or months depending on the limit.

Proposal: have the client try to figure out when a job is stuck.

ACTIVE_TASK new fields:
  double stuck_check_elapsed_time
  double stuck_check_fraction_done
  double stuck_check_cpu_time
  (initialize all to zero)

STUCK_CHECK_POLL_PERIOD = 3600

every STUCK_CHECK_POLL_PERIOD seconds
   for each active task atp
      if non_cpu_intensive: continue
      if sporadic: continue
      if atp->stuck_check_elapsed_time == 0
         atp->stuck_check_elapsed_time = atp->elapsed_time
         atp->stuck_check_fraction_done = atp->fraction_done
         atp->stuck_check_cpu_time = atp->current_cpu_time
         continue
      if atp->elapsed_time < atp->stuck_check_elapsed_time + STUCK_CHECK_POLL_PERIOD
        continue
     if atp->stuck_check_fraction_done == atp->fraction_done
        && (atp->current_cpu_time - atp->stuck_check_cpu_time < 10)
        (job is stuck - print warning)
     atp->stuck_check_elapsed_time = atp->elapsed_time
     atp->stuck_check_fraction_done = atp->fraction_done
     atp->stuck_check_cpu_time = atp->current_cpu_time

e.g. in the last hour of running, the fraction done hasn't changed,
and the incremental CPU time is < 10s.

At that point, the client could

  1. notify the user, suggesting that they abort the job
  2. abort the job

Let's do 1) for starters, to make sure that the logic is right,
then at some point do 2).

@FTang21
Copy link
Contributor

FTang21 commented Nov 10, 2023

Hello,

I'm Franke Tang, a graduate student currently taking a Distributed Computing course, and part of my final project encourages us to contribute to open issues on GitHub relating to distributed systems. I would like to work on this issue if this has not been implemented yet.

@AenBleidd
Copy link
Member

Welcome, @FTang21, sure, go ahead

@FTang21
Copy link
Contributor

FTang21 commented Nov 27, 2023

Hello, sorry for the late followup, was working on PRs on other repos. I was looking through code, would app.cpp be a good point to start on this issue?

@davidpanderson
Copy link
Contributor Author

The new logic would go in ACTIVE_TASK_SET::poll()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In progress
Development

Successfully merging a pull request may close this issue.

3 participants