Skip to content
This repository has been archived by the owner on Nov 13, 2019. It is now read-only.

Add arbitrary "starting state" when submitting jobs till actual status is returned from the batch server #260

Closed
ericfranz opened this issue Mar 2, 2018 · 4 comments

Comments

@ericfranz
Copy link
Contributor

ericfranz commented Mar 2, 2018

The problem:

[efranz@ada7 ood_core]$ bsub -m curie < hello.sh
Verifying job submission parameters...
Verifying project account...
     Account to charge:   082810573256
         Balance (SUs):      4999.8694
         SUs to charge:         5.3333
Job <7274791> is submitted to default queue <curie_devel>.
[efranz@ada7 ood_core]$ bjobs -m curie 7274791
Job <7274791> is not found on host/group <curie>
[efranz@ada7 ood_core]$ bjobs -m curie 7274791
Job <7274791> is not found on host/group <curie>
[efranz@ada7 ood_core]$ bjobs -m curie 7274791
JOBID      STAT  USER             QUEUE      JOB_NAME             NEXEC_HOST SLOTS RUN_TIME        TIME_LEFT
7274791    RUN   efranz           curie_deve helloWorld           1          16    0 second(s)     0:20 L
[efranz@ada7 ood_core]$
  • How do you reliably tell the difference between the job that has not yet appeared in the queue and a job that has failed or completed and thus exited the queue? This is about how we submit and report the status of all of our jobs.

A solution:

diagram

@ericfranz
Copy link
Contributor Author

  1. This solution might be something we want to add as an additional wrapper or object to ood_core.
  2. A fix applied to this app should be applied to the dashboard when submitting interactive apps as well.

@ericfranz ericfranz added the bug label Mar 2, 2018
@ericfranz
Copy link
Contributor Author

Discussed internally. We consider two approaches:

  1. The LSF adapter's "id" could be a string with metadata attached to it, including the submission date. So then the adapter can enforce the above statemachine by expanding the else block in this code. The problem is that any app that displays the "id" would need to "pretty print" the id. Do we add a ppid method to the base adapter?
  2. Following on the goals of the first option, we change "id" from a string to a value object that implements to_s, to_str, etc. The value object can store extra information the adapter needs, for state. We would however, need to design the serialization of this information when storing the id in the database.

@ericfranz
Copy link
Contributor Author

Another option is when submitting a job, after calling bsub, the adapter itself calls bjobs to verify it is in the system. If not, it waits, then checks again. The checking would just be for the state of "delaying" the return of the submit method.

Of course if this took too long it could time out the request... we should see what a type of delay is expected.

@ericfranz
Copy link
Contributor Author

Going to do OSC/ood_core#81 instead. If OSC/ood_core#81 doesn't fix the problem, reopen this issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant