Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSF Adapter: Add optional delay before returning id after successfully submitting a job #81

Open
ericfranz opened this issue Mar 20, 2018 · 3 comments

Comments

@ericfranz
Copy link
Contributor

commented Mar 20, 2018

In the Adapters::Lsf#submit (could probably do it all in the batch object itself, since it has access easy to the config), if the config has the option verify_submit_timeout which specifies timeout in number of seconds:

verify_submit_timeout: 10

then after bsub-ing the job we call bjobs to verify the submitted job has appeared in the queue. If it has not yet appeared, we then sleep for several seconds, then call bjobs again. We repeat until the job appears in the queue or the time since bsub returned the id exceeds the number of seconds specified in "verify_submit_timeout". If "verify_submit_timeout" is 0 or -1 or a non number then we don't check the queue for the job, we just return.

This will allow addressing the edge case were after a job is submitted in multi-cluster node there is a delay before bjobs displays the job in the queue.

@ericfranz

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2018

This should address OSC/ood-myjobs#260.

@nickjer

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2018

I propose we should hold off on this until we can confirm that the lsf.conf parameter:

https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.3/lsf_config_ref/lsb.params.newjob_refresh.5.html

can fix this issue by setting NEWJOB_REFRESH=Y. Then we just need to document this.

@nickjer

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2018

The issue turned out that we were calling the job status on a host group and not a cluster. So the delay was because the job was Pending and had yet to be dispatched to a host in the requested host group. Once the job was dispatched to a valid host under the requested host group and entered the Running state it would appear in the job status request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.