New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Condor jobs left in queue with X state at end of completion #26

Closed
yadudoc opened this Issue Jan 25, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@yadudoc
Member

yadudoc commented Jan 25, 2018

Reported by Michael Wang

The condor provider currently leaves jobs in X state in the queue. The provider sets the leave_in_queue option to prevent condor from removing completed jobs before the provider has a chance to notice the state change. This has a side-effect of condor_rm'ed job's staying in the queue in an 'X' state. Usually this is not a problem as condor will do a cleanup later, however this can be confusing to users. From condor documentation (condor_rm man page) :
When removing a grid job, the job may remain in the ‘‘X’’ state for a very long time. This is normal, as HTCondor is attempting to communicate with the remote scheduling system, ensuring that the job has been properly cleaned up. If it takes too long, or in rare circumstances is never removed, the job may be forced to leave the job queue by using the -forcex option. This forcibly removes jobs that are in the ‘‘X’’ state without attempting to finish any clean up at the remote scheduler.

To avoid this situation, the provider should ideally remove the job, and do a final cleanup.

@yadudoc yadudoc added the bug label Jan 25, 2018

@yadudoc yadudoc added this to the 0.3.0 milestone Jan 25, 2018

@yadudoc yadudoc self-assigned this Jan 25, 2018

yadudoc added a commit that referenced this issue Jan 25, 2018

Fix for #26, and bad returns from submit
* cancel() now uses condor_rm followed by condor_rm -forcex to clean out jobs left in 'X' state.
* submit() was returning a list of ids, fixed to return just id.
* Doc improvements
@yadudoc

This comment has been minimized.

Member

yadudoc commented Jan 26, 2018

Tested on OSG. Fixed.

@yadudoc yadudoc closed this Jan 26, 2018

yadudoc added a commit that referenced this issue Jan 30, 2018

Bumping version from 0.2.6 -> 0.3.0
Several updates and cleanups for @benhg for the Azure provider
Removed haikunator dep
Fixes for issue #27 and regression tests
Support for cobalt+aprun on theta
SGE support (untested) from @benhg
Fix for #26
Updating to use scriptDir instead of script_dir
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment