-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The scheduler and file system can race #3841
Comments
The problem here can manifest as #3758 (comment) which @Guigzai experienced. If we hear back from the scheduler/batch system that a job is done but not all its writes are visible on disk, we can get into trouble. Dealing with missing files might just involve waiting for them to appear, but dealing with files that were supposed to have been replaced/modified but weren't is going to be harder. |
@adamnovak is this being considered in the future development plan. Kindly help as this is quite recurring and impacts many pipelines that we work on. |
➤ Adam Novak commented: We should add a clock to the job description in the file job store, so we can look at it and know whether or not the writes (or job description deletion) from any particular invocation of the job are visible. Then the leader can poll until the writes are visible, and then proceed with re-scheduling the job or scheduling the successors. |
➤ Adam Novak commented: We should maybe just steal Snakemake’s idea of having a configurable wait time for the filesystem to settle. |
┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1054
The text was updated successfully, but these errors were encountered: