Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The scheduler and file system can race #3841

Closed
unito-bot opened this issue Oct 12, 2021 · 5 comments · Fixed by #4811
Closed

The scheduler and file system can race #3841

unito-bot opened this issue Oct 12, 2021 · 5 comments · Fixed by #4811
Assignees

Comments

@unito-bot
Copy link

unito-bot commented Oct 12, 2021

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1054

@adamnovak
Copy link
Member

The problem here can manifest as #3758 (comment) which @Guigzai experienced. If we hear back from the scheduler/batch system that a job is done but not all its writes are visible on disk, we can get into trouble.

Dealing with missing files might just involve waiting for them to appear, but dealing with files that were supposed to have been replaced/modified but weren't is going to be harder.

@rohith-bs
Copy link

@adamnovak is this being considered in the future development plan. Kindly help as this is quite recurring and impacts many pipelines that we work on.

@Guigzai
Copy link

Guigzai commented Jun 16, 2023

Hello,

We have updated toil 5.7.1 -> 5.11.0
We made tests.

Update FIX a recurrent bug explained in #3758 or #4092 .

Thanks you for your job.

@unito-bot
Copy link
Author

➤ Adam Novak commented:

We should add a clock to the job description in the file job store, so we can look at it and know whether or not the writes (or job description deletion) from any particular invocation of the job are visible. Then the leader can poll until the writes are visible, and then proceed with re-scheduling the job or scheduling the successors.

@unito-bot
Copy link
Author

➤ Adam Novak commented:

We should maybe just steal Snakemake’s idea of having a configurable wait time for the filesystem to settle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants