The scheduler and file system can race #3841

unito-bot · 2021-10-12T16:09:43Z

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1054

adamnovak · 2021-10-13T14:56:30Z

The problem here can manifest as #3758 (comment) which @Guigzai experienced. If we hear back from the scheduler/batch system that a job is done but not all its writes are visible on disk, we can get into trouble.

Dealing with missing files might just involve waiting for them to appear, but dealing with files that were supposed to have been replaced/modified but weren't is going to be harder.

rohith-bs · 2023-05-24T11:30:41Z

@adamnovak is this being considered in the future development plan. Kindly help as this is quite recurring and impacts many pipelines that we work on.

Guigzai · 2023-06-16T15:22:26Z

Hello,

We have updated toil 5.7.1 -> 5.11.0
We made tests.

Update FIX a recurrent bug explained in #3758 or #4092 .

Thanks you for your job.

unito-bot · 2023-09-19T17:21:46Z

➤ Adam Novak commented:

We should add a clock to the job description in the file job store, so we can look at it and know whether or not the writes (or job description deletion) from any particular invocation of the job are visible. Then the leader can poll until the writes are visible, and then proceed with re-scheduling the job or scheduling the successors.

unito-bot · 2024-02-06T18:32:16Z

➤ Adam Novak commented:

We should maybe just steal Snakemake’s idea of having a configurable wait time for the filesystem to settle.

unito-bot assigned adamnovak Oct 12, 2021

This was referenced Feb 29, 2024

Remove workaround for ghost jobs caused by stale reads from SDB #1091

Open

Make leader wait for expected updates to be visible in the job store, or fail the job #4811

Merged

adamnovak closed this as completed in #4811 Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The scheduler and file system can race #3841

The scheduler and file system can race #3841

unito-bot commented Oct 12, 2021 •

edited

adamnovak commented Oct 13, 2021

rohith-bs commented May 24, 2023

Guigzai commented Jun 16, 2023 •

edited

unito-bot commented Sep 19, 2023

unito-bot commented Feb 6, 2024

The scheduler and file system can race #3841

The scheduler and file system can race #3841

Comments

unito-bot commented Oct 12, 2021 • edited

adamnovak commented Oct 13, 2021

rohith-bs commented May 24, 2023

Guigzai commented Jun 16, 2023 • edited

unito-bot commented Sep 19, 2023

unito-bot commented Feb 6, 2024

unito-bot commented Oct 12, 2021 •

edited

Guigzai commented Jun 16, 2023 •

edited