NFS? synchronization problems #33

lawremi · 2014-01-16T17:22:56Z

We have pretty much given up trying to get BatchJobsParam working on our LSF cluster. We frequently encounter seemingly random failures at various stages, including just after submission (the transition to waiting) and when attempting to collect results. These failures result in empty "" error messages being reported as the "first error" for the operation. Another common failure is that jobs will be started by the scheduler before their .R file exists.

The file system is mounted via NFS, and we suspect that file operations are happening out of order. For example, when switching to the wait state, the system seems to think that there is nothing to wait for, when in fact no jobs have started yet. And, when all jobs are finished, it checks for results, but they have not yet appeared, so it fails.

This is mostly a BatchJobs issue but is made worse by the automation provided by BatchJobParam. Is there any way to synchronize these operations? Or must we find a more reliable filesystem?

This problem is also probably responsible for the bpresume() issue.

HenrikBengtsson · 2014-01-16T18:25:30Z

Thanks for sharing your experiences. You're not the first one to observed
weird non-reproducible in parallel/distributed processing and you won't be
the last either. Even if there won't be clear answers, these type of
reports are very important and will eventually help narrow down potential
bugs and over time things will slowly become better. For further
troubleshooting / reporting it's probably helpful to know:

sessionInfo()?
What scheduler are you using?
Do you see these errors when you run launch jobs on a single machine?
(related to by NFS comment below)
Do you see these errors more often when running lots of jobs in parallel,
or do you see them equally often when running in a more sequential fashion
(a few jobs at the time)?
If possible, I would scale of the BiocParallel layer for now and use only
BatchJobs. It's not that difficult [I've got some slides
http://goo.gl/s1eqBz]. If you also see the errors that way (I assume you
will), I then recommend that you report to BatchJobs [
https://github.com/tudo-r/BatchJobs/issues].

My comments on NFS, which may or may not explain what you're observations:
NFS can have up to 30 seconds delays (I doubt it's that bad nowadays), but
I've certainly observed a few seconds delays on heavily used NFS systems.
This means that when one node writes something, it may take a few seconds
before another node will see it. This can lead to race conditions, even on
I/O operations that you'd believe are atomic, e.g. file moves. I gave some
anecdotal illustrations of this in a BatchJobs thread a while ago -
https://groups.google.com/forum/#!msg/batchjobs/15IIm_Nb4YQ/iwiGFYIPpmwJ.
From that thread you've can see that BatchJobs is trying to protect
against this, but I don't know if it is completely possible to do.

My $.02

/Henrik

On Thu, Jan 16, 2014 at 9:22 AM, lawremi notifications@github.com wrote:

We have pretty much given up trying to get BatchJobsParam working on our
LSF cluster. We frequently encounter seemingly random failures at various
stages, including just after submission (the transition to waiting) and
when attempting to collect results. These failures result in empty "" error
messages being reported as the "first error" for the operation. Another
common failure is that jobs will be started by the scheduler before their
.R file exists.

The file system is mounted via NFS, and we suspect that file operations
are happening out of order. For example, when switching to the wait state,
the system seems to think that there is nothing to wait for, when in fact
no jobs have started yet. And, when all jobs are finished, it checks for
results, but they have not yet appeared, so it fails.

This is mostly a BatchJobs issue but is made worse by the automation
provided by BatchJobParam. Is there any way to synchronize these
operations? Or must we find a more reliable filesystem?

This problem is also probably responsible for the bpresume() issue.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/33
.

mllg · 2014-02-19T12:42:54Z

Version 1.2 of BatchJobs includes a more conservative mechanism to determine if jobs are running. And, as Henrik already pointed out, you should try the staged.queries option described here.

But I can imagine how everything falls apart with a 30 second delay.

I will try to include some extra checks and sleeps in submitJobs and discuss a configuration option with Bernd next week.

As a side note: We also use NFS on a smaller faculty cluster, and we don't see such delays. If there are parallel I/O heavy jobs, we encounter delays up to some seconds. Stale nfs locks are really rare and can usually be resolved by resubmitting a job. So the file system is not necessarily the reason.

lawremi · 2014-02-19T12:51:07Z

Thanks,

We've used staged.queries since day one. Is this more conservative
mechanism the same thing you added for Leonard, or is it new? More
checks/delays would be great. At this point, people are having good luck
with adhoc BatchJobs usage, but BiocParallel seems to dramatically decrease
the stability, so it's probably just a timing issue.

On Wed, Feb 19, 2014 at 4:42 AM, Michel notifications@github.com wrote:

Version 1.2 of BatchJobs includes a more conservative mechanism to
determine if jobs are running. And, as Henrik already pointed out, you
should try the staged.queries option described herehttps://github.com/tudo-r/BatchJobs/wiki/Configuration#wiki-optimizing-performance
.

But I can imagine how everything falls apart with a 30 second delay.

I will try to include some extra checks and sleeps in submitJobs and
discuss a configuration option with Bernd next week.

As a side note: We also use NFS on a smaller faculty cluster, and we don't
see such delays. If there are parallel I/O heavy jobs, we encounter delays
up to some seconds. Stale nfs locks are really rare and can usually be
resolved by resubmitting a job. So the file system is not necessarily the
reason.

Reply to this email directly or view it on GitHubhttps://github.com//issues/33#issuecomment-35494697
.

mllg · 2014-02-20T14:12:45Z

Michael, could you try if the commits tudo-r/BatchJobs@c26e17e and tudo-r/BatchJobs@f46f88e resolved your issues?
You have to set the config option fs.timeout to a timeout value (in seconds) to wait for file creation before throwing an R exception.

This option mainly influences the execution on the master in submitJobs. If you encounter other inconsistencies (after killing, reseting etc.), please let me know.

lawremi · 2014-02-21T18:33:15Z

Things seemed to have improved. With a timeout of 5 seconds, the failure rate is about 1/1000 instead of 1/10. But this does of course depend on cluster conditions. Thanks for this workaround! Closing for now.

mllg mentioned this issue Feb 19, 2014

Add an option to respect delayed file systems tudo-r/BatchJobs#26

Closed

lawremi closed this as completed Feb 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFS? synchronization problems #33

NFS? synchronization problems #33

lawremi commented Jan 16, 2014

HenrikBengtsson commented Jan 16, 2014

mllg commented Feb 19, 2014

lawremi commented Feb 19, 2014

mllg commented Feb 20, 2014

lawremi commented Feb 21, 2014

NFS? synchronization problems #33

NFS? synchronization problems #33

Comments

lawremi commented Jan 16, 2014

HenrikBengtsson commented Jan 16, 2014

mllg commented Feb 19, 2014

lawremi commented Feb 19, 2014

mllg commented Feb 20, 2014

lawremi commented Feb 21, 2014