Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFS? synchronization problems #33

Closed
lawremi opened this issue Jan 16, 2014 · 5 comments
Closed

NFS? synchronization problems #33

lawremi opened this issue Jan 16, 2014 · 5 comments

Comments

@lawremi
Copy link

lawremi commented Jan 16, 2014

We have pretty much given up trying to get BatchJobsParam working on our LSF cluster. We frequently encounter seemingly random failures at various stages, including just after submission (the transition to waiting) and when attempting to collect results. These failures result in empty "" error messages being reported as the "first error" for the operation. Another common failure is that jobs will be started by the scheduler before their .R file exists.

The file system is mounted via NFS, and we suspect that file operations are happening out of order. For example, when switching to the wait state, the system seems to think that there is nothing to wait for, when in fact no jobs have started yet. And, when all jobs are finished, it checks for results, but they have not yet appeared, so it fails.

This is mostly a BatchJobs issue but is made worse by the automation provided by BatchJobParam. Is there any way to synchronize these operations? Or must we find a more reliable filesystem?

This problem is also probably responsible for the bpresume() issue.

@HenrikBengtsson
Copy link
Contributor

Thanks for sharing your experiences. You're not the first one to observed
weird non-reproducible in parallel/distributed processing and you won't be
the last either. Even if there won't be clear answers, these type of
reports are very important and will eventually help narrow down potential
bugs and over time things will slowly become better. For further
troubleshooting / reporting it's probably helpful to know:

  • sessionInfo()?
  • What scheduler are you using?
  • Do you see these errors when you run launch jobs on a single machine?
    (related to by NFS comment below)
  • Do you see these errors more often when running lots of jobs in parallel,
    or do you see them equally often when running in a more sequential fashion
    (a few jobs at the time)?
  • If possible, I would scale of the BiocParallel layer for now and use only
    BatchJobs. It's not that difficult [I've got some slides
    http://goo.gl/s1eqBz]. If you also see the errors that way (I assume you
    will), I then recommend that you report to BatchJobs [
    https://github.com/tudo-r/BatchJobs/issues].

My comments on NFS, which may or may not explain what you're observations:
NFS can have up to 30 seconds delays (I doubt it's that bad nowadays), but
I've certainly observed a few seconds delays on heavily used NFS systems.
This means that when one node writes something, it may take a few seconds
before another node will see it. This can lead to race conditions, even on
I/O operations that you'd believe are atomic, e.g. file moves. I gave some
anecdotal illustrations of this in a BatchJobs thread a while ago -
https://groups.google.com/forum/#!msg/batchjobs/15IIm_Nb4YQ/iwiGFYIPpmwJ.
From that thread you've can see that BatchJobs is trying to protect
against this, but I don't know if it is completely possible to do.

My $.02

/Henrik

On Thu, Jan 16, 2014 at 9:22 AM, lawremi notifications@github.com wrote:

We have pretty much given up trying to get BatchJobsParam working on our
LSF cluster. We frequently encounter seemingly random failures at various
stages, including just after submission (the transition to waiting) and
when attempting to collect results. These failures result in empty "" error
messages being reported as the "first error" for the operation. Another
common failure is that jobs will be started by the scheduler before their
.R file exists.

The file system is mounted via NFS, and we suspect that file operations
are happening out of order. For example, when switching to the wait state,
the system seems to think that there is nothing to wait for, when in fact
no jobs have started yet. And, when all jobs are finished, it checks for
results, but they have not yet appeared, so it fails.

This is mostly a BatchJobs issue but is made worse by the automation
provided by BatchJobParam. Is there any way to synchronize these
operations? Or must we find a more reliable filesystem?

This problem is also probably responsible for the bpresume() issue.


Reply to this email directly or view it on GitHubhttps://github.com//issues/33
.

@mllg
Copy link
Collaborator

mllg commented Feb 19, 2014

Version 1.2 of BatchJobs includes a more conservative mechanism to determine if jobs are running. And, as Henrik already pointed out, you should try the staged.queries option described here.

But I can imagine how everything falls apart with a 30 second delay.

I will try to include some extra checks and sleeps in submitJobs and discuss a configuration option with Bernd next week.

As a side note: We also use NFS on a smaller faculty cluster, and we don't see such delays. If there are parallel I/O heavy jobs, we encounter delays up to some seconds. Stale nfs locks are really rare and can usually be resolved by resubmitting a job. So the file system is not necessarily the reason.

@lawremi
Copy link
Author

lawremi commented Feb 19, 2014

Thanks,

We've used staged.queries since day one. Is this more conservative
mechanism the same thing you added for Leonard, or is it new? More
checks/delays would be great. At this point, people are having good luck
with adhoc BatchJobs usage, but BiocParallel seems to dramatically decrease
the stability, so it's probably just a timing issue.

On Wed, Feb 19, 2014 at 4:42 AM, Michel notifications@github.com wrote:

Version 1.2 of BatchJobs includes a more conservative mechanism to
determine if jobs are running. And, as Henrik already pointed out, you
should try the staged.queries option described herehttps://github.com/tudo-r/BatchJobs/wiki/Configuration#wiki-optimizing-performance
.

But I can imagine how everything falls apart with a 30 second delay.

I will try to include some extra checks and sleeps in submitJobs and
discuss a configuration option with Bernd next week.

As a side note: We also use NFS on a smaller faculty cluster, and we don't
see such delays. If there are parallel I/O heavy jobs, we encounter delays
up to some seconds. Stale nfs locks are really rare and can usually be
resolved by resubmitting a job. So the file system is not necessarily the
reason.

Reply to this email directly or view it on GitHubhttps://github.com//issues/33#issuecomment-35494697
.

@mllg
Copy link
Collaborator

mllg commented Feb 20, 2014

Michael, could you try if the commits tudo-r/BatchJobs@c26e17e and tudo-r/BatchJobs@f46f88e resolved your issues?
You have to set the config option fs.timeout to a timeout value (in seconds) to wait for file creation before throwing an R exception.

This option mainly influences the execution on the master in submitJobs. If you encounter other inconsistencies (after killing, reseting etc.), please let me know.

@lawremi
Copy link
Author

lawremi commented Feb 21, 2014

Things seemed to have improved. With a timeout of 5 seconds, the failure rate is about 1/1000 instead of 1/10. But this does of course depend on cluster conditions. Thanks for this workaround! Closing for now.

@lawremi lawremi closed this as completed Feb 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants