-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NFS? synchronization problems #33
Comments
Thanks for sharing your experiences. You're not the first one to observed
My comments on NFS, which may or may not explain what you're observations: My $.02 /Henrik On Thu, Jan 16, 2014 at 9:22 AM, lawremi notifications@github.com wrote:
|
Version 1.2 of BatchJobs includes a more conservative mechanism to determine if jobs are running. And, as Henrik already pointed out, you should try the But I can imagine how everything falls apart with a 30 second delay. I will try to include some extra checks and sleeps in As a side note: We also use NFS on a smaller faculty cluster, and we don't see such delays. If there are parallel I/O heavy jobs, we encounter delays up to some seconds. Stale nfs locks are really rare and can usually be resolved by resubmitting a job. So the file system is not necessarily the reason. |
Thanks, We've used staged.queries since day one. Is this more conservative On Wed, Feb 19, 2014 at 4:42 AM, Michel notifications@github.com wrote:
|
Michael, could you try if the commits tudo-r/BatchJobs@c26e17e and tudo-r/BatchJobs@f46f88e resolved your issues? This option mainly influences the execution on the master in |
Things seemed to have improved. With a timeout of 5 seconds, the failure rate is about 1/1000 instead of 1/10. But this does of course depend on cluster conditions. Thanks for this workaround! Closing for now. |
We have pretty much given up trying to get BatchJobsParam working on our LSF cluster. We frequently encounter seemingly random failures at various stages, including just after submission (the transition to waiting) and when attempting to collect results. These failures result in empty "" error messages being reported as the "first error" for the operation. Another common failure is that jobs will be started by the scheduler before their .R file exists.
The file system is mounted via NFS, and we suspect that file operations are happening out of order. For example, when switching to the wait state, the system seems to think that there is nothing to wait for, when in fact no jobs have started yet. And, when all jobs are finished, it checks for results, but they have not yet appeared, so it fails.
This is mostly a BatchJobs issue but is made worse by the automation provided by BatchJobParam. Is there any way to synchronize these operations? Or must we find a more reliable filesystem?
This problem is also probably responsible for the bpresume() issue.
The text was updated successfully, but these errors were encountered: