-
Notifications
You must be signed in to change notification settings - Fork 103
heartbeat gives false negative under heavy nfs load #433
Comments
The default heartbeat is too short. You're doing it right. However, the interval of testing the heartbeat also needs to drop. Then, at some point, the issue becomes a non-issue. And there could be a separate interval of looking for the exit files. Also, we could stat the timestamp less often than we check for existence. I'll look into that later.
Good. Understand that when this weird failure occurs, those users would wait forever! It's not a problem to be ignored.
We are very unhappy at having to rely on the filesystem. Unfortunately, we know nothing about the network available for any given user. The filesystem is the lowest common resource, and I don't think that filesystem locks (on a database) would improve the situation, though it's worth trying. However, the architecture is flexible. "pwatcher" is only meant as the universal solution, not the best. We could quickly code other plugins. E.g. we could take advantage of UNIX sockets (if available for grid machines). Heartbeats could be calls back to a central server. Or better: The central server could query the running jobs for status, so we would not need heartbeats at all. We could hoard a number of grid hosts, to reduce the delays in host acquisition for the case of huge numbers of very quick jobs, and each host could have a single process to communicate with the central server. Or we could go the other direction: rely on qstat instead of watching jobs ourself. The watcher has a simple JSON API. It has 3 commands:
If you write an alternative watcher, we could make that available to everyone by configuration. We might need a matrix of watchers, with grid-system on one axis. But this is super low priority for PacBio, where the main goal is Sequel improvements. A longer heartbeat solves most problems. I guess that needs to be configurable, yes? That can be in our next sprint. |
Okay, done. I added an option to the config file to choose the watcher type, defaulting to fs_type. It'll probably take me a while to upload a fork and make a push request, though, because I suck at git (took me two days of work to manage it last time). One question, though - I also have a python script to allow the server to be queried from the command line (you just specify the server:port or point to the run directory so it can grab the info from state.py). Where should I put the script, as it seems like a good thing to include with the patch? I have tried to make it reasonable durable, but there are certainly things it misses, like if the heartbeat wrapper fails. And it's possible it'll have problems in environments unlike mine, but that's hard for me to test. ;) |
Yeah, coding is always the easy (and fun) part. But git is worth learning. Put the script into
Exactly! That's my biggest problem. So I'm happy to take contributions that work well in specific systems.
The heartbeat wrapper was really meant only for the specific case of weird process terminations on our local "Pod". I don't think any external users have reported such a problem, so maybe we don't even need it. And to the extent that it's useful, we might be better off using UNIX sockets to query the top process. I'll look at what you have when I see it. Btw, I think we need |
just to reinforce, I'm seeing exactly this same issue. Specific messages: As soon as all of the qsubs are done, it waits only briefly then shows
...
[WARNING] File "/usr/local/bin/fc_run", line 9, in <module>
load_entry_point('falcon-kit==0.7', 'console_scripts', 'fc_run')()
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 563, in main
main1(argv[0], args.config, args.logger)
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 327, in main1
setNumThreadAllowed=PypeProcWatcherWorkflow.setNumThreadAllowed)
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 405, in run
wf.refreshTargets(exitOnFailure=exitOnFailure)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 523, in refreshTargets
rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 657, in _refreshTargets
numAliveThreads = self.thread_handler.alive(task2thread.values())
File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 198, in alive
fred.endrun(status)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 73, in endrun
log.warning(''.join(traceback.format_stack()))
[ERROR]Task Fred{'URL': 'task://localhost/d_0108_raw_reads'}
is DEAD, meaning no HEARTBEAT, but this can be a race-condition. If it was not killed, then restarting might suffice. Otherwise, you might have excessive clock-skew.
[ERROR]Task Fred{'URL': 'task://localhost/d_0b91_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_09a9_raw_reads'} failed with exit-code=35072
[WARNING] File "/usr/local/bin/fc_run", line 9, in <module>
load_entry_point('falcon-kit==0.7', 'console_scripts', 'fc_run')()
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 563, in main
main1(argv[0], args.config, args.logger)
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 327, in main1
setNumThreadAllowed=PypeProcWatcherWorkflow.setNumThreadAllowed)
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 405, in run
wf.refreshTargets(exitOnFailure=exitOnFailure)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 523, in refreshTargets
rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 657, in _refreshTargets
numAliveThreads = self.thread_handler.alive(task2thread.values())
File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 198, in alive
fred.endrun(status)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 73, in endrun
log.warning(''.join(traceback.format_stack()))
[ERROR]Task Fred{'URL': 'task://localhost/d_0344_raw_reads'}
is DEAD, meaning no HEARTBEAT, but this can be a race-condition. If it was not killed, then restarting might suffice. Otherwise, you might have excessive clock-skew.
[WARNING] File "/usr/local/bin/fc_run", line 9, in <module>
load_entry_point('falcon-kit==0.7', 'console_scripts', 'fc_run')()
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 563, in main
main1(argv[0], args.config, args.logger)
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 327, in main1
setNumThreadAllowed=PypeProcWatcherWorkflow.setNumThreadAllowed)
File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 405, in run
wf.refreshTargets(exitOnFailure=exitOnFailure)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 523, in refreshTargets
rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 657, in _refreshTargets
numAliveThreads = self.thread_handler.alive(task2thread.values())
File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 198, in alive
fred.endrun(status)
File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 73, in endrun
log.warning(''.join(traceback.format_stack()))
|
Maybe we need an extra option so we can ignore the heartbeats by default. It would not be easy for me to diagnose the problems you're having remotely. |
It's seems related to NFS under heavy load. with 4 jobs of daligner, under ~20 machines does fine. Once I've got 20+ machines it has the heartbeat timeout issue. I've very comfortable in python and if you can just broadly point at what parts of the code need to be (commented out, changed to false, etc) in order to disable the heartbeat I'd be glad to do some of the work for myself (and provide the diff/pull) instead of waiting for PacBio to it. On Aug 29, 2016 1:28 AM, "Christopher Dunn" notifications@github.com Maybe we need an extra option so we can ignore the heartbeats by default. — |
Ignore my suggestion - I hadn't realized the *done and *done.exit files were now obsolete, with everything handled through the process watcher. |
You can set But I'll release something pretty soon which turns the heartbeat off by default. It seems that we only need it internally, so I'll add configuration for that later. |
Try the |
Let's see if that helps. If not, then the problem might be in how often we check for finished jobs, indicated by sentinel files. That's easy to adjust, but I want to prove first whether the heartbeat alone was actually the cause of your problems. |
Ran 1.7.5 [INFO]starting job Job(jobid='J59a0bf847b5c8b0f8a8a87487c37aafba8b68d056b376cc9f1665828616b51a3', cmd='/bin/bash rj_0849.sh', rundir='/mnt/agenome/runs/run3/0-rawreads/job_0849', options={'job_queue': None, 'sge_option': '-pe orte 4 -q all.q -l h_vmem=60G', 'job_type': None}) warnings.warn(shutdown_msg) [INFO]!qdel -k Je3230a485fdc04 [WARNING]Now, #tasks=3528, #alive=0 |
It's worth pointing out, that for many (all?) of my previous runs, I had already grown the cluster to a decent size by the time the qsubs were done being queued. as a result I assumed that the number of processes was a contributing factor. In this case, I had a single master (non-exec host) and a single 16 proc (r3.4xlarge) worker. So multiple processes does not seem to be a contributing factor in this crash, and probably wasn't in the others as well. While this is another crash, it has no mention of the heartbeat, which was present in all previous crashes. I'm not sure if it will help to focus the search, but that does seem to at least confirm that I was correctly running the new 1.7.5 code |
perhaps the most useful conclusion in my case. Does that help to point the finger? I hope so. But if I'm limited to 50 concurrent jobs, this isn't going run at anywhere near the scale I was hoping for. :-( |
You can look into a particular job directory and look at Otherwise, I'll have to try to repro somehow. |
It's very difficult to debug remotely. If you don't see any stdout/stderr in any I'm going to put together some do-nothing sleep jobs, to test something unrelated on our network a PacBio. Then, maybe you can experiment to learn the exact cause of these problems, perhaps by submitting them manually. But this could be a lot of work for you. |
I appreciate there are limits to how much debug work PacBio can do, and am I am the network admin and the bioinformatician ... and the alpha and the If we get to a stage where it makes sense, you can email me a public key, 3 things related to previous messages above also seem worth mentioning logger = logging.getLogger('foo') or similar would be helpful, but I imagine you have the experience to say the docs remains a bit unclear to me, regarding an implied (but not in my case after building falcon, I need to create a new ami, so that the regarding your comment to flowers9, 'But git is worth learning.' I have a love-hate relationship with git. I'm not an expert but am I figured I'd clean up flowers9's patch, fork the PacBio code, and then On Aug 30, 2016 12:28 PM, "Christopher Dunn" notifications@github.com
|
Sorry, my suggestion was a complete mistake, I hadn't delved into just how much the process watcher now handled (it bit me some other places, too). I do plan to upload a branch at some point, but I'm currently trying to get rid of a couple of other issues as well while I'm doing this: the mkdir() and qsub in rapid succession generally causes the first eight or so qubs to error out because they can't find the just created directories, and the qsub'ed process log files cause avoidable disk traffic (dgordon had mentioned this earlier in another thread). I'm currently expanding my heartbeat server to handle the log files as well, and then to store the whole process watcher state file on a user defined disk, so I can make it local to the fc_run.py machine instead of shared, since it would then only be touched by processes on the fc_run.py machine. I hadn't realized the state file was quite as important as it turns out to be, so it's taking a bit longer than I'd hoped. Namely, if you want to be able to restart the run, you've got to be very assiduous about saving the state, or else it starts over at the beginning. |
I've been unable to reproduce the crash, even at substantial scale. There are at least 2 possible contributing factors, 1 changing the nfs mounting params, 2 having exhausted an initial burstable allowance from amazon. I may try to reproduce this further in the future, but costs are running up and I've got a good enough solution, since I'm about to step onto a plane. This is not to say there aren't aspects where it seem falcon might be made more robust, but I don't currently have a clean enough bug to report and allow reproduction. My aspects of this bug report can be closed, and I will leave it as only a flowers9 related post. |
We try to provide helpful ways to integrate, and we test integration in TravisCI. But ultimately, every user's system is different. PacBio wants us to stop spending so much time on user integration problems.
That can be a big mistake. Some of the most time-consuming issues I've had debugging the problems encountered by other users were the result of previous system installs. But that's up to you. We no longer supply a copy of virtualenv because its latest version does not work on Centos6. If it works for you, great.
The 2nd (optional) argument to That log is for debugging job-distribution, not the jobs themselves. Not very useful. Someday, we'll try to pull the messages from stderr for failed jobs. |
Oh! Thank you! That explains a lot. Ok, that is something we must solve as quickly as possible. One bit of clarification might help: Is qsub itself failing, because the run-dir provided to qsub does not exist? Or is the top script of the qsub job itself failing, because it tries to cd into a non-existent directory? If you have any related snippet of an error log, I could avoid solving the wrong problem.
Yes. We will reduce the logging, but we cannot do that until things are basically working. Maybe we are near that point now. Ultimately, I'd like the logging to be written into a local disk and then copied -- by the qsub job -- into nfs. However, I don't think that would help within Amazon, as for @cariaso. In fact, on virtual machines, you might forever have problems with I/O. In Amazon, you might need to find the highest usable concurrency exerimentally.
I think you're wasting effort. The heartbeats have not been proven necessary anywhere except on our "Pod". Personally, I like the idea of socket-based heartbeat checking, but you're getting into the area of Project Management. Internally, we have an "agile" workflow, same as I used when I worked at Amazon. (Yes, I used to work at Amazon, and I am confident that I could produce an excellent, robust socket-based if I were tasked with that specifically. Mostly, I do iterative improvements on Jason's original idea, which is aimed at the lowest common denominator, the filesystem.) "Agile" means writing a small bit at a time. Please don't be upset if we end up using only bits and pieces of whatever you come up with. I have confidence in your skills, and I will try to use as much as possible. But I apologize if we cannot use it all.
You may have misunderstood the overall philosophy. The files in the dependency graph are unrelated to the sentinel files if the fs-based pwatcher. If you mix those, then we will never accept your solution. Normally, I Ideally, the "watcher" would be socket-based instead of filesystem-based, so sentinel files would be completely unnecessary. We really should discuss pwatcher enhancements in a separate thread (or threads). You have a lot of excellent ideas, and your local experience is invaluable. |
Regarding the problem starting jobs: From falcon's point of view, the jobs just take a long time to start (possibly forever ;). From grid's point of view, the jobs get put into an error state, with the error that when it tried to start the job, it couldn't chdir to the given directory. Which is presumably because the newly created dierctory wasn't yet available on the remote machine. Using qmod to clear the error state then allows them to start normally. I'm not planning to mix things - my network_based watcher code is based on a copy of your fs_based watcher code. I admit, I'm tossing the logs into it, but that's by modifying the heartbeat wrapper script to forward the output from the wrapped script, so it's still in the area of the process watcher codebase (as I also copied mains/fs_based and modified that as part of the heartbeat change, and that's where the log change is). Honestly, I'm not expecting you to take any of this, after my previous experience with offering code. But the current version seems to give better results than the older (locally modified) version, so I'd like to try and get this working better, if only on my machines. I figure it's polite to upload my results, if only so you can see what I did, though that's why I'm somewhat resistent to going through the learning git experience again. |
Here's the error message (from qstat -c)
|
I think I see what you mean about deleting the mypwatcher directory. I'd noticed that sometimes when I restarted a job (small, thankfully ;) that it restarted from the beginning, and I think that was because I had not deleted the mypwatcher directory. Thanks for the note, though, as it helps make sure when I restart large jobs they don't start over. One reason why I'm trying to keep the mypwatcher directory is because I would like to be able to re-acquire running jobs, and for the socket based approach, that means I need to know what socket the previous server was on (just as, with your filesystem based code, you lose the exit status and heartbeat files of running jobs when you delete it). I'd forgotten that when falcon shuts itself down (as opposed to eating a SIGTERM), it qdels running jobs, making the issue moot (at the moment). |
I don't think my current approach will do much for cloud jobs, which is just as well, as I don't currently use the cloud, and couldn't test it there. |
That's an interesting idea. But it would require a different work-around for each job-dist system. Why not just wait for the directory to be available everywhere before even submitting the job? That's the simple work-around I plan to do Thursday. (I'm not allowed to do it today.)
The reason I have not done this already is that sometimes the output is very, very long. Copying the last few hundred lines might be ok though. I also want the stderr/stdout to go into /tmp initially, when the job runs there. (
Much appreciated! I think we can make your watcher-replacement available optionally by configuration. So maybe we can take it as is. |
Right, but that's fine. It won't be the default.
Good! I'm pretty sure that's a qsub message, so it's failing because of the All we need to do is to put the first Ok, so maybe we really do need different solutions for different systems. qmod+retry could be fine for SGE.
Exactly! That's the whole goal of the pwatcher "state". The watcher has 3 commands in its API, "run", "query", and "delete". We would need to start with "query" to find already running jobs. And we would need to provide users with a simple way to call "delete" when we stop killing their jobs on exit. |
I didn't mean that you should have falcon do the qmod to clear the error state. I was just saying that when I did, the job then worked normally. My approach is to get rid of the per-job directory, which isn't needed if the wrapper puts everything somewhere else (i.e., sends it off to the heartbeat server). For grid, at least, the wrapper does not need to be available on the remote machine - grid makes a copy for itself and handles the distribution (which is one reason why you want to give grid short scripts ;). My lastest revision (currently being tested) allows the user to specify a directory to put the watcher data into, and I'm currently placing it on a non-shared filesystem. Seems to work so far! |
@flowers9, Will you be at the SMRT Biofx Developer's conference in Gaithersburg this month? And btw, where do you work/live? Just wondering. Maybe there would be a way for us to get together sometime. |
Nope, I'm not much of a conference goer. Not good at networking, which leaves little point. My job is in Huntsville, AL, at the HudsonAlpha Institute for Biotechnology. I live and work in San Jose, CA, though. |
Since you are nearby, we should meet face to face next time @pb-cdunn is in town. |
LAmerge has always generated a lot of nfs traffic, and load on our disk server. However, with the introduction of a heartbeat based on access to the same filesystem, I've been seeing a lot of spurious heartbeat failures that cause the entire run to stop. I'd suggest allowing the heartbeat to be on a different filesystem, or possibly not on a filesystem at all (this seems like a task well suited to a transactional database type thing). I've also had problems with grid putting tasks immediately into an error state because it couldn't find the watcher directories that had just been created for the task.
I've mitigated the heartbeat issue a little by raising the rate from 10s to 10m (I've never actually had a process die in a way the heartbeat is designed to spot). The other problem I'm not sure how to approach.
The text was updated successfully, but these errors were encountered: