Skip to content
This repository has been archived by the owner on Mar 16, 2022. It is now read-only.

heartbeat gives false negative under heavy nfs load #433

Open
flowers9 opened this issue Aug 19, 2016 · 30 comments
Open

heartbeat gives false negative under heavy nfs load #433

flowers9 opened this issue Aug 19, 2016 · 30 comments
Assignees

Comments

@flowers9
Copy link

LAmerge has always generated a lot of nfs traffic, and load on our disk server. However, with the introduction of a heartbeat based on access to the same filesystem, I've been seeing a lot of spurious heartbeat failures that cause the entire run to stop. I'd suggest allowing the heartbeat to be on a different filesystem, or possibly not on a filesystem at all (this seems like a task well suited to a transactional database type thing). I've also had problems with grid putting tasks immediately into an error state because it couldn't find the watcher directories that had just been created for the task.

I've mitigated the heartbeat issue a little by raising the rate from 10s to 10m (I've never actually had a process die in a way the heartbeat is designed to spot). The other problem I'm not sure how to approach.

@pb-cdunn
Copy link

I've mitigated the heartbeat issue a little by raising the rate from 10s to 10m.

The default heartbeat is too short. You're doing it right. However, the interval of testing the heartbeat also needs to drop. Then, at some point, the issue becomes a non-issue.

And there could be a separate interval of looking for the exit files. Also, we could stat the timestamp less often than we check for existence. I'll look into that later.

(I've never actually had a process die in a way the heartbeat is designed to spot).

Good. Understand that when this weird failure occurs, those users would wait forever! It's not a problem to be ignored.

I'd suggest allowing the heartbeat to be on a different filesystem, or possibly not on a filesystem at all (this seems like a task well suited to a transactional database type thing).

We are very unhappy at having to rely on the filesystem. Unfortunately, we know nothing about the network available for any given user. The filesystem is the lowest common resource, and I don't think that filesystem locks (on a database) would improve the situation, though it's worth trying.

However, the architecture is flexible. "pwatcher" is only meant as the universal solution, not the best. We could quickly code other plugins. E.g. we could take advantage of UNIX sockets (if available for grid machines). Heartbeats could be calls back to a central server. Or better: The central server could query the running jobs for status, so we would not need heartbeats at all. We could hoard a number of grid hosts, to reduce the delays in host acquisition for the case of huge numbers of very quick jobs, and each host could have a single process to communicate with the central server. Or we could go the other direction: rely on qstat instead of watching jobs ourself.

The watcher has a simple JSON API. It has 3 commands:

  • run
  • query
  • delete

If you write an alternative watcher, we could make that available to everyone by configuration. We might need a matrix of watchers, with grid-system on one axis.

But this is super low priority for PacBio, where the main goal is Sequel improvements. A longer heartbeat solves most problems. I guess that needs to be configurable, yes? That can be in our next sprint.

@flowers9
Copy link
Author

flowers9 commented Aug 26, 2016

Okay, done. I added an option to the config file to choose the watcher type, defaulting to fs_type. It'll probably take me a while to upload a fork and make a push request, though, because I suck at git (took me two days of work to manage it last time).

One question, though - I also have a python script to allow the server to be queried from the command line (you just specify the server:port or point to the run directory so it can grab the info from state.py). Where should I put the script, as it seems like a good thing to include with the patch?

I have tried to make it reasonable durable, but there are certainly things it misses, like if the heartbeat wrapper fails. And it's possible it'll have problems in environments unlike mine, but that's hard for me to test. ;)

@pb-cdunn
Copy link

Yeah, coding is always the easy (and fun) part. But git is worth learning.

Put the script into mains, with a nice long unique name.... Actually, we should probably put this stuff into pypeflow. But don't worry: Once I see a branch from your fork, I can refactor quickly.

it's possible it'll have problems in environments unlike mine, but that's hard for me to test. ;)

Exactly! That's my biggest problem. So I'm happy to take contributions that work well in specific systems.

there are certainly things it misses, like if the heartbeat wrapper fails

The heartbeat wrapper was really meant only for the specific case of weird process terminations on our local "Pod". I don't think any external users have reported such a problem, so maybe we don't even need it. And to the extent that it's useful, we might be better off using UNIX sockets to query the top process. I'll look at what you have when I see it.

Btw, I think we need -k on our qdel lines for full process-group deletion in SGE. Have you seen dangling processes when a task fails? And if so, could you try adding -k to the qdel line? (If you've seen this, let's make a separate Issue for it.)

@cariaso
Copy link

cariaso commented Aug 28, 2016

just to reinforce, I'm seeing exactly this same issue. Specific messages:

As soon as all of the qsubs are done, it waits only briefly then shows

[WARNING]DEAD job? 1472384971.0 + 3*10.0 + 120.0 < 1472389910.09 for '/mnt/agenome/runs/run3/mypwatcher/heartbeats/heartbeat-Jcbb6500b96565a8dc416adb62c974fad0492ae31fc9a7cc97242a49a09d90082'
[WARNING]DEAD job? 1472384987.93 + 3*10.0 + 120.0 < 1472389910.09 for '/mnt/agenome/runs/run3/mypwatcher/heartbeats/heartbeat-J31de22c786cf45c0fdc8279e2853be68e9bf1d5e16ed6a3cc0d65173a1564331'

...

[INFO]Submitted backgroundjob=MetaJobSge(MetaJob(job=Job(jobid='J5c2ec8100e21e6f91e6968b24dcfe05e98d7d160a7d334ce20ca8d8f4514a68f', cmd='/bin/bash rj_032e.sh', rundir='/mnt/agenome/runs/run3/0-rawreads/job_032e', options={'sge_option': '-pe orte 4 -q all.q -l h_vmem=60G', 'job_type': None}), lang_exe='/bin/bash'))

[WARNING]DEAD job? 1472384971.0 + 3*10.0 + 120.0 < 1472389910.09 for '/mnt/agenome/runs/run3/mypwatcher/heartbeats/heartbeat-Jcbb6500b96565a8dc416adb62c974fad0492ae31fc9a7cc97242a49a09d90082'
[WARNING]DEAD job? 1472384987.93 + 3*10.0 + 120.0 < 1472389910.09 for '/mnt/agenome/runs/run3/mypwatcher/heartbeats/heartbeat-J31de22c786cf45c0fdc8279e2853be68e9bf1d5e16ed6a3cc0d65173a1564331'
[WARNING]DEAD job? 1472384977.32 + 3*10.0 + 120.0 < 1472389910.09 for '/mnt/agenome/runs/run3/mypwatcher/heartbeats/heartbeat-Jbef4f46b68b1cc3d9f54a543682caba618c8944d20ac5e554385342d713544db'
[WARNING]  File "/usr/local/bin/fc_run", line 9, in <module>
    load_entry_point('falcon-kit==0.7', 'console_scripts', 'fc_run')()
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 563, in main
    main1(argv[0], args.config, args.logger)
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 327, in main1
    setNumThreadAllowed=PypeProcWatcherWorkflow.setNumThreadAllowed)
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 405, in run
    wf.refreshTargets(exitOnFailure=exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 523, in refreshTargets
    rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 657, in _refreshTargets
    numAliveThreads = self.thread_handler.alive(task2thread.values())
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 198, in alive
    fred.endrun(status)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 73, in endrun
    log.warning(''.join(traceback.format_stack()))

[ERROR]Task Fred{'URL': 'task://localhost/d_0108_raw_reads'}
 is DEAD, meaning no HEARTBEAT, but this can be a race-condition. If it was not killed, then restarting might suffice. Otherwise, you might have excessive clock-skew.
[ERROR]Task Fred{'URL': 'task://localhost/d_0b91_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_09a9_raw_reads'} failed with exit-code=35072
[WARNING]  File "/usr/local/bin/fc_run", line 9, in <module>
    load_entry_point('falcon-kit==0.7', 'console_scripts', 'fc_run')()
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 563, in main
    main1(argv[0], args.config, args.logger)
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 327, in main1
    setNumThreadAllowed=PypeProcWatcherWorkflow.setNumThreadAllowed)
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 405, in run
    wf.refreshTargets(exitOnFailure=exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 523, in refreshTargets
    rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 657, in _refreshTargets
    numAliveThreads = self.thread_handler.alive(task2thread.values())
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 198, in alive
    fred.endrun(status)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 73, in endrun
    log.warning(''.join(traceback.format_stack()))

[ERROR]Task Fred{'URL': 'task://localhost/d_0344_raw_reads'}
 is DEAD, meaning no HEARTBEAT, but this can be a race-condition. If it was not killed, then restarting might suffice. Otherwise, you might have excessive clock-skew.
[WARNING]  File "/usr/local/bin/fc_run", line 9, in <module>
    load_entry_point('falcon-kit==0.7', 'console_scripts', 'fc_run')()
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 563, in main
    main1(argv[0], args.config, args.logger)
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 327, in main1
    setNumThreadAllowed=PypeProcWatcherWorkflow.setNumThreadAllowed)
  File "/usr/local/lib/python2.7/dist-packages/falcon_kit/mains/run1.py", line 405, in run
    wf.refreshTargets(exitOnFailure=exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 523, in refreshTargets
    rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py", line 657, in _refreshTargets
    numAliveThreads = self.thread_handler.alive(task2thread.values())
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 198, in alive
    fred.endrun(status)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow/pwatcher_bridge.py", line 73, in endrun
    log.warning(''.join(traceback.format_stack()))
[ERROR]Task Fred{'URL': 'task://localhost/d_03f7_raw_reads'}
 is DEAD, meaning no HEARTBEAT, but this can be a race-condition. If it was not killed, then restarting might suffice. Otherwise, you might have excessive clock-skew.
[INFO]Queued 'task://localhost/d_0659_raw_reads' ...
[INFO]Queued 'task://localhost/d_04b7_raw_reads' ...
[INFO]Queued 'task://localhost/d_0a48_raw_reads' ...
[INFO]Queued 'task://localhost/d_0810_raw_reads' ...
[INFO]Queued 'task://localhost/d_0051_raw_reads' ...
[INFO]Queued 'task://localhost/d_005d_raw_reads' ...




...


[INFO]Queued 'task://localhost/d_0875_raw_reads' ...
[INFO]Queued 'task://localhost/d_0bac_raw_reads' ...
[INFO]Queued 'task://localhost/d_0729_raw_reads' ...
[INFO]Queued 'task://localhost/d_0709_raw_reads' ...
[INFO]Queued 'task://localhost/d_06ff_raw_reads' ...
[INFO]Failure ('fail'). Joining 'task://localhost/d_0108_raw_reads'...
[INFO]Failure ('fail'). Joining 'task://localhost/d_0b91_raw_reads'...
[INFO]Failure ('fail'). Joining 'task://localhost/d_09a9_raw_reads'...
[INFO]Failure ('fail'). Joining 'task://localhost/d_0344_raw_reads'...
[INFO]Failure ('fail'). Joining 'task://localhost/d_03f7_raw_reads'...
[CRITICAL]Any exception caught in RefreshTargets() indicates an unrecoverable error. Shutting down...
/usr/local/lib/python2.7/dist-packages/pypeflow/controller.py:537: UserWarning:
            "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
            "! Please wait for all threads / processes to terminate !"
            "! Also, maybe use 'ps' or 'qstat' to check all threads,!"
            "! processes and/or jobs are terminated cleanly.        !"
            "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

  warnings.warn(shutdown_msg)

@pb-cdunn
Copy link

Maybe we need an extra option so we can ignore the heartbeats by default. It would not be easy for me to diagnose the problems you're having remotely.

@pb-cdunn pb-cdunn self-assigned this Aug 28, 2016
@cariaso
Copy link

cariaso commented Aug 28, 2016

It's seems related to NFS under heavy load. with 4 jobs of daligner, under ~20 machines does fine. Once I've got 20+ machines it has the heartbeat timeout issue. I've very comfortable in python and if you can just broadly point at what parts of the code need to be (commented out, changed to false, etc) in order to disable the heartbeat I'd be glad to do some of the work for myself (and provide the diff/pull) instead of waiting for PacBio to it.

On Aug 29, 2016 1:28 AM, "Christopher Dunn" notifications@github.com
wrote:

Maybe we need an extra option so we can ignore the heartbeats by default.
It would not be easy for me to diagnose the problems you're having remotely.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#433 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHpktqP_lPbQL8MogsZMoNc9o5vkABMks5qkbc7gaJpZM4JoDBo
.

@flowers9
Copy link
Author

flowers9 commented Aug 29, 2016

Ignore my suggestion - I hadn't realized the *done and *done.exit files were now obsolete, with everything handled through the process watcher.

@pb-cdunn
Copy link

You can set HEARTBEAT_RATE_S=1000 in pypeFLOW/pwatcher/fs_based.py.

But I'll release something pretty soon which turns the heartbeat off by default. It seems that we only need it internally, so I'll add configuration for that later.

@pb-cdunn
Copy link

@pb-cdunn
Copy link

Let's see if that helps. If not, then the problem might be in how often we check for finished jobs, indicated by sentinel files. That's easy to adjust, but I want to prove first whether the heartbeat alone was actually the cause of your problems.

@cariaso
Copy link

cariaso commented Aug 29, 2016

Ran 1.7.5
finishes queueing all ~3500 jobs and then immediately crashes.

[INFO]starting job Job(jobid='J59a0bf847b5c8b0f8a8a87487c37aafba8b68d056b376cc9f1665828616b51a3', cmd='/bin/bash rj_0849.sh', rundir='/mnt/agenome/runs/run3/0-rawreads/job_0849', options={'job_queue': None, 'sge_option': '-pe orte 4 -q all.q -l h_vmem=60G', 'job_type': None})
[INFO]!qsub -N J59a0bf847b5c8b -pe orte 4 -q all.q -l h_vmem=60G -V -cwd -o stdout -e stderr -S /bin/bash /mnt/agenome/runs/run3/mypwatcher/wrappers/run-J59a0bf847b5c8b0f8a8a87487c37aafba8b68d056b376cc9f1665828616b51a3.bash
[INFO]Submitted backgroundjob=MetaJobSge(MetaJob(job=Job(jobid='J59a0bf847b5c8b0f8a8a87487c37aafba8b68d056b376cc9f1665828616b51a3', cmd='/bin/bash rj_0849.sh', rundir='/mnt/agenome/runs/run3/0-rawreads/job_0849', options={'job_queue': None, 'sge_option': '-pe orte 4 -q all.q -l h_vmem=60G', 'job_type': None}), lang_exe='/bin/bash'))
[INFO]starting job Job(jobid='J09fb5c3483c74fab84f1cfea25661f481340e5a32d7d1aa5452ea71637f1bd1a', cmd='/bin/bash rj_0b3e.sh', rundir='/mnt/agenome/runs/run3/0-rawreads/job_0b3e', options={'job_queue': None, 'sge_option': '-pe orte 4 -q all.q -l h_vmem=60G', 'job_type': None})
[INFO]!qsub -N J09fb5c3483c74f -pe orte 4 -q all.q -l h_vmem=60G -V -cwd -o stdout -e stderr -S /bin/bash /mnt/agenome/runs/run3/mypwatcher/wrappers/run-J09fb5c3483c74fab84f1cfea25661f481340e5a32d7d1aa5452ea71637f1bd1a.bash
[INFO]Submitted backgroundjob=MetaJobSge(MetaJob(job=Job(jobid='J09fb5c3483c74fab84f1cfea25661f481340e5a32d7d1aa5452ea71637f1bd1a', cmd='/bin/bash rj_0b3e.sh', rundir='/mnt/agenome/runs/run3/0-rawreads/job_0b3e', options={'job_queue': None, 'sge_option': '-pe orte 4 -q all.q -l h_vmem=60G', 'job_type': None}), lang_exe='/bin/bash'))
[INFO]starting job Job(jobid='J5c2ec8100e21e6f91e6968b24dcfe05e98d7d160a7d334ce20ca8d8f4514a68f', cmd='/bin/bash rj_032e.sh', rundir='/mnt/agenome/runs/run3/0-rawreads/job_032e', options={'job_queue': None, 'sge_option': '-pe orte 4 -q all.q -l h_vmem=60G', 'job_type': None})
[INFO]!qsub -N J5c2ec8100e21e6 -pe orte 4 -q all.q -l h_vmem=60G -V -cwd -o stdout -e stderr -S /bin/bash /mnt/agenome/runs/run3/mypwatcher/wrappers/run-J5c2ec8100e21e6f91e6968b24dcfe05e98d7d160a7d334ce20ca8d8f4514a68f.bash
[INFO]Submitted backgroundjob=MetaJobSge(MetaJob(job=Job(jobid='J5c2ec8100e21e6f91e6968b24dcfe05e98d7d160a7d334ce20ca8d8f4514a68f', cmd='/bin/bash rj_032e.sh', rundir='/mnt/agenome/runs/run3/0-rawreads/job_032e', options={'job_queue': None, 'sge_option': '-pe orte 4 -q all.q -l h_vmem=60G', 'job_type': None}), lang_exe='/bin/bash'))
[ERROR]Task Fred{'URL': 'task://localhost/d_0521_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0a03_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0d69_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0970_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_03bd_raw_reads'} failed with exit-code=256
[ERROR]Task Fred{'URL': 'task://localhost/d_0669_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0bd4_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0108_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_033f_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0dab_raw_reads'} failed with exit-code=36608
[ERROR]Task Fred{'URL': 'task://localhost/d_0b9e_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0775_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_04e5_raw_reads'} failed with exit-code=36608
[ERROR]Task Fred{'URL': 'task://localhost/d_0942_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0d5a_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0041_raw_reads'} failed with exit-code=11
[ERROR]Task Fred{'URL': 'task://localhost/d_03ca_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0b21_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0bb5_raw_reads'} failed with exit-code=256
[ERROR]Task Fred{'URL': 'task://localhost/d_09b5_raw_reads'} failed with exit-code=256
[ERROR]Task Fred{'URL': 'task://localhost/d_0b91_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_0026_raw_reads'} failed with exit-code=35072
[ERROR]Task Fred{'URL': 'task://localhost/d_05ed_raw_reads'} failed with exit-code=36608
[ERROR]Task Fred{'URL': 'task://localhost/d_09a9_raw_reads'} failed with exit-code=35072
...
[ERROR]Task Fred{'URL': 'task://localhost/d_08f0_raw_reads'} failed with exit-code=256
[ERROR]Task Fred{'URL': 'task://localhost/d_0b5b_raw_reads'} failed with exit-code=256
[INFO]Queued 'task://localhost/d_01d6_raw_reads' ...
[INFO]Queued 'task://localhost/d_03d7_raw_reads' ...
[INFO]Queued 'task://localhost/d_000b_raw_reads' ...
[INFO]Queued 'task://localhost/d_0512_raw_reads' ...
[INFO]Queued 'task://localhost/d_096b_raw_reads' ...
[INFO]Queued 'task://localhost/d_0559_raw_reads' ...
[INFO]Queued 'task://localhost/d_0be7_raw_reads' ...
...
[INFO]Failure ('fail'). Joining 'task://localhost/d_09e0_raw_reads'...
[INFO]Failure ('fail'). Joining 'task://localhost/d_0578_raw_reads'...
[INFO]Failure ('fail'). Joining 'task://localhost/d_08f0_raw_reads'...
[INFO]Failure ('fail'). Joining 'task://localhost/d_0b5b_raw_reads'...
[CRITICAL]Any exception caught in RefreshTargets() indicates an unrecoverable error. Shutting down...
/home/ubuntu/FALCON-integrate/pypeFLOW/pypeflow/controller.py:537: UserWarning:
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
"! Please wait for all threads / processes to terminate !"
"! Also, maybe use 'ps' or 'qstat' to check all threads,!"
"! processes and/or jobs are terminated cleanly. !"
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

warnings.warn(shutdown_msg)
[WARNING]#tasks=3528, #alive=3489
[INFO]!qdel -k Je2e62b71e9af1b
[INFO]!qdel -k Ja9b47339e36dde
[INFO]!qdel -k J4d8c875fcd10a5
[INFO]!qdel -k Jf5c7149bb535a8
[INFO]!qdel -k Jcd24881c5f4215
...

[INFO]!qdel -k Je3230a485fdc04
[INFO]!qdel -k J09fb5c3483c74f
[INFO]!qdel -k J5c2ec8100e21e6

[WARNING]Now, #tasks=3528, #alive=0
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/fc_run", line 11, in
load_entry_point('falcon-kit', 'console_scripts', 'fc_run')()
File "/home/ubuntu/FALCON-integrate/FALCON/falcon_kit/mains/run1.py", line 576, in main
main1(argv[0], args.config, args.logger)
File "/home/ubuntu/FALCON-integrate/FALCON/falcon_kit/mains/run1.py", line 353, in main1
setNumThreadAllowed=PypeProcWatcherWorkflow.setNumThreadAllowed)
File "/home/ubuntu/FALCON-integrate/FALCON/falcon_kit/mains/run1.py", line 431, in run
wf.refreshTargets(exitOnFailure=exitOnFailure)
File "/home/ubuntu/FALCON-integrate/pypeFLOW/pypeflow/controller.py", line 548, in refreshTargets
raise Exception('Caused by:\n' + tb)
Exception: Caused by:
Traceback (most recent call last):
File "/home/ubuntu/FALCON-integrate/pypeFLOW/pypeflow/controller.py", line 523, in refreshTargets
rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
File "/home/ubuntu/FALCON-integrate/pypeFLOW/pypeflow/controller.py", line 740, in _refreshTargets
raise TaskFailureError("Counted %d failure(s) with 0 successes so far." %failedJobCount)
TaskFailureError: 'Counted 39 failure(s) with 0 successes so far.'

@cariaso
Copy link

cariaso commented Aug 29, 2016

It's worth pointing out, that for many (all?) of my previous runs, I had already grown the cluster to a decent size by the time the qsubs were done being queued. as a result I assumed that the number of processes was a contributing factor.

In this case, I had a single master (non-exec host) and a single 16 proc (r3.4xlarge) worker. So multiple processes does not seem to be a contributing factor in this crash, and probably wasn't in the others as well.

While this is another crash, it has no mention of the heartbeat, which was present in all previous crashes. I'm not sure if it will help to focus the search, but that does seem to at least confirm that I was correctly running the new 1.7.5 code

@cariaso
Copy link

cariaso commented Aug 29, 2016

perhaps the most useful conclusion in my case.
with
pa_concurrent_jobs = 50
it runs
with
pa_concurrent_jobs = 60
it crashes immediately after the qsubs have been submitted. This is true even when there is only a single worker.

Does that help to point the finger?

I hope so. But if I'm limited to 50 concurrent jobs, this isn't going run at anywhere near the scale I was hoping for. :-(

@pb-cdunn
Copy link

pb-cdunn commented Aug 29, 2016

You can look into a particular job directory and look at pwatcher.dir/stderr. (The top-level log is rarely useful.) That might tell us exactly what happened, if the job actually started.

Otherwise, I'll have to try to repro somehow.

@pb-cdunn
Copy link

It's very difficult to debug remotely.

If you don't see any stdout/stderr in any pwatcher.dir, then the jobs are not actually starting. In that case, it's possible that your Grid system is killing jobs when you submit too many. You might have to talk to a sysadmin. I notice from your debug log that there are several different exit codes: 256, 11, 35072, and 36608. That might be significant too. You might need to look into SGE logs to see why SGE thinks your jobs are failing.

I'm going to put together some do-nothing sleep jobs, to test something unrelated on our network a PacBio. Then, maybe you can experiment to learn the exact cause of these problems, perhaps by submitting them manually. But this could be a lot of work for you.

@cariaso
Copy link

cariaso commented Aug 30, 2016

I appreciate there are limits to how much debug work PacBio can do, and am
pleased with the levels of support I've previously received from you and
your team. I say the rest of what's below not because I expect you to take
on the full responsibility, but to indicate where there are opportunities
that might be easier than you'd expect.

I am the network admin and the bioinformatician ... and the alpha and the
omega... and all the creme filled layers inbetween. It's a starcluster
inside Amazon. I turn it on and off, and have already turned on a separate
cluster just for doing the test which captures the stderr & stdout files in
a totally clean environment. I should have those results soonish.

If we get to a stage where it makes sense, you can email me a public key,
and I'll give you an ssh login on the actual machine.

3 things related to previous messages above also seem worth mentioning
while I'm writing the rest of this.
1.
as you've mentioned, the top level 'all.log' seems rarely useful. It's also
incredibly verbose. Presumably a

logger = logging.getLogger('foo')
logger.setLevel(logging.WARN)

or similar would be helpful, but I imagine you have the experience to say
where to add it, and what value to use for 'foo'. Alternatively and
preferably, is this controllable from the fc_run config file?

the docs remains a bit unclear to me, regarding an implied (but not
explicit) 'make install' step during
https://github.com/PacificBiosciences/FALCON-integrate/wiki/Installation

in my case after building falcon, I need to create a new ami, so that the
daligner and other commands are available to all workers. (Presumably I
could install it onto a shared drive, but since I already need to make an
ami so that I have SGE on an ubuntu16, I may as well bake in falcon at the
same time).
Your documents give 3 scenarios for types of installation, but none of them
actually do the make install. While I use virtualenv quite a bit for my own
python development, it's unclear how best to automate making it part of the
environment on a newly booted worker machine. I've instead opted to modify
'env.sh' before 'source'ing it, so that
FALCON_PREFIX=/usr/local
and then doing a
sudo -E make install at the end. Lastly I'm modifying /etc/profile
with
PATH=/home/ubuntu/.local/bin:$PATH
as some of the tools are only there. This is definitely a hack, but since
I'm the only user on these machines, and they'll be destroyed as soon as
I'm done, it's fine. I would be grateful if this part of the documentation
could be clearer.

regarding your comment to flowers9, 'But git is worth learning.'

I have a love-hate relationship with git. I'm not an expert but am
comfortable rebasing and regularly submit pull requests to various open
source projects. As other have said, ("git gets easier once you get the
basic idea that branches are homeomorphic endofunctors mapping submanifolds
of a Hilbert space." ;) & https://xkcd.com/1597/

I figured I'd clean up flowers9's patch, fork the PacBio code, and then
perhaps submit a pull request. most of his patch applied cleanly, but the
very first hunk did not. It was easy to see the problem, and fix it (I'm
sure this reflects being a 'pared down version of some related patches'),
even with that the patch still didn't work for me. Instead the lack of the
'q' variable, cause a problem in pypeFLOW/pypeflow/pwatcher_bridge.py and
my attempt to replace it with a None, then None-ish {'jobids':{}} got
nowhere, then your 1.7.5 showed up. However I did spend a few minutes
trying to update his patch, only to discover that apparently git can't diff
across submodules (modulo
https://gist.github.com/tomykaira/40abc2ba411be28bee67 ) which does make
what he was trying to do shockingly difficult considering how trivial that
would be under the old regime.

On Aug 30, 2016 12:28 PM, "Christopher Dunn" notifications@github.com
wrote:

It's very difficult to debug remotely.

If you don't see any stdout/stderr in any pwatcher.dir, then the jobs are
not actually starting. In that case, it's possible that your Grid system is
killing jobs when you submit too many. You might have to talk to a
sysadmin. I notice from your debug log that there are several different
exit codes: 256, 11, 35072, and 36608. That might be significant too. You
might need to look into SGE logs to see why SGE thinks your jobs are
failing.

I'm going to put together some do-nothing sleep jobs, to test something
unrelated on our network a PacBio. Then, maybe you can experiment to learn
the exact cause of these problems, perhaps by submitting them manually. But
this could be a lot of work for you.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#433 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAHpkuwB7DeN8car8QASl304M29h71tFks5qk6NCgaJpZM4JoDBo
.

@flowers9
Copy link
Author

Sorry, my suggestion was a complete mistake, I hadn't delved into just how much the process watcher now handled (it bit me some other places, too). I do plan to upload a branch at some point, but I'm currently trying to get rid of a couple of other issues as well while I'm doing this: the mkdir() and qsub in rapid succession generally causes the first eight or so qubs to error out because they can't find the just created directories, and the qsub'ed process log files cause avoidable disk traffic (dgordon had mentioned this earlier in another thread). I'm currently expanding my heartbeat server to handle the log files as well, and then to store the whole process watcher state file on a user defined disk, so I can make it local to the fc_run.py machine instead of shared, since it would then only be touched by processes on the fc_run.py machine.

I hadn't realized the state file was quite as important as it turns out to be, so it's taking a bit longer than I'd hoped. Namely, if you want to be able to restart the run, you've got to be very assiduous about saving the state, or else it starts over at the beginning.

@cariaso
Copy link

cariaso commented Aug 30, 2016

I've been unable to reproduce the crash, even at substantial scale. There are at least 2 possible contributing factors, 1 changing the nfs mounting params, 2 having exhausted an initial burstable allowance from amazon. I may try to reproduce this further in the future, but costs are running up and I've got a good enough solution, since I'm about to step onto a plane. This is not to say there aren't aspects where it seem falcon might be made more robust, but I don't currently have a clean enough bug to report and allow reproduction. My aspects of this bug report can be closed, and I will leave it as only a flowers9 related post.

@pb-cdunn
Copy link

Your documents give 3 scenarios for types of installation, but none of them actually do the make install.

make all runs the following in FALCON-make:

 15 all: checklist
 16     ${MAKE} show
 17     ${MAKE} install-pip
 18     ${MAKE} install
 19     ${MAKE} check

We try to provide helpful ways to integrate, and we test integration in TravisCI. But ultimately, every user's system is different. PacBio wants us to stop spending so much time on user integration problems.

FALCON_PREFIX=/usr/local

That can be a big mistake. Some of the most time-consuming issues I've had debugging the problems encountered by other users were the result of previous system installs. But that's up to you.

We no longer supply a copy of virtualenv because its latest version does not work on Centos6. If it works for you, great. bash config-standard.sh should work for you just fine, after you have activated your virtualenv.

Alternatively and preferably, is this controllable from the fc_run config file?

The 2nd (optional) argument to fc_run is a logging config file. It can be either .json or .ini format. There is an example in *FALCON-examples (in synth0 I think). It's a standard thing in Python:

That log is for debugging job-distribution, not the jobs themselves. Not very useful. Someday, we'll try to pull the messages from stderr for failed jobs.

@pb-cdunn
Copy link

the mkdir() and qsub in rapid succession generally causes the first eight or so qubs to error out

Oh! Thank you! That explains a lot. Ok, that is something we must solve as quickly as possible.

One bit of clarification might help: Is qsub itself failing, because the run-dir provided to qsub does not exist? Or is the top script of the qsub job itself failing, because it tries to cd into a non-existent directory? If you have any related snippet of an error log, I could avoid solving the wrong problem.

and the qsub'ed process log files cause avoidable disk traffic (dgordon had mentioned this earlier in another thread).

Yes. We will reduce the logging, but we cannot do that until things are basically working. Maybe we are near that point now. Ultimately, I'd like the logging to be written into a local disk and then copied -- by the qsub job -- into nfs.

However, I don't think that would help within Amazon, as for @cariaso. In fact, on virtual machines, you might forever have problems with I/O. In Amazon, you might need to find the highest usable concurrency exerimentally.

I'm currently expanding my heartbeat server to handle the log files as well...

I think you're wasting effort. The heartbeats have not been proven necessary anywhere except on our "Pod".

Personally, I like the idea of socket-based heartbeat checking, but you're getting into the area of Project Management. Internally, we have an "agile" workflow, same as I used when I worked at Amazon. (Yes, I used to work at Amazon, and I am confident that I could produce an excellent, robust socket-based if I were tasked with that specifically. Mostly, I do iterative improvements on Jason's original idea, which is aimed at the lowest common denominator, the filesystem.) "Agile" means writing a small bit at a time. Please don't be upset if we end up using only bits and pieces of whatever you come up with. I have confidence in your skills, and I will try to use as much as possible. But I apologize if we cannot use it all.

I hadn't realized the state file was quite as important as it turns out to be, so it's taking a bit longer than I'd hoped. Namely, if you want to be able to restart the run, you've got to be very assiduous about saving the state, or else it starts over at the beginning.

You may have misunderstood the overall philosophy. The files in the dependency graph are unrelated to the sentinel files if the fs-based pwatcher. If you mix those, then we will never accept your solution. Normally, I rm -rf mypwatcher/ before any restart. But the eventual plan is for pwatcher to use the state-file to re-acquire any already running jobs upon restart. That's tricky, so it's not done yet, but it's something that dgordon has wanted for a long time. In that case, we would not qdel running jobs upon failure. Instead, we would provide a script to let a user to that themselves if desired. (Does that make sense? It's hard to explain.)

Ideally, the "watcher" would be socket-based instead of filesystem-based, so sentinel files would be completely unnecessary.

We really should discuss pwatcher enhancements in a separate thread (or threads). You have a lot of excellent ideas, and your local experience is invaluable.

@flowers9
Copy link
Author

Regarding the problem starting jobs: From falcon's point of view, the jobs just take a long time to start (possibly forever ;). From grid's point of view, the jobs get put into an error state, with the error that when it tried to start the job, it couldn't chdir to the given directory. Which is presumably because the newly created dierctory wasn't yet available on the remote machine. Using qmod to clear the error state then allows them to start normally.

I'm not planning to mix things - my network_based watcher code is based on a copy of your fs_based watcher code. I admit, I'm tossing the logs into it, but that's by modifying the heartbeat wrapper script to forward the output from the wrapped script, so it's still in the area of the process watcher codebase (as I also copied mains/fs_based and modified that as part of the heartbeat change, and that's where the log change is).

Honestly, I'm not expecting you to take any of this, after my previous experience with offering code. But the current version seems to give better results than the older (locally modified) version, so I'd like to try and get this working better, if only on my machines. I figure it's polite to upload my results, if only so you can see what I did, though that's why I'm somewhat resistent to going through the learning git experience again.

@flowers9
Copy link
Author

Here's the error message (from qstat -c)

error reason 1: 08/30/2016 16:10:28 [1047:34036]: error: can't chdir to /home/f2p1/flowers/cotton_70x/cotton.1a/mypwatcher/jobs/J5591812c8293dbf5e09da092dcfc9730832df1eb03a51483a0fee1a0597fee77: No such file or directory

@flowers9
Copy link
Author

I think I see what you mean about deleting the mypwatcher directory. I'd noticed that sometimes when I restarted a job (small, thankfully ;) that it restarted from the beginning, and I think that was because I had not deleted the mypwatcher directory. Thanks for the note, though, as it helps make sure when I restart large jobs they don't start over.

One reason why I'm trying to keep the mypwatcher directory is because I would like to be able to re-acquire running jobs, and for the socket based approach, that means I need to know what socket the previous server was on (just as, with your filesystem based code, you lose the exit status and heartbeat files of running jobs when you delete it). I'd forgotten that when falcon shuts itself down (as opposed to eating a SIGTERM), it qdels running jobs, making the issue moot (at the moment).

@flowers9
Copy link
Author

I don't think my current approach will do much for cloud jobs, which is just as well, as I don't currently use the cloud, and couldn't test it there.

@pb-cdunn
Copy link

Using qmod to clear the error state then allows them to start normally.

That's an interesting idea. But it would require a different work-around for each job-dist system. Why not just wait for the directory to be available everywhere before even submitting the job? That's the simple work-around I plan to do Thursday. (I'm not allowed to do it today.)

by modifying the heartbeat wrapper script to forward the output from the wrapped script

The reason I have not done this already is that sometimes the output is very, very long. Copying the last few hundred lines might be ok though.

I also want the stderr/stdout to go into /tmp initially, when the job runs there. (use_tmpdir=true) That's another TODO.

I figure it's polite to upload my results...

Much appreciated!

I think we can make your watcher-replacement available optionally by configuration. So maybe we can take it as is.

@pb-cdunn
Copy link

I don't think my current approach will do much for cloud jobs...

Right, but that's fine. It won't be the default.

error reason 1

Good! I'm pretty sure that's a qsub message, so it's failing because of the cwd we pass to qsub.

All we need to do is to put the first cd into the wrapper. We can use a little sleep-loop to wait up to N seconds (maybe 60?) for the directory. However, the wrapper itself might not exist yet, right? What if we put the sleep-loop into a string that we use to start the job? ... Oh, wait. Some job-dist systems require a script in the filesystem, not a string.

Ok, so maybe we really do need different solutions for different systems. qmod+retry could be fine for SGE.

One reason why I'm trying to keep the mypwatcher directory is because I would like to be able to re-acquire running jobs...

Exactly! That's the whole goal of the pwatcher "state". The watcher has 3 commands in its API, "run", "query", and "delete". We would need to start with "query" to find already running jobs. And we would need to provide users with a simple way to call "delete" when we stop killing their jobs on exit.

@flowers9
Copy link
Author

flowers9 commented Aug 31, 2016

I didn't mean that you should have falcon do the qmod to clear the error state. I was just saying that when I did, the job then worked normally.

My approach is to get rid of the per-job directory, which isn't needed if the wrapper puts everything somewhere else (i.e., sends it off to the heartbeat server).

For grid, at least, the wrapper does not need to be available on the remote machine - grid makes a copy for itself and handles the distribution (which is one reason why you want to give grid short scripts ;).

My lastest revision (currently being tested) allows the user to specify a directory to put the watcher data into, and I'm currently placing it on a non-shared filesystem. Seems to work so far!

@pb-cdunn
Copy link

@flowers9, Will you be at the SMRT Biofx Developer's conference in Gaithersburg this month? And btw, where do you work/live? Just wondering. Maybe there would be a way for us to get together sometime.

@flowers9
Copy link
Author

flowers9 commented Sep 2, 2016

Nope, I'm not much of a conference goer. Not good at networking, which leaves little point.

My job is in Huntsville, AL, at the HudsonAlpha Institute for Biotechnology. I live and work in San Jose, CA, though.

@pb-jchin
Copy link
Contributor

pb-jchin commented Sep 3, 2016

Since you are nearby, we should meet face to face next time @pb-cdunn is in town.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants