Add watchdog functionality to find frozen subprocesses #21

vpetersson · 2012-10-28T20:37:36Z

I've noticed at a few occasions that omxplayer freezes up, which in turn stops Screenly (since it is gently waiting for it to wrap up).

To avoid this, we need some kind of watchdog functionality that scans for omxplayer-processes that are older than n hours (where n should probably be a setting).

…omxplayer-processes older than one hour. Must also be added to cron to run every X minutes.

NiKiZe · 2012-10-29T11:03:39Z

I have had uzbl die sometimes, and also thinking that it is possible for the server process to die as well..
One way would be to use the system watchdog to handle this. I also recommend to add nowayout=1 to the watchdog module.

vpetersson · 2012-10-29T14:36:08Z

I have had uzbl die sometimes

Interesting. I haven't seen that happening. I do have a reload of uzbl in xloader.sh, but it would only be triggered if viewer.py crashes.

I guess a check for if uzbl is running would be a viable solution here, as it should be running at all times viewer.py is running.

thinking that it is possible for the server process to die as well

No, the server is managed by supervisor, so it will respawn the process if it crashes.

One way would be to use the system watchdog to handle this.

I'm not too familiar with the system watchdog. It's something I'm going to look in. However, it was my understanding that the watchdog only detects a frozen system, and not a frozen process. Hence, at least in the case of omxplayer, that wouldn't help.

also recommend to add nowayout=1 to the watchdog module.

I looked it up, and the benefit of using nowayout appears to be that the module cannot be unloaded. I'm not sure what the benefit would be of that in this particular situation, but maybe I'm missing something.

axel-b · 2012-10-29T14:57:50Z

Just wondering: wouldn't it be better, instead of the cron job

to let the viewer kill the player if it takes too long, or
to let the viewer not start the player directly, but via a wrapper program that kills the player when it has run too long (the wrapper program would very much like the memtime program that uses wall-clock time -- memtime kills a program when it used up allocated cpu or memory resources)

in that case, you could use the duration field in the asset database as 'maximum playing time until kill', i.e. you then can have asset-specific maximum playing times.

NiKiZe · 2012-10-29T14:59:54Z

A guess is some kind of out of memory condition in uzbl. Will try to catch more info if it happens again.

the watchdog can run other scripts as well, and in the hard case that it fails the machine will be restarted, see test-binary option for watchdog.conf, there is also repair-binary. If there is a way too look for lingering omxplayers, or a viewer that have not touched a file in a given time etc (maybe the log?), it can try and resolve the problem, if the software recovery is not successful the system device is not poked, and in the worst case system is restarted.

With nowayout in the case where something or someone kills the watchdog, the system will still restart. Not saying it's the right thing just an old habit of mine.

vpetersson · 2012-10-29T15:17:10Z

@axel-b I think that's a reasonable approach. The reason why I went for the cron-tab approach was simply because it was faster to implement.

@NiKiZe Thanks. Yeah, but the problem with omxplayer is that it shouldn't always run. It should only run if there is a video playing. Perhaps the 'touch' approach would work if implemented in viewer.py's main loop, since omxplayer would halt the loop, and hence prevent the touch-file from being updated.

There are pros and cons of using the internal timer in viewer.py versus a watchdog-approach.

As far as I understand, the benefit viewer.py-timer is that we can add more intelligence to it (and perhaps pass values like the expected run-time etc), but for the price of added complexity to the script. It will also not catch issues with crashing/frozen uzbl-processes (but perhaps the general watchdog as it is configured today) will catch these if it's a memory/swap related issue.

The benefit with touching a file and check for the last-modification with a watchdog process is that it is a lot easier to implement and will also catch every possible issue that may cause viewer.py to choke. The downside however is that it isn't very intelligent and may cause unnecessary reboots.

I'm torn...:)

The watchdog-file should never be older than the length of the longest asset. Hence we should be able to use this to detect a freeze-up.

vpetersson · 2012-11-09T04:09:47Z

I think the watchdog-approach is the best one. I've added initial support for a watchdog-file that is updated within the loop.

My plan is then to set the system watchdog to trigger if the watchdog-file is older than an hour. If that is the case, it should kill viewer.py. That in turn will lead xloader.sh to refresh viewer.py and kill frozen uzbl and omxplayer processes.

The drawback with this is that it assumes that no asset has a longer run-time than an hour. Yet, I would imagine that it is safe to assume that that would be rare. If someone does need to run assets with a display-time longer than an hour, the value could easily be bumped up in the watchdog config-file.

axel-b · 2012-11-09T10:42:40Z

regarding the one-hour assumption... the asset database 'knows' the duration of the longest asset, except for movies.

I found some python code that uses mplayer to obtain the duration of movies -- it does take a couple of seconds to run, though. I hacked my screenly to use this when a movie is added (just like the resolution of images is obtained when an image is added). I like it that now the video-duration field contains useful information.
(I could also imagine that we do not automatically obtain video duration when a video is added, but provide a button in the edit interface 'fetch video duration from video file'.)

I'll see that I push the change to obtain video duration to a separate branch later today (probably this evening).

update: I just saw that 'schedule asset' wants the user to provide a duration, always, and sets '5' as default -- also if you schedule an asset that already has a duration. I did not check whether it actually overwrites the duration for a video when you press 'submit'. I'll have a look at this later today as well, I hope.

vpetersson · 2012-11-09T16:57:41Z

@axel-b

regarding the one-hour assumption... the asset database 'knows' the duration of the longest asset, except for movies. I found some python code that uses mplayer to obtain the duration of movies .

Yes, but that doesn't matter, since the timeout is statically configured in /etc/watchdog.conf. Hence, even if the python-process did know the maximum duration, we would need to write a system that keeps overwriting the config-file with the new value, and then restart the watchdog. That doesn't seem like a great approach.

Also, I've already added such code for Screenly Pro, but it then runs on the server (and not on the Pi).

I just saw that 'schedule asset' wants the user to provide a duration, always, and sets '5' as default --

Yes, that's true. there are room for improvement here, but it boils down to the detection of length. I'm not sure it is a good idea to run this on the Pi, as it could potentially take a very long time.

axel-b · 2012-11-09T20:12:06Z

I think we can have our cake and eat it too.

With the approach that you propose, in each 'cycle' of the main loop you touch a file (the file that the watchdog should look for). The watchdog will to look at the last-modified-time of the given file, and the current time, and, using information from its config file, do a little computation to decide whether everything is ok, or not ok.

Now suppose that you do not just touch the file in every cycle, but you write a number into it (the longest duration of the playlist, or maybe even better: duration of the next item that is going to be shown).

Now suppose that you write your own watchdog test command and put it in /etc/watchdog.d (the watchdog allows you to do this, see watchdog(8)).
Now suppose that your own test command will not only look at the last-modified-time of the given file, and the current time, but it will also look at the contents of the file (the duration) and then do a little computation to decide whether everything is ok, or not ok. This does not sound very complicated.
(ok, I think that your test command actually has to be a test-and-repair command, i.e. it will not only be invoked to test, but, when the command reports a failure, it will be also invoked (again), to repair -- so it is a little bit more complex than I sketched so far, but not much.)

No need to keep overwriting config files, no need to keep restarting watchdogs.
No need to make assumptions about max asset time.
If duration varies a lot amongst items in the playlist: potentially much faster reaction to frozen stuff.

And, until you find time to write your own watchdog test command, you could even configure watchdog to only look at the modification time of the given file (the file the watchdog must look for)... :-) :-) :-) :-)

(sidenote: I'll probably keep the length-fetching code in my Screenly copy, or turn it into a button-to-do-it-on-command, because I like the functionality, and so far, although it took longer than I liked, it did not take extremely long... at most 15 secs or so -- I did not use a watch. Moreover, we do not add videos very often -- we have three, and the screen is there now for more than a year.)

NiKiZe · 2012-11-11T02:21:35Z

Normally the watchdog process only checks that a process is alive by pid. In the case of screenly it is a bit more complicated because of the secondary processes, and thus touching a file and checking it's modification time is a good workaround. But adding calculation logic there feels risky.

I would like to see the viewer running a separate thread that updates the mtime no less then every 10 secs.
This in turn would need to know that the browser or videoplayer is alive and have done what it should.
I think @axel-b's browser branch is a nice start, and can be used to check that uzbl did what it was supposed to. (this also removes the need for the 200 status check in view_web?)
The mtime is then updated every 10 secs as long as the status from the last viewer is OK and the duration is not exceeded.

In regards to video, maybe https://github.com/jbaiter/pyomxplayer could be used? Both for checking that the player is alive/hanged and to get video duration.
The duration entry in the db is then updated on when the video starts to play, to avoid initial delay when adding assets.

axel-b · 2012-11-11T20:05:45Z

To me, having a small, own test (+repair) program doesn't seem too risky.
After all, from the watchdog manual page I get the feeling that the watchdog program is designed with external test/repair programs in mind. The calculation that it has to do is rather limited (I would probably program it in C, not python, though).
And, most importantly, that approach allows testing what you want to test: progress of the viewing loop.

Having a separate thread write timestamps sounds much more risky to me: I don't know what happens when the viewer thread is hung, but I could perfectly well imagine that the timestamping thread continues to make progress...
No, personally I prefer it when the viewer main thread updates the timestamp once in each cycle, then at least I know/understand what is going on.

I do like my Browser class, but it can not be used to check progress: when uzbl-core hangs, the Browser code (e.g. in Browser.show() will just wait for it -- wait forever, if necessary).

Also, I would be hesitant about removing the 200 status check in view_web -- for my use-case, it is not very costly,
and I think I have seen it be useful.

Regarding video, I think I want slightly more that what pyomxplayer does (allows) -- but I have not tried pyomxplayer, so I may be underestimating what it can do.
I would like to be able to do with video, what I currently do with webpage: while the current one is being shown, I 'prefetch' the next, such that transition from current to the next can be almost instantly.

To elaborate on what I currently do with webpages:
In my current setup I have a couple of webpages that take a long time to load.
Therefore, I use not one, but two Browser instances, both full-screen, one on top of the other.
While the top Browser instance shows a page, the other one fetches the next page that has to be shown (either the black background, when the next asset is a video, or otherwise, the webpage or image that has to be shown next).
When a transition has to be made to the next page, I raise the lower Browser instance (using a command line x-windows window manager tool), which makes the transition appear almost instantly (it still takes a tiny bit, but is is much better than slowly loading a page for all to see).
Actually, for one page (jenkins wall display) loading takes 15-20 seconds, everytime that page is loaded afresh -- it would be nice to have a third Browser instance that always shows that page. Right now, the playlist item that precedes the jenkins wall display page must be shown for 15-20 seconds (to allow for sufficient prefetching time). Ideally, I would like to be able to specify with a web-page asset whether it should be 'pinned' to its own private Browser instance.

For some of my videos, loading/starting omxplayer takes several (upto 10) seconds, i.e. 10 seconds of plack sceen (not really anymore: I modified the black background to show, for 10 seconds, an animated gif with typical movie start countdown image).
To reduce that amount of black (countdown) screen, I would like to do for video something similar to what I do with the Browser.
it would be nice if I can say to omxplayer "load the video, but do not start it yet, and also, do not yet show anything on the screen, but wait for a command on stdin".
That would allow me to start omxplayer (for next asset) while current asset is being shown, and then wait until it is time to make a transition. Not showing anything while we wait for the start signal is very important: without it, it is not possible to start oxmplayer in advance, because it would make it impossible to see the asset that is being shown. Because omxplayer talks more directly to the hardware than uzbl, the trick of raising/lowering using x-windows window manager commands does not work for it, to my knowledge (I'd like to be proven wrong here).
(I also cannot show my fader/shutter program when omxplayer is active, and the other way around).

Regarding video length: a simple solution would be to set length to -1 in the database when the asset is entered, and update the database after the first viewing of the video (when we know how long it took, by just measuring).
But then, we do have to make an assumption about the length for the watchdog, for that first viewing (i.e. when asset duration == -1).
An alternative would be to allow the user to enter an educated guess of the video length, and use that as duration for the watchdog during first cyle, and then update that guest with actual measured value, when the video has been shown for the first time.

NiKiZe · 2012-11-11T20:23:55Z

@axel-b I will try to be short and then read what you wrote a few more times ;)
But it really sounds like a slave mode is needed and, however the pyomxplayer code probably needs to be optimized a bit to be usable in screenly.

With pyomxplayer there is quite a clear mode where they load video but not start it for example.

Slave mode is also what i would call how uzbl is run, current screenly master uses i fifo while your browser branch uses a open pipe. omxplayer seem to support something similar, (and so do mplayer)

I will read more and get back on the other points.

axel-b · 2012-11-16T21:41:50Z

This is just to confirm that these ideas -- starting omxplayer in paused mode -- work: I'm doing that now in viewer.py on my branch player-browser-fader.

back to watchdog:

Yesterday I got my first hang in omxplayer -- in retrospect it may have been caused by a file-server issue that affected the web server that hosts my assets. Nevertheless, it triggered me to write a test/repair command in watchtests/testrepair.c on branch watchdog-testcommand.

Command testrepair looks at both the modification time of /tmp/screenly.watchdog, and its content: I assume it contains the duration of the asset that is currently being shown, or -1 if that duration is not known -- I changed my viewer.py (not committed to the branch) to write such /tmp/screenly.watchdog, just for trying this approach.

I have run testrepair from the command line to see what it does, but have not yet tried to install it such that it is automatically run by watchdog, so be careful with it.

stale · 2018-08-13T16:44:17Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

vpetersson added a commit that referenced this issue Oct 28, 2012

Refs #21. Adds (somewhat untested) watchdog for omxplayer that kills …

5f73d52

…omxplayer-processes older than one hour. Must also be added to cron to run every X minutes.

vpetersson added a commit that referenced this issue Nov 9, 2012

Re #21. Added watchdog function.

b53fe0d

The watchdog-file should never be older than the length of the longest asset. Hence we should be able to use this to detect a freeze-up.

OMID-313 mentioned this issue May 5, 2016

Watchdog subogero/omxd#50

Closed

stale bot added the wontfix label Aug 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add watchdog functionality to find frozen subprocesses #21

Add watchdog functionality to find frozen subprocesses #21

vpetersson commented Oct 28, 2012

NiKiZe commented Oct 29, 2012

vpetersson commented Oct 29, 2012

axel-b commented Oct 29, 2012

NiKiZe commented Oct 29, 2012

vpetersson commented Oct 29, 2012

vpetersson commented Nov 9, 2012

axel-b commented Nov 9, 2012

vpetersson commented Nov 9, 2012

axel-b commented Nov 9, 2012

NiKiZe commented Nov 11, 2012

axel-b commented Nov 11, 2012

NiKiZe commented Nov 11, 2012

axel-b commented Nov 16, 2012

stale bot commented Aug 13, 2018

Add watchdog functionality to find frozen subprocesses #21

Add watchdog functionality to find frozen subprocesses #21

Comments

vpetersson commented Oct 28, 2012

NiKiZe commented Oct 29, 2012

vpetersson commented Oct 29, 2012

axel-b commented Oct 29, 2012

NiKiZe commented Oct 29, 2012

vpetersson commented Oct 29, 2012

vpetersson commented Nov 9, 2012

axel-b commented Nov 9, 2012

vpetersson commented Nov 9, 2012

axel-b commented Nov 9, 2012

NiKiZe commented Nov 11, 2012

axel-b commented Nov 11, 2012

NiKiZe commented Nov 11, 2012

axel-b commented Nov 16, 2012

stale bot commented Aug 13, 2018