New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add watchdog functionality to find frozen subprocesses #21
Comments
…omxplayer-processes older than one hour. Must also be added to cron to run every X minutes.
I have had uzbl die sometimes, and also thinking that it is possible for the server process to die as well.. |
Interesting. I haven't seen that happening. I do have a reload of uzbl in xloader.sh, but it would only be triggered if viewer.py crashes. I guess a check for if uzbl is running would be a viable solution here, as it should be running at all times viewer.py is running.
No, the server is managed by supervisor, so it will respawn the process if it crashes.
I'm not too familiar with the system watchdog. It's something I'm going to look in. However, it was my understanding that the watchdog only detects a frozen system, and not a frozen process. Hence, at least in the case of omxplayer, that wouldn't help.
I looked it up, and the benefit of using nowayout appears to be that the module cannot be unloaded. I'm not sure what the benefit would be of that in this particular situation, but maybe I'm missing something. |
Just wondering: wouldn't it be better, instead of the cron job
in that case, you could use the duration field in the asset database as 'maximum playing time until kill', i.e. you then can have asset-specific maximum playing times. |
A guess is some kind of out of memory condition in uzbl. Will try to catch more info if it happens again. the watchdog can run other scripts as well, and in the hard case that it fails the machine will be restarted, see test-binary option for watchdog.conf, there is also repair-binary. If there is a way too look for lingering omxplayers, or a viewer that have not touched a file in a given time etc (maybe the log?), it can try and resolve the problem, if the software recovery is not successful the system device is not poked, and in the worst case system is restarted. With nowayout in the case where something or someone kills the watchdog, the system will still restart. Not saying it's the right thing just an old habit of mine. |
@axel-b I think that's a reasonable approach. The reason why I went for the cron-tab approach was simply because it was faster to implement. @NiKiZe Thanks. Yeah, but the problem with omxplayer is that it shouldn't always run. It should only run if there is a video playing. Perhaps the 'touch' approach would work if implemented in viewer.py's main loop, since omxplayer would halt the loop, and hence prevent the touch-file from being updated. There are pros and cons of using the internal timer in viewer.py versus a watchdog-approach. As far as I understand, the benefit viewer.py-timer is that we can add more intelligence to it (and perhaps pass values like the expected run-time etc), but for the price of added complexity to the script. It will also not catch issues with crashing/frozen uzbl-processes (but perhaps the general watchdog as it is configured today) will catch these if it's a memory/swap related issue. The benefit with touching a file and check for the last-modification with a watchdog process is that it is a lot easier to implement and will also catch every possible issue that may cause viewer.py to choke. The downside however is that it isn't very intelligent and may cause unnecessary reboots. I'm torn...:) |
The watchdog-file should never be older than the length of the longest asset. Hence we should be able to use this to detect a freeze-up.
I think the watchdog-approach is the best one. I've added initial support for a watchdog-file that is updated within the loop. My plan is then to set the system watchdog to trigger if the watchdog-file is older than an hour. If that is the case, it should kill viewer.py. That in turn will lead xloader.sh to refresh viewer.py and kill frozen uzbl and omxplayer processes. The drawback with this is that it assumes that no asset has a longer run-time than an hour. Yet, I would imagine that it is safe to assume that that would be rare. If someone does need to run assets with a display-time longer than an hour, the value could easily be bumped up in the watchdog config-file. |
regarding the one-hour assumption... the asset database 'knows' the duration of the longest asset, except for movies. I found some python code that uses mplayer to obtain the duration of movies -- it does take a couple of seconds to run, though. I hacked my screenly to use this when a movie is added (just like the resolution of images is obtained when an image is added). I like it that now the video-duration field contains useful information. I'll see that I push the change to obtain video duration to a separate branch later today (probably this evening). update: I just saw that 'schedule asset' wants the user to provide a duration, always, and sets '5' as default -- also if you schedule an asset that already has a duration. I did not check whether it actually overwrites the duration for a video when you press 'submit'. I'll have a look at this later today as well, I hope. |
Yes, but that doesn't matter, since the timeout is statically configured in /etc/watchdog.conf. Hence, even if the python-process did know the maximum duration, we would need to write a system that keeps overwriting the config-file with the new value, and then restart the watchdog. That doesn't seem like a great approach. Also, I've already added such code for Screenly Pro, but it then runs on the server (and not on the Pi).
Yes, that's true. there are room for improvement here, but it boils down to the detection of length. I'm not sure it is a good idea to run this on the Pi, as it could potentially take a very long time. |
I think we can have our cake and eat it too. With the approach that you propose, in each 'cycle' of the main loop you touch a file (the file that the watchdog should look for). The watchdog will to look at the last-modified-time of the given file, and the current time, and, using information from its config file, do a little computation to decide whether everything is ok, or not ok. Now suppose that you do not just touch the file in every cycle, but you write a number into it (the longest duration of the playlist, or maybe even better: duration of the next item that is going to be shown). Now suppose that you write your own watchdog test command and put it in /etc/watchdog.d (the watchdog allows you to do this, see watchdog(8)). No need to keep overwriting config files, no need to keep restarting watchdogs. And, until you find time to write your own watchdog test command, you could even configure watchdog to only look at the modification time of the given file (the file the watchdog must look for)... :-) :-) :-) :-) (sidenote: I'll probably keep the length-fetching code in my Screenly copy, or turn it into a button-to-do-it-on-command, because I like the functionality, and so far, although it took longer than I liked, it did not take extremely long... at most 15 secs or so -- I did not use a watch. Moreover, we do not add videos very often -- we have three, and the screen is there now for more than a year.) |
Normally the watchdog process only checks that a process is alive by pid. In the case of screenly it is a bit more complicated because of the secondary processes, and thus touching a file and checking it's modification time is a good workaround. But adding calculation logic there feels risky. I would like to see the viewer running a separate thread that updates the mtime no less then every 10 secs. In regards to video, maybe https://github.com/jbaiter/pyomxplayer could be used? Both for checking that the player is alive/hanged and to get video duration. |
To me, having a small, own test (+repair) program doesn't seem too risky. Having a separate thread write timestamps sounds much more risky to me: I don't know what happens when the viewer thread is hung, but I could perfectly well imagine that the timestamping thread continues to make progress... I do like my Browser class, but it can not be used to check progress: when uzbl-core hangs, the Browser code (e.g. in Browser.show() will just wait for it -- wait forever, if necessary). Also, I would be hesitant about removing the 200 status check in view_web -- for my use-case, it is not very costly, Regarding video, I think I want slightly more that what pyomxplayer does (allows) -- but I have not tried pyomxplayer, so I may be underestimating what it can do. To elaborate on what I currently do with webpages: For some of my videos, loading/starting omxplayer takes several (upto 10) seconds, i.e. 10 seconds of plack sceen (not really anymore: I modified the black background to show, for 10 seconds, an animated gif with typical movie start countdown image). Regarding video length: a simple solution would be to set length to -1 in the database when the asset is entered, and update the database after the first viewing of the video (when we know how long it took, by just measuring). |
@axel-b I will try to be short and then read what you wrote a few more times ;) With pyomxplayer there is quite a clear mode where they load video but not start it for example. Slave mode is also what i would call how uzbl is run, current screenly master uses i fifo while your browser branch uses a open pipe. omxplayer seem to support something similar, (and so do mplayer) I will read more and get back on the other points. |
This is just to confirm that these ideas -- starting omxplayer in paused mode -- work: I'm doing that now in viewer.py on my branch player-browser-fader. back to watchdog: Yesterday I got my first hang in omxplayer -- in retrospect it may have been caused by a file-server issue that affected the web server that hosts my assets. Nevertheless, it triggered me to write a test/repair command in watchtests/testrepair.c on branch watchdog-testcommand. Command testrepair looks at both the modification time of /tmp/screenly.watchdog, and its content: I assume it contains the duration of the asset that is currently being shown, or -1 if that duration is not known -- I changed my viewer.py (not committed to the branch) to write such /tmp/screenly.watchdog, just for trying this approach. I have run testrepair from the command line to see what it does, but have not yet tried to install it such that it is automatically run by watchdog, so be careful with it. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I've noticed at a few occasions that omxplayer freezes up, which in turn stops Screenly (since it is gently waiting for it to wrap up).
To avoid this, we need some kind of watchdog functionality that scans for omxplayer-processes that are older than n hours (where n should probably be a setting).
The text was updated successfully, but these errors were encountered: