-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
staleness heuristic / performance #9
Comments
I don't have insight into the stats myself. Do these servers respond to HEAD requests accurately? That should be faster than a diff.. |
Some data points from my setup. CBC - 42 feeds
The Globe & Mail - 174 feeds
La Presse - 272 feeds
Toronto Sun - 61 feeds
Canadaland - 4 feeds
|
@ruebot Wow, Toronto Sun has 61 feeds? Did you mean entries? I'm adding the number of entries checked during the run, and the number skipped to the log output to help. |
@edsu yep. La Presse has 200+, and some 404... which is fun. |
If you install v0.0.24 you will get additional information like this:
|
updating now |
Oh, merge conflicts. I'll update #11. |
I had a runtime error in there so you'll need to install a new version. FWIW I'm developing the code separately from the installed version (installed with pip). |
Updated stats:
|
Cool, it'll be interesting to watch these over time. What is your cron schedule like, and are you using the flock trick to prevent concurrent jobs--until diffengine locks for you? |
I had:
Now I'm giving this a shot again:
|
@gregjan I guess feed/entry fetching could use HTTP headers to avoid getting the full representation. But I still wouldn't want to do thousands of HTTP HEAD requests every time it runs. So I think some kind of logic for backing off would still be desirable? Currently I have a time.sleep embedded that prevents diffengine from doing more than 1 request per second. I wanted to do this to be nice to the server...and also not get blocked. So that's the main thing slowing things down at the moment. If multiple sites are being monitored I guess diffengine could stripe the requests across them, rather than working on a feed at a time... |
@ruebot nice thing about using the flock is that you can start it up at lower intervals, every 15 minutes or even every 5 minutes... |
After thinking about this a bit, I wonder if a few simple changes might help:
|
Order Feed entries by Entry creation date, newest-first (see #9)
The longer diffengine runs the more urls it needs to check, and the more time it takes to take a full pass through them. The assumption I've had so far based on watching news websites is that the older a page gets the less likely it is to change its content.
There is a method on the Entry object that calculates whether an entry is stale or not. It uses what I call a staleness ratio or s. If s is greater than a given value (currently .2) it is deemed stale. I've thought about making this magic number configurable per feed. Here's how it works:
So if an entry is 3 hours (10800 seconds) old and it was last checked 20 minutes (1200 seconds) ago, the calculation is:
Or if the entry is 3 hours old and it was last checked 1 hour ago:
The idea is that things get checked less often as they get older, but the problem that I haven't really verified yet is that I think it can still result in thresholds over which lots of checks need to happen. So periodically diffengine will spend a lot of time checking URLs as they cross over that threshold.
I was wondering if it might make sense to take a more probabilistic approach where URLs are checked more often when they are new and less often as they get older using some sort of probability sampling. For example when an entry is new it is checked 80% of the time, and as it gets to be old, say a month old, it is checked only 50% of the time. So a gradiant of some kind like that? Or maybe it should also factor in the total number of entries that need to be checked, and the desired time it should take for a complete run?
It takes about a second to check an entry, and after running against Washington Post the Guardian and Breitbart for a week I have 1531 URLs to check. If there were no backing off at all this be 25 minutes of runtime, and it would just get worse. This would mean that new entries would not be monitored closely enough. Also it would unduly burden the webserver being checked with tons of requests.
I suspect this problem may have been solved elsewhere before, so if you have ideas or pointers they would be appreciated!
The text was updated successfully, but these errors were encountered: