staleness heuristic / performance #9

edsu · 2017-01-14T13:28:53Z

The longer diffengine runs the more urls it needs to check, and the more time it takes to take a full pass through them. The assumption I've had so far based on watching news websites is that the older a page gets the less likely it is to change its content.

There is a method on the Entry object that calculates whether an entry is stale or not. It uses what I call a staleness ratio or s. If s is greater than a given value (currently .2) it is deemed stale. I've thought about making this magic number configurable per feed. Here's how it works:

hotness = current time - entry creation time
staleness = current time - time last checked
s = hotness / staleness

stale? = s >= .2

So if an entry is 3 hours (10800 seconds) old and it was last checked 20 minutes (1200 seconds) ago, the calculation is:

1200 / 10800 = .11 (not stale)

Or if the entry is 3 hours old and it was last checked 1 hour ago:

3600 / 10800 = .33 (stale)

The idea is that things get checked less often as they get older, but the problem that I haven't really verified yet is that I think it can still result in thresholds over which lots of checks need to happen. So periodically diffengine will spend a lot of time checking URLs as they cross over that threshold.

I was wondering if it might make sense to take a more probabilistic approach where URLs are checked more often when they are new and less often as they get older using some sort of probability sampling. For example when an entry is new it is checked 80% of the time, and as it gets to be old, say a month old, it is checked only 50% of the time. So a gradiant of some kind like that? Or maybe it should also factor in the total number of entries that need to be checked, and the desired time it should take for a complete run?

It takes about a second to check an entry, and after running against Washington Post the Guardian and Breitbart for a week I have 1531 URLs to check. If there were no backing off at all this be 25 minutes of runtime, and it would just get worse. This would mean that new entries would not be monitored closely enough. Also it would unduly burden the webserver being checked with tons of requests.

I suspect this problem may have been solved elsewhere before, so if you have ideas or pointers they would be appreciated!

The text was updated successfully, but these errors were encountered:

gregjan · 2017-01-14T16:34:54Z

I don't have insight into the stats myself. Do these servers respond to HEAD requests accurately? That should be faster than a diff..

ruebot · 2017-01-15T14:29:00Z

Some data points from my setup.

CBC - 42 feeds

2017-01-14 10:12:20,055 - root - INFO - shutting down: 0:51:27.529976
2017-01-14 13:32:35,568 - root - INFO - shutting down: 1:02:34.198437
2017-01-14 15:24:19,525 - root - INFO - shutting down: 0:54:17.493559
2017-01-14 17:09:52,855 - root - INFO - shutting down: 0:54:50.911007
2017-01-14 18:11:45,281 - root - INFO - shutting down: 0:56:43.139898
2017-01-14 19:09:45,958 - root - INFO - shutting down: 0:54:44.392014
2017-01-14 20:09:54,507 - root - INFO - shutting down: 0:54:53.252183
2017-01-14 21:11:51,047 - root - INFO - shutting down: 0:56:49.261002
2017-01-14 22:13:16,338 - root - INFO - shutting down: 0:58:15.006552
2017-01-14 23:11:30,542 - root - INFO - shutting down: 0:56:28.920192
2017-01-15 00:12:25,490 - root - INFO - shutting down: 0:57:23.657105
2017-01-15 01:11:45,595 - root - INFO - shutting down: 0:56:43.754114
2017-01-15 02:06:40,166 - root - INFO - shutting down: 0:51:38.215280
2017-01-15 03:05:54,218 - root - INFO - shutting down: 0:50:52.668085
2017-01-15 04:04:57,935 - root - INFO - shutting down: 0:49:56.339932
2017-01-15 05:05:25,381 - root - INFO - shutting down: 0:50:23.064285
2017-01-15 06:06:06,482 - root - INFO - shutting down: 0:51:04.715768
2017-01-15 07:06:25,907 - root - INFO - shutting down: 0:51:24.007759
2017-01-15 08:10:25,069 - root - INFO - shutting down: 0:55:22.783080
2017-01-15 09:15:26,977 - root - INFO - shutting down: 1:00:25.543582

The Globe & Mail - 174 feeds

2017-01-14 09:47:27,868 - root - INFO - shutting down: 0:17:25.983662
2017-01-14 10:17:42,696 - root - INFO - shutting down: 0:17:40.530126
2017-01-14 10:47:57,171 - root - INFO - shutting down: 0:17:55.090977
2017-01-14 11:17:06,091 - root - INFO - shutting down: 0:17:04.567858
2017-01-14 11:47:55,512 - root - INFO - shutting down: 0:17:54.131806
2017-01-14 12:18:14,028 - root - INFO - shutting down: 0:18:12.170724
2017-01-14 12:48:34,178 - root - INFO - shutting down: 0:18:32.800043
2017-01-14 13:17:40,891 - root - INFO - shutting down: 0:17:39.432581
2017-01-14 19:50:17,275 - root - INFO - shutting down: 5:20:15.156602
2017-01-14 23:24:06,302 - root - INFO - shutting down: 2:49:04.247076
2017-01-15 02:32:11,626 - root - INFO - shutting down: 2:57:09.492647
2017-01-15 05:16:41,393 - root - INFO - shutting down: 2:41:39.316417
2017-01-15 08:28:26,709 - root - INFO - shutting down: 2:53:25.176995

La Presse - 272 feeds

2017-01-14 21:05:31,423 - root - INFO - shutting down: 0:00:28.450089
2017-01-14 21:20:20,473 - root - INFO - shutting down: 0:00:19.141869
2017-01-14 23:25:41,218 - root - INFO - shutting down: 0:07:51.681414
2017-01-14 23:52:57,912 - root - INFO - shutting down: 0:08:00.091925
2017-01-15 02:22:17,085 - root - INFO - shutting down: 2:02:15.088546
2017-01-15 04:21:39,277 - root - INFO - shutting down: 1:01:37.524389
2017-01-15 06:21:21,182 - root - INFO - shutting down: 1:01:19.278733
2017-01-15 08:21:24,028 - root - INFO - shutting down: 1:01:21.583565

Toronto Sun - 61 feeds

2017-01-14 12:01:08,299 - root - INFO - shutting down: 0:01:06.367782
2017-01-14 12:04:57,990 - root - INFO - shutting down: 0:01:24.518744
2017-01-14 12:06:05,057 - root - INFO - shutting down: 0:00:30.213326
2017-01-14 14:32:02,813 - root - INFO - shutting down: 1:02:00.581537
2017-01-14 15:51:12,505 - root - INFO - shutting down: 0:41:10.570579
2017-01-14 16:49:51,841 - root - INFO - shutting down: 0:39:49.905569
2017-01-14 17:49:13,600 - root - INFO - shutting down: 0:39:11.447067
2017-01-14 18:50:00,379 - root - INFO - shutting down: 0:39:58.324235
2017-01-14 19:50:15,777 - root - INFO - shutting down: 0:40:14.511174
2017-01-14 20:52:23,072 - root - INFO - shutting down: 0:42:21.243300
2017-01-14 21:52:13,635 - root - INFO - shutting down: 0:42:11.479865
2017-01-14 22:50:59,846 - root - INFO - shutting down: 0:40:58.384031
2017-01-14 23:51:05,598 - root - INFO - shutting down: 0:41:03.421256
2017-01-15 00:51:21,725 - root - INFO - shutting down: 0:41:19.825083
2017-01-15 01:52:39,583 - root - INFO - shutting down: 0:42:37.540521
2017-01-15 02:50:35,141 - root - INFO - shutting down: 0:40:33.583783
2017-01-15 03:52:36,677 - root - INFO - shutting down: 0:42:35.075756
2017-01-15 04:51:29,327 - root - INFO - shutting down: 0:41:26.991347
2017-01-15 05:50:03,298 - root - INFO - shutting down: 0:40:01.526480
2017-01-15 06:52:04,578 - root - INFO - shutting down: 0:42:02.599009
2017-01-15 07:51:25,814 - root - INFO - shutting down: 0:41:23.516856
2017-01-15 08:52:46,569 - root - INFO - shutting down: 0:42:45.185334

Canadaland - 4 feeds

2017-01-14 21:24:59,903 - root - INFO - shutting down: 0:24:57.804095
2017-01-14 21:55:25,131 - root - INFO - shutting down: 0:25:22.970673
2017-01-14 22:24:50,952 - root - INFO - shutting down: 0:24:49.555501
2017-01-14 22:55:09,853 - root - INFO - shutting down: 0:25:08.089897
2017-01-14 23:25:11,017 - root - INFO - shutting down: 0:25:08.898734
2017-01-14 23:54:59,516 - root - INFO - shutting down: 0:24:57.475271
2017-01-15 00:25:12,466 - root - INFO - shutting down: 0:25:10.616999
2017-01-15 00:54:57,652 - root - INFO - shutting down: 0:24:55.844422
2017-01-15 01:25:52,581 - root - INFO - shutting down: 0:25:50.593156
2017-01-15 01:54:51,265 - root - INFO - shutting down: 0:24:49.347676
2017-01-15 02:25:35,087 - root - INFO - shutting down: 0:25:33.488282
2017-01-15 02:55:05,424 - root - INFO - shutting down: 0:25:03.990036
2017-01-15 03:24:45,634 - root - INFO - shutting down: 0:24:43.871369
2017-01-15 03:54:36,158 - root - INFO - shutting down: 0:24:34.183754
2017-01-15 04:24:56,073 - root - INFO - shutting down: 0:24:54.569207
2017-01-15 04:54:58,330 - root - INFO - shutting down: 0:24:56.910666
2017-01-15 05:24:24,081 - root - INFO - shutting down: 0:24:22.412581
2017-01-15 05:54:26,942 - root - INFO - shutting down: 0:24:25.517462
2017-01-15 06:24:55,516 - root - INFO - shutting down: 0:24:53.228180
2017-01-15 06:54:40,924 - root - INFO - shutting down: 0:24:39.055811
2017-01-15 07:25:02,262 - root - INFO - shutting down: 0:25:00.001297
2017-01-15 07:54:44,987 - root - INFO - shutting down: 0:24:43.380605
2017-01-15 08:24:53,357 - root - INFO - shutting down: 0:24:52.005523
2017-01-15 08:55:06,777 - root - INFO - shutting down: 0:25:04.727537
2017-01-15 09:24:25,455 - root - INFO - shutting down: 0:24:23.347182

edsu · 2017-01-16T13:33:35Z

@ruebot Wow, Toronto Sun has 61 feeds? Did you mean entries? I'm adding the number of entries checked during the run, and the number skipped to the log output to help.

ruebot · 2017-01-16T14:05:10Z

@edsu yep. La Presse has 200+, and some 404... which is fun.

edsu · 2017-01-16T14:05:16Z

If you install v0.0.24 you will get additional information like this:

2017-01-16 13:59:36,353 - root - INFO - shutting down: new=2 checked=4 skipped=10 elapsed=0:00:01.073984

ruebot · 2017-01-16T14:05:47Z

updating now

ruebot · 2017-01-16T14:07:15Z

Oh, merge conflicts. I'll update #11.

edsu · 2017-01-16T14:36:48Z

I had a runtime error in there so you'll need to install a new version. FWIW I'm developing the code separately from the installed version (installed with pip).

ruebot · 2017-01-16T15:31:12Z

Updated stats:

Toronto Sun

2017-01-16 10:10:11,689 - root - INFO - shutting down: new=0 checked=4 skipped=91 elapsed=0:00:09.998173

La Presse

root - INFO - shutting down: new=0 checked=8 skipped=17 elapsed=0:00:12.566924

Canadaland

2017-01-16 09:30:03,538 - root - INFO - shutting down: new=0 checked=0 skipped=59 elapsed=0:00:01.263901
2017-01-16 10:00:03,233 - root - INFO - shutting down: new=0 checked=0 skipped=59 elapsed=0:00:01.347719

CBC

2017-01-16 10:02:09,287 - root - INFO - shutting down: new=0 checked=78 skipped=105 elapsed=0:02:07.376921

The Globe & Mail

2017-01-16 10:22:17,488 - root - INFO - shutting down: new=0 checked=107 skipped=52 elapsed=0:04:03.235425

edsu · 2017-01-16T15:40:37Z

Cool, it'll be interesting to watch these over time. What is your cron schedule like, and are you using the flock trick to prevent concurrent jobs--until diffengine locks for you?

ruebot · 2017-01-16T15:45:33Z

I had:

Canadaland; every 30 mintues
Toronto Sun; every hour
La Press; every 2 hours
CBC; every 2 hours
The Globe & Mails; every 3 hours

Now I'm giving this a shot again:

0,30 * * * * /usr/bin/flock -xn /tmp/canadaland.lock -c "/usr/local/bin/diffengine /home/nruest/.canadaland"
#0,30 * * * * /usr/local/bin/diffengine /home/nruest/.diffengine
0,30 * * * * /usr/bin/flock -xn /tmp/globemail.lock -c "/usr/local/bin/diffengine /home/nruest/.globemail"
0,30 * * * * /usr/bin/flock -xn /tmp/torontosun.lock -c "/usr/local/bin/diffengine /home/nruest/.torontosun"
0,30 * * * * /usr/bin/flock -xn /tmp/cbc.lock -c "/usr/local/bin/diffengine /home/nruest/.cbc"
0,30 * * * * /usr/bin/flock -xn /tmp/lapresse.lock -c "/usr/local/bin/diffengine /home/nruest/.lapresse"

edsu · 2017-01-16T15:49:05Z

@gregjan I guess feed/entry fetching could use HTTP headers to avoid getting the full representation. But I still wouldn't want to do thousands of HTTP HEAD requests every time it runs. So I think some kind of logic for backing off would still be desirable?

Currently I have a time.sleep embedded that prevents diffengine from doing more than 1 request per second. I wanted to do this to be nice to the server...and also not get blocked. So that's the main thing slowing things down at the moment.

If multiple sites are being monitored I guess diffengine could stripe the requests across them, rather than working on a feed at a time...

edsu · 2017-01-16T15:51:24Z

@ruebot nice thing about using the flock is that you can start it up at lower intervals, every 15 minutes or even every 5 minutes...

ryanfb · 2017-04-13T19:55:40Z

After thinking about this a bit, I wonder if a few simple changes might help:

Make sure we iterate over the entries list in creation order (this might already be happening, but just looking at the code I'm not sure https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py#L67). This way we're always processing the newest entries first instead of potentially spending a bunch of time doing old entries first - which if you're wrapping your diffengine in a timeout, might starve new entries.
Increase the default threshold. I could see something in the range of .5 (day-old gets checked every half day, week-old gets checked every few days, month-old gets checked every couple weeks) to 1.0 (very simple to understand the age/checking relationship - we could also set an upper bound on the value of hotness to e.g. a month so that we're never waiting more than a month between checks).
In conjunction with 2, reset the creation time if an entry is modified (or add a new modified time attribute and use that to calculate the hotness value). This would allow us to have a higher threshold and still catch the cases where e.g. an article is getting updated frequently over the course of a week, or an old article gets a flurry of updates. This could potentially result in some thrashing when a site makes a global change that causes all their entries to show up as modified after content normalization.

Order Feed entries by Entry creation date, newest-first (see #9)

Configurable database

edsu mentioned this issue Jan 14, 2017

Document how to work with multiple feeds #10

Closed

edsu added a commit that referenced this issue Jan 16, 2017

added more diagnostics info on shut down; refs #9

8a90763

edsu changed the title ~~staleness heuristic~~ staleness heuristic / performance Jan 16, 2017

ryanfb added a commit to ryanfb/diffengine that referenced this issue Apr 13, 2017

Order Feed entries by Entry creation date, newest-first (see DocNow#9)

6df5d1d

edsu added a commit that referenced this issue Apr 13, 2017

Merge pull request #32 from ryanfb/order_by

0bcbf06

Order Feed entries by Entry creation date, newest-first (see #9)

edsu pushed a commit that referenced this issue Jun 3, 2020

Merge pull request #9 from nahuelhds/feature/configurable-database

a15b052

Configurable database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

staleness heuristic / performance #9

staleness heuristic / performance #9

edsu commented Jan 14, 2017 •

edited

Loading

gregjan commented Jan 14, 2017

ruebot commented Jan 15, 2017

edsu commented Jan 16, 2017 •

edited

Loading

ruebot commented Jan 16, 2017

edsu commented Jan 16, 2017 •

edited

Loading

ruebot commented Jan 16, 2017

ruebot commented Jan 16, 2017

edsu commented Jan 16, 2017

ruebot commented Jan 16, 2017

edsu commented Jan 16, 2017

ruebot commented Jan 16, 2017

edsu commented Jan 16, 2017 •

edited

Loading

edsu commented Jan 16, 2017

ryanfb commented Apr 13, 2017 •

edited

Loading

staleness heuristic / performance #9

staleness heuristic / performance #9

Comments

edsu commented Jan 14, 2017 • edited Loading

gregjan commented Jan 14, 2017

ruebot commented Jan 15, 2017

edsu commented Jan 16, 2017 • edited Loading

ruebot commented Jan 16, 2017

edsu commented Jan 16, 2017 • edited Loading

ruebot commented Jan 16, 2017

ruebot commented Jan 16, 2017

edsu commented Jan 16, 2017

ruebot commented Jan 16, 2017

edsu commented Jan 16, 2017

ruebot commented Jan 16, 2017

edsu commented Jan 16, 2017 • edited Loading

edsu commented Jan 16, 2017

ryanfb commented Apr 13, 2017 • edited Loading

edsu commented Jan 14, 2017 •

edited

Loading

edsu commented Jan 16, 2017 •

edited

Loading

edsu commented Jan 16, 2017 •

edited

Loading

edsu commented Jan 16, 2017 •

edited

Loading

ryanfb commented Apr 13, 2017 •

edited

Loading