Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

staleness heuristic / performance #9

Open
edsu opened this issue Jan 14, 2017 · 14 comments
Open

staleness heuristic / performance #9

edsu opened this issue Jan 14, 2017 · 14 comments

Comments

@edsu
Copy link
Member

edsu commented Jan 14, 2017

The longer diffengine runs the more urls it needs to check, and the more time it takes to take a full pass through them. The assumption I've had so far based on watching news websites is that the older a page gets the less likely it is to change its content.

There is a method on the Entry object that calculates whether an entry is stale or not. It uses what I call a staleness ratio or s. If s is greater than a given value (currently .2) it is deemed stale. I've thought about making this magic number configurable per feed. Here's how it works:

hotness = current time - entry creation time
staleness = current time - time last checked
s = hotness / staleness

stale? = s >= .2

So if an entry is 3 hours (10800 seconds) old and it was last checked 20 minutes (1200 seconds) ago, the calculation is:

1200 / 10800 = .11 (not stale)

Or if the entry is 3 hours old and it was last checked 1 hour ago:

3600 / 10800 = .33 (stale)

The idea is that things get checked less often as they get older, but the problem that I haven't really verified yet is that I think it can still result in thresholds over which lots of checks need to happen. So periodically diffengine will spend a lot of time checking URLs as they cross over that threshold.

I was wondering if it might make sense to take a more probabilistic approach where URLs are checked more often when they are new and less often as they get older using some sort of probability sampling. For example when an entry is new it is checked 80% of the time, and as it gets to be old, say a month old, it is checked only 50% of the time. So a gradiant of some kind like that? Or maybe it should also factor in the total number of entries that need to be checked, and the desired time it should take for a complete run?

It takes about a second to check an entry, and after running against Washington Post the Guardian and Breitbart for a week I have 1531 URLs to check. If there were no backing off at all this be 25 minutes of runtime, and it would just get worse. This would mean that new entries would not be monitored closely enough. Also it would unduly burden the webserver being checked with tons of requests.

I suspect this problem may have been solved elsewhere before, so if you have ideas or pointers they would be appreciated!

@gregjan
Copy link

gregjan commented Jan 14, 2017

I don't have insight into the stats myself. Do these servers respond to HEAD requests accurately? That should be faster than a diff..

@ruebot
Copy link
Member

ruebot commented Jan 15, 2017

Some data points from my setup.

CBC - 42 feeds

2017-01-14 10:12:20,055 - root - INFO - shutting down: 0:51:27.529976
2017-01-14 13:32:35,568 - root - INFO - shutting down: 1:02:34.198437
2017-01-14 15:24:19,525 - root - INFO - shutting down: 0:54:17.493559
2017-01-14 17:09:52,855 - root - INFO - shutting down: 0:54:50.911007
2017-01-14 18:11:45,281 - root - INFO - shutting down: 0:56:43.139898
2017-01-14 19:09:45,958 - root - INFO - shutting down: 0:54:44.392014
2017-01-14 20:09:54,507 - root - INFO - shutting down: 0:54:53.252183
2017-01-14 21:11:51,047 - root - INFO - shutting down: 0:56:49.261002
2017-01-14 22:13:16,338 - root - INFO - shutting down: 0:58:15.006552
2017-01-14 23:11:30,542 - root - INFO - shutting down: 0:56:28.920192
2017-01-15 00:12:25,490 - root - INFO - shutting down: 0:57:23.657105
2017-01-15 01:11:45,595 - root - INFO - shutting down: 0:56:43.754114
2017-01-15 02:06:40,166 - root - INFO - shutting down: 0:51:38.215280
2017-01-15 03:05:54,218 - root - INFO - shutting down: 0:50:52.668085
2017-01-15 04:04:57,935 - root - INFO - shutting down: 0:49:56.339932
2017-01-15 05:05:25,381 - root - INFO - shutting down: 0:50:23.064285
2017-01-15 06:06:06,482 - root - INFO - shutting down: 0:51:04.715768
2017-01-15 07:06:25,907 - root - INFO - shutting down: 0:51:24.007759
2017-01-15 08:10:25,069 - root - INFO - shutting down: 0:55:22.783080
2017-01-15 09:15:26,977 - root - INFO - shutting down: 1:00:25.543582

The Globe & Mail - 174 feeds

2017-01-14 09:47:27,868 - root - INFO - shutting down: 0:17:25.983662
2017-01-14 10:17:42,696 - root - INFO - shutting down: 0:17:40.530126
2017-01-14 10:47:57,171 - root - INFO - shutting down: 0:17:55.090977
2017-01-14 11:17:06,091 - root - INFO - shutting down: 0:17:04.567858
2017-01-14 11:47:55,512 - root - INFO - shutting down: 0:17:54.131806
2017-01-14 12:18:14,028 - root - INFO - shutting down: 0:18:12.170724
2017-01-14 12:48:34,178 - root - INFO - shutting down: 0:18:32.800043
2017-01-14 13:17:40,891 - root - INFO - shutting down: 0:17:39.432581
2017-01-14 19:50:17,275 - root - INFO - shutting down: 5:20:15.156602
2017-01-14 23:24:06,302 - root - INFO - shutting down: 2:49:04.247076
2017-01-15 02:32:11,626 - root - INFO - shutting down: 2:57:09.492647
2017-01-15 05:16:41,393 - root - INFO - shutting down: 2:41:39.316417
2017-01-15 08:28:26,709 - root - INFO - shutting down: 2:53:25.176995

La Presse - 272 feeds

2017-01-14 21:05:31,423 - root - INFO - shutting down: 0:00:28.450089
2017-01-14 21:20:20,473 - root - INFO - shutting down: 0:00:19.141869
2017-01-14 23:25:41,218 - root - INFO - shutting down: 0:07:51.681414
2017-01-14 23:52:57,912 - root - INFO - shutting down: 0:08:00.091925
2017-01-15 02:22:17,085 - root - INFO - shutting down: 2:02:15.088546
2017-01-15 04:21:39,277 - root - INFO - shutting down: 1:01:37.524389
2017-01-15 06:21:21,182 - root - INFO - shutting down: 1:01:19.278733
2017-01-15 08:21:24,028 - root - INFO - shutting down: 1:01:21.583565

Toronto Sun - 61 feeds

2017-01-14 12:01:08,299 - root - INFO - shutting down: 0:01:06.367782
2017-01-14 12:04:57,990 - root - INFO - shutting down: 0:01:24.518744
2017-01-14 12:06:05,057 - root - INFO - shutting down: 0:00:30.213326
2017-01-14 14:32:02,813 - root - INFO - shutting down: 1:02:00.581537
2017-01-14 15:51:12,505 - root - INFO - shutting down: 0:41:10.570579
2017-01-14 16:49:51,841 - root - INFO - shutting down: 0:39:49.905569
2017-01-14 17:49:13,600 - root - INFO - shutting down: 0:39:11.447067
2017-01-14 18:50:00,379 - root - INFO - shutting down: 0:39:58.324235
2017-01-14 19:50:15,777 - root - INFO - shutting down: 0:40:14.511174
2017-01-14 20:52:23,072 - root - INFO - shutting down: 0:42:21.243300
2017-01-14 21:52:13,635 - root - INFO - shutting down: 0:42:11.479865
2017-01-14 22:50:59,846 - root - INFO - shutting down: 0:40:58.384031
2017-01-14 23:51:05,598 - root - INFO - shutting down: 0:41:03.421256
2017-01-15 00:51:21,725 - root - INFO - shutting down: 0:41:19.825083
2017-01-15 01:52:39,583 - root - INFO - shutting down: 0:42:37.540521
2017-01-15 02:50:35,141 - root - INFO - shutting down: 0:40:33.583783
2017-01-15 03:52:36,677 - root - INFO - shutting down: 0:42:35.075756
2017-01-15 04:51:29,327 - root - INFO - shutting down: 0:41:26.991347
2017-01-15 05:50:03,298 - root - INFO - shutting down: 0:40:01.526480
2017-01-15 06:52:04,578 - root - INFO - shutting down: 0:42:02.599009
2017-01-15 07:51:25,814 - root - INFO - shutting down: 0:41:23.516856
2017-01-15 08:52:46,569 - root - INFO - shutting down: 0:42:45.185334

Canadaland - 4 feeds

2017-01-14 21:24:59,903 - root - INFO - shutting down: 0:24:57.804095
2017-01-14 21:55:25,131 - root - INFO - shutting down: 0:25:22.970673
2017-01-14 22:24:50,952 - root - INFO - shutting down: 0:24:49.555501
2017-01-14 22:55:09,853 - root - INFO - shutting down: 0:25:08.089897
2017-01-14 23:25:11,017 - root - INFO - shutting down: 0:25:08.898734
2017-01-14 23:54:59,516 - root - INFO - shutting down: 0:24:57.475271
2017-01-15 00:25:12,466 - root - INFO - shutting down: 0:25:10.616999
2017-01-15 00:54:57,652 - root - INFO - shutting down: 0:24:55.844422
2017-01-15 01:25:52,581 - root - INFO - shutting down: 0:25:50.593156
2017-01-15 01:54:51,265 - root - INFO - shutting down: 0:24:49.347676
2017-01-15 02:25:35,087 - root - INFO - shutting down: 0:25:33.488282
2017-01-15 02:55:05,424 - root - INFO - shutting down: 0:25:03.990036
2017-01-15 03:24:45,634 - root - INFO - shutting down: 0:24:43.871369
2017-01-15 03:54:36,158 - root - INFO - shutting down: 0:24:34.183754
2017-01-15 04:24:56,073 - root - INFO - shutting down: 0:24:54.569207
2017-01-15 04:54:58,330 - root - INFO - shutting down: 0:24:56.910666
2017-01-15 05:24:24,081 - root - INFO - shutting down: 0:24:22.412581
2017-01-15 05:54:26,942 - root - INFO - shutting down: 0:24:25.517462
2017-01-15 06:24:55,516 - root - INFO - shutting down: 0:24:53.228180
2017-01-15 06:54:40,924 - root - INFO - shutting down: 0:24:39.055811
2017-01-15 07:25:02,262 - root - INFO - shutting down: 0:25:00.001297
2017-01-15 07:54:44,987 - root - INFO - shutting down: 0:24:43.380605
2017-01-15 08:24:53,357 - root - INFO - shutting down: 0:24:52.005523
2017-01-15 08:55:06,777 - root - INFO - shutting down: 0:25:04.727537
2017-01-15 09:24:25,455 - root - INFO - shutting down: 0:24:23.347182

@edsu
Copy link
Member Author

edsu commented Jan 16, 2017

@ruebot Wow, Toronto Sun has 61 feeds? Did you mean entries? I'm adding the number of entries checked during the run, and the number skipped to the log output to help.

@ruebot
Copy link
Member

ruebot commented Jan 16, 2017

@edsu yep. La Presse has 200+, and some 404... which is fun.

@edsu
Copy link
Member Author

edsu commented Jan 16, 2017

If you install v0.0.24 you will get additional information like this:

2017-01-16 13:59:36,353 - root - INFO - shutting down: new=2 checked=4 skipped=10 elapsed=0:00:01.073984

@ruebot
Copy link
Member

ruebot commented Jan 16, 2017

updating now

@ruebot
Copy link
Member

ruebot commented Jan 16, 2017

Oh, merge conflicts. I'll update #11.

@edsu
Copy link
Member Author

edsu commented Jan 16, 2017

I had a runtime error in there so you'll need to install a new version. FWIW I'm developing the code separately from the installed version (installed with pip).

@ruebot
Copy link
Member

ruebot commented Jan 16, 2017

Updated stats:

  • Toronto Sun
2017-01-16 10:10:11,689 - root - INFO - shutting down: new=0 checked=4 skipped=91 elapsed=0:00:09.998173
  • La Presse
root - INFO - shutting down: new=0 checked=8 skipped=17 elapsed=0:00:12.566924
  • Canadaland
2017-01-16 09:30:03,538 - root - INFO - shutting down: new=0 checked=0 skipped=59 elapsed=0:00:01.263901
2017-01-16 10:00:03,233 - root - INFO - shutting down: new=0 checked=0 skipped=59 elapsed=0:00:01.347719
  • CBC
2017-01-16 10:02:09,287 - root - INFO - shutting down: new=0 checked=78 skipped=105 elapsed=0:02:07.376921
  • The Globe & Mail
2017-01-16 10:22:17,488 - root - INFO - shutting down: new=0 checked=107 skipped=52 elapsed=0:04:03.235425

@edsu
Copy link
Member Author

edsu commented Jan 16, 2017

Cool, it'll be interesting to watch these over time. What is your cron schedule like, and are you using the flock trick to prevent concurrent jobs--until diffengine locks for you?

@ruebot
Copy link
Member

ruebot commented Jan 16, 2017

I had:

  • Canadaland; every 30 mintues
  • Toronto Sun; every hour
  • La Press; every 2 hours
  • CBC; every 2 hours
  • The Globe & Mails; every 3 hours

Now I'm giving this a shot again:

0,30 * * * * /usr/bin/flock -xn /tmp/canadaland.lock -c "/usr/local/bin/diffengine /home/nruest/.canadaland"
#0,30 * * * * /usr/local/bin/diffengine /home/nruest/.diffengine
0,30 * * * * /usr/bin/flock -xn /tmp/globemail.lock -c "/usr/local/bin/diffengine /home/nruest/.globemail"
0,30 * * * * /usr/bin/flock -xn /tmp/torontosun.lock -c "/usr/local/bin/diffengine /home/nruest/.torontosun"
0,30 * * * * /usr/bin/flock -xn /tmp/cbc.lock -c "/usr/local/bin/diffengine /home/nruest/.cbc"
0,30 * * * * /usr/bin/flock -xn /tmp/lapresse.lock -c "/usr/local/bin/diffengine /home/nruest/.lapresse"

@edsu
Copy link
Member Author

edsu commented Jan 16, 2017

@gregjan I guess feed/entry fetching could use HTTP headers to avoid getting the full representation. But I still wouldn't want to do thousands of HTTP HEAD requests every time it runs. So I think some kind of logic for backing off would still be desirable?

Currently I have a time.sleep embedded that prevents diffengine from doing more than 1 request per second. I wanted to do this to be nice to the server...and also not get blocked. So that's the main thing slowing things down at the moment.

If multiple sites are being monitored I guess diffengine could stripe the requests across them, rather than working on a feed at a time...

@edsu
Copy link
Member Author

edsu commented Jan 16, 2017

@ruebot nice thing about using the flock is that you can start it up at lower intervals, every 15 minutes or even every 5 minutes...

@edsu edsu changed the title staleness heuristic staleness heuristic / performance Jan 16, 2017
@ryanfb
Copy link
Contributor

ryanfb commented Apr 13, 2017

After thinking about this a bit, I wonder if a few simple changes might help:

  1. Make sure we iterate over the entries list in creation order (this might already be happening, but just looking at the code I'm not sure https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py#L67). This way we're always processing the newest entries first instead of potentially spending a bunch of time doing old entries first - which if you're wrapping your diffengine in a timeout, might starve new entries.
  2. Increase the default threshold. I could see something in the range of .5 (day-old gets checked every half day, week-old gets checked every few days, month-old gets checked every couple weeks) to 1.0 (very simple to understand the age/checking relationship - we could also set an upper bound on the value of hotness to e.g. a month so that we're never waiting more than a month between checks).
  3. In conjunction with 2, reset the creation time if an entry is modified (or add a new modified time attribute and use that to calculate the hotness value). This would allow us to have a higher threshold and still catch the cases where e.g. an article is getting updated frequently over the course of a week, or an old article gets a flurry of updates. This could potentially result in some thrashing when a site makes a global change that causes all their entries to show up as modified after content normalization.

ryanfb added a commit to ryanfb/diffengine that referenced this issue Apr 13, 2017
edsu added a commit that referenced this issue Apr 13, 2017
Order Feed entries by Entry creation date, newest-first (see #9)
edsu pushed a commit that referenced this issue Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants