Heartbeat timeout (fixes #601) by leplatrem · Pull Request #656 · Kinto/kinto

leplatrem · 2016-05-27T08:32:57Z

POC using signals r? @tarekziade
Write test (using very low timeout and time.sleep?)

Natim · 2016-05-27T08:35:54Z

I personalty like the approach.

leplatrem · 2016-05-27T08:56:47Z

kinto/core/views/heartbeat.py

-    for name, callable in heartbeats.items():
-        status[name] = callable(request)
+    seconds = float(request.registry.settings['heartbeat_timeout_seconds'])
+    with timeout(seconds):


nit: missing comment that it could raise TimeoutException

Can we add a test with the error message displayed to the user?

I would probably go for a global timeout as well as a timeout per heartbeat too.

That would be reimplementing harakiri :(

leplatrem · 2016-05-27T10:05:41Z

I personalty like the approach.

Too bad:

$ kinto start
...
File "/home/mathieu/Code/Mozilla/kinto/.venv/local/lib/python2.7/site-packages/timeoutcontext-1.0.0-py2.7.egg/timeoutcontext/_timeout.py", line 68, in _replace_alarm_handler
    raise_timeout)
ValueError: signal only works in main thread lang=None uid=None
2016-05-27 12:05:03,265 INFO  [kinto.core.initialization][waitress] "GET   /v1/__heartbeat__" 500 (2 ms) request.summary agent=HTTPie/0.9.2 authn_type=None errno=999 lang=None time=2016-05-27T12:05:03 uid=None

tarekziade · 2016-05-31T08:04:46Z

You should do it the other way

1/ call each heartbeat function into a separate thread.
2/ a successful result flips a flag
3/ on timeout (or before) return with the result.

use concurrent.futures.ThreadPoolExecutor

I am making the assumption that each heartbeat function is isolated enough to run in parallel with the other ones

tarekziade · 2016-05-31T08:05:58Z

also: r-

tarekziade · 2016-05-31T10:07:11Z

docs/configuration/settings.rst

 |                                                 |              | endpoint: ``/v1`` redirects to ``/v1/`` and ``/buckets/default/``        |
 |                                                 |              | to ``/buckets/default``. No redirections are made when turned off.       |
 +-------------------------------------------------+--------------+--------------------------------------------------------------------------+
+| kinto.heartbeat_timeout_seconds                 | ``2``        | The maximum duration of each heartbeat entry. Depending of the amount of |


2 seconds seems very low. Since we're running in parallel I think we can use 10 seconds by default maybe ?

@tarekziade

@tarekziade review

@tarekziade

@tarekziade review

tarekziade · 2016-05-31T13:10:49Z

kinto/core/views/heartbeat.py

+        error_msg = "'%s' heartbeat has exceeded timeout of %s seconds."
+        logger.error(error_msg % (name, seconds))
+
+    # If any has failed, return a 503 error response.


there's one thing missing: when one (or several) heartbeat(s) fails, we need to catch the future(s) exception(s) and log them here with logger.exception so we can track down what happens. The future object holds that exception, so we just need to iterate on them and collect the TB

another option is to catch them in heartbeat_check and push them in an exceptions list

also: make sure you call logger.exception in the main thread sequentially on the collected tracebacks, otherwise you might have mangled exceptions since two functions can fail in parallel at the same instant

@tarekziade

@tarekziade review

tarekziade · 2016-05-31T14:03:52Z

kinto/core/views/heartbeat.py

+
+    # A heartbeat is supposed to return True or False, and never raise.
+    # Just in case, go though results to spot any potential exception.
+    for future in done:


if we re-raise here we're getting a 500 I think. I think what you want is (not tested):

for future in done: exc = future.exception() if exc is not None: logger.error("%r failed" % future.__heartbeat_name) logger.error(exc)

Yes, I did not change the previous behaviour regarding heartbeats (the try/except is managed there)

ok I see, so there's the convention that the heartbeat functions should catch all errors. But we never know what might happen in external heartbeat functions.

Is that a behaviour we want to keep ?

e.g. do we want the global heartbeat to completely fail when one heartbeat is producing an error, or do we want to log that error and flag that backend to false in the result ?

@tarekziade

@tarekziade review

tarekziade · 2016-06-01T11:30:11Z

kinto/core/views/heartbeat.py

-        future.result()  # Will re-raise.
+        exc = future.exception()
+        if exc is not None:
+            logger.error("'%s' heartbeat failed." % future.__heartbeat_name)


nit: you can use %r instead of '%s'

aaah that's why you used %r! :/

tarekziade · 2016-06-01T11:31:13Z

Looks great r+

leplatrem added the in progress label May 27, 2016

leplatrem reviewed May 27, 2016
View reviewed changes

leplatrem added 3 commits May 31, 2016 10:58

POC for heartbeat timeout (fixes #601)

d8d8f0c

@Natim review

b028b87

@tarekziade review

ad1bcad

leplatrem force-pushed the 601-heartbeat-timeout branch from e14bed7 to ad1bcad Compare May 31, 2016 09:15

leplatrem added 2 commits May 31, 2016 11:36

Use python3 backport of concurrent.futures

c9c1f91

Add timeout test (fixes #601)

8cd04a4

leplatrem mentioned this pull request May 31, 2016

Preparing release 3.2.0 #663

Merged

17 tasks

tarekziade reviewed May 31, 2016
View reviewed changes

leplatrem added 3 commits May 31, 2016 12:43

Fix default timeout setting

3eefa50

@tarekziade review

Wait for futures in parallel

bcf491f

@tarekziade review

Remove leftover

3d8dbe6

tarekziade reviewed May 31, 2016
View reviewed changes

leplatrem added 3 commits May 31, 2016 15:18

Fix safety check with py35

f6ae938

Do not swallow heartbeat exceptions

f35164f

@tarekziade review

Fix pool executor with py3.5

81a2c34

tarekziade reviewed May 31, 2016
View reviewed changes

Catch unexpected heartbeat errors

3f955b8

@tarekziade review

tarekziade reviewed Jun 1, 2016
View reviewed changes

Use representation instead of string

bb17d5f

leplatrem merged commit 3a8428c into master Jun 1, 2016

Conversation

leplatrem commented May 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Natim commented May 27, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leplatrem commented May 27, 2016

Uh oh!

tarekziade commented May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarekziade commented May 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarekziade May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarekziade May 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarekziade Jun 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarekziade commented Jun 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

leplatrem commented May 27, 2016 •

edited

Loading

tarekziade commented May 31, 2016 •

edited

Loading

tarekziade May 31, 2016 •

edited

Loading

tarekziade May 31, 2016 •

edited

Loading

tarekziade Jun 1, 2016 •

edited

Loading