Skip to content

Heartbeat timeout (fixes #601)#656

Merged
leplatrem merged 13 commits intomasterfrom
601-heartbeat-timeout
Jun 1, 2016
Merged

Heartbeat timeout (fixes #601)#656
leplatrem merged 13 commits intomasterfrom
601-heartbeat-timeout

Conversation

@leplatrem
Copy link
Contributor

@leplatrem leplatrem commented May 27, 2016

  • POC using signals r? @tarekziade
  • Write test (using very low timeout and time.sleep?)

@Natim
Copy link
Member

Natim commented May 27, 2016

I personalty like the approach.

for name, callable in heartbeats.items():
status[name] = callable(request)
seconds = float(request.registry.settings['heartbeat_timeout_seconds'])
with timeout(seconds):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing comment that it could raise TimeoutException

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test with the error message displayed to the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably go for a global timeout as well as a timeout per heartbeat too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be reimplementing harakiri :(

@leplatrem
Copy link
Contributor Author

I personalty like the approach.

Too bad:

$ kinto start
...
File "/home/mathieu/Code/Mozilla/kinto/.venv/local/lib/python2.7/site-packages/timeoutcontext-1.0.0-py2.7.egg/timeoutcontext/_timeout.py", line 68, in _replace_alarm_handler
    raise_timeout)
ValueError: signal only works in main thread lang=None uid=None
2016-05-27 12:05:03,265 INFO  [kinto.core.initialization][waitress] "GET   /v1/__heartbeat__" 500 (2 ms) request.summary agent=HTTPie/0.9.2 authn_type=None errno=999 lang=None time=2016-05-27T12:05:03 uid=None

@tarekziade
Copy link
Contributor

tarekziade commented May 31, 2016

You should do it the other way

1/ call each heartbeat function into a separate thread.
2/ a successful result flips a flag
3/ on timeout (or before) return with the result.

use concurrent.futures.ThreadPoolExecutor

I am making the assumption that each heartbeat function is isolated enough to run in parallel with the other ones

@tarekziade
Copy link
Contributor

also: r-

@leplatrem leplatrem force-pushed the 601-heartbeat-timeout branch from e14bed7 to ad1bcad Compare May 31, 2016 09:15
@leplatrem leplatrem mentioned this pull request May 31, 2016
17 tasks
| | | endpoint: ``/v1`` redirects to ``/v1/`` and ``/buckets/default/`` |
| | | to ``/buckets/default``. No redirections are made when turned off. |
+-------------------------------------------------+--------------+--------------------------------------------------------------------------+
| kinto.heartbeat_timeout_seconds | ``2`` | The maximum duration of each heartbeat entry. Depending of the amount of |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 seconds seems very low. Since we're running in parallel I think we can use 10 seconds by default maybe ?

error_msg = "'%s' heartbeat has exceeded timeout of %s seconds."
logger.error(error_msg % (name, seconds))

# If any has failed, return a 503 error response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's one thing missing: when one (or several) heartbeat(s) fails, we need to catch the future(s) exception(s) and log them here with logger.exception so we can track down what happens. The future object holds that exception, so we just need to iterate on them and collect the TB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option is to catch them in heartbeat_check and push them in an exceptions list

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: make sure you call logger.exception in the main thread sequentially on the collected tracebacks, otherwise you might have mangled exceptions since two functions can fail in parallel at the same instant


# A heartbeat is supposed to return True or False, and never raise.
# Just in case, go though results to spot any potential exception.
for future in done:
Copy link
Contributor

@tarekziade tarekziade May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we re-raise here we're getting a 500 I think. I think what you want is (not tested):

for future in done:
    exc = future.exception()
    if exc is not None:
        logger.error("%r failed" % future.__heartbeat_name)
        logger.error(exc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did not change the previous behaviour regarding heartbeats (the try/except is managed there)

Copy link
Contributor

@tarekziade tarekziade May 31, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see, so there's the convention that the heartbeat functions should catch all errors. But we never know what might happen in external heartbeat functions.

Is that a behaviour we want to keep ?

e.g. do we want the global heartbeat to completely fail when one heartbeat is producing an error, or do we want to log that error and flag that backend to false in the result ?

future.result() # Will re-raise.
exc = future.exception()
if exc is not None:
logger.error("'%s' heartbeat failed." % future.__heartbeat_name)
Copy link
Contributor

@tarekziade tarekziade Jun 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can use %r instead of '%s'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaah that's why you used %r! :/

@tarekziade
Copy link
Contributor

Looks great r+

@leplatrem leplatrem merged commit 3a8428c into master Jun 1, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants