Fix host checker #2049

buger · 2019-01-08T19:10:18Z

Current implementation has multiple flaws.
At the moment first HostDown event happens after a number of specified tries, and if the host is still down, the next HostDown event will happen after the same number of tries. So if time_wait is 10s, and failure_trigger_sample_size 2, it means 20 seconds between events. At the same time we set redis expiration key for "host down" key, to time_wait + 1, which means key will be expired 9 seconds before second HostDown event. This PR fix it by setting expiration time to time_wait * failure_trigger_sample_size

Another issue is how host considered to be up. At the moment on first succesful attempt we just enabled the host, however if upstream is unstable, having single healthy attempt does not mean that it is recovered. This change makes Host UP logic work exactly like Host Down and now consider number of tries.

Fix #2036

Current implementation has multiple flaws. At the moment first HostDown event happens after a number of specified tries, and if the host is still down, the next HostDown event will happen after the same number of tries. So if `time_wait` is 10s, and `failure_trigger_sample_size` 2, it means 20 seconds between events. At the same time we set redis expiration key for "host down" key, to `time_wait` + 1, which means key will be expired 9 seconds before second HostDown event. This PR fix it by setting expiration time to `time_wait` * `failure_trigger_sample_size` Another issue is how host considered to be up. At the moment on first succesful attempt we just enabled the host, however if upstream is unstable, having single healthy attempt does not mean that it is recovered. This change makes Host UP logic work exactly like Host Down and now consider number of tries. Fix #2036

letzya · 2019-01-22T18:00:21Z

@buger Could be nice to add here response time of that sample against an expected_resp_time, then we can raise an event of degradation in service if it's longer and back-in-service it's <= expected_resp_time.

buger added 2 commits January 8, 2019 20:09

Fix host checker

fa15f86

Fix test

f1c9601

buger merged commit ac2d339 into master Jan 8, 2019

buger deleted the fix-hostchecker branch January 8, 2019 21:35

buger mentioned this pull request Jan 13, 2019

Host checks does not work if we have 2 hybrid gateways #2036

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix host checker #2049

Fix host checker #2049

buger commented Jan 8, 2019 •

edited

Loading

letzya commented Jan 22, 2019

Fix host checker #2049

Fix host checker #2049

Conversation

buger commented Jan 8, 2019 • edited Loading

letzya commented Jan 22, 2019

buger commented Jan 8, 2019 •

edited

Loading