Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix host checker #2049

Merged
merged 2 commits into from
Jan 8, 2019
Merged

Fix host checker #2049

merged 2 commits into from
Jan 8, 2019

Conversation

buger
Copy link
Member

@buger buger commented Jan 8, 2019

Current implementation has multiple flaws.
At the moment first HostDown event happens after a number of specified tries, and if the host is still down, the next HostDown event will happen after the same number of tries. So if time_wait is 10s, and failure_trigger_sample_size 2, it means 20 seconds between events. At the same time we set redis expiration key for "host down" key, to time_wait + 1, which means key will be expired 9 seconds before second HostDown event. This PR fix it by setting expiration time to time_wait * failure_trigger_sample_size

Another issue is how host considered to be up. At the moment on first succesful attempt we just enabled the host, however if upstream is unstable, having single healthy attempt does not mean that it is recovered. This change makes Host UP logic work exactly like Host Down and now consider number of tries.

Fix #2036

@buger buger merged commit ac2d339 into master Jan 8, 2019
@buger buger deleted the fix-hostchecker branch January 8, 2019 21:35
buger added a commit that referenced this pull request Jan 8, 2019
Current implementation has multiple flaws. 
At the moment first HostDown event happens after a number of specified tries, and if the host is still down, the next HostDown event will happen after the same number of tries. So if `time_wait` is 10s, and `failure_trigger_sample_size` 2, it means 20 seconds between events. At the same time we set redis expiration key for "host down" key, to `time_wait` + 1, which means key will be expired 9 seconds before second HostDown event. This PR fix it by setting expiration time to `time_wait` * `failure_trigger_sample_size`

Another issue is how host considered to be up. At the moment on first succesful attempt we just enabled the host, however if upstream is unstable, having single healthy attempt does not mean that it is recovered. This change makes Host UP logic work exactly like Host Down and now consider number of tries. 

Fix #2036
buger added a commit that referenced this pull request Jan 8, 2019
Current implementation has multiple flaws. 
At the moment first HostDown event happens after a number of specified tries, and if the host is still down, the next HostDown event will happen after the same number of tries. So if `time_wait` is 10s, and `failure_trigger_sample_size` 2, it means 20 seconds between events. At the same time we set redis expiration key for "host down" key, to `time_wait` + 1, which means key will be expired 9 seconds before second HostDown event. This PR fix it by setting expiration time to `time_wait` * `failure_trigger_sample_size`

Another issue is how host considered to be up. At the moment on first succesful attempt we just enabled the host, however if upstream is unstable, having single healthy attempt does not mean that it is recovered. This change makes Host UP logic work exactly like Host Down and now consider number of tries. 

Fix #2036
@letzya
Copy link
Contributor

letzya commented Jan 22, 2019

@buger Could be nice to add here response time of that sample against an expected_resp_time, then we can raise an event of degradation in service if it's longer and back-in-service it's <= expected_resp_time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Host checks does not work if we have 2 hybrid gateways
2 participants