Update reminder state with CAS instead of Put #193
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
After a network failure, we managed to get a hanging reminder in Consul-Alerts. The reminders looked like this:
The health check that initially triggered the reminder was already passing, but the reminder still existed. NB its inconsistent state where the status is critical, but according to the output the agent is alive and reachable.
Solution
The problem was caused by a race condition between
consul/client.UpdateCheckData
(#L346) that updates the output of a reminder if the output of the health check is changed andcheck-handler.notify
(#L130) that deletes the reminder after the check is back to passing.To reproduce the race condition I modified
UpdateCheckData
such that the update happens always after the reminder is deleted. The relevant part of the function:Then, I started Consul and Nomad in dev mode on the local machine:
consul-alerts config:
consul-alerts/config/notif-profiles/log_with_reminders
consul-alerts/config/notif-selection/hosts/Aleksandrs-MacBook-Pro.local
To create a reminder I blocked the port 4646. On OS X I add the following line to the file
/private/etc/pf.conf
Then, reloaded the configuration:
After the reminder was created I unblocked the port. The reminder was first removed by
check-handler.notify
and then added back with the updated output by the code above.To fix the issue I used
kvApi.CAS
instead ofkvApi.Put
that checksModifyIndex
of the entry before updating it. More information about CAS update is available here