[consul] Add support for consul 1.0.0 #876
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Consul 1.0.0 is the first version that Hashicorp considers "stable" in
terms of API calls. As such, previous versions of consul may return
different payloads, and the API calls that the Datadog agent currently
uses to determine cluster leadership no longer work on 1.0.0 and later.
The current method of determining cluster leadership is to retrieve the
local agent's IP:Port combination and compare it to the IP:Port of the
leader. While consul 0.7.0 provides a more robust way of determining
leadership (literally a true/false key), this method is not available
in consul 0.6.4, so using it would break compatibility. Therefore this
commit makes the smallest possible change to avoid breaking that
compatibility, by instead using a different method of fetching the
local agent's IP:Port combination.
Tests have been updated (and somewhat refactored) and now supports
versions 0.6.4, 0.7.2, and 1.0.0.
Technical details
As noted above, the consul API has not guaranteed stability until version 1.0.0. In previous versions of consul, querying /v1/agent/self would among other things return:
This works for consul versions 0.6.4 and 0.7.2, but as of version 1.0.0 (and possibly earlier, I haven't checked!) the address and server port keys are gone.
As of newer versions it is now possible to find this information in:
This key seems to function only intermittently in consul 0.6.4 so we can't outright replace the old business logic with the new one. Since the information will be in either one set of keys or another, and "or" is the operative word here, I've dropped in an
or
operator.agent_addr
andagent_port
will become whichever key does not returnNone
(unless they both do).While there is a more robust way of detecting cluster leadership from /v1/agent/self via
{ 'Stats': { 'consul': { 'leader': true }}}
this key is not available in version 0.6.4, and refactoring the check to use both methods would likely introduce more complexity. While comparing IP:Port is a bit roundabout, it works.Motivation
Wealthsimple has recently deployed consul, which coincided nicely with the version 1.0.0 release. We noticed right away that Datadog was hardly reporting any metrics at all. Digging into the check shows that the bulk of the metrics are retrieved only from the leader node, and the leader node is determined by the above-described IP:Port check, and led us down the road to writing this fix.
Testing Guidelines
Tests are expected to pass for all three flavours.
Some refactoring of consul tests were required to make this work. As a start, the
server.json
config file has now been divided up into three configs, one each for consul versions 0.6.4, 0.7.2, and 1.0.0. This is necessary because version 1.0.0 requiresacl_agent_token
to be set, whereas previous versions do not recognize this key and refuse to start. Configs should probably be immutable, so having one per version will support that notion going forward.consul.rake
has also been updated with a few new tricks:docker create
,docker cp
, anddocker start
three times; instead we usedocker run
for all three containers with the--volume
flag to mount the config file as read-only at runtime.wait_on_docker_logs
has been updated to includeagent: Synced node info
which is how 1.0.0 agents assert readiness.Versioning
manifest.json
CHANGELOG.md
. Please useUnreleased
as the date in the titlefor the new section.
Additional Notes
We have a ticket open with Datadog's excellent support team where more context may be found (ZD ticket 116528).