Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consul] Add support for consul 1.0.0 #876

Merged
merged 3 commits into from
Nov 22, 2017

Conversation

byronwolfman
Copy link
Contributor

What does this PR do?

Consul 1.0.0 is the first version that Hashicorp considers "stable" in
terms of API calls. As such, previous versions of consul may return
different payloads, and the API calls that the Datadog agent currently
uses to determine cluster leadership no longer work on 1.0.0 and later.

The current method of determining cluster leadership is to retrieve the
local agent's IP:Port combination and compare it to the IP:Port of the
leader. While consul 0.7.0 provides a more robust way of determining
leadership (literally a true/false key), this method is not available
in consul 0.6.4, so using it would break compatibility. Therefore this
commit makes the smallest possible change to avoid breaking that
compatibility, by instead using a different method of fetching the
local agent's IP:Port combination.

Tests have been updated (and somewhat refactored) and now supports
versions 0.6.4, 0.7.2, and 1.0.0.

Technical details

As noted above, the consul API has not guaranteed stability until version 1.0.0. In previous versions of consul, querying /v1/agent/self would among other things return:

{ 'Config': { 'AdvertiseAddr': '1.2.3.4', 'Ports': { 'Server': 8300 }}}

This works for consul versions 0.6.4 and 0.7.2, but as of version 1.0.0 (and possibly earlier, I haven't checked!) the address and server port keys are gone.

As of newer versions it is now possible to find this information in:

{ 'Member': { 'Addr': '1.2.3.4', 'Tags': { 'port': 8300 }}}

This key seems to function only intermittently in consul 0.6.4 so we can't outright replace the old business logic with the new one. Since the information will be in either one set of keys or another, and "or" is the operative word here, I've dropped in an or operator. agent_addr and agent_port will become whichever key does not return None (unless they both do).

While there is a more robust way of detecting cluster leadership from /v1/agent/self via { 'Stats': { 'consul': { 'leader': true }}} this key is not available in version 0.6.4, and refactoring the check to use both methods would likely introduce more complexity. While comparing IP:Port is a bit roundabout, it works.

Motivation

Wealthsimple has recently deployed consul, which coincided nicely with the version 1.0.0 release. We noticed right away that Datadog was hardly reporting any metrics at all. Digging into the check shows that the bulk of the metrics are retrieved only from the leader node, and the leader node is determined by the above-described IP:Port check, and led us down the road to writing this fix.

Testing Guidelines

Tests are expected to pass for all three flavours.

Some refactoring of consul tests were required to make this work. As a start, the server.json config file has now been divided up into three configs, one each for consul versions 0.6.4, 0.7.2, and 1.0.0. This is necessary because version 1.0.0 requires acl_agent_token to be set, whereas previous versions do not recognize this key and refuse to start. Configs should probably be immutable, so having one per version will support that notion going forward.

consul.rake has also been updated with a few new tricks:

  1. All three consul containers load the server config now. This is necessary for consul 1.0.0 where the agent token must be present at the beginning.
  2. Because all containers must load the config, it's a bit of a pain to do docker create, docker cp, and docker start three times; instead we use docker run for all three containers with the --volume flag to mount the config file as read-only at runtime.
  3. Last but not least, wait_on_docker_logs has been updated to include agent: Synced node info which is how 1.0.0 agents assert readiness.

Versioning

  • Bumped the version check in manifest.json
  • Updated CHANGELOG.md. Please use Unreleased as the date in the title
    for the new section.

Additional Notes

We have a ticket open with Datadog's excellent support team where more context may be found (ZD ticket 116528).

Consul 1.0.0 is the first version that Hashicorp considers "stable" in
terms of API calls. As such, previous versions of consul may return
different payloads, and the API calls that the Datadog agent currently
uses to determine cluster leadership no longer work on 1.0.0 and later.

The current method of determining cluster leadership is to retrieve the
local agent's IP:Port combination and compare it to the IP:Port of the
leader. While consul 0.7.0 provides a more robust way of determining
leadership (literally a true/false key), this method is not available
in consul 0.6.4, so using it would break compatibility. Therefore this
commit makes the smallest possible change to avoid breaking that
compatibility, by instead using a different method of fetching the
local agent's IP:Port combination.

Tests have been updated (and somewhat refactored) and now supports
versions 0.6.4, 0.7.2, and 1.0.0.
Byron Wolfman and others added 2 commits November 16, 2017 11:49
While submitting some changes to this rake task to support consul
version 1.0.0, I may have inadvertently angered rubocop with a 162-
character long line. This commit breaks things up to be more readable
and to satisfy rubocop.
We should not try to define `consul_config` based on the FLAVOR_VERSION
environment variable since said variable may be undefined. Instead we
should define `consul_config` based on the `consul_version` variable
which correctly handles the presence and absence of FLAVOR_VERSION.
@byronwolfman
Copy link
Contributor Author

Hey! Just bumping this to find out what we need to do to move this forward. Thanks!

@masci
Copy link
Contributor

masci commented Nov 22, 2017

@byronwolfman I'm on it, sorry this took so long!

Copy link
Contributor

@masci masci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the thorough explanation of the PR, that made the review process super easy 💯

@masci masci added this to the 5.21 milestone Nov 22, 2017
@masci masci merged commit e77c1e4 into DataDog:master Nov 22, 2017
@byronwolfman byronwolfman deleted the consul-1.0.0-fix branch October 2, 2018 02:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants