Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etcd support with self and store metrics. #1235

Merged
merged 11 commits into from
Dec 26, 2014
93 changes: 93 additions & 0 deletions checks.d/etcd.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# stdlib
import time
from hashlib import md5
import urllib2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time, urllib2 and md5 do not seem to be used


# project
from checks import AgentCheck
from util import headers

# 3rd party
import simplejson as json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplejson is not used

import requests

class Etcd(AgentCheck):
def check(self, instance):
if 'url' not in instance:
raise Exception('etcd instance missing "url" value.')

# Load values from the instance config
url = instance['url']
instance_tags = instance.get('tags', [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the url as a tag in case someone is using several instances in the config file, otherwise there is no way of getting unique timeseries by instance.

default_timeout = self.init_config.get('default_timeout', 5)
timeout = float(instance.get('timeout', default_timeout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, if it's necessary to have a 2-level setup from the config file, how about putting the default as DEFAULT_TIMEOUT global var in the check, and allow the override in the yaml file (adding some documentation in the yaml file for that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, not sure I follow. What's the 2-level setup?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being unclear.

0/ You've got a fixed default 5 (which could be put as a global var DEFAULT_TIMEOUT)
1/ It can be overriden by init_config['default_timeout'] : LEVEL 1
2/ It can be overriden per instance with `instance['timeout'] : LEVEL 2

Basically I am saying that we can get rid of LEVEL 1 and add documentation for the LEVEL 2 override.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. As I said I copied the mesos plugin and to boot I'm not much of a python programmer so I'm quite happy to take direction. ;)


tags = instance_tags
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new variable assignment could probably be avoided.


storeResponse = self.get_store_metrics(url, timeout)
if storeResponse is not None:
for key in ['getsSuccess', 'getsFail', 'setsSuccess', 'setsFail', 'deleteSuccess', 'deleteFail', 'updateSuccess', 'updateFail', 'createSuccess', 'createFail', 'compareAndSwapSuccess', 'compareAndSwapFail', 'compareAndDeleteSuccess', 'compareAndDeleteFail', 'expireCount']:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great if you can make this large array as a class variable.

class Etcd(AgentCheck):
    STORE_RATES = {
        'getsSuccess': 'etcd.store.getssucess',
        ...
    }
for key, metric_name in self.STORE_RATES:
    self.rate(metric_name, storeResponse.get(key), tags=tags)

self.rate('etcd.store.' + key, storeResponse[key], tags=tags)

for key in ['watchers']:
self.gauge('etcd.store.' + key, storeResponse[key], tags=tags)

selfResponse = self.get_self_metrics(url, timeout)
if selfResponse is not None:
if selfResponse['state'] == 'leader':
self.gauge('etcd.self.leader', 1, tags=tags)
else:
self.gauge('etcd.self.leader', 0, tags=tags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for these metrics we want to remove the confusing '.self' namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing, but it's also what etcd actually calls them. I erred on the side of sticking to the products names, since often changing those names — at the whims of DD's plugin authors — creates a mismatch between those with knowledge of the product and the names used in DD.

But that's an opinion weakly held. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining, I only looked quickly at some etcd doc and plugins (and saw only .store and .leader namespaces) and I have used it just a few times, so definitely not an expert on the topic. If these names make sense to an etcd guru please keep them 😉


for key in ['recvAppendRequestCnt', 'sendAppendRequestCnt']:
self.rate('etcd.self.' + key, selfResponse[key], tags=tags)

for key in ['sendPkgRate', 'sendBandwidthRate']:
self.gauge('etcd.self.' + key, selfResponse[key], tags=tags)

def get_self_metrics(self, url, timeout):
return self.get_json(url + "/v2/stats/self", timeout)

def get_store_metrics(self, url, timeout):
return self.get_json(url + "/v2/stats/store", timeout)

def get_json(self, url, timeout):
# Use a hash of the URL as an aggregation key
aggregation_key = md5(url).hexdigest()
try:
r = requests.get(url, timeout=timeout)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Importing headers is a good thing, but it should be used here ^ add headers=headers(self.agentConfig)

except requests.exceptions.Timeout as e:
# If there's a timeout
self.timeout_event(url, timeout, aggregation_key)
self.warning("Timeout when hitting %s" % url)
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this you can use a new feature called service checks, that can allow you to report a state to the backend with an error message, the big difference with events is that you can alert on service checks the same way you do with metrics.
See https://github.com/DataDog/dd-agent/pull/1202/files for an example.


if r.status_code != 200:
self.status_code_event(url, r, aggregation_key)
self.warning("Got %s when hitting %s" % (r.status_code, url))
return None

# Condition for request v1.x backward compatibility
if hasattr(r.json, '__call__'):
return r.json()
else:
return r.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This backwards compatibility check can probably go away, now that we use omnibus and pin the deps versions, @remh can you confirm?



def timeout_event(self, url, timeout, aggregation_key):
self.event({
'timestamp': int(time.time()),
'event_type': 'http_check',
'msg_title': 'URL timeout',
'msg_text': '%s timed out after %s seconds.' % (url, timeout),
'aggregation_key': aggregation_key
})

def status_code_event(self, url, r, aggregation_key):
self.event({
'timestamp': int(time.time()),
'event_type': 'http_check',
'msg_title': 'Invalid reponse code for %s' % url,
'msg_text': '%s returned a status of %s' % (url, r.status_code),
'aggregation_key': aggregation_key
})
7 changes: 7 additions & 0 deletions conf.d/etcd.yaml.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
init_config:
# time to wait on a etcd API request
# default_timeout: 5

instances:
# url: the API endpoint of your etcd instance
# - url: "https://server:port"