[kubelet] port cadvisor metric collection from legacy kubernetes check#1339
Conversation
0243101 to
b908076
Compare
b951a17 to
69871ee
Compare
|
As autodiscovering prometeus/cadvisor might be pretty tricky, we aggreed on going with manual configuration. This can be done by overriding the |
|
|
||
| For checks that are not listed here, please refer to [Legacy development Setup](docs/dev/legacy.md). | ||
|
|
||
| If you updated the test requirements for a check, you will need to run `tox --recreate` for changes to be effective. |
There was a problem hiding this comment.
r/you will need to run `tox --recreate` for changes/run `tox --recreate` for changes
| ### | ||
| ### Metric collection for legacy (< 1.7.6) clusters via the kubelet's | ||
| ### cadvisor port. | ||
| ### This port is closed by default on k8s 1.7 and OpenShift, make sure |
There was a problem hiding this comment.
r/make sure you enable it/enable it
b7c6587 to
a5ca319
Compare
|
I left some minor nit phrasing comment to follow the contributing guidelines: https://github.com/DataDog/documentation/blob/master/CONTRIBUTING.md It seems that we are collecting a new metric:
should we update https://github.com/DataDog/integrations-core/blob/master/kubelet/metadata.csv or https://github.com/DataDog/integrations-core/blob/master/kubernetes/metadata.csv accordingly? |
|
Cadvisor exposes network metrics per container, but as there's one network namespace per pod, the new endpoint exposes them per pod. The agent6 cadvisor mode will align with prometheus mode (that will have correct host sums), and break compat with agent5. Agent5, sending one gauge per container in the pod, only pause container has a non-zero valueAgent6 + prometeus: correctly tagging by pod tags onlyAgent6 + cadvisor : move the network metric to pod level, mirroring prometeus mode |
51ab121 to
e2c9487
Compare
| self._update_metrics(instance) | ||
|
|
||
| def _update_metrics(self, instance): | ||
| def parse_quantity(s): |
There was a problem hiding this comment.
Why do you define this function in this method ?
This function isn't used.
There was a problem hiding this comment.
dead-code copy-pasted from agent5, removing
| except Exception as e: | ||
| self.log.error("Unable to collect metrics for container: {0} ({1})".format(c_id, e)) | ||
|
|
||
| def _publish_raw_metrics(self, metric, dat, tags, is_pod, depth=0): |
There was a problem hiding this comment.
Is it possible to have a docstrings for this method ?
Like:
def _publish_raw_metrics(self, metric, dat, tags, is_pod, depth=0):
"""
Blahblah
metric: type
dat: type
...
"""| LEGACY_CADVISOR_METRICS_PATH = '/api/v1.3/subcontainers/' | ||
|
|
||
|
|
||
| class CadvisorScraper(): |
There was a problem hiding this comment.
| return url | ||
|
|
||
| def retrieve_cadvisor_metrics(self, timeout=10): | ||
| return requests.get(self.cadvisor_legacy_url, timeout=timeout).json() |
There was a problem hiding this comment.
Is self.cadvisor_legacy_url attribute defined in this class ?
| metrics = self.retrieve_cadvisor_metrics() | ||
|
|
||
| if not metrics: | ||
| raise Exception('No metrics retrieved cmd=%s' % self.metrics_cmd) |
There was a problem hiding this comment.
Is self.metrics_cmd attribute defined in this class ?
| try: | ||
| self._update_container_metrics(instance, subcontainer) | ||
| except Exception as e: | ||
| self.log.error("Unable to collect metrics for container: {0} ({1})".format(c_id, e)) |
There was a problem hiding this comment.
self.log isn't defined, this could be a global from (IIRC):
logger = logging.getLogger(__name__)There was a problem hiding this comment.
this class is a mixin intended to be used inside an AgentCheck class, so it'll use the agentcheck's self.log as it uses its self.gauge. Adding a docstring
| self._publish_raw_metrics(metric, dat[-1], tags, is_pod, depth + 1) | ||
|
|
||
| def _update_container_metrics(self, instance, subcontainer): | ||
| tags = [] |
There was a problem hiding this comment.
tags definition isn't needed as you are redefining it in each condition below.
| # Let's see who we have here | ||
| if is_pod: | ||
| tags = tags_for_pod(pod_uid, True) | ||
| elif (in_static_pod and k_container_name): |
There was a problem hiding this comment.
nit: the parenthesis are redundant
| tags += tags_for_pod(pod_uid, True) | ||
| tags.append("kube_container_name:%s" % k_container_name) | ||
| else: # Standard container | ||
| if self.container_filter.is_excluded(cid): |
There was a problem hiding this comment.
Same for self.container_filter, what's this scope ?
| return False | ||
|
|
||
|
|
||
| class ContainerFilter: |
There was a problem hiding this comment.
Old style class, see the other mention.
|
|
||
| self._update_metrics(instance, cadvisor_url, pod_list, container_filter) | ||
|
|
||
| def _retrieve_cadvisor_metrics(self, cadvisor_url, timeout=10): |
There was a problem hiding this comment.
This method doesn't use the instance, can it be static ?
| """ | ||
| Recusively parses and submit metrics for a given entity, until | ||
| reaching self.max_depth. | ||
| Nested metric names are flattened: memory/usage -> memory.usage |
There was a problem hiding this comment.
nit: I really enjoy docstrings with the type of each parameter
| self._publish_raw_metrics(metric + '.%s' % k.lower(), v, tags, is_pod, depth + 1) | ||
|
|
||
| elif isinstance(dat, list): | ||
| self._publish_raw_metrics(metric, dat[-1], tags, is_pod, depth + 1) |
There was a problem hiding this comment.
Could be useful to catch a potential else here ?
There was a problem hiding this comment.
else would only be a pass. I'm not sure we should log it
| is_pod = False | ||
| in_static_pod = False | ||
| cid = subcontainer.get('id') | ||
| pod_uid = subcontainer.get('labels', []).get('io.kubernetes.pod.uid') |
There was a problem hiding this comment.
Can the default be a dict instead of a list ?
Especially with .get('io.kubernetes.pod.uid') over it (list doesn't support get on it)
| in_static_pod = False | ||
| cid = subcontainer.get('id') | ||
| pod_uid = subcontainer.get('labels', []).get('io.kubernetes.pod.uid') | ||
| k_container_name = subcontainer.get('labels', []).get('io.kubernetes.container.name') |
| return | ||
| tags = list(set(tags + instance.get('tags', []))) | ||
|
|
||
| stats = subcontainer['stats'][-1] # take the latest |
There was a problem hiding this comment.
Are we sure of the existence of stats and a len > 0 ?
Does it make sense to add a try except here ?
There was a problem hiding this comment.
The exception will be caught in the parent _update_metrics
| stats = subcontainer['stats'][-1] # take the latest | ||
| self._publish_raw_metrics(NAMESPACE, stats, tags, is_pod) | ||
|
|
||
| if is_pod is False and subcontainer.get("spec", {}).get("has_filesystem") and stats.get('filesystem', []) != []: |
There was a problem hiding this comment.
nit: what about doing this instead:
if is_pod is False and subcontainer.get("spec", {}).get("has_filesystem") and stats.get('filesystem'):It doesn't create two empty lists to just compare if it's not None.
| return get_tags('docker://%s' % cid, cardinality) | ||
|
|
||
|
|
||
| def get_pod_by_uid(uid, podlist): |
There was a problem hiding this comment.
Really like that kind of docstrings 👍
| def __init__(self, podlist): | ||
| self.containers = {} | ||
|
|
||
| for pod in podlist.get('items') or []: |
There was a problem hiding this comment.
Can we replace this by:
for pod in podlist.get('items', []):There was a problem hiding this comment.
Now that we have
if self.pod_list.get("items") is None:
# Sanitize input: if no pod are running, 'items' is a NoneObject
self.pod_list['items'] = []in check() we can. Before, we had a items key with a None value, that made the iteration fail when no pod was running
f7b29c2 to
95a3ebc
Compare
- adds the ContainerFilter helper class to consume the new agent interface. Only used for cadvisor mode for now - factors-out common parts to a common.py file - copy agent5 cadvisor logic to a CadvisorScraper helper class for separation - update the cadvisor logic to support agent6 facilities (tagger, filter) - update the cadvisor logic to report network metrics at the pod cardinality for consistency with prometheus mode (change from agent5): see comment lower - add the missing disk metric in the kubernetes.csv metadata file
95a3ebc to
d1a7593
Compare
23acc34 to
463a8f9
Compare
|
Final testing with |
|
@l0k0ms can we sync about this PR and DataDog/datadog-agent#1550 ? Do you see other docs to update? |
What does this PR do?
Port the cadvisor collection logic from agent5's kubernetes check to agent6. This PR:
ContainerFilterhelper class to consume the new agent interface. Only used for cadvisor mode for now, prometheus mode will be patched in another PRcommon.pyfileCadvisorScraperhelper class for clearer separationkubernetes.csvmetadata fileMotivation
Support legacy k8s clusters
Testing Guidelines
Pushed the
datadog/agent-dev:xvello-test1-3anddatadog/agent-dev:xvello-test1-3-jmximages for testingCadvisor mode can be triggered with the following confd configmap:
Versioning
manifest.jsondatadog_checks/{integration}/__init__.pyCHANGELOG.md. Please useUnreleasedas the date in the titlefor the new section.
Additional Notes
Anything else we should know when reviewing?