New kube-state-metrics cluster checks faq by cswatt · Pull Request #5013 · DataDog/documentation

cswatt · 2019-07-18T18:50:23Z

What does this PR do?

Adds an FAQ for running the kubernetes_state check as a cluster check.

Motivation

customer questions

Preview link

https://docs-staging.datadoghq.com/cswatt/kube-state-metrics-cluster-check/agent/faq/kubernetes-state-cluster-check

hkaj

It's a good start, let's add detailed steps here as well. I'll write something up and send it your way

hkaj · 2019-07-19T12:57:26Z

 [8]: /agent/faq/why-should-i-install-the-agent-on-my-cloud-instances
 [9]: /agent/faq/kubernetes-secrets
 [10]: /agent/faq/windows-agent-ddagent-user
+[11]: 


Suggested change

[11]:

[11]: /agent/faq/kubernetes-state-cluster-check

hkaj · 2019-07-19T16:27:51Z

+
+## Background
+
+If you use Autodiscovery from the DaemonSet, one of your Agents (the one running on the same node as `kube-state-metrics`) runs the check and uses a significant amount of memory, while other Agents in the DaemonSet use far less. To prevent the outlier Agent from getting killed, you could increase the memory limit for all Agents, but this could be a waste. The alternative is to use the DaemonSet for lightweight checks to keep the general memory usage low, and use a small (e.g. 2-3 pods) dedicated deployment for heavy checks, where each pod has a large amont of RAM and only runs cluster checks.


this could be a waste --> this would be a waste, no doubt there 😄

hkaj · 2019-07-23T14:07:44Z

@cswatt here are more detailed instructions in case you want to list that.

install ksm check as a CLC

Deploy KSM with cluster check annotations on the service:

ad.datadoghq.com/service.check_names: '["kubernetes_state"]'
ad.datadoghq.com/service.init_configs: '[{}]'
ad.datadoghq.com/service.instances: |
    [
    {
        "kube_state_url": "http://%%host%%:8080/metrics",
        "prometheus_timeout": 30,
        "min_collection_interval": 30,
        "send_pod_phase_service_checks": false,
        "telemetry": true
    }
    ]

Deploy the Datadog Cluster Agent with DD_CLUSTER_CHECKS_ENABLED set to true
Configure agent deployment to mount empty dir in /etc/datadog-agent/conf.d/kubernetes_state.d:

...
    volumeMounts:
    ...
      - name: empty-dir
      mountPath: /etc/datadog-agent/conf.d/kubernetes_state.d
    ...
  volumes:
    - name: empty-dir
    emptyDir: {}
    ...

Deploy the agent

verify check dispatching

kexec -it <datadog-cluster-agent-leader> agent clusterchecks should display something like:

===== Checks on gke-my-cluster-default-pool-b969f074-npsw =====

=== kubernetes_state check ===
Source: kubernetes-services
Instance ID: kubernetes_state:4a95df9b407f14d7
empty_default_hostname: true
kube_state_url: http://10.59.249.74:8080/metrics
min_collection_interval: 30
prometheus_timeout: 30
send_pod_phase_service_checks: false
tags:
- kube_service:kube-state-metrics
- kube_namespace:kube-system
- cluster_name:test-cluster
telemetry: true
~
Init Config:
{}
===

And the agent running on the node returned above should also have a kubernetes_state check in its agent status.

hkaj · 2019-07-26T11:01:49Z

@CharlyF has written an even more complete guide on this GH issue DataDog/datadog-agent#3923

Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>

CharlyF · 2019-07-31T18:02:14Z

+	ad.datadoghq.com/service.instances: |
+	    [
+	    {
+	        "kube_state_url": "http://%%host%%:8080/metrics",


We should have a disclaimer of the port. It's worth mentioning that 8080 is the metrics port.

CharlyF · 2019-07-31T18:04:10Z

+	    [
+	    {
+	        "kube_state_url": "http://%%host%%:8080/metrics",
+	        "prometheus_timeout": 30,


the timeout and min collection interval are set to 30 here because we have very large payloads, but it's worth a disclaimer as it means they would only collect their metrics at a lower granularity.
I would suggest mentioning them or pointing to the doc for the ksm check, but no need to lower the defautl granularity without explanation.

agreed, we can remove this line and the next by default, and maybe add a section to the faq about the check timing out and the need to decrease the frequency and increasing the timeout.

CharlyF · 2019-07-31T18:04:56Z

+	        "prometheus_timeout": 30,
+	        "min_collection_interval": 30,
+	        "send_pod_phase_service_checks": false,
+	        "telemetry": true


same for those two options. They might want the pod phase as a service check.
Telemetry is only activated in the latest version of the Datadog KSM check

CharlyF · 2019-07-31T18:06:04Z

+	    ]
+	```
+
+2. Deploy the Datadog Cluster Agent with `DD_CLUSTER_CHECKS_ENABLED` set to `true`.


It also requires

- name: DD_EXTRA_CONFIG_PROVIDERS value: "kube_services" - name: DD_EXTRA_LISTENERS value: "kube_services"

CharlyF · 2019-07-31T18:06:41Z

+
+4. Deploy the Agent.
+
+Refer to the [Running Cluster Checks with Autodiscovery][2] documentation for more information. See options that begin with `clusterchecksDeployment` in the Helm chart [README.md][3].


Maybe we could be more verbose about the options to use in the agent ?

Yeah we should aim to explain all options, but i think for an FAQ this is ok. We'll see what follow up questions users ask, and go from there.

CharlyF · 2019-07-31T18:08:12Z

+
+## Background
+
+If you use Autodiscovery from the DaemonSet, one of your Agents (the one running on the same node as `kube-state-metrics`) runs the check and uses a significant amount of memory, while other Agents in the DaemonSet use far less. To prevent the outlier Agent from getting killed, you could increase the memory limit for all Agents, but this wastes resources. The alternative is to use the DaemonSet for lightweight checks to keep the general memory usage low, and use a small (e.g. 2-3 pods) dedicated deployment for heavy checks, where each pod has a large amont of RAM and only runs cluster checks.


Good to give background, but we need to document or point to what the full workload looks like, so they can clearly see what the cluster check worker is compared to the regular agent.

Simwar · 2019-08-09T08:51:49Z

It is not very clear that the customer needs to use this solution with another daemonset only deployed on a few nodes with higher memory limits.
The background section gives this info but I feel like it should be outlined from the start.

kivagant-ba · 2019-08-09T09:48:00Z

+Run:
+
+```
+kubectl exec -it <datadog-cluster-agent-leader> agent clusterchecks


How to find the leader in the pods list?

kivagant-ba · 2019-08-09T09:53:07Z

+
+## Configuration
+
+1. Deploy `kube-state-metrics` with cluster check annotations on the service:


Please change to

Deploy kube-state-metrics with cluster check annotations on its Kubernetes service:

apiVersion: v1 kind: Service metadata: annotations: # ... others ad.datadoghq.com/service.check_names: '["kubernetes_state"]' ad.datadoghq.com/service.init_configs: '[{}]' ad.datadoghq.com/service.instances: | [ { "kube_state_url": "http://%%host%%:8080/metrics", "prometheus_timeout": 30, "min_collection_interval": 30, "send_pod_phase_service_checks": false, "telemetry": true } ]

kivagant-ba · 2019-08-09T10:46:34Z

+## Configuration
+
+1. Deploy `kube-state-metrics` with cluster check annotations on the service:
+


Helm chart kube-state-metrics 2.0.0 doesn't support custom service annotations, they are hardcoded.

UP: fixed in helm/charts#15864 But the main chart is locked to version: ~2.0.0: https://github.com/helm/charts/blob/master/stable/datadog/requirements.yaml#L3

kivagant-ba · 2019-08-09T10:56:31Z

+tags:
+- kube_service:kube-state-metrics
+- kube_namespace:kube-system
+- cluster_name:test-cluster


Offtopic: is there any way to add a custom tag to the list? All our DD dashboards configured to read kubernetescluster"tag instead of cluster_name.

I tried to use the section: https://github.com/helm/charts/blob/master/stable/datadog/values.yaml#L122
But looks it doesn't work, at least in this case when a separate cluster-agent pod is running. Also I found that tags value from the Helm chart is translated into this format that looks wrong according to the documentation:

- name: DD_TAGS value: '[map[kubernetescluster:k8s-blue-staging]]'

Does this look like a bug?

Found working configuration:

clusterAgent: ... env: # This also works but it's better not to break the default value # but add another one for backward compatibility # - name: DD_CLUSTER_CHECKS_CLUSTER_TAG_NAME # value: "kubernetescluster" - name: DD_CLUSTER_CHECKS_EXTRA_TAGS value: "kubernetescluster:my-lovely-cluster"

kivagant-ba · 2019-08-09T15:34:21Z

For people who will be looking for the Helm chart configuration, check the comment:

DataDog/datadog-agent#3923 (comment)

cswatt · 2019-08-15T17:49:01Z

updated to webpack, closing this one. see new pr
#5232

adding ksm faq

d22f1bc

cswatt added the Do Not Merge Just do not merge this PR :) label Jul 18, 2019

cswatt requested a review from a team as a code owner July 18, 2019 18:50

l0k0ms reviewed Jul 19, 2019

View reviewed changes

Comment thread content/en/agent/faq/_index.md Outdated

l0k0ms reviewed Jul 19, 2019

View reviewed changes

Comment thread content/en/agent/faq/kubernetes-state-cluster-check.md Outdated

l0k0ms reviewed Jul 19, 2019

View reviewed changes

Comment thread content/en/agent/faq/kubernetes-state-cluster-check.md Outdated

hkaj requested changes Jul 19, 2019

View reviewed changes

cswatt and others added 5 commits July 31, 2019 12:16

Update content/en/agent/faq/_index.md

172fba2

Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>

Update content/en/agent/faq/kubernetes-state-cluster-check.md

719c01b

Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>

Update content/en/agent/faq/kubernetes-state-cluster-check.md

617fec3

Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>

add detailed directions

ec2d45c

add verification

112de43

CharlyF reviewed Jul 31, 2019

View reviewed changes

updates and caveats

4894ebc

hkaj mentioned this pull request Aug 6, 2019

k8s w/ ksm integration issues DataDog/datadog-agent#1853

Closed

kivagant-ba reviewed Aug 9, 2019

View reviewed changes

cswatt mentioned this pull request Aug 15, 2019

add kubernetes state cluster check faq #5232

Merged

cswatt closed this Aug 15, 2019

l0k0ms deleted the cswatt/kube-state-metrics-cluster-check branch August 19, 2019 06:58


		## Background

		If you use Autodiscovery from the DaemonSet, one of your Agents (the one running on the same node as `kube-state-metrics`) runs the check and uses a significant amount of memory, while other Agents in the DaemonSet use far less. To prevent the outlier Agent from getting killed, you could increase the memory limit for all Agents, but this could be a waste. The alternative is to use the DaemonSet for lightweight checks to keep the general memory usage low, and use a small (e.g. 2-3 pods) dedicated deployment for heavy checks, where each pod has a large amont of RAM and only runs cluster checks.


		4. Deploy the Agent.

		Refer to the [Running Cluster Checks with Autodiscovery][2] documentation for more information. See options that begin with `clusterchecksDeployment` in the Helm chart [README.md][3].


		## Configuration

		1. Deploy `kube-state-metrics` with cluster check annotations on the service:

Uh oh!

Conversation

cswatt commented Jul 18, 2019

What does this PR do?

Motivation

Preview link

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hkaj left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hkaj commented Jul 23, 2019

install ksm check as a CLC

verify check dispatching

Uh oh!

hkaj commented Jul 26, 2019

Uh oh!

CharlyF Jul 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Simwar commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kivagant-ba Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kivagant-ba Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kivagant-ba Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kivagant-ba commented Aug 9, 2019

Uh oh!

cswatt commented Aug 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CharlyF Jul 31, 2019 •

edited

Loading

Simwar commented Aug 9, 2019 •

edited

Loading

kivagant-ba Aug 9, 2019 •

edited

Loading

kivagant-ba Aug 9, 2019 •

edited

Loading

kivagant-ba Aug 9, 2019 •

edited

Loading