New kube-state-metrics cluster checks faq#5013
Conversation
hkaj
left a comment
There was a problem hiding this comment.
It's a good start, let's add detailed steps here as well. I'll write something up and send it your way
| [8]: /agent/faq/why-should-i-install-the-agent-on-my-cloud-instances | ||
| [9]: /agent/faq/kubernetes-secrets | ||
| [10]: /agent/faq/windows-agent-ddagent-user | ||
| [11]: |
There was a problem hiding this comment.
| [11]: | |
| [11]: /agent/faq/kubernetes-state-cluster-check |
|
|
||
| ## Background | ||
|
|
||
| If you use Autodiscovery from the DaemonSet, one of your Agents (the one running on the same node as `kube-state-metrics`) runs the check and uses a significant amount of memory, while other Agents in the DaemonSet use far less. To prevent the outlier Agent from getting killed, you could increase the memory limit for all Agents, but this could be a waste. The alternative is to use the DaemonSet for lightweight checks to keep the general memory usage low, and use a small (e.g. 2-3 pods) dedicated deployment for heavy checks, where each pod has a large amont of RAM and only runs cluster checks. |
There was a problem hiding this comment.
this could be a waste --> this would be a waste, no doubt there 😄
|
@cswatt here are more detailed instructions in case you want to list that. install ksm check as a CLC
ad.datadoghq.com/service.check_names: '["kubernetes_state"]'
ad.datadoghq.com/service.init_configs: '[{}]'
ad.datadoghq.com/service.instances: |
[
{
"kube_state_url": "http://%%host%%:8080/metrics",
"prometheus_timeout": 30,
"min_collection_interval": 30,
"send_pod_phase_service_checks": false,
"telemetry": true
}
]
...
volumeMounts:
...
- name: empty-dir
mountPath: /etc/datadog-agent/conf.d/kubernetes_state.d
...
volumes:
- name: empty-dir
emptyDir: {}
...
verify check dispatching
And the agent running on the node returned above should also have a kubernetes_state check in its |
|
@CharlyF has written an even more complete guide on this GH issue DataDog/datadog-agent#3923 |
Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>
Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>
Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>
| ad.datadoghq.com/service.instances: | | ||
| [ | ||
| { | ||
| "kube_state_url": "http://%%host%%:8080/metrics", |
There was a problem hiding this comment.
We should have a disclaimer of the port. It's worth mentioning that 8080 is the metrics port.
| [ | ||
| { | ||
| "kube_state_url": "http://%%host%%:8080/metrics", | ||
| "prometheus_timeout": 30, |
There was a problem hiding this comment.
the timeout and min collection interval are set to 30 here because we have very large payloads, but it's worth a disclaimer as it means they would only collect their metrics at a lower granularity.
I would suggest mentioning them or pointing to the doc for the ksm check, but no need to lower the defautl granularity without explanation.
There was a problem hiding this comment.
agreed, we can remove this line and the next by default, and maybe add a section to the faq about the check timing out and the need to decrease the frequency and increasing the timeout.
| "prometheus_timeout": 30, | ||
| "min_collection_interval": 30, | ||
| "send_pod_phase_service_checks": false, | ||
| "telemetry": true |
There was a problem hiding this comment.
same for those two options. They might want the pod phase as a service check.
Telemetry is only activated in the latest version of the Datadog KSM check
| ] | ||
| ``` | ||
|
|
||
| 2. Deploy the Datadog Cluster Agent with `DD_CLUSTER_CHECKS_ENABLED` set to `true`. |
There was a problem hiding this comment.
It also requires
- name: DD_EXTRA_CONFIG_PROVIDERS
value: "kube_services"
- name: DD_EXTRA_LISTENERS
value: "kube_services"
|
|
||
| 4. Deploy the Agent. | ||
|
|
||
| Refer to the [Running Cluster Checks with Autodiscovery][2] documentation for more information. See options that begin with `clusterchecksDeployment` in the Helm chart [README.md][3]. |
There was a problem hiding this comment.
Maybe we could be more verbose about the options to use in the agent ?
There was a problem hiding this comment.
Yeah we should aim to explain all options, but i think for an FAQ this is ok. We'll see what follow up questions users ask, and go from there.
|
|
||
| ## Background | ||
|
|
||
| If you use Autodiscovery from the DaemonSet, one of your Agents (the one running on the same node as `kube-state-metrics`) runs the check and uses a significant amount of memory, while other Agents in the DaemonSet use far less. To prevent the outlier Agent from getting killed, you could increase the memory limit for all Agents, but this wastes resources. The alternative is to use the DaemonSet for lightweight checks to keep the general memory usage low, and use a small (e.g. 2-3 pods) dedicated deployment for heavy checks, where each pod has a large amont of RAM and only runs cluster checks. |
There was a problem hiding this comment.
Good to give background, but we need to document or point to what the full workload looks like, so they can clearly see what the cluster check worker is compared to the regular agent.
|
It is not very clear that the customer needs to use this solution with another daemonset only deployed on a few nodes with higher memory limits. |
| Run: | ||
|
|
||
| ``` | ||
| kubectl exec -it <datadog-cluster-agent-leader> agent clusterchecks |
There was a problem hiding this comment.
How to find the leader in the pods list?
|
|
||
| ## Configuration | ||
|
|
||
| 1. Deploy `kube-state-metrics` with cluster check annotations on the service: |
There was a problem hiding this comment.
Please change to
- Deploy
kube-state-metricswith cluster check annotations on its Kubernetes service:
apiVersion: v1
kind: Service
metadata:
annotations:
# ... others
ad.datadoghq.com/service.check_names: '["kubernetes_state"]'
ad.datadoghq.com/service.init_configs: '[{}]'
ad.datadoghq.com/service.instances: |
[
{
"kube_state_url": "http://%%host%%:8080/metrics",
"prometheus_timeout": 30,
"min_collection_interval": 30,
"send_pod_phase_service_checks": false,
"telemetry": true
}
]
| ## Configuration | ||
|
|
||
| 1. Deploy `kube-state-metrics` with cluster check annotations on the service: | ||
|
|
There was a problem hiding this comment.
Helm chart
kube-state-metrics2.0.0 doesn't support custom service annotations, they are hardcoded.
UP: fixed in helm/charts#15864 But the main chart is locked to version: ~2.0.0: https://github.com/helm/charts/blob/master/stable/datadog/requirements.yaml#L3
| tags: | ||
| - kube_service:kube-state-metrics | ||
| - kube_namespace:kube-system | ||
| - cluster_name:test-cluster |
There was a problem hiding this comment.
Offtopic: is there any way to add a custom tag to the list? All our DD dashboards configured to read kubernetescluster"tag instead of cluster_name.
There was a problem hiding this comment.
I tried to use the section: https://github.com/helm/charts/blob/master/stable/datadog/values.yaml#L122
But looks it doesn't work, at least in this case when a separate cluster-agent pod is running. Also I found that tags value from the Helm chart is translated into this format that looks wrong according to the documentation:
- name: DD_TAGS
value: '[map[kubernetescluster:k8s-blue-staging]]'
Does this look like a bug?
There was a problem hiding this comment.
Found working configuration:
clusterAgent:
...
env:
# This also works but it's better not to break the default value
# but add another one for backward compatibility
# - name: DD_CLUSTER_CHECKS_CLUSTER_TAG_NAME
# value: "kubernetescluster"
- name: DD_CLUSTER_CHECKS_EXTRA_TAGS
value: "kubernetescluster:my-lovely-cluster"
|
For people who will be looking for the Helm chart configuration, check the comment: |
|
updated to webpack, closing this one. see new pr |
What does this PR do?
Adds an FAQ for running the
kubernetes_statecheck as a cluster check.Motivation
customer questions
Preview link
https://docs-staging.datadoghq.com/cswatt/kube-state-metrics-cluster-check/agent/faq/kubernetes-state-cluster-check