Skip to content

New kube-state-metrics cluster checks faq#5013

Closed
cswatt wants to merge 7 commits into
masterfrom
cswatt/kube-state-metrics-cluster-check
Closed

New kube-state-metrics cluster checks faq#5013
cswatt wants to merge 7 commits into
masterfrom
cswatt/kube-state-metrics-cluster-check

Conversation

@cswatt

@cswatt cswatt commented Jul 18, 2019

Copy link
Copy Markdown
Contributor

What does this PR do?

Adds an FAQ for running the kubernetes_state check as a cluster check.

Motivation

customer questions

Preview link

https://docs-staging.datadoghq.com/cswatt/kube-state-metrics-cluster-check/agent/faq/kubernetes-state-cluster-check

@cswatt cswatt added the Do Not Merge Just do not merge this PR :) label Jul 18, 2019
@cswatt cswatt requested a review from a team as a code owner July 18, 2019 18:50
Comment thread content/en/agent/faq/_index.md Outdated
Comment thread content/en/agent/faq/kubernetes-state-cluster-check.md Outdated
Comment thread content/en/agent/faq/kubernetes-state-cluster-check.md Outdated

@hkaj hkaj left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good start, let's add detailed steps here as well. I'll write something up and send it your way

Comment thread content/en/agent/faq/_index.md Outdated
[8]: /agent/faq/why-should-i-install-the-agent-on-my-cloud-instances
[9]: /agent/faq/kubernetes-secrets
[10]: /agent/faq/windows-agent-ddagent-user
[11]:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[11]:
[11]: /agent/faq/kubernetes-state-cluster-check


## Background

If you use Autodiscovery from the DaemonSet, one of your Agents (the one running on the same node as `kube-state-metrics`) runs the check and uses a significant amount of memory, while other Agents in the DaemonSet use far less. To prevent the outlier Agent from getting killed, you could increase the memory limit for all Agents, but this could be a waste. The alternative is to use the DaemonSet for lightweight checks to keep the general memory usage low, and use a small (e.g. 2-3 pods) dedicated deployment for heavy checks, where each pod has a large amont of RAM and only runs cluster checks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be a waste --> this would be a waste, no doubt there 😄

@hkaj

hkaj commented Jul 23, 2019

Copy link
Copy Markdown
Member

@cswatt here are more detailed instructions in case you want to list that.

install ksm check as a CLC

  1. Deploy KSM with cluster check annotations on the service:
ad.datadoghq.com/service.check_names: '["kubernetes_state"]'
ad.datadoghq.com/service.init_configs: '[{}]'
ad.datadoghq.com/service.instances: |
    [
    {
        "kube_state_url": "http://%%host%%:8080/metrics",
        "prometheus_timeout": 30,
        "min_collection_interval": 30,
        "send_pod_phase_service_checks": false,
        "telemetry": true
    }
    ]
  1. Deploy the Datadog Cluster Agent with DD_CLUSTER_CHECKS_ENABLED set to true
  2. Configure agent deployment to mount empty dir in /etc/datadog-agent/conf.d/kubernetes_state.d:
...
    volumeMounts:
    ...
      - name: empty-dir
      mountPath: /etc/datadog-agent/conf.d/kubernetes_state.d
    ...
  volumes:
    - name: empty-dir
    emptyDir: {}
    ...
  1. Deploy the agent

verify check dispatching

kexec -it <datadog-cluster-agent-leader> agent clusterchecks should display something like:

===== Checks on gke-my-cluster-default-pool-b969f074-npsw =====

=== kubernetes_state check ===
Source: kubernetes-services
Instance ID: kubernetes_state:4a95df9b407f14d7
empty_default_hostname: true
kube_state_url: http://10.59.249.74:8080/metrics
min_collection_interval: 30
prometheus_timeout: 30
send_pod_phase_service_checks: false
tags:
- kube_service:kube-state-metrics
- kube_namespace:kube-system
- cluster_name:test-cluster
telemetry: true
~
Init Config:
{}
===

And the agent running on the node returned above should also have a kubernetes_state check in its agent status.

@hkaj

hkaj commented Jul 26, 2019

Copy link
Copy Markdown
Member

@CharlyF has written an even more complete guide on this GH issue DataDog/datadog-agent#3923

cswatt and others added 5 commits July 31, 2019 12:16
Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>
Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>
Co-Authored-By: Pierre Guceski <pierre.guceski@datadoghq.com>
ad.datadoghq.com/service.instances: |
[
{
"kube_state_url": "http://%%host%%:8080/metrics",

@CharlyF CharlyF Jul 31, 2019

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a disclaimer of the port. It's worth mentioning that 8080 is the metrics port.

[
{
"kube_state_url": "http://%%host%%:8080/metrics",
"prometheus_timeout": 30,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the timeout and min collection interval are set to 30 here because we have very large payloads, but it's worth a disclaimer as it means they would only collect their metrics at a lower granularity.
I would suggest mentioning them or pointing to the doc for the ksm check, but no need to lower the defautl granularity without explanation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, we can remove this line and the next by default, and maybe add a section to the faq about the check timing out and the need to decrease the frequency and increasing the timeout.

"prometheus_timeout": 30,
"min_collection_interval": 30,
"send_pod_phase_service_checks": false,
"telemetry": true

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for those two options. They might want the pod phase as a service check.
Telemetry is only activated in the latest version of the Datadog KSM check

]
```

2. Deploy the Datadog Cluster Agent with `DD_CLUSTER_CHECKS_ENABLED` set to `true`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also requires

        - name: DD_EXTRA_CONFIG_PROVIDERS
          value: "kube_services"
        - name: DD_EXTRA_LISTENERS
          value: "kube_services" 


4. Deploy the Agent.

Refer to the [Running Cluster Checks with Autodiscovery][2] documentation for more information. See options that begin with `clusterchecksDeployment` in the Helm chart [README.md][3].

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could be more verbose about the options to use in the agent ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should aim to explain all options, but i think for an FAQ this is ok. We'll see what follow up questions users ask, and go from there.


## Background

If you use Autodiscovery from the DaemonSet, one of your Agents (the one running on the same node as `kube-state-metrics`) runs the check and uses a significant amount of memory, while other Agents in the DaemonSet use far less. To prevent the outlier Agent from getting killed, you could increase the memory limit for all Agents, but this wastes resources. The alternative is to use the DaemonSet for lightweight checks to keep the general memory usage low, and use a small (e.g. 2-3 pods) dedicated deployment for heavy checks, where each pod has a large amont of RAM and only runs cluster checks.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to give background, but we need to document or point to what the full workload looks like, so they can clearly see what the cluster check worker is compared to the regular agent.

@Simwar

Simwar commented Aug 9, 2019

Copy link
Copy Markdown
Contributor

It is not very clear that the customer needs to use this solution with another daemonset only deployed on a few nodes with higher memory limits.
The background section gives this info but I feel like it should be outlined from the start.

Run:

```
kubectl exec -it <datadog-cluster-agent-leader> agent clusterchecks

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to find the leader in the pods list?


## Configuration

1. Deploy `kube-state-metrics` with cluster check annotations on the service:

@kivagant-ba kivagant-ba Aug 9, 2019

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change to

  1. Deploy kube-state-metrics with cluster check annotations on its Kubernetes service:
apiVersion: v1
kind: Service
metadata:
  annotations:
    # ... others
    ad.datadoghq.com/service.check_names: '["kubernetes_state"]'
    ad.datadoghq.com/service.init_configs: '[{}]'
    ad.datadoghq.com/service.instances: |
      [
      {
          "kube_state_url": "http://%%host%%:8080/metrics",
          "prometheus_timeout": 30,
          "min_collection_interval": 30,
          "send_pod_phase_service_checks": false,
          "telemetry": true
      }
      ]

## Configuration

1. Deploy `kube-state-metrics` with cluster check annotations on the service:

@kivagant-ba kivagant-ba Aug 9, 2019

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Helm chart kube-state-metrics 2.0.0 doesn't support custom service annotations, they are hardcoded.

UP: fixed in helm/charts#15864 But the main chart is locked to version: ~2.0.0: https://github.com/helm/charts/blob/master/stable/datadog/requirements.yaml#L3

tags:
- kube_service:kube-state-metrics
- kube_namespace:kube-system
- cluster_name:test-cluster

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Offtopic: is there any way to add a custom tag to the list? All our DD dashboards configured to read kubernetescluster"tag instead of cluster_name.

@kivagant-ba kivagant-ba Aug 9, 2019

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use the section: https://github.com/helm/charts/blob/master/stable/datadog/values.yaml#L122
But looks it doesn't work, at least in this case when a separate cluster-agent pod is running. Also I found that tags value from the Helm chart is translated into this format that looks wrong according to the documentation:

    - name: DD_TAGS
      value: '[map[kubernetescluster:k8s-blue-staging]]'

Does this look like a bug?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found working configuration:

clusterAgent:
...
  env:
# This also works but it's better not to break the default value 
# but add another one for backward compatibility
# - name: DD_CLUSTER_CHECKS_CLUSTER_TAG_NAME 
# value: "kubernetescluster" 
    - name: DD_CLUSTER_CHECKS_EXTRA_TAGS
      value: "kubernetescluster:my-lovely-cluster"

@kivagant-ba

Copy link
Copy Markdown

For people who will be looking for the Helm chart configuration, check the comment:

DataDog/datadog-agent#3923 (comment)

@cswatt

cswatt commented Aug 15, 2019

Copy link
Copy Markdown
Contributor Author

updated to webpack, closing this one. see new pr
#5232

@cswatt cswatt closed this Aug 15, 2019
@l0k0ms l0k0ms deleted the cswatt/kube-state-metrics-cluster-check branch August 19, 2019 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Do Not Merge Just do not merge this PR :)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants