Get metrics from Kubernetes service in real-time to:
- Visualize and monitor Kubernetes states.
- Be notified about Kubernetes failovers and events.
The Kubernetes State Metrics Core check leverages kube-state-metrics version 2+ and includes major performance and tagging improvements compared to the legacy kubernetes_state
check.
As opposed to the legacy check, with the Kubernetes State Metrics Core check, you no longer need to deploy kube-state-metrics
in your cluster.
Kubernetes State Metrics Core provides a better alternative to the legacy kubernetes_state
check as it offers more granular metrics and tags. See the Major Changes and Data Collected for more details.
The Kubernetes State Metrics Core check is included in the Datadog Cluster Agent image, so you don't need to install anything else on your Kubernetes servers.
- Datadog Cluster Agent v1.12+
In your Helm values.yaml
, add the following:
datadog:
# (...)
kubeStateMetricsCore:
enabled: true
To enable the kubernetes_state_core
check, the setting spec.features.kubeStateMetricsCore.enabled
must be set to true
in the DatadogAgent resource:
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
name: datadog
spec:
global:
credentials:
apiKey: <DATADOG_API_KEY>
features:
kubeStateMetricsCore:
enabled: true
Note: Datadog Operator v0.7.0 or greater is required.
In the original kubernetes_state
check, several tags have been flagged as deprecated and replaced by new tags. To determine your migration path, check which tags are submitted with your metrics.
In the kubernetes_state_core
check, only the non-deprecated tags are submitted. Before migrating from kubernetes_state
to kubernetes_state_core
, verify that only official tags are used in monitors and dashboards.
Here is the mapping between deprecated tags and the official tags that have replaced them:
deprecated tag | official tag |
---|---|
cluster_name | kube_cluster_name |
container | kube_container_name |
cronjob | kube_cronjob |
daemonset | kube_daemon_set |
deployment | kube_deployment |
hpa | horizontalpodautoscaler |
image | image_name |
job | kube_job |
job_name | kube_job |
namespace | kube_namespace |
phase | pod_phase |
pod | pod_name |
replicaset | kube_replica_set |
replicationcontroller | kube_replication_controller |
statefulset | kube_stateful_set |
The Kubernetes State Metrics Core check is not backward compatible, be sure to read the changes carefully before migrating from the legacy kubernetes_state
check.
kubernetes_state.node.by_condition
: A new metric with node name granularity. The legacy metric kubernetes_state.nodes.by_condition
is deprecated in favor of this one. Note: This metric is backported into the legacy check, where both metrics (it and the legacy metric it replaces) are available.
kubernetes_state.persistentvolume.by_phase
: A new metric with persistentvolume name granularity. It replaces kubernetes_state.persistentvolumes.by_phase
.
kubernetes_state.pod.status_phase
: The metric is tagged with pod level tags, like pod_name
.
kubernetes_state.node.count
: The metric is not tagged with host
anymore. It aggregates the nodes count by kernel_version
os_image
container_runtime_version
kubelet_version
.
kubernetes_state.container.waiting
and kubernetes_state.container.status_report.count.waiting
: These metrics no longer emit a 0 value if no pods are waiting. They only report non-zero values.
kube_job
: In kubernetes_state
, the kube_job
tag value is the CronJob
name if the Job
had CronJob
as an owner, otherwise it is the Job
name. In kubernetes_state_core
, the kube_job
tag value is always the Job
name, and a new kube_cronjob
tag key is added with the CronJob
name as the tag value. When migrating to kubernetes_state_core
, it's recommended to use the new tag or kube_job:foo*
, where foo
is the CronJob
name, for query filters.
kubernetes_state.job.succeeded
: In kubernetes_state
, the kubernetes.job.succeeded
was count
type. In kubernetes_state_core
it is gauge
type.
Host or node-level tags no longer appear on cluster-centric metrics. Only metrics relative to an actual node in the cluster, like kubernetes_state.node.by_condition
or kubernetes_state.container.restarts
, continue to inherit their respective host or node level tags.
To add tags globally, use the DD_TAGS
environment variable, or use the respective Helm or Operator configurations. Instance-only level tags can be specified by mounting a custom kubernetes_state_core.yaml
into the Cluster Agent.
datadog:
kubeStateMetricsCore:
enabled: true
tags:
- "<TAG_KEY>:<TAG_VALUE>"
kind: DatadogAgent
apiVersion: datadoghq.com/v2alpha1
metadata:
name: datadog
spec:
global:
credentials:
apiKey: <DATADOG_API_KEY>
tags:
- "<TAG_KEY>:<TAG_VALUE>"
features:
kubeStateMetricsCore:
enabled: true
Metrics like kubernetes_state.container.memory_limit.total
or kubernetes_state.node.count
are aggregate counts of groups within a cluster, and host or node-level tags are not added.
Enabling kubeStateMetricsCore
in your Helm values.yaml
configures the Agent to ignore the auto configuration file for legacy kubernetes_state
check. The goal is to avoid running both checks simultaneously.
If you still want to enable both checks simultaneously for the migration phase, disable the ignoreLegacyKSMCheck
field in your values.yaml
.
Note: ignoreLegacyKSMCheck
makes the Agent only ignore the auto configuration for the legacy kubernetes_state
check. Custom kubernetes_state
configurations need to be removed manually.
The Kubernetes State Metrics Core check does not require deploying kube-state-metrics
in your cluster anymore, you can disable deploying kube-state-metrics
as part of the Datadog Helm Chart. To do this, add the following in your Helm values.yaml
:
datadog:
# (...)
kubeStateMetricsEnabled: false
Important Note: The Kubernetes State Metrics Core check is an alternative to the legacy kubernetes_state
check. Datadog recommends not enabling both checks simultaneously to guarantee consistent metrics.
See metadata.csv for a list of metrics provided by this integration.
Note: You can configure Datadog Standard labels on your Kubernetes objects to get the env
service
version
tags.
The Kubernetes State Metrics Core check does not include any events.
Recommended Label | Tag |
---|---|
app.kubernetes.io/name |
kube_app_name |
app.kubernetes.io/instance |
kube_app_instance |
app.kubernetes.io/version |
kube_app_version |
app.kubernetes.io/component |
kube_app_component |
app.kubernetes.io/part-of |
kube_app_part_of |
app.kubernetes.io/managed-by |
kube_app_managed_by |
helm.sh/chart |
helm_chart |
Recommended Label | Tag |
---|---|
topology.kubernetes.io/region |
kube_region |
topology.kubernetes.io/zone |
kube_zone |
failure-domain.beta.kubernetes.io/region |
kube_region |
failure-domain.beta.kubernetes.io/zone |
kube_zone |
Datadog Label | Tag |
---|---|
tags.datadoghq.com/env |
env |
tags.datadoghq.com/service |
service |
tags.datadoghq.com/version |
version |
kubernetes_state.cronjob.complete
: Whether the last job of the cronjob is failed or not. Tags:kube_cronjob
kube_namespace
(env
service
version
from standard labels).
kubernetes_state.cronjob.on_schedule_check
: Alert if the cronjob's next schedule is in the past. Tags:kube_cronjob
kube_namespace
(env
service
version
from standard labels).
kubernetes_state.job.complete
: Whether the job is failed or not. Tags:kube_job
or kube_cronjob
kube_namespace
(env
service
version
from standard labels).
kubernetes_state.node.ready
: Whether the node is ready. Tags:node
condition
status
.
kubernetes_state.node.out_of_disk
: Whether the node is out of disk. Tags:node
condition
status
.
kubernetes_state.node.disk_pressure
: Whether the node is under disk pressure. Tags:node
condition
status
.
kubernetes_state.node.network_unavailable
: Whether the node network is unavailable. Tags:node
condition
status
.
kubernetes_state.node.memory_pressure
: Whether the node network is under memory pressure. Tags:node
condition
status
.
Run the Cluster Agent's status
subcommand inside your Cluster Agent container and look for kubernetes_state_core
under the Checks section.
By default, the Kubernetes State Metrics Core check waits 10 seconds for a response from the Kubernetes API server. For large clusters, the request may time out, resulting in missing metrics.
You can avoid this by setting the environment variable DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT
to a higher value than the default 10 seconds.
Update your datadog-agent.yaml
with the following configuration:
apiVersion: datadoghq.com/v2alpha1
kind: DatadogAgent
metadata:
name: datadog
spec:
override:
clusterAgent:
env:
- name: DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT
value: <value_greater_than_10>
Then apply the new configuration:
kubectl apply -n $DD_NAMESPACE -f datadog-agent.yaml
Update your datadog-values.yaml
with the following configuration:
clusterAgent:
env:
- name: DD_KUBERNETES_APISERVER_CLIENT_TIMEOUT
value: <value_greater_than_10>
Then upgrade your Helm chart:
helm upgrade -f datadog-values.yaml <RELEASE_NAME> datadog/datadog
Need help? Contact Datadog support.