Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubernetes] Kubernetes integration submits high volume of queries for events #3381

Closed
cberner opened this issue Jun 10, 2017 · 10 comments

Comments

Projects
None yet
4 participants
@cberner
Copy link

commented Jun 10, 2017

**Output of the info page **

2017-06-09 23:44:35,115 | WARNING | dd.collector | utils.service_discovery.config(config.py:31) | No configuration backend provided for service discovery. Only auto config templates will be used.
====================
Collector (v 5.14.0)
====================

  Status date: 2017-06-09 23:44:23 (12s ago)
  Pid: 24
  Platform: Linux-4.4.0-72-generic-x86_64-with-debian-8.8
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/collector.log

  Clocks
  ======

    NTP offset: -0.111 s
    System UTC time: 2017-06-09 23:44:35.231581

  Paths
  =====

    conf.d: /etc/dd-agent/conf.d
    checks.d: /opt/datadog-agent/agent/checks.d

  Hostnames
  =========

    socket-hostname: dd-agent-2ddz7
    hostname: kubernetes-worker-10-85-16-8.dev.openai.org
    socket-fqdn: dd-agent-2ddz7

  Checks
  ======

    ntp (5.14.0)
    ------------
      - Collected 0 metrics, 0 events & 0 service checks

    disk (5.14.0)
    -------------
      - instance #0 [OK]
      - Collected 32 metrics, 0 events & 0 service checks

    docker_daemon (5.14.0)
    ----------------------
      - instance #0 [OK]
      - Collected 157 metrics, 0 events & 1 service check

    kubernetes (5.14.0)
    -------------------
      - initialize check class [ERROR]: ConnectionError(MaxRetryError('None: Max retries exceeded with url: /healthz?verbose=True (Caused by None)',),)

  Emitters
  ========

    - http_emitter [OK]

====================
Dogstatsd (v 5.14.0)
====================

  Status date: 2017-06-09 23:44:30 (4s ago)
  Pid: 18
  Platform: Linux-4.4.0-72-generic-x86_64-with-debian-8.8
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/dogstatsd.log

  Flush count: 315
  Packet Count: 5360
  Packets per second: 1.7
  Metric count: 9
  Event count: 0
  Service check count: 0

====================
Forwarder (v 5.14.0)
====================

  Status date: 2017-06-09 23:44:32 (3s ago)
  Pid: 17
  Platform: Linux-4.4.0-72-generic-x86_64-with-debian-8.8
  Python Version: 2.7.13, 64bit
  Logs: <stderr>, /var/log/datadog/forwarder.log

  Queue Size: 0 bytes
  Queue Length: 0
  Flush Count: 1104
  Transactions received: 645
  Transactions flushed: 645
  Transactions rejected: 0
  API Key Status: API Key is valid


======================
Trace Agent (v 5.14.0)
======================

  Pid: 16
  Uptime: 3158 seconds
  Mem alloc: 964792 bytes

  Hostname: dd-agent-2ddz7
  Receiver: 0.0.0.0:8126
  API Endpoint: https://trace.agent.datadoghq.com

  Bytes received (1 min): 0
  Traces received (1 min): 0
  Spans received (1 min): 0

  Bytes sent (1 min): 0
  Traces sent (1 min): 0
  Stats sent (1 min): 0

Additional environment details (Operating System, Cloud provider, etc):
Ubuntu 14.04, Kubernetes cluster, Azure

Steps to reproduce the issue:

  1. Install dd-agent
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: dd-agent
  namespace: daemons
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 100%
  template:
    metadata:
      labels:
        app: dd-agent
      name: dd-agent
    spec:
      serviceAccountName: datadog
      containers:
      - image: datadog/docker-dd-agent:latest
        imagePullPolicy: Always
        name: dd-agent
        ports:
          - containerPort: 8125
            name: dogstatsdport
            protocol: UDP
        env:
          - name: API_KEY
            value: xxxxxx
          - name: KUBERNETES
            value: "yes"
          - name: SD_BACKEND
            value: docker
          - name: TAGS
            value: env:${env},cluster:${cluster_name}
        volumeMounts:
          - name: dockersocket
            mountPath: /var/run/docker.sock
          - name: procdir
            mountPath: /host/proc
            readOnly: true
          - name: cgroups
            mountPath: /host/sys/fs/cgroup
            readOnly: true
      volumes:
        - hostPath:
            path: /var/run/docker.sock
          name: dockersocket
        - hostPath:
            path: /proc
          name: procdir
        - hostPath:
            path: /sys/fs/cgroup
          name: cgroups
      dnsPolicy: Default  # Don't use cluster DNS.
  1. Check audit logs on apiserver, and observe queries from every node in the cluster, like:
    2017-06-09T23:21:37.230676064Z AUDIT: id="db2c5ebe-eb71-48ab-a1a2-8b3b537c916b" ip="10.126.18.37" method="GET" user="system:serviceaccount:daemons:datadog" groups="\"system:serviceaccounts\",\"system:serviceaccounts:daemons\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/events"

Describe the results you received:
Too many, expensive, queries to API servers

Describe the results you expected:
Datadog should not account for 20%+ of our apiserver traffic

@hkaj

This comment has been minimized.

Copy link
Member

commented Jun 14, 2017

Hi @cberner
Thanks for the report. You're right, it seems we got a little greedy on the event API. We used to only query it for reporting events, but with 5.14 we started reporting a kube_service tag in Auto Discovered checks and that hits the event API often from every agent.

We're queuing some work around this, expect it for 5.14.1 - in the meantime downgrading to 5.13.2 should reduce the load.

Sorry for the trouble.

@xvello

This comment has been minimized.

Copy link
Member

commented Jun 14, 2017

Hi @cberner,

I reworked the kube service matching logic to only pool the apiserver every 5 minutes. That means the kube_service tagging might lag behind a few minutes on creation/deletion but will greatly reduce the traffic. This timing can be configured via the service_tag_update_freq in kubernetes.yaml.

Could you please try running the datadog/dev-dd-agent:xvello_kube_event_delay docker image and tell me if the apiserver traffic is back to reasonable levels?

Cheers

@cberner

This comment has been minimized.

Copy link
Author

commented Jun 14, 2017

Does it still use the events API? Even once per 5 min is going to be too much, we have 1500+ nodes in our cluster, so that's 5 queries per second, and these are very expensive queries. There's a collect_events: True which the documentation says should only be set on one node, so it seems like this polling should be part of that

@xvello

This comment has been minimized.

Copy link
Member

commented Jun 14, 2017

Unfortunately, as we do the pod -> service mapping on the agent-side, every agent needs this information, that the kubelet does not store.

If you don't use the kube_service tag, we can introduce a configuration option to disable it. The kube_deployment / kube_daemonset / kube_task tags would still work as they only rely on node-local information. Would that work for you?

@cberner

This comment has been minimized.

Copy link
Author

commented Jun 14, 2017

yes, that would be great

@olivielpeau olivielpeau modified the milestones: 5.15, 5.14.1 Jun 15, 2017

@xvello

This comment has been minimized.

Copy link
Member

commented Jun 16, 2017

Hi @cberner,

We just released 5.14.1 than includes the fix for this issue. Could you please confirm setting collect_service_tags to false on datadog/docker-dd-agent:latest works for you?

Sorry for the trouble that caused.

@xvello xvello modified the milestones: 5.14.1, 5.15 Jun 19, 2017

@xvello xvello closed this Jun 19, 2017

@cberner

This comment has been minimized.

Copy link
Author

commented Jun 19, 2017

Thanks. Is there an environment variable that can be used to set that?

@xvello

This comment has been minimized.

Copy link
Member

commented Jun 20, 2017

Right now, no. But as I'm merging other improvements to the entrypoint, I'll add one.
Do you need help injecting a custom kubernetes.yaml?

@cberner

This comment has been minimized.

Copy link
Author

commented Jun 20, 2017

Great, thanks! Nope, we're happy to wait for the new entrypoint. We have a workaround of just redirecting all the DataDog Agents to a hostname that doesn't exist for their Kube API server requests

@xvello

This comment has been minimized.

Copy link
Member

commented Jun 26, 2017

Hi @cberner

I just merged DataDog/docker-dd-agent#214, that adds the KUBERNETES_COLLECT_SERVICE_TAGS envvar in the entrypoint. All latest flavors have been rebuilt with it.

Regards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.