Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kubernetes] Ingest k8s events + limits and requests metrics #2551

Merged
merged 18 commits into from Aug 23, 2016

Conversation

masci
Copy link
Contributor

@masci masci commented May 27, 2016

What

  • Fetch events from the Kubernetes API and feed them to Datadog.
  • Add new limits and requests metrics.

Logic recap:

  • Each agent fetches the whole list of available events and discards those not regarding pods running on the same node.
  • If the collect_events config param is set to True, the agent fetches the whole list of available events
  • The strategy above could be replaced with delegating Users should delegate one and only one agent pod to collect and send all the events - that agent possibly being the one running on k8s master. This unfortunately doesn't work on GCE where the master is not part of the cluster but it's provided by the platform as an external service instead. How to achieve this part is left to the users.
  • To avoid duplicates, the timestamp of the oldest event seen is stored in the Kubeutil singleton and used to filter out older events.

Open issues:

  • There are events not directly tied to Pods, like ReplicaSet events. Those are discarded for the moment.
  • There are serious concerns about flooding since there's no way at the moment to perform selection queries through the event API. We rely on the fact that k8s events currently have a TTL limited to 1 hour.

"type": "Normal"
}
]
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixture is a little bit large but contains a different kind of events, could be useful later.

@irabinovitch
Copy link
Contributor

@masci @remh Given the discussion about above GKE vs K8S should we reach out to our friends at Google for feedback on approach?

@irabinovitch
Copy link
Contributor

Do we have examples of the "extraneous" events that aren't tied to pods or replica sets?

@masci
Copy link
Contributor Author

masci commented Jun 1, 2016

@irabinovitch generally speaking, every node having the Kind field set to something different from Pod, for example Created pod: xxx having ReplicaSet kind, or Scaled up xxx of Deployment kind.

@masci masci changed the title [WIP][kubernetes] Ingest Kubernetes events [kubernetes] Ingest Kubernetes events Jun 6, 2016
@masci masci changed the title [kubernetes] Ingest Kubernetes events [kubernetes] Ingest k8s events + limits and requests metrics Jun 21, 2016
Massimiliano Pippi and others added 9 commits July 5, 2016 15:07
query the master API for events

fixed line endings

query the master API

pass the auth token

more logs

resolve host_ip

minor improvements, fixed log prints

post events to DD

fixed timestamp evaluation

fix timestamp evaluation

again on ts logic

do not overwrite last timestamp with 0

skip events from other nodes

default value if port is missing from instance

update skip condition

debug logs++

use node ip as host

moved timestamp register to kubeutil

fixed tests

added kubeutil tests

added tests for events
@masci masci force-pushed the massi/ingest_k8s_events branch 4 times, most recently from 7d97705 to f5cf8f6 Compare July 5, 2016 15:44
@hkaj hkaj added this to the 5.9.0 milestone Jul 13, 2016
@therc
Copy link
Contributor

therc commented Jul 20, 2016

For people with multiple masters, saying "make the pod running on the master be the one to get all events" doesn't work. Perhaps a PetSet might help? Or perhaps just run a DaemonSet on every node, then create an additional ReplicaSet/Deployment with a single agent pod that only reports events (the other metrics are left to the regular DaemonSet pod). So, on one machine, you'd have two agents, with no overlap between the data they report back.

@onbjerg
Copy link

onbjerg commented Aug 3, 2016

Hi @masci! I was notified about this PR by @irabinovitch. I just opened a new PR (#2728) about k8s deployment metrics. We should see if we could somehow combine the two.

A quick recap of what I did:

  • Queried the deployments endpoint (/apis/extensions/v1beta1/deployments)
  • Looped over each deployment and pulled data for the 3 metrics in [kubernetes] add metrics for k8s deployments #2728 (kubernetes.pods.desired, kubernetes.pods.available, kubernetes.pods.unavailable), tagged with kube_deployment:{name}
  • Resolved the deployment name for a pod based on the ReplicaSet name in kubernetes.io/created-by and the pod template hash (see L108 of kubeutil.py in the PR). This is used to tag the subcontainers with kube_deployment:{name}.

Other than that there's the obvious new fixture and some added cases in the tests. I also opened an issue #2722.

@onbjerg
Copy link

onbjerg commented Aug 3, 2016

Also: Are there any events for deployments? Currently I do not have deploy markers on my graphs for my team, because we are deploying using kubectl. It would be cool to have some sort of way to detect a k8s deploy.

@masci masci modified the milestones: 5.9.0, 5.10.0 Aug 18, 2016
@@ -93,5 +106,101 @@ def extract_kube_labels(self, pods_list, excluded_keys=None):

return kube_labels

def extract_meta(self, pods_list, field_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this? It's nice and it's tested ( 🍪 ) but I don't see any usage of it.

@hkaj
Copy link
Member

hkaj commented Aug 18, 2016

Let's make sure that it works nicely with 1.2 and 1.3, then it's good to go.

@hkaj
Copy link
Member

hkaj commented Aug 23, 2016

all good on 1.2 too, you can squash and 👍

@masci masci merged commit d79f7c8 into master Aug 23, 2016
@masci masci deleted the massi/ingest_k8s_events branch August 23, 2016 14:30
@therc
Copy link
Contributor

therc commented Aug 30, 2016

Is this available for testing in any of the Docker images?

@masci
Copy link
Contributor Author

masci commented Aug 31, 2016

You can try the nightly tag from https://hub.docker.com/r/datadog/docker-dd-agent/tags/ but please consider the agent as unstable. Events collection is disabled by default, you can activate it in the check configuration file if you want. Later it'll be possible to instrument that feature through env variables, see DataDog/docker-dd-agent#117

@therc
Copy link
Contributor

therc commented Aug 31, 2016

I tried nightly, but I'm not sure it has the latest:

-rw-r--r--. 4 root root 10843 Aug 19 04:04 kubernetes.py
-rw-r--r--. 1 root root 9823 Aug 31 15:41 kubernetes.pyc

It doesn't e.g. import the time module as done in this PR.

@therc
Copy link
Contributor

therc commented Aug 31, 2016

From docker inspect:

"Created": "2016-08-29T16:41:53.174520676Z",

From https://hub.docker.com/r/datadog/docker-dd-agent/builds/bc5brbm22yzmrwsiruph9ua/

2016-08-29T16:40:33.009Z

Nothing else lines up exactly, either. E.g. I can't find a revision of kubernetes.py that is exactly 10843 bytes.

@therc
Copy link
Contributor

therc commented Aug 31, 2016

Nevermind, I was looking at the commit history in the PR, not in the repo. It turns out that the image has kubernetes.py from my Aug 7 change. :-)

https://github.com/DataDog/dd-agent/blob/ef3aa479ea2aae19cb74a5f14e70a45566197550/checks.d/kubernetes.py

So, despite the nightly name, it looks like the source tree has .py files from Aug 19.

@rosskukulinski
Copy link

Important note: This requires that the kubernetes service exists within the namespace dd-agent is running. This is not always the case (especially in namespaces that are not default).

The preferred mechanism to get the API Server name is by using the environmental variables injected into every pod: KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT.

@rosskukulinski
Copy link

This also only gets the events for the default namespace. I believe it should pull events from all namespaces.

@rosskukulinski
Copy link

cc: @remh who I was chatting with on IRC

@masci
Copy link
Contributor Author

masci commented Sep 12, 2016

Hi @rosskukulinski and thanks for your feedback, very precious.

You can specify any namespace other than default in the configuration file under the instance section. Anyway, we could (and probably should) collect all the events with no regards for the namespace, but I'd like this to be behind some sort of configuration parameter with reasonable defaults.

Since this code is already in master and is going to be shipped with 5.9, do you mind to open two new, different issues to track and discuss:
a) the way we get the api URL
b) how we can improve events gathering when someone is interested in any namespace available

A PR would work well too 😜 !

Thanks!

@rosskukulinski
Copy link

done, thanks @masci. Up to my clients as to whether I invest the development time in submitting PRs.

# requests
try:
for request, value_str in container['resources']['requests'].iteritems():
values = [float(s) for s in prog.findall(value_str)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this going to parse eg. 500m as 500 instead of 0.5?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonmoter
Copy link

jonmoter commented Oct 6, 2016

@masci, you said:

You can specify any namespace other than default in the configuration file under the instance section.

I'm looking at the kubernetes.yaml.example file, and I don't see anything documented on how to change the namespace where it looks for the kubernetes service. Could you clarify?

@masci
Copy link
Contributor Author

masci commented Oct 7, 2016

Hi @jonmoter
that's because documentation and the example yaml file are not up to date unfortunately but basically all you have to do is

instance:
  namespace: my-other-namespace

for more details on the issue and provide feedback let's use #2838

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants