Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve fluentd liveness probe #343

Merged
merged 11 commits into from Jan 6, 2020

Conversation

vsinghal13
Copy link
Contributor

@vsinghal13 vsinghal13 commented Dec 19, 2019

Description

Add new liveness probe to fluentd deployment.

httpGet:
    path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
    port: 9880

The rationale is that if Fluentd can accept log messages, it must be healthy.
The endpoint itself results in a new fluentd tag fluentd.pod-healthcheck
The query parameter in the URL defines a URL-encoded JSON object that looks like this:
{"log": "health check"}

Testing performed
  • ci/build.sh
  • Redeploy fluentd and fluentd-events pods
  • Confirm events, logs, and metrics are coming in

@@ -3,5 +3,10 @@
port 24321
bind 0.0.0.0
</source>
<source>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why only logs.conf gets this liveness probe. Should this be moved to common.conf ?

- "[ $(pgrep ruby | wc -l) -gt 0 ]"
httpGet:
path: /fluentd.pod.healthcheck?json=%7B%22log%22%3A+%22health+check%22%7D
port: 9880
initialDelaySeconds: 300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems very generous, do we need to wait 5 minutes before starting the liveness probe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we do have wait of 5 minutes, so I kept that. It is recommended to be generous in that otherwise the pod might end up in infinite loop of being restarted during startup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5min does sound a little high, but the readiness probe will fail first and the pod will stop accepting data, and if we need more replicas in between that time then with autoscaler enabled I think we'll be okay. So I'm okay with leaving it at 5min

Copy link
Contributor

@frankreno frankreno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but any reason we would not also want to do this on the event deployment? The same logic holds true I believe.

@rvmiller89
Copy link
Contributor

+1 for doing on events deployment, missed that part.

@vsinghal13 vsinghal13 merged commit 81012f7 into master Jan 6, 2020
@vsinghal13 vsinghal13 deleted the vsinghal-improve-fluentd-liveness-probe branch January 6, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants