Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the timeoutSeconds=1 is intended? #93

Open
SergeyKanzhelev opened this issue Apr 23, 2021 · 2 comments
Open

Is the timeoutSeconds=1 is intended? #93

SergeyKanzhelev opened this issue Apr 23, 2021 · 2 comments

Comments

@SergeyKanzhelev
Copy link

In 1.20 the exec probe timeout will start being enforced:

Before Kubernetes 1.20, the field timeoutSeconds was not respected for exec probes: probes continued running indefinitely, even past their configured deadline, until a result was returned.

So if this callback was not intended/tested to be running under 1 second, agent may start being killed in case of heavy load or resource starvation as liveness probe will start failing:

livenessProbe:
exec:
command:
- /bin/sh
- -c
- |
LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300}; STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900}; if [ ! -e /var/run/google-fluentd/buffers ]; then
exit 1;
fi; touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-stuck -print -quit)" ]]; then
rm -rf /var/run/google-fluentd/buffers;
exit 1;
fi; touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-liveness -print -quit)" ]]; then
exit 1;
fi;
failureThreshold: 3
initialDelaySeconds: 600
periodSeconds: 60
successThreshold: 1
timeoutSeconds: 1

I recommend to bump the value to some big number after testing it

@liggitt
Copy link

liggitt commented Apr 23, 2021

I recommend to bump the value to some big number after testing it

do we actually want a large timeout?

@SergeyKanzhelev
Copy link
Author

I recommend a big number as the long probe execution likely indicates high IO latency issues or CPU starvation. Neither is the reason to kill the pod that is monitoring things and especially interesting to report data from the node that is under the heavy load.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants