Is the timeoutSeconds=1 is intended? #93

SergeyKanzhelev · 2021-04-23T16:30:21Z

In 1.20 the exec probe timeout will start being enforced:

Before Kubernetes 1.20, the field timeoutSeconds was not respected for exec probes: probes continued running indefinitely, even past their configured deadline, until a result was returned.

So if this callback was not intended/tested to be running under 1 second, agent may start being killed in case of heavy load or resource starvation as liveness probe will start failing:

kubernetes-configs/logging-agent.yaml

Lines 46 to 64 in f01ceca

    
                   livenessProbe: 
        
                     exec: 
        
                       command: 
        
                       - /bin/sh 
        
                       - -c 
        
                       - | 
        
                         LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300}; STUCK_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-900}; if [ ! -e /var/run/google-fluentd/buffers ]; then 
        
                           exit 1; 
        
                         fi; touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-stuck -print -quit)" ]]; then 
        
                           rm -rf /var/run/google-fluentd/buffers; 
        
                           exit 1; 
        
                         fi; touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness; if [[ -z "$(find /var/run/google-fluentd/buffers -type f -newer /tmp/marker-liveness -print -quit)" ]]; then 
        
                           exit 1; 
        
                         fi; 
        
                     failureThreshold: 3 
        
                     initialDelaySeconds: 600 
        
                     periodSeconds: 60 
        
                     successThreshold: 1 
        
                     timeoutSeconds: 1

I recommend to bump the value to some big number after testing it

liggitt · 2021-04-23T17:04:07Z

I recommend to bump the value to some big number after testing it

do we actually want a large timeout?

SergeyKanzhelev · 2021-04-23T17:57:28Z

I recommend a big number as the long probe execution likely indicates high IO latency issues or CPU starvation. Neither is the reason to kill the pod that is monitoring things and especially interesting to report data from the node that is under the heavy load.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the timeoutSeconds=1 is intended? #93

Is the timeoutSeconds=1 is intended? #93

SergeyKanzhelev commented Apr 23, 2021

liggitt commented Apr 23, 2021

SergeyKanzhelev commented Apr 23, 2021

Is the timeoutSeconds=1 is intended? #93

Is the timeoutSeconds=1 is intended? #93

Comments

SergeyKanzhelev commented Apr 23, 2021

liggitt commented Apr 23, 2021

SergeyKanzhelev commented Apr 23, 2021