Add curl timeout to the rgw-probe.sh healtchcheck script #14143

kayrus · 2024-04-30T11:08:32Z

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

When there is a network hiccup, the bash scripts with a curl probe start to accumulate. By default curl has a high connect timeout, e.g.

$ time curl 8.8.8.8
curl: (28) Failed to connect to 8.8.8.8 port 80 after 134639 ms: Connection timed out

real    2m14,677s
user    0m0,007s
sys     0m0,032s

And the probes are:

        failureThreshold: 3
        initialDelaySeconds: 10
        periodSeconds: 10
        successThreshold: 3
        timeoutSeconds: 5

Expected behavior:

Curl probe should have about ~3 sec connect timeout. E.g. use --connect-timeout 3 or --max-time 3.

How to reproduce it (minimal and precise):

n/a

File(s) to submit:

diff --git a/pkg/operator/ceph/object/rgw-probe.sh b/pkg/operator/ceph/object/rgw-probe.sh
index 853b430bc..20297f572 100644
--- a/pkg/operator/ceph/object/rgw-probe.sh
+++ b/pkg/operator/ceph/object/rgw-probe.sh
@@ -17,11 +17,12 @@ RGW_URL="$PROBE_PROTOCOL://0.0.0.0:$PROBE_PORT"
 
 function check() {
   local URL="$1"
+  # --max-time 3 - override the default 2 minutes wait time down to 3 seconds
   # --insecure - don't validate ssl if using secure port only
   # --silent - don't output progress info
   # --output /dev/stderr - output HTML header to stdout (good for debugging)
   # --write-out '%{response_code}' - print the HTTP response code to stdout
-  curl --insecure --silent --output /dev/stderr --write-out '%{response_code}' "$URL"
+  curl --max-time 3 --insecure --silent --output /dev/stderr --write-out '%{response_code}' "$URL"
 }
 
 http_response="$(check "$RGW_URL")"

Cluster CR (custom resource), typically called cluster.yaml, if necessary

Logs to submit:

Operator's logs, if necessary
Crashing pod(s) logs, if necessary

To get logs, use kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use the insert code button from the Github UI.
Read GitHub documentation if you need help.

Cluster Status to submit:

Output of kubectl commands, if necessary

To get the health of the cluster, use kubectl rook-ceph health
To get the status of the cluster, use kubectl rook-ceph ceph status
For more details, see the Rook kubectl Plugin

Environment:

OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Cloud provider or hardware configuration:
Rook version (use rook version inside of a Rook Pod):
Storage backend version (e.g. for ceph do ceph -v):
Kubernetes version (use kubectl version):
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift):
Storage backend status (e.g. for Ceph use ceph health in the Rook Ceph toolbox):

The text was updated successfully, but these errors were encountered:

BlaineEXE · 2024-05-01T19:59:02Z

I'll take a look at this in the coming days. I want to be sure that the best solution is indeed the curl --max-time flag. When I was implementing the RGW probe, I remember looking into that flag and rejecting it, but I forget my rationale at present.

BlaineEXE · 2024-05-02T15:04:37Z

@kayrus what do you mean by "accumulate" here?

When there is a network hiccup, the bash scripts with a curl probe start to accumulate

Can you provide logs that show what you mean?

kayrus · 2024-05-02T15:13:54Z

Can you provide logs that show what you mean?

there are no logs, there are just dozens of bash and curl processes trying to access the endpoint

BlaineEXE · 2024-05-03T15:16:31Z

How are you determining that to be true? I would like to see some kind of output of proof that this is happening. My understanding of the kubernetes probes is that when they time out, kubernetes kills the process (bash), which should also kill the curl process in the script.

BlaineEXE · 2024-05-03T18:41:50Z

The dilemma I face is thus:

If we a CURL timeout value, we may introduce unexpected issues for other Rook users
A CURL timeout value also means that the script timeout differs from the timeout that users configure via the probe config
Introducing an interior timeout when an exterior timeout is also running seems unnecessary and only risks timing-related bugs

Because of the risk of introducing errors by making the proposed change, I want to make sure we are certain that there is an issue present. Rook has gotten the pod probes wrong in the past, and it has taken us a lot of effort to develop probes that are useful and bug-free, and to make sure that we are not implementing probes or probe features that risk causing cascading system failures.

We have also had Rook users in the past submit issues based on theory rather than on observed behavior. Since you are not providing evidence via observed behavior, I am still concerned that the reasoning behind the creation of this issue could be theoretical and not evidence-based. When users have reported theoretical issues in the past, it has led to us introducing unnecessary complexity into Rook, and it has sometimes caused bugs.

I did some in-depth investigation today, and I have findings to share.

Findings

Research

This is the information I can find on the topic of probe "accumulation" (from June 2022):
kubernetes/kubernetes#110070 (comment)

In summary, cri-o and containerd runtimes will terminate probe processes if they pass the timeout threshold, preventing process "accumulation". The dockershim runtime does not have this ability, but since dockershim is not a supported K8s configuration, k8s will not fix the behavior. Because dockershim was already an unsupported runtime 2 years ago, I don't see any compelling reason that we should consider this a Rook bug in 2024.

Testing

To confirm that there isn't a bash issue present, I looked into how the script behaves in isolation when it receives a SIGTERM signal. When the SIGTERM signal is received while http_response="$(check "$RGW_URL")" is running, I see that the curl process stops and does not remain running.

Conclusion

Because this involves liveness probes, I am taking this issue seriously. That also means that I am doing my own due diligence to be certain that Rook is doing the best thing for all its users.

And because my research and testing didn't find anything wrong, it means that I must put the burden of proof on you to show that the issue is real. For us to move forward with the bug fix, please show us via some sort of console output that the process accumulation is definitely happening in your environment and that the container runtime is something other than dockershim.

kayrus · 2024-05-03T18:57:50Z

@BlaineEXE thanks a lot for your extra research. I'll try to provide more evidences next week.

BlaineEXE · 2024-05-13T16:24:55Z

Hi @kayrus I wanted to check in to see how things are progressing on your side.

I'm also happy to try to see if I can repro, if you can help me understand what you mean by "network hiccup." I'm not sure what would cause networking between a probe and pod to stall, but I might be able to simulate it if I understood more about the situation you're describing.

kayrus · 2024-05-15T14:40:12Z

@BlaineEXE apologies, I was having other more important issues to fix, like keystone auth, etc.

Regarding the hanged curl processes, in our case we're using containerd, and according to the ticket you provided this problem doesn't take place with containerd. I need some more time to reproduce an issue on our side.

kayrus · 2024-05-21T15:54:47Z

I reproduced an issue, and here is some additional info.

By default kubelet sets the readiness check timeout based on the runtimeRequestTimeout kubelet config option and pod's readiness timeoutSeconds: runtimeRequestTimeout + timeoutSeconds https://github.com/kubernetes/kubernetes/blob/c0f4879164f3d52dc19e49a7419bc076c66832d2/staging/src/k8s.io/cri-client/pkg/remote_runtime.go#L491

When the context timeout is reached, the running curl process is not killed, until it's finished itself. And as a result you have a bunch of curl processes:

# ps axuww | grep '[c]url 8.8.8'
root     3394191  0.0  0.0  19520  9180 ?        S    15:46   0:00 curl 8.8.8.8
root     3394759  0.0  0.0  19520  9176 ?        S    15:46   0:00 curl 8.8.8.8
root     3395104  0.0  0.0  19520  9108 ?        S    15:46   0:00 curl 8.8.8.8
root     3395363  0.0  0.0  19520  9100 ?        S    15:46   0:00 curl 8.8.8.8
root     3395548  0.0  0.0  19520  9016 ?        S    15:46   0:00 curl 8.8.8.8
root     3395779  0.0  0.0  19520  9180 ?        S    15:47   0:00 curl 8.8.8.8
root     3396034  0.0  0.0  19520  9076 ?        S    15:47   0:00 curl 8.8.8.8
root     3396229  0.0  0.0  19520  9004 ?        S    15:47   0:00 curl 8.8.8.8
root     3396480  0.0  0.0  19520  9060 ?        S    15:47   0:00 curl 8.8.8.8
root     3396667  0.0  0.0  19520  9104 ?        S    15:47   0:00 curl 8.8.8.8
root     3396890  0.0  0.0  19520  9120 ?        S    15:47   0:00 curl 8.8.8.8
root     3397122  0.0  0.0   2576   904 ?        Ss   15:47   0:00 /bin/sh -c curl 8.8.8.8
root     3397128  0.0  0.0  19520  9096 ?        S    15:47   0:00 curl 8.8.8.8

Pod config to reproduce the issue:

apiVersion: v1
kind: Pod
metadata:
  name: sleep-pod
spec:
  nodeSelector:
    kubernetes.io/hostname: some-node
  containers:
  - name: sleeper
    image: nginx:latest
    args:
    - sleep
    - "3600"  # Sleep for 1 hour
    readinessProbe:
      exec:
        command:
        - /bin/sh
        - -c
        - curl 8.8.8.8
      initialDelaySeconds: 0
      periodSeconds: 1
      timeoutSeconds: 1
      successThreshold: 1
      failureThreshold: 1

kubelet config:

$ kubectl get --raw "/api/v1/nodes/some-node/proxy/configz" | jq | grep -i runtimeRequestTimeout
    "runtimeRequestTimeout": "5s",

P.S. Pay attention that it's impossible to set the runtimeRequestTimeout: 0s, since it's interpreted as 0 and the config annotation for this value is omitempty https://github.com/kubernetes/kubernetes/blob/c0f4879164f3d52dc19e49a7419bc076c66832d2/staging/src/k8s.io/kubelet/config/v1beta1/types.go#L414 . As a result setting 0s will lead to the default 2m value.

BlaineEXE · 2024-05-21T16:10:07Z

Fascinating and frustrating. It looks like only the curl 8.8.8.8 commands from old exec probes are running. It appears that the /bin/sh -c ... part is killed as expected.

I did some reading and found some people mentioning that curl might not respond to signals, so perhaps this is the underlying reason why we're seeing this.

I'd like to make sure that we also take a look at Rook's other exec probes to see if similar issues might occur, but I think it makes sense to move forward with your PR for the curl issue specifically.

BlaineEXE · 2024-05-21T17:04:30Z

Interestingly, I can't reproduce this exactly using your sample pod on minikube with containerd runtime. Runtime request timeout is a crazy 15 minutes in the minikube config. I'm seeing <defunct> processes though, which is similar.

I doubt this has anything to do with processor architecture (I'm on arm64).

However, also interesting, is that those defunct processes don't go away when I use --max-time. Have you verified that setting --max-time fixes the issue?

root       15017  0.0  0.0      0     0 ?        Z    17:01   0:00 [curl] <defunct>
root       15126  0.0  0.0      0     0 ?        Z    17:02   0:00 [curl] <defunct>
root       15204  0.0  0.0      0     0 ?        Z    17:02   0:00 [curl] <defunct>
root       15291  0.0  0.1  21200  9644 ?        S    17:03   0:00 curl --max-time 30 8.8.8.8

kayrus · 2024-05-21T17:18:00Z

I'm using flatcar amd64 with containerd. In my case if the --max-time is lower than runtimeRequestTimeout + timeoutSeconds, curl terminated correctly.

BlaineEXE · 2024-05-21T17:19:03Z

I am noticing with my local testing that /bin/sh shows the issue you describe, but using bash properly kills the curl command. Rook's probes all use bash. Could you try to repro using bash?

BlaineEXE · 2024-05-21T18:23:51Z

Sorry to spam here, but I have some new info.

I am trying to test using this probe, which is conceptually most similar to how our probes are configured.

      readinessProbe:
        exec:
          command:
            - bash
            - -c
            - |
              out="$(curl 8.8.8.8)"
              echo "$out"

I am not seeing a bunch of zombies in my case, but I am seeing that the curl 8.8.8.8 process isn't killed, and K8s doesn't run another probe until the curl command times out on its own.

I believe the underlying issue here is that k8s doesn't do process reaping. It does properly send the kill signal to bash -c, but bash doesn't forward the signal to sub-processes like the curl command. Then, k8s (in my case) seems to wait until the curl command exits to start another probe process.

In your case, it seems that k8s is starting a new probe process even when the old children are still around. I can't explain the discrepancy in behavior, but in both cases, the behavior is very non-ideal.

This issue can be trivially solved by setting a curl timeout, but the more generalized issue seems to be a compound one:

bash doesn't forward SIGTERM to child processes, and
k8s doesn't have a process reaper unless the pod specifies that containers should share the PID namespace

In theory, one of the trivial things that Rook could do is enable container pid sharing in pods. However, in my testing, the curl process isn't getting reaped even with shareProcessNamespace: true.

Given this, I think it still makes sense to continue with your PR since it can fix the issue for curl, and I'd like to keep looking to see if there is something Rook can do more broadly to not have this situation come up for other probes.

BlaineEXE · 2024-05-21T23:25:03Z

@kayrus I have been researching and playing around with your example, because I really want to find a generalized way to resolve this for the rest of our Rook probes so we don't have more issues like this elsewhere.

I didn't have any luck trying to use trap 'kill -9 -- -$BASHPID' EXIT or any derivatives of that, even if I also used [timeout 1 bash -c '...script...'].

However, I have found that this version (below) doesn't result in zombie curl commands in my env.

apiVersion: v1
kind: Pod
metadata:
  name: sleep-pod
spec:
  nodeSelector:
    kubernetes.io/hostname: minikube
  # shareProcessNamespace: true
  containers:
    - name: sleeper
      image: quay.io/ceph/ceph:v18
      args:
        - sleep
        - "3600" # Sleep for 1 hour
      readinessProbe:
        exec:
          command:
            - bash
            - -c
            - |
              set -e
              out="$(curl --max-time 30 8.8.8.8)"
              echo "$out"
        initialDelaySeconds: 0
        periodSeconds: 10
        timeoutSeconds: 5
        successThreshold: 1
        failureThreshold: 1
      resources: {}

Would you mind testing this to see if it also resolves things in your env? It seems much easier to use set -e in all the probes than to have to nitpick every single command for timeouts.

And, for what it's worth, I see the same thing if I replace the curl command with something like sleep 30, but it's zombie sleep processes then. So I think there is reason to be concerned about arbitrary commands.

kayrus added the bug label Apr 30, 2024

kayrus linked a pull request Apr 30, 2024 that will close this issue

object: introduce a "curl --max-time 3" in a rgw readiness probe #14144

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add curl timeout to the rgw-probe.sh healtchcheck script #14143

Add curl timeout to the rgw-probe.sh healtchcheck script #14143

kayrus commented Apr 30, 2024

BlaineEXE commented May 1, 2024

BlaineEXE commented May 2, 2024

kayrus commented May 2, 2024

BlaineEXE commented May 3, 2024

BlaineEXE commented May 3, 2024

kayrus commented May 3, 2024

BlaineEXE commented May 13, 2024

kayrus commented May 15, 2024

kayrus commented May 21, 2024 •

edited

BlaineEXE commented May 21, 2024

BlaineEXE commented May 21, 2024

kayrus commented May 21, 2024

BlaineEXE commented May 21, 2024

BlaineEXE commented May 21, 2024

BlaineEXE commented May 21, 2024 •

edited

Add curl timeout to the rgw-probe.sh healtchcheck script #14143

Add curl timeout to the rgw-probe.sh healtchcheck script #14143

Comments

kayrus commented Apr 30, 2024

BlaineEXE commented May 1, 2024

BlaineEXE commented May 2, 2024

kayrus commented May 2, 2024

BlaineEXE commented May 3, 2024

BlaineEXE commented May 3, 2024

Findings

Research

Testing

Conclusion

kayrus commented May 3, 2024

BlaineEXE commented May 13, 2024

kayrus commented May 15, 2024

kayrus commented May 21, 2024 • edited

BlaineEXE commented May 21, 2024

BlaineEXE commented May 21, 2024

kayrus commented May 21, 2024

BlaineEXE commented May 21, 2024

BlaineEXE commented May 21, 2024

BlaineEXE commented May 21, 2024 • edited

kayrus commented May 21, 2024 •

edited

BlaineEXE commented May 21, 2024 •

edited