Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object: introduce a "curl --max-time 3" in a rgw readiness probe #14144

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kayrus
Copy link

@kayrus kayrus commented Apr 30, 2024

Issue resolved by this Pull Request:
Resolves #14143

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Copy link
Member

@parth-gr parth-gr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# --insecure - don't validate ssl if using secure port only
# --silent - don't output progress info
# --output /dev/stderr - output HTML header to stdout (good for debugging)
# --write-out '%{response_code}' - print the HTTP response code to stdout
curl --insecure --silent --output /dev/stderr --write-out '%{response_code}' "$URL"
curl --max-time 3 --insecure --silent --output /dev/stderr --write-out '%{response_code}' "$URL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 3 sec?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the healthcheck itself has default timeoutSeconds: 5. 3 seconds is enough to detect the network problem and stop curl process right away + some more secs for bash script to process the script when the node is not responsive.

Copy link
Member

@BlaineEXE BlaineEXE May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bash script itself should be able to process the curl result in under a second, so I'm not too worried about the trivial extra time there.

The default curl timeout is more than 2 minutes and may cause
a number of abandeoned curl processes within a network hiccup.
This fix limits the curl to check the endpoint within 3 seconds.

Signed-off-by: kayrus <kay.diam@gmail.com>
@kayrus kayrus changed the title Introduce a "curl --max-time 3" in a rgw readiness probe object: introduce a "curl --max-time 3" in a rgw readiness probe Apr 30, 2024
Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the object.yaml the user can configure the readiness and startup probe timeouts. For example, they might specify:

  healthCheck:
    # Configure the pod probes for the rgw daemon
    startupProbe:
      disabled: false
      probe:
        timeoutSeconds: 3
    readinessProbe:
      disabled: false
      probe:
        timeoutSeconds: 3

How about implementing this so the curl max-time is set to the timeoutSeconds if set in the probe? And at the same time we could also have a smart default.

Copy link
Member

@BlaineEXE BlaineEXE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to be sure that the best solution is indeed the curl --max-time flag. When I was implementing the RGW probe, I remember looking into that flag and rejecting it, but I forget my rationale at present.

# --insecure - don't validate ssl if using secure port only
# --silent - don't output progress info
# --output /dev/stderr - output HTML header to stdout (good for debugging)
# --write-out '%{response_code}' - print the HTTP response code to stdout
curl --insecure --silent --output /dev/stderr --write-out '%{response_code}' "$URL"
curl --max-time 3 --insecure --silent --output /dev/stderr --write-out '%{response_code}' "$URL"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to parameterize the --max-time value instead of hard-coding it. When we have had hard-coded things like this in previous probes, it has led to situations where some users are encountering an error state that they are unable to work around.

Let's add a PROBE_TIMEOUT="{{ .Timeout }} value to the beginning of this script and use --max-time "$PROBE_TIMEOUT". Because the bash portion of the script should process in under a second, I'm not worried about a few hundred milliseconds of time where something extra may be running.

This will require passing the configured probe timeout to the template config in .go code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to update the commit title and body, and the PR description to reflect that the probe is no longer hard-coded to 3 seconds.

@@ -17,11 +17,12 @@ RGW_URL="$PROBE_PROTOCOL://0.0.0.0:$PROBE_PORT"

function check() {
local URL="$1"
# --max-time 3 - override the default 2 minutes wait time down to 3 seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to make sure we capture the reason why we are using this flag in this case, instead of just stating what the flag does.

Suggested change
# --max-time 3 - override the default 2 minutes wait time down to 3 seconds
# --max-time - curl doesn't always respond to SIGTERM on timeout and can become zombie; make sure it times out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add curl timeout to the rgw-probe.sh healtchcheck script
4 participants