Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watch node object for podCIDR instead of crash + wait for restart #329

Merged
merged 1 commit into from
May 27, 2024

Conversation

jingyuanliang
Copy link
Member

@jingyuanliang jingyuanliang commented May 24, 2024

This can save up to 10 seconds before Kubelet decides to restart the pod after the crash.

Also made some logging improvements.

/assign @marqc
/assign @sypakine
/hold

@jingyuanliang jingyuanliang changed the title Watch node object for podCIDR instead of crashing and wait for restart Watch node object for podCIDR instead of crash + wait for restart May 24, 2024
@google-oss-prow google-oss-prow bot added size/L and removed size/S labels May 24, 2024
host=${KUBERNETES_SERVICE_HOST}
# If host contains a colon (:), it is an IPv6 address, hence needs wrapping
# with [..].
if [[ "${host}" =~ : ]]; then
host="[$host]"
fi
node_url="https://$host:${KUBERNETES_SERVICE_PORT}/api/v1/nodes/${HOSTNAME}"
response=$(curl -k -s -H "Authorization: Bearer $token" "$node_url")
node_url="https://$host:${KUBERNETES_SERVICE_PORT}/api/v1/watch/nodes/${HOSTNAME}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /api/v1/watch/nodes path is deprecated:

watch changes to an object of kind Node. deprecated: use the 'watch' parameter with a list operation instead, filtered to a single item with the 'fieldSelector' parameter.

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#node-v1-core

Please use the recommended approach for the list with fieldSelector for metadata.name field equal to the hostname.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

response=$(curl -k -s -H "Authorization: Bearer $token" "$node_url")
node_url="https://$host:${KUBERNETES_SERVICE_PORT}/api/v1/watch/nodes/${HOSTNAME}"
for _ in {1..3}; do
response=$(grep -m1 . <(curl -k -s -N -H "Authorization: Bearer $(</var/run/secrets/kubernetes.io/serviceaccount/token)" "$node_url" | jq --unbuffered -c '.object | select(.spec.podCIDR != null)'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no timeout specified, please specify the timeout explicitly using the timeoutSeconds request param to avoid the forever hanging call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally you can add the limit=1 parameter to end the call as soon as the first update on the node object gets returned.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeout added, and no, we can't end as soon as the first update, in case the first update is not adding podCIDR.

grep -m1 . <(...) can return after receiving the first line.

node_url="https://$host:${KUBERNETES_SERVICE_PORT}/api/v1/nodes/${HOSTNAME}"
response=$(curl -k -s -H "Authorization: Bearer $token" "$node_url")
node_url="https://$host:${KUBERNETES_SERVICE_PORT}/api/v1/watch/nodes/${HOSTNAME}"
for _ in {1..3}; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 1..3 is a magic number. make it a named const variable that says this is the number of retries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

fatal "Could not successfully watch node and wait for podCIDR."
fi
log "Node object fetched from $node_url"
log "${response}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block may be extracted into the bash function for better readability, i.e. get_k8s_node_descriptor. And additionally the variable name can be changed to something more meaningful than $response

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The response is somehow used throughout the file... don't bother to change it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's not too many. Let me change it and get the opportunity to change and poor styles along the way.

# for kubelet to retry the whole container if response is still not fetched.
fetch_node_object 3 20

if [[ -z "${node_object:-}" ]]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check and the following 2 log entries can be part of the fetch_node_object function as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified a bit, but I still like to leave the full dump outside.

@google-oss-prow google-oss-prow bot added the lgtm label May 24, 2024
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jingyuanliang, marqc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jingyuanliang
Copy link
Member Author

/hold cancel

This can save up to about 10 seconds (initial CrashLoopBackOff delay)
before Kubelet decides to restart the pod after the crash.
@marqc
Copy link
Member

marqc commented May 27, 2024

/lgtm

@google-oss-prow google-oss-prow bot added the lgtm label May 27, 2024
@google-oss-prow google-oss-prow bot merged commit 929f2bc into GoogleCloudPlatform:master May 27, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants