Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

fix: reduce etcd download retries to avoid timeouts #437

Merged
merged 1 commit into from Feb 5, 2019

Conversation

CecileRobertMichon
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon commented Feb 4, 2019

Reason for Change:

I've noticed some cluster deployments fail because etcd fails to download (no outbound on the VM) but the CSE times out instead of returning an error code because the etcd retries take too long. Reducing retries to 120 and sleep to 5s between retries.

Issue Fixed:

Requirements:

Notes:

@@ -19,7 +19,7 @@ installEtcd() {
if [[ "$CURRENT_VERSION" == "${ETCD_VERSION}" ]]; then
echo "etcd version ${ETCD_VERSION} is already installed, skipping download"
else
retrycmd_get_tarball 360 10 /tmp/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz ${ETCD_DOWNLOAD_URL}/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz || exit $ERR_ETCD_DOWNLOAD_TIMEOUT
retrycmd_get_tarball 120 5 /tmp/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz ${ETCD_DOWNLOAD_URL}/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz || exit $ERR_ETCD_DOWNLOAD_TIMEOUT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You feel good about reducing the timeout to 5 seconds here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that number is sleep between retries https://github.com/Azure/aks-engine/blob/master/parts/k8s/kubernetesprovisionsource.sh#L88

timeout of each retry is hardcoded in retrycmd_get_tarball as 60 seconds

5 second sleep is the same number we use for other retrycmd_get_tarball calls (eg, Azure CNI). I don't think sleeping 10s between failures is necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks!

@codecov
Copy link

codecov bot commented Feb 5, 2019

Codecov Report

Merging #437 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #437   +/-   ##
=======================================
  Coverage   53.38%   53.38%           
=======================================
  Files          95       95           
  Lines       14374    14374           
=======================================
  Hits         7674     7674           
  Misses       6037     6037           
  Partials      663      663

@jackfrancis
Copy link
Member

/lgtm

@acs-bot
Copy link

acs-bot commented Feb 5, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [CecileRobertMichon,jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis jackfrancis merged commit 99a265d into Azure:master Feb 5, 2019
juhacket pushed a commit to juhacket/aks-engine that referenced this pull request Mar 14, 2019
@CecileRobertMichon CecileRobertMichon deleted the fix-etcd-timeout branch April 18, 2019 22:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants