fix: reduce etcd download retries to avoid timeouts #437
Conversation
@@ -19,7 +19,7 @@ installEtcd() { | |||
if [[ "$CURRENT_VERSION" == "${ETCD_VERSION}" ]]; then | |||
echo "etcd version ${ETCD_VERSION} is already installed, skipping download" | |||
else | |||
retrycmd_get_tarball 360 10 /tmp/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz ${ETCD_DOWNLOAD_URL}/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz || exit $ERR_ETCD_DOWNLOAD_TIMEOUT | |||
retrycmd_get_tarball 120 5 /tmp/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz ${ETCD_DOWNLOAD_URL}/etcd-v${ETCD_VERSION}-linux-amd64.tar.gz || exit $ERR_ETCD_DOWNLOAD_TIMEOUT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You feel good about reducing the timeout to 5 seconds here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that number is sleep between retries https://github.com/Azure/aks-engine/blob/master/parts/k8s/kubernetesprovisionsource.sh#L88
timeout of each retry is hardcoded in retrycmd_get_tarball as 60 seconds
5 second sleep is the same number we use for other retrycmd_get_tarball calls (eg, Azure CNI). I don't think sleeping 10s between failures is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks!
Codecov Report
@@ Coverage Diff @@
## master #437 +/- ##
=======================================
Coverage 53.38% 53.38%
=======================================
Files 95 95
Lines 14374 14374
=======================================
Hits 7674 7674
Misses 6037 6037
Partials 663 663 |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: CecileRobertMichon, jackfrancis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Reason for Change:
I've noticed some cluster deployments fail because etcd fails to download (no outbound on the VM) but the CSE times out instead of returning an error code because the etcd retries take too long. Reducing retries to 120 and sleep to 5s between retries.
Issue Fixed:
Requirements:
Notes: