Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

fix: validate k8s control plane before addons #3729

Merged
merged 5 commits into from
Aug 31, 2020

Conversation

jackfrancis
Copy link
Member

@jackfrancis jackfrancis commented Aug 19, 2020

Reason for Change:

This PR improves the CSE error handling due to the control plane being unable to connect to the cluster apiserver.

Specifically, this PR makes the following changes:

  • validate basic apiserver connectivity and cluster functionality (via kubectl cluster-info) before validating "init" addons, because being able to connect to the cluster apiserver endpoint is a precondition of validating "init" addons
  • remove the non-working "validate that kube-addon-manager pods" are present check, prefer the concrete check of waiting for specific init addon outcomes:
    • the arrival of PodSecurityPolicy-provided resources, if the pod-security-policy addon is enabled
    • the arrival of the kubernetes-dashboard namespace, if the kubernetes-dashboard addon is enabled
    • the arrival of the gatekeeper-system namespace, if the azure-policy addon is enabled
  • adds a new exit_cse CSE helper func which can be used to gather debug data prior to exiting the CSE script
    • this PR uses exit_cse to dump systemd kubelet logs prior to all exit paths where Kubernetes application layer logs are relevant for debugging failures

Issue Fixed:

Requirements:

Notes:

@codecov
Copy link

codecov bot commented Aug 19, 2020

Codecov Report

Merging #3729 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3729   +/-   ##
=======================================
  Coverage   73.20%   73.20%           
=======================================
  Files         148      148           
  Lines       25372    25375    +3     
=======================================
+ Hits        18573    18576    +3     
  Misses       5663     5663           
  Partials     1136     1136           
Impacted Files Coverage Δ
pkg/engine/templates_generated.go 53.42% <ø> (ø)
pkg/engine/template_generator.go 82.19% <100.00%> (+0.07%) ⬆️
pkg/engine/virtualmachinescalesets.go 80.37% <100.00%> (ø)
pkg/engine/vmextensions.go 94.55% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e38aa8...84e49ad. Read the comment docs.

@jackfrancis
Copy link
Member Author

As it's rather hard to test the error respons from the new exit_cse func, here's what it looks like when tested manually. For example, I messed with the k8s control plane test so it would definitely fail:

$ git diff parts/k8s/cloud-init/artifacts/cse_config.sh 
diff --git a/parts/k8s/cloud-init/artifacts/cse_config.sh b/parts/k8s/cloud-init/artifacts/cse_config.sh
index 327580cba..69319478e 100755
--- a/parts/k8s/cloud-init/artifacts/cse_config.sh
+++ b/parts/k8s/cloud-init/artifacts/cse_config.sh
@@ -468,7 +468,7 @@ ensureK8sControlPlane() {
   if [ -f /var/run/reboot-required ] || [ "$NO_OUTBOUND" = "true" ]; then
     return
   fi
-  retrycmd 120 5 25 $KUBECTL 2>/dev/null cluster-info || exit_cse $GET_KUBELET_LOGS {{GetLinuxCSELogPath}} {{GetCSEErrorCode "ERR_K8S_RUNNING_TIMEOUT"}}
+  retrycmd 3 5 25 $KUBECTL 2>/dev/null not-a-valid-command || exit_cse $GET_KUBELET_LOGS {{GetLinuxCSELogPath}} {{GetCSEErrorCode "ERR_K8S_RUNNING_TIMEOUT"}}
 }
 {{- if IsAzurePolicyAddonEnabled}}
 ensureLabelExclusionForAzurePolicyAddon() {

The expected CSE 30 exit response is preserved:

Error from deployment for kubernetes-westus2-88714 in resource group kubernetes-westus2-88714:exit status 1
2020/08/19 11:44:54 Command Output: Deployment failed. Correlation ID: b6068106-e885-469e-924e-6362fc0ff3b8. {
  "status": "Failed",
  "error": {
    "code": "ResourceDeploymentFailure",
    "message": "The resource operation completed with terminal provisioning state 'Failed'.",
    "details": [
      {
        "code": "VMExtensionProvisioningError",
        "message": "VM has reported a failure when processing extension 'cse-master-0'. Error message: \"Enable failed: failed to execute command: command terminated with exit status=30\n[stdout]\nWed Aug 19 18:42:21 UTC 2020,k8s-master-28900402-0\n\n[stderr]\n\"\r\n\r\nMore information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot "
      }
    ]
  }
}

And the relevant kubelet logs are interpolated into the CSE log output at /var/log/azure/cluster-provision.log:

+ retrycmd 3 5 25 /usr/local/bin/kubectl --kubeconfig=/home/azureuser/.kube/config not-a-valid-command
Executed "/usr/local/bin/kubectl --kubeconfig=/home/azureuser/.kube/config not-a-valid-command" 3 times
+ exit_cse journalctl -u kubelet --no-pager /var/log/azure/cluster-provision.log 30
+ local output=/var/log/azure/cluster-provision.log
+ local exit_code=30
+ local 'command=journalctl -u kubelet --no-pager'
+ journalctl -u kubelet --no-pager
-- Logs begin at Wed 2020-08-19 18:41:36 UTC, end at Wed 2020-08-19 18:43:51 UTC. --
Aug 19 18:43:38 k8s-master-28900402-0 systemd[1]: Starting Kubelet...
Aug 19 18:43:38 k8s-master-28900402-0 systemd[1]: Started Kubelet.
Aug 19 18:43:38 k8s-master-28900402-0 kubelet[5625]: Flag --address has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Aug 19 18:43:38 k8s-master-28900402-0 kubelet[5625]: Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
...
...
< lots more kubelet logs>
...
...
Aug 19 18:43:50 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:50.338171    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
Aug 19 18:43:50 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:50.438311    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
Aug 19 18:43:50 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:50.538472    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
Aug 19 18:43:50 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:50.638628    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
Aug 19 18:43:50 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:50.738978    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
Aug 19 18:43:50 k8s-master-28900402-0 kubelet[5625]: I0819 18:43:50.782660    5625 csi_plugin.go:945] Failed to contact API server when waiting for CSINode publishing: Get https://10.255.255.5:443/apis/storage.k8s.io/v1/csinodes/k8s-master-28900402-0: dial tcp 10.255.255.5:443: connect: connection refused
Aug 19 18:43:50 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:50.839129    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
Aug 19 18:43:50 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:50.939285    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
Aug 19 18:43:51 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:51.039498    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
Aug 19 18:43:51 k8s-master-28900402-0 kubelet[5625]: E0819 18:43:51.139650    5625 kubelet.go:2268] node "k8s-master-28900402-0" not found
+ exit 30
+ exitAndFlush
+ local tags script_end duration
++ append_default_tags code=0
++ local tags=code=0
++ join_tags code=0 k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
++ local left=code=0 right=k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
++ [[ -n code=0 ]]
++ [[ -n k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 ]]
++ echo code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
+ tags=code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
+ [[ 0 == \0 ]]
+ trace_info cse-exit code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
+ local name=cse-exit tags=code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
++ append_default_tags code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
++ local tags=code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
++ join_tags code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
++ local left=code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 right=k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
++ [[ -n code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 ]]
++ [[ -n k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 ]]
++ echo code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
+ tags=code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708
+ [[ -z code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 ]]
+ apmz trace -n cse-exit -t code=0,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708,k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 -l 0 -o
++ apmz time unixnano
+ script_end=1597862631167751249
++ apmz time diff -a 1597862542813553039 -b 1597862631167751249 -r sec
+ duration=88.354198
+ [[ -z k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 ]]
+ apmz metric -n cse-duration -v 88.354198 -t k8s_version=1.18.8,network_plugin=azure,os_type=linux,cri=docker,correlation_id=6b5ac64b-60db-4d42-a619-04491869a708 -o
+ [[ -n c92d8284-b550-4b06-b7ba-e80fd7178faa ]]
+ [[ -z '' ]]
+ apmz batch -f /tmp/apmz.UJhx3X --api-keys c92d8284-b550-4b06-b7ba-e80fd7178faa
sent 27 events
+ [[ -z '' ]]
+ rm /tmp/apmz.UJhx3X

local output=${@: -2: 1}
local exit_code=${@: -1: 1}
local command=${@:1:$#-2}
$command >> $output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to make sure that you don't care about the exit/failure in line 234 (the $command part)
Including that the $command is slow. I would think maybe backgrounding that such that it can complete separate from the exit (just a thought)

Something like: $command >> $output &

The point being that you would like this information but you don't want it to cause a nested failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of running in the background

@jackfrancis jackfrancis force-pushed the cse-k8s-control-plane branch 3 times, most recently from 4b8b9e8 to cc398db Compare August 25, 2020 16:03
Copy link
Member

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot
Copy link

acs-bot commented Aug 31, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [jackfrancis,mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis jackfrancis merged commit 41ca9d6 into Azure:master Aug 31, 2020
@jackfrancis jackfrancis deleted the cse-k8s-control-plane branch August 31, 2020 22:16
penggu pushed a commit to penggu/aks-engine that referenced this pull request Oct 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants