New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve kubeadm reset #37831
Improve kubeadm reset #37831
Conversation
dockerCheck := preflight.ServiceCheck{Service: "docker"} | ||
if warnings, errors := dockerCheck.Check(); len(warnings) == 0 && len(errors) == 0 { | ||
fmt.Println("[reset] Stopping all running docker containers...") | ||
if err := exec.Command("sh", "-c", "docker ps | grep 'k8s_' | awk '{print $1}' | xargs -r docker rm --force --volumes").Run(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would happen if non-docker container runtime used ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't support other runtimes than docker for the moment, so that's not a problem for now.
this code just moved from below
} | ||
|
||
fmt.Println("[reset] Unmounting directories in /var/lib/kubelet...") | ||
umountDirsCmd := "cat /proc/mounts | awk '{print $2}' | grep '/var/lib/kubelet' | xargs -r umount" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe instead of embedding "shell" script here, write small function to call umount individually and report error on particular path ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything else looks good to me Lucas. New output is improved and I see we're hitting /etc/cni/net.d now which is probably good.
@@ -45,7 +46,7 @@ func NewCmdReset(out io.Writer) *cobra.Command { | |||
|
|||
cmd.PersistentFlags().BoolVar( | |||
&skipPreFlight, "skip-preflight-checks", false, | |||
"skip preflight checks normally run before modifying the system", | |||
"preflight checks normally run before modifying the system, pass this argument to skip those checks", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally prefer the original version, it reads just as clearly to me and is more concise which I like for CLI help options.
Either way though we should capitalize first letter, this is inconsistent with all the other init --help options right now.
Jenkins GCI GKE smoke e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history. The magic incantation to run this job again is |
Jenkins GCE Node e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history. The magic incantation to run this job again is |
Jenkins unit/integration failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history. The magic incantation to run this job again is |
Jenkins Kubemark GCE e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history. The magic incantation to run this job again is |
Jenkins CRI GCE Node e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history. The magic incantation to run this job again is |
Jenkins GCI GCE e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history. The magic incantation to run this job again is |
@dgoodwin When it looks good to you, you should simply be able to comment "/lgtm" and the bot will apply the label. |
Jenkins GKE smoke e2e failed for commit d350e98965768598ee58688ba7f53ded56733031. Full PR test history. The magic incantation to run this job again is |
Jenkins GCE etcd3 e2e failed for commit d350e98965768598ee58688ba7f53ded56733031. Full PR test history. The magic incantation to run this job again is |
Jenkins kops AWS e2e failed for commit 65edcc0. Full PR test history. The magic incantation to run this job again is |
Jenkins GCE e2e failed for commit 65edcc0. Full PR test history. The magic incantation to run this job again is |
@k8s-bot cvm gce e2e test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some text change requsts.
@@ -48,118 +49,167 @@ func NewCmdReset(out io.Writer) *cobra.Command { | |||
"skip preflight checks normally run before modifying the system", | |||
) | |||
|
|||
cmd.PersistentFlags().BoolVar( | |||
&removeNode, "remove-node", true, | |||
"remove this node from the pool of nodes in this cluster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to standardize here but it looks like the winner so far is to capitalize "Remove".
|
||
func drainNode(removeNode bool) error { | ||
|
||
fmt.Println("[reset] Draining this node...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid using "this" here, doesn't read quite right. How about we move this below the hostname check, and if it passes output "Draining node: hostname".
} | ||
|
||
if removeNode { | ||
fmt.Println("[reset] Removing this node from the cluster...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per above I would suggest "Removing node: hostname" or just "Removing node..."
if err != nil { | ||
fmt.Printf("failed to remove file: [%v]\n", err) | ||
} | ||
fmt.Printf("[reset] Failed to drain node: [%v]\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably anticipate this reset will be run on a lot of nodes that never made it to the cluster, or have since been deleted by other means, so for the text here because the method might fail to drain, or to remove, we shouldn't output "drain" explicitly here.
How about "Failed to cleanup node: [%v]" and let the inner error message indicate if it was drain or remove that went wrong.
defer d.Close() | ||
names, err := d.Readdirnames(-1) | ||
// Drain and maybe remove the node from the cluster | ||
err := drainNode(r.removeNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drainAndRemoveNode might be a little more clear.
fmt.Println("[reset] Draining this node...") | ||
hostname, err := os.Hostname() | ||
if err != nil { | ||
return fmt.Errorf("failed to detect the hostname of this node. Won't drain this node and remove it from the cluster.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest just "failed to detect node hostname".
kubeConfigPath := path.Join(kubeadmapi.GlobalEnvParams.KubernetesDir, "kubelet.conf") | ||
_, err = exec.Command("kubectl", "--kubeconfig", kubeConfigPath, "drain", hostname, "--delete-local-data", "--force", "--ignore-daemonsets").Output() | ||
if err != nil { | ||
return fmt.Errorf("failed to drain this node [%v]", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest dropping "this".
|
||
_, err = exec.Command("kubectl", "--kubeconfig", kubeConfigPath, "delete", "node", hostname).Output() | ||
if err != nil { | ||
return fmt.Errorf("failed to remove this node [%v]", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest dropping "this" here as well.
@dgoodwin Done, I think I fixed all your output nits. |
…p the wrong config on the next kubeadm init run
… user-friendly output
Jenkins verification failed for commit 6fa116b3df801784de5f412fb07ac0fce1f35896. Full PR test history. The magic incantation to run this job again is |
…de should be removed from the cluster as well
|
||
fmt.Printf("[reset] Draining node: %q\n", hostname) | ||
|
||
output, err = exec.Command("kubectl", "--kubeconfig", kubeConfigPath, "drain", hostname, "--delete-local-data", "--force", "--ignore-daemonsets").Output() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to add HttpProxyCheck() in RunResetCheck(). but we need somehow get API URL which written in kubelet.conf.
Also, FileAvailableCheck{Path: "/etc/kubernetes/kubelet.conf"}
needed there if we want to use kubectl here.
And InPathCheck{executable: "kubectl", mandatory: true}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would these be covered sufficiently by the err handling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I don't want to add those checks is that the drain/remove functions aren't critical in any way, if it doesn't work, it just won't do the drain and remove things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checks can be done optional based on state of "remove-node". if user specifies flag, he/she assumes that it would be done gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, it will try to drain regardless of remove-node
...
At least, this is totally fine for the next stable release, it improves and fixes lots of things
The TODO I added was to switch over to the go native client, but we're gonna do that in time for v1.6 when we have more time and are confident that the versioned go client (client-go
) works properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment applicable to both invocations of kubectl, before client-go used.
It doesn't make sense to fail with debug message if we already in the beginning can warn user about potential issue. That's the whole reason for prechecks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This situation is very different with reset however, we're expecting that the system could easily have failed to cluster up, so the drain operation may well fail and we intend to keep on going. Provided the error messages resulting from attempting to drain are logical, it's probably ok to let the error for whatever happens bubble up, as we're going to proceed either way.
That said, @luxas what happens on pure nodes here? Is kubectl configured there? Does draining only work on masters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works on nodes as well, otherwise I wouldn't have added it here...
The /etc/kubernetes/kubelet.conf
file exists everywhere when using kubeadm, and kubectl is installed everywhere, so when going the opinionated route with debs and rpms, it just "should work (tm)" 😄
/LGTM |
Thanks @dgoodwin! |
Automatic merge from submit-queue (batch tested with PRs 38194, 37594, 38123, 37831, 37084) |
Depends on: #36474
Broken out from: #37568
Carries: #35709, @camilocot
This makes the
kubeadm reset
command more robust and user-friendly.I'll rebase after #36474 merges...
cc-ing reviewers: @mikedanese @errordeveloper @dgoodwin @jbeda