Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve kubeadm reset #37831

Merged
merged 3 commits into from Dec 7, 2016
Merged

Conversation

luxas
Copy link
Member

@luxas luxas commented Dec 1, 2016

Depends on: #36474
Broken out from: #37568
Carries: #35709, @camilocot

This makes the kubeadm reset command more robust and user-friendly.
I'll rebase after #36474 merges...

cc-ing reviewers: @mikedanese @errordeveloper @dgoodwin @jbeda

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 1, 2016
@k8s-oncall
Copy link

This change is Reviewable

@luxas luxas added the release-note-none Denotes a PR that doesn't merit a release note. label Dec 1, 2016
@k8s-github-robot k8s-github-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 1, 2016
dockerCheck := preflight.ServiceCheck{Service: "docker"}
if warnings, errors := dockerCheck.Check(); len(warnings) == 0 && len(errors) == 0 {
fmt.Println("[reset] Stopping all running docker containers...")
if err := exec.Command("sh", "-c", "docker ps | grep 'k8s_' | awk '{print $1}' | xargs -r docker rm --force --volumes").Run(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen if non-docker container runtime used ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't support other runtimes than docker for the moment, so that's not a problem for now.
this code just moved from below

}

fmt.Println("[reset] Unmounting directories in /var/lib/kubelet...")
umountDirsCmd := "cat /proc/mounts | awk '{print $2}' | grep '/var/lib/kubelet' | xargs -r umount"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe instead of embedding "shell" script here, write small function to call umount individually and report error on particular path ?

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 2, 2016
Copy link
Contributor

@dgoodwin dgoodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else looks good to me Lucas. New output is improved and I see we're hitting /etc/cni/net.d now which is probably good.

@@ -45,7 +46,7 @@ func NewCmdReset(out io.Writer) *cobra.Command {

cmd.PersistentFlags().BoolVar(
&skipPreFlight, "skip-preflight-checks", false,
"skip preflight checks normally run before modifying the system",
"preflight checks normally run before modifying the system, pass this argument to skip those checks",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally prefer the original version, it reads just as clearly to me and is more concise which I like for CLI help options.

Either way though we should capitalize first letter, this is inconsistent with all the other init --help options right now.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GKE smoke e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE Node e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history.

The magic incantation to run this job again is @k8s-bot node e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins unit/integration failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history.

The magic incantation to run this job again is @k8s-bot unit test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins Kubemark GCE e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history.

The magic incantation to run this job again is @k8s-bot kubemark e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins CRI GCE Node e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history.

The magic incantation to run this job again is @k8s-bot cri node e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCI GCE e2e failed for commit bf6b95447c1355cec93044bd849b0ca108b7bf5e. Full PR test history.

The magic incantation to run this job again is @k8s-bot gci gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@luxas
Copy link
Member Author

luxas commented Dec 4, 2016

@dgoodwin When it looks good to you, you should simply be able to comment "/lgtm" and the bot will apply the label.
Also included a third commit, which makes reset drain the node and remove it from the cluster.

@k8s-github-robot k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins GKE smoke e2e failed for commit d350e98965768598ee58688ba7f53ded56733031. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gke e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE etcd3 e2e failed for commit d350e98965768598ee58688ba7f53ded56733031. Full PR test history.

The magic incantation to run this job again is @k8s-bot gce etcd3 e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins kops AWS e2e failed for commit 65edcc0. Full PR test history.

The magic incantation to run this job again is @k8s-bot kops aws e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE e2e failed for commit 65edcc0. Full PR test history.

The magic incantation to run this job again is @k8s-bot cvm gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

@luxas
Copy link
Member Author

luxas commented Dec 5, 2016

@k8s-bot cvm gce e2e test this

Copy link
Contributor

@dgoodwin dgoodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some text change requsts.

@@ -48,118 +49,167 @@ func NewCmdReset(out io.Writer) *cobra.Command {
"skip preflight checks normally run before modifying the system",
)

cmd.PersistentFlags().BoolVar(
&removeNode, "remove-node", true,
"remove this node from the pool of nodes in this cluster",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to standardize here but it looks like the winner so far is to capitalize "Remove".


func drainNode(removeNode bool) error {

fmt.Println("[reset] Draining this node...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid using "this" here, doesn't read quite right. How about we move this below the hostname check, and if it passes output "Draining node: hostname".

}

if removeNode {
fmt.Println("[reset] Removing this node from the cluster...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above I would suggest "Removing node: hostname" or just "Removing node..."

if err != nil {
fmt.Printf("failed to remove file: [%v]\n", err)
}
fmt.Printf("[reset] Failed to drain node: [%v]\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably anticipate this reset will be run on a lot of nodes that never made it to the cluster, or have since been deleted by other means, so for the text here because the method might fail to drain, or to remove, we shouldn't output "drain" explicitly here.

How about "Failed to cleanup node: [%v]" and let the inner error message indicate if it was drain or remove that went wrong.

defer d.Close()
names, err := d.Readdirnames(-1)
// Drain and maybe remove the node from the cluster
err := drainNode(r.removeNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drainAndRemoveNode might be a little more clear.

fmt.Println("[reset] Draining this node...")
hostname, err := os.Hostname()
if err != nil {
return fmt.Errorf("failed to detect the hostname of this node. Won't drain this node and remove it from the cluster.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest just "failed to detect node hostname".

kubeConfigPath := path.Join(kubeadmapi.GlobalEnvParams.KubernetesDir, "kubelet.conf")
_, err = exec.Command("kubectl", "--kubeconfig", kubeConfigPath, "drain", hostname, "--delete-local-data", "--force", "--ignore-daemonsets").Output()
if err != nil {
return fmt.Errorf("failed to drain this node [%v]", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest dropping "this".


_, err = exec.Command("kubectl", "--kubeconfig", kubeConfigPath, "delete", "node", hostname).Output()
if err != nil {
return fmt.Errorf("failed to remove this node [%v]", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest dropping "this" here as well.

@luxas
Copy link
Member Author

luxas commented Dec 5, 2016

@dgoodwin Done, I think I fixed all your output nits.
PTAL

@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit 6fa116b3df801784de5f412fb07ac0fce1f35896. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

…de should be removed from the cluster as well

fmt.Printf("[reset] Draining node: %q\n", hostname)

output, err = exec.Command("kubectl", "--kubeconfig", kubeConfigPath, "drain", hostname, "--delete-local-data", "--force", "--ignore-daemonsets").Output()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to add HttpProxyCheck() in RunResetCheck(). but we need somehow get API URL which written in kubelet.conf.
Also, FileAvailableCheck{Path: "/etc/kubernetes/kubelet.conf"} needed there if we want to use kubectl here.
And InPathCheck{executable: "kubectl", mandatory: true}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would these be covered sufficiently by the err handling?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I don't want to add those checks is that the drain/remove functions aren't critical in any way, if it doesn't work, it just won't do the drain and remove things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checks can be done optional based on state of "remove-node". if user specifies flag, he/she assumes that it would be done gracefully.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, it will try to drain regardless of remove-node...
At least, this is totally fine for the next stable release, it improves and fixes lots of things

The TODO I added was to switch over to the go native client, but we're gonna do that in time for v1.6 when we have more time and are confident that the versioned go client (client-go) works properly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment applicable to both invocations of kubectl, before client-go used.
It doesn't make sense to fail with debug message if we already in the beginning can warn user about potential issue. That's the whole reason for prechecks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This situation is very different with reset however, we're expecting that the system could easily have failed to cluster up, so the drain operation may well fail and we intend to keep on going. Provided the error messages resulting from attempting to drain are logical, it's probably ok to let the error for whatever happens bubble up, as we're going to proceed either way.

That said, @luxas what happens on pure nodes here? Is kubectl configured there? Does draining only work on masters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works on nodes as well, otherwise I wouldn't have added it here...
The /etc/kubernetes/kubelet.conf file exists everywhere when using kubeadm, and kubectl is installed everywhere, so when going the opinionated route with debs and rpms, it just "should work (tm)" 😄

@dgoodwin
Copy link
Contributor

dgoodwin commented Dec 6, 2016

/LGTM

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 6, 2016
@luxas
Copy link
Member Author

luxas commented Dec 6, 2016

Thanks @dgoodwin!

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 38194, 37594, 38123, 37831, 37084)

@k8s-github-robot k8s-github-robot merged commit 4eb4777 into kubernetes:master Dec 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants