Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky #15498

Closed
dchen1107 opened this issue Oct 12, 2015 · 14 comments
Assignees
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@dchen1107
Copy link
Member

Jenkins project: kubernetes-e2e-gce-reboot failed last couple of runs:

Reboot each node by triggering kernel panic and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Oct 12 13:13:46.645: Node e2e-reboot-minion-iqqf failed reboot test.

@dchen1107 dchen1107 added kind/flake Categorizes issue or PR as related to a flaky test. area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 12, 2015
@timothysc
Copy link
Member

/cc @kubernetes/rh-cluster-infra

@saad-ali saad-ali added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 23, 2015
@jszczepkowski
Copy link
Contributor

This problem occurred again:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/6868/

This time the following tests are failing:
Reboot each node by dropping all inbound packets for a while and ensure they function afterwards
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:91
Oct 28 06:01:59.706: Node e2e-reboot-minion-0iya failed reboot test.

Reboot each node by ordering clean reboot and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:65
Oct 28 06:02:19.886: Node e2e-reboot-minion-0iya failed reboot test.

@alex-mohr
Copy link
Contributor

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7109/
Reboot each node by dropping all outbound packets for a while and ensure they function afterwards

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:99
Nov 2 06:14:02.945: Node e2e-reboot-minion-0pev failed reboot test.

Reboot each node by triggering kernel panic and ensure they function upon restart

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Nov 2 06:14:23.119: Node e2e-reboot-minion-0pev failed reboot test.

@alex-mohr
Copy link
Contributor

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7117/
Identified problems

Reboot each node by triggering kernel panic and ensure they function upon restart

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Nov 2 10:15:41.541: Node e2e-reboot-minion-09bl failed reboot test.

Reboot each node by dropping all outbound packets for a while and ensure they function afterwards

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:99
Nov 2 10:16:01.761: Node e2e-reboot-minion-09bl failed reboot test.

@alex-mohr
Copy link
Contributor

FWIW, this set of apparent-flakes is currently blocking submit queue, which it seems to do about 10% of the time based on history of the past 2 days.

Other similar occurrences: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7048/ http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7044/ http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7029/

@alex-mohr alex-mohr added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 2, 2015
@alex-mohr
Copy link
Contributor

A build contributing to 10-20% downtime of submit queue is P0 worth IMO.

Is there a real issue with nodes not handling reboots correctly? The only open issue I see is #14642

@alex-mohr
Copy link
Contributor

Dawn, are reboot tests yours as part of node team? If not, can you please delegate to actual owners?

(And another failure: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7127/)

@alex-mohr alex-mohr changed the title Reboot each node by triggering kernel panic and ensure they function upon restart flaky e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky Nov 2, 2015
@dchen1107
Copy link
Member Author

Ok, let me triage it now to understand the issue better first. Thanks!

@dchen1107
Copy link
Member Author

Why kubernetes-e2e-gce-reboot doesn't save kubelet logs? @ixdy

Since there is no kubelet logs, I cannot confirm it. But I checked test output, and here is my hyperthesis:

  1. Above failed build (GetReference should require a meta.Interface #7127) has 4 failed tests:
Reboot each node by dropping all inbound packets for a while and ensure they function afterwards

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:91
Nov  2 15:05:09.004: Node e2e-reboot-minion-csby failed reboot test.


Reboot each node by switching off the network interface and ensure they function upon switch on

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:83
Nov  2 15:07:16.491: Node e2e-reboot-minion-csby failed reboot test.

Reboot each node by ordering clean reboot and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:65
Nov  2 15:07:36.658: Node e2e-reboot-minion-csby failed reboot test.

Reboot each node by dropping all outbound packets for a while and ensure they function afterwards

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:99
Nov  2 15:07:56.852: Node e2e-reboot-minion-csby failed reboot test.
  1. All failures listed above are caused by a bad node: Node e2e-reboot-minion-csby

  2. Bad node e2e-reboot-minion-csby is reported with NodeCondition:

15:05:07 Nov  2 15:05:07.004: INFO: Node e2e-reboot-minion-csby condition 1/2: type: Ready, status: False, reason: "KubeletNotReady", message: "network not configured correctly", last transition time: 2015-11-02 15:04:40 -0800 PST
  1. All above reboot tests are not really rebooting the node. So above failure cannot be recovered or fixed.

Why network not configured correctly at the first place? I think we hit some docker issue, similar to #16601 (comment)

But due to lacking of kubelet logs, I cannot confirm it.

@ixdy
Copy link
Member

ixdy commented Nov 3, 2015

@dchen1107 there are some kubelet logs for build 7127: https://storage.cloud.google.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-reboot/7127/artifacts/

One of the hosts is missing because it wasn't responding to ssh:

Error running command: error getting SSH client to jenkins@104.154.32.61:22: 'dial tcp 104.154.32.61:22: connection refused'
Error running command: error getting SSH client to jenkins@104.154.32.61:22: 'dial tcp 104.154.32.61:22: connection refused'
Error running command: error getting SSH client to jenkins@104.154.32.61:22: 'dial tcp 104.154.32.61:22: connection refused'

@dchen1107
Copy link
Member Author

For last ~100 runs, there is no failure. And for last 300 runs, there are 2 failures. Comparing to other e2e tests, the failure rate is pretty low:

kubernetes-e2e-gce 0 of 202 failed (±0) 2 hr 40 min - #9580 15 hr - #9571
kubernetes-e2e-gce-reboot 0 of 202 failed (±0) 43 min - #8079 16 hr - #8050
kubernetes-e2e-gce-release-1.0 0 of 99 failed (-1) 1 hr 19 min - #2242 8 hr 2 min - #2236
kubernetes-e2e-gce-release-1.1 0 of 175 failed (±0) 3 hr 16 min - #738 6 hr 48 min - #736

I am going to close the issue for now since nothing to debug at this moment. Please open a new issue if the problem reoccur, and assign to me.

@lavalamp
Copy link
Member

I think this reoccurred, in the "drop outbound packets" and "switching off the network interface" tests.

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/10379/

@lavalamp lavalamp reopened this Jan 12, 2016
@dchen1107
Copy link
Member Author

I were leaving this one for current reboot test failure in kubernetes-ece-gke-ci-reboot, but since #19986 was filed for that purpose. And other causes for reboot failures are filed separately already: #20021 for kube-proxy, and #8705 for version mismatch. I am closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

7 participants