e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky #15498

dchen1107 · 2015-10-12T20:40:48Z

Jenkins project: kubernetes-e2e-gce-reboot failed last couple of runs:

Reboot each node by triggering kernel panic and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Oct 12 13:13:46.645: Node e2e-reboot-minion-iqqf failed reboot test.

dchen1107 · 2015-10-12T20:41:39Z

couple failure examples:
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-reboot/6088/
http://kubekins.dls.corp.google.com/job/kubernetes-e2e-gce-reboot/6083/

timothysc · 2015-10-13T02:15:55Z

/cc @kubernetes/rh-cluster-infra

jszczepkowski · 2015-10-28T13:27:45Z

This problem occurred again:
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/6868/

This time the following tests are failing:
Reboot each node by dropping all inbound packets for a while and ensure they function afterwards
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:91
Oct 28 06:01:59.706: Node e2e-reboot-minion-0iya failed reboot test.

Reboot each node by ordering clean reboot and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:65
Oct 28 06:02:19.886: Node e2e-reboot-minion-0iya failed reboot test.

alex-mohr · 2015-11-02T18:42:49Z

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7109/
Reboot each node by dropping all outbound packets for a while and ensure they function afterwards

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:99
Nov 2 06:14:02.945: Node e2e-reboot-minion-0pev failed reboot test.

Reboot each node by triggering kernel panic and ensure they function upon restart

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Nov 2 06:14:23.119: Node e2e-reboot-minion-0pev failed reboot test.

alex-mohr · 2015-11-02T18:43:03Z

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7117/
Identified problems

Reboot each node by triggering kernel panic and ensure they function upon restart

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Nov 2 10:15:41.541: Node e2e-reboot-minion-09bl failed reboot test.

Reboot each node by dropping all outbound packets for a while and ensure they function afterwards

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:99
Nov 2 10:16:01.761: Node e2e-reboot-minion-09bl failed reboot test.

alex-mohr · 2015-11-02T18:44:52Z

FWIW, this set of apparent-flakes is currently blocking submit queue, which it seems to do about 10% of the time based on history of the past 2 days.

Other similar occurrences: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7048/ http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7044/ http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7029/

alex-mohr · 2015-11-02T21:42:34Z

A build contributing to 10-20% downtime of submit queue is P0 worth IMO.

Is there a real issue with nodes not handling reboots correctly? The only open issue I see is #14642

alex-mohr · 2015-11-02T23:35:52Z

Dawn, are reboot tests yours as part of node team? If not, can you please delegate to actual owners?

(And another failure: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7127/)

dchen1107 · 2015-11-02T23:36:30Z

Ok, let me triage it now to understand the issue better first. Thanks!

dchen1107 · 2015-11-03T00:04:06Z

Why kubernetes-e2e-gce-reboot doesn't save kubelet logs? @ixdy

Since there is no kubelet logs, I cannot confirm it. But I checked test output, and here is my hyperthesis:

Above failed build (GetReference should require a meta.Interface #7127) has 4 failed tests:

Reboot each node by dropping all inbound packets for a while and ensure they function afterwards

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:91
Nov  2 15:05:09.004: Node e2e-reboot-minion-csby failed reboot test.


Reboot each node by switching off the network interface and ensure they function upon switch on

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:83
Nov  2 15:07:16.491: Node e2e-reboot-minion-csby failed reboot test.

Reboot each node by ordering clean reboot and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:65
Nov  2 15:07:36.658: Node e2e-reboot-minion-csby failed reboot test.

Reboot each node by dropping all outbound packets for a while and ensure they function afterwards

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:99
Nov  2 15:07:56.852: Node e2e-reboot-minion-csby failed reboot test.

All failures listed above are caused by a bad node: Node e2e-reboot-minion-csby
Bad node e2e-reboot-minion-csby is reported with NodeCondition:

15:05:07 Nov  2 15:05:07.004: INFO: Node e2e-reboot-minion-csby condition 1/2: type: Ready, status: False, reason: "KubeletNotReady", message: "network not configured correctly", last transition time: 2015-11-02 15:04:40 -0800 PST

All above reboot tests are not really rebooting the node. So above failure cannot be recovered or fixed.

Why network not configured correctly at the first place? I think we hit some docker issue, similar to #16601 (comment)

But due to lacking of kubelet logs, I cannot confirm it.

ixdy · 2015-11-03T00:23:19Z

@dchen1107 there are some kubelet logs for build 7127: https://storage.cloud.google.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-reboot/7127/artifacts/

One of the hosts is missing because it wasn't responding to ssh:

Error running command: error getting SSH client to jenkins@104.154.32.61:22: 'dial tcp 104.154.32.61:22: connection refused'
Error running command: error getting SSH client to jenkins@104.154.32.61:22: 'dial tcp 104.154.32.61:22: connection refused'
Error running command: error getting SSH client to jenkins@104.154.32.61:22: 'dial tcp 104.154.32.61:22: connection refused'

dchen1107 · 2015-11-24T00:23:01Z

For last ~100 runs, there is no failure. And for last 300 runs, there are 2 failures. Comparing to other e2e tests, the failure rate is pretty low:

kubernetes-e2e-gce 0 of 202 failed (±0) 2 hr 40 min - #9580 15 hr - #9571
kubernetes-e2e-gce-reboot 0 of 202 failed (±0) 43 min - #8079 16 hr - #8050
kubernetes-e2e-gce-release-1.0 0 of 99 failed (-1) 1 hr 19 min - #2242 8 hr 2 min - #2236
kubernetes-e2e-gce-release-1.1 0 of 175 failed (±0) 3 hr 16 min - #738 6 hr 48 min - #736

I am going to close the issue for now since nothing to debug at this moment. Please open a new issue if the problem reoccur, and assign to me.

lavalamp · 2016-01-12T19:31:49Z

I think this reoccurred, in the "drop outbound packets" and "switching off the network interface" tests.

http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/10379/

dchen1107 · 2016-01-22T22:54:36Z

I were leaving this one for current reboot test failure in kubernetes-ece-gke-ci-reboot, but since #19986 was filed for that purpose. And other causes for reboot failures are filed separately already: #20021 for kube-proxy, and #8705 for version mismatch. I am closing this one.

dchen1107 added kind/flake Categorizes issue or PR as related to a flaky test. area/test priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Oct 12, 2015

saad-ali added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 23, 2015

alex-mohr added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 2, 2015

alex-mohr assigned dchen1107 Nov 2, 2015

alex-mohr changed the title ~~Reboot each node by triggering kernel panic and ensure they function upon restart flaky~~ e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky Nov 2, 2015

dchen1107 closed this as completed Nov 24, 2015

lavalamp reopened this Jan 12, 2016

dchen1107 closed this as completed Jan 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky #15498

e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky #15498

dchen1107 commented Oct 12, 2015

dchen1107 commented Oct 12, 2015

timothysc commented Oct 13, 2015

jszczepkowski commented Oct 28, 2015

alex-mohr commented Nov 2, 2015

alex-mohr commented Nov 2, 2015

alex-mohr commented Nov 2, 2015

alex-mohr commented Nov 2, 2015

alex-mohr commented Nov 2, 2015

dchen1107 commented Nov 2, 2015

dchen1107 commented Nov 3, 2015

ixdy commented Nov 3, 2015

dchen1107 commented Nov 24, 2015

lavalamp commented Jan 12, 2016

dchen1107 commented Jan 22, 2016

e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky #15498

e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky #15498

Comments

dchen1107 commented Oct 12, 2015

dchen1107 commented Oct 12, 2015

timothysc commented Oct 13, 2015

jszczepkowski commented Oct 28, 2015

alex-mohr commented Nov 2, 2015

alex-mohr commented Nov 2, 2015

alex-mohr commented Nov 2, 2015

alex-mohr commented Nov 2, 2015

alex-mohr commented Nov 2, 2015

dchen1107 commented Nov 2, 2015

dchen1107 commented Nov 3, 2015

ixdy commented Nov 3, 2015

dchen1107 commented Nov 24, 2015

lavalamp commented Jan 12, 2016

dchen1107 commented Jan 22, 2016