New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e: Reboot each node by triggering kernel panic and ensure they function upon restart flaky #15498
Comments
/cc @kubernetes/rh-cluster-infra |
This problem occurred again: This time the following tests are failing: Reboot each node by ordering clean reboot and ensure they function upon restart |
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7109/ /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:99 Reboot each node by triggering kernel panic and ensure they function upon restart /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77 |
http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7117/ Reboot each node by triggering kernel panic and ensure they function upon restart /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77 Reboot each node by dropping all outbound packets for a while and ensure they function afterwards /go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:99 |
FWIW, this set of apparent-flakes is currently blocking submit queue, which it seems to do about 10% of the time based on history of the past 2 days. Other similar occurrences: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7048/ http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7044/ http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7029/ |
A build contributing to 10-20% downtime of submit queue is P0 worth IMO. Is there a real issue with nodes not handling reboots correctly? The only open issue I see is #14642 |
Dawn, are reboot tests yours as part of node team? If not, can you please delegate to actual owners? (And another failure: http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/7127/) |
Ok, let me triage it now to understand the issue better first. Thanks! |
Why kubernetes-e2e-gce-reboot doesn't save kubelet logs? @ixdy Since there is no kubelet logs, I cannot confirm it. But I checked test output, and here is my hyperthesis:
Why network not configured correctly at the first place? I think we hit some docker issue, similar to #16601 (comment) But due to lacking of kubelet logs, I cannot confirm it. |
@dchen1107 there are some kubelet logs for build 7127: https://storage.cloud.google.com/kubernetes-jenkins/logs/kubernetes-e2e-gce-reboot/7127/artifacts/ One of the hosts is missing because it wasn't responding to ssh:
|
For last ~100 runs, there is no failure. And for last 300 runs, there are 2 failures. Comparing to other e2e tests, the failure rate is pretty low: kubernetes-e2e-gce 0 of 202 failed (±0) 2 hr 40 min - #9580 15 hr - #9571 I am going to close the issue for now since nothing to debug at this moment. Please open a new issue if the problem reoccur, and assign to me. |
I think this reoccurred, in the "drop outbound packets" and "switching off the network interface" tests. http://kubekins.dls.corp.google.com/view/Critical%20Builds/job/kubernetes-e2e-gce-reboot/10379/ |
Jenkins project: kubernetes-e2e-gce-reboot failed last couple of runs:
Reboot each node by triggering kernel panic and ensure they function upon restart
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/reboot.go:77
Oct 12 13:13:46.645: Node e2e-reboot-minion-iqqf failed reboot test.
The text was updated successfully, but these errors were encountered: