Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaking][sig-node] [NodeConformance] Containers Lifecycle when a pod is terminating because its liveness probe fails should execute readiness probe while in preStop, but not liveness #124791

Open
Rajalakshmi-Girish opened this issue May 10, 2024 · 12 comments · Fixed by #125006
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Rajalakshmi-Girish
Copy link
Contributor

Which jobs are flaking?

https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-e2e-containerd

Which tests are flaking?

E2eNode Suite: [It] [sig-node] [NodeConformance] Containers Lifecycle when a pod is terminating because its liveness probe fails should execute readiness probe while in preStop, but not liveness [NodeConformance]

Since when has it been flaking?

Has been flaking sinc Testcase addition in #124718.

Testgrid link

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-node-e2e-containerd

Reason for failure (if possible)

[sig-node] [NodeConformance] Containers Lifecycle when a pod is terminating because its liveness probe fails [It] should execute readiness probe while in preStop, but not liveness [NodeConformance] [sig-node, NodeConformance]
k8s.io/kubernetes/test/e2e_node/container_lifecycle_test.go:793

  Timeline >>
  STEP: Creating a kubernetes client @ 05/10/24 07:20:24.148
  STEP: Building a namespace api object, basename containers-lifecycle-test @ 05/10/24 07:20:24.148
  I0510 07:20:24.160787 1886 framework.go:275] Skipping waiting for service account
  STEP: creating a pod with a readiness probe and a preStop hook @ 05/10/24 07:20:24.161
  STEP: Waiting for the pod to complete @ 05/10/24 07:20:24.219
  STEP: Parsing results @ 05/10/24 07:20:54.289
  STEP: Analyzing results @ 05/10/24 07:20:54.297
  I0510 07:20:54.298641 1886 container_lifecycle_test.go:837] Unexpected error: 
      <*errors.errorString | 0xc000999df0>: 
      expected Readiness-regular-1 to start before exiting PreStop-regular-1, got
      0) 2024-05-10 07:20:24.868 +0000 UTC regular-1 Starting
      1) 2024-05-10 07:20:24.894 +0000 UTC regular-1 Started
      2) 2024-05-10 07:20:24.9 +0000 UTC regular-1 Delaying
      3) 2024-05-10 07:20:25.237 +0000 UTC Readiness-regular-1 Starting
      4) 2024-05-10 07:20:25.248 +0000 UTC Readiness-regular-1 Started
      5) 2024-05-10 07:20:25.263 +0000 UTC Readiness-regular-1 Delaying
      6) 2024-05-10 07:20:25.279 +0000 UTC Readiness-regular-1 Exiting
      7) 2024-05-10 07:20:25.58 +0000 UTC Readiness-regular-1 Starting
      8) 2024-05-10 07:20:25.587 +0000 UTC Readiness-regular-1 Started
      9) 2024-05-10 07:20:25.594 +0000 UTC Readiness-regular-1 Delaying
      10) 2024-05-10 07:20:25.599 +0000 UTC Readiness-regular-1 Exiting
      11) 2024-05-10 07:20:26.673 +0000 UTC Readiness-regular-1 Starting
      12) 2024-05-10 07:20:26.69 +0000 UTC Readiness-regular-1 Started
      13) 2024-05-10 07:20:26.703 +0000 UTC Readiness-regular-1 Delaying
      14) 2024-05-10 07:20:26.71 +0000 UTC Readiness-regular-1 Exiting
      15) 2024-05-10 07:20:27.591 +0000 UTC Readiness-regular-1 Starting
      16) 2024-05-10 07:20:27.606 +0000 UTC Readiness-regular-1 Started
      17) 2024-05-10 07:20:27.618 +0000 UTC Readiness-regular-1 Delaying
      18) 2024-05-10 07:20:27.628 +0000 UTC Readiness-regular-1 Exiting
      19) 2024-05-10 07:20:28.583 +0000 UTC Readiness-regular-1 Starting
      20) 2024-05-10 07:20:28.591 +0000 UTC Readiness-regular-1 Started
      21) 2024-05-10 07:20:28.598 +0000 UTC Readiness-regular-1 Delaying
      22) 2024-05-10 07:20:28.603 +0000 UTC Readiness-regular-1 Exiting
      23) 2024-05-10 07:20:29.661 +0000 UTC Readiness-regular-1 Starting
      24) 2024-05-10 07:20:29.677 +0000 UTC Readiness-regular-1 Started
      25) 2024-05-10 07:20:29.7 +0000 UTC Readiness-regular-1 Delaying
      26) 2024-05-10 07:20:29.723 +0000 UTC Readiness-regular-1 Exiting
      27) 2024-05-10 07:20:30.584 +0000 UTC Readiness-regular-1 Starting
      28) 2024-05-10 07:20:30.591 +0000 UTC Readiness-regular-1 Started
      29) 2024-05-10 07:20:30.598 +0000 UTC Readiness-regular-1 Delaying
      30) 2024-05-10 07:20:30.61 +0000 UTC Readiness-regular-1 Exiting
      31) 2024-05-10 07:20:31.58 +0000 UTC Readiness-regular-1 Starting
      32) 2024-05-10 07:20:31.587 +0000 UTC Readiness-regular-1 Started
      33) 2024-05-10 07:20:31.593 +0000 UTC Readiness-regular-1 Delaying
      34) 2024-05-10 07:20:31.599 +0000 UTC Readiness-regular-1 Exiting
      35) 2024-05-10 07:20:32.58 +0000 UTC Readiness-regular-1 Starting
      36) 2024-05-10 07:20:32.587 +0000 UTC Readiness-regular-1 Started
      37) 2024-05-10 07:20:32.593 +0000 UTC Readiness-regular-1 Delaying
      38) 2024-05-10 07:20:32.599 +0000 UTC Readiness-regular-1 Exiting
      39) 2024-05-10 07:20:33.629 +0000 UTC Readiness-regular-1 Starting
      40) 2024-05-10 07:20:33.65 +0000 UTC Readiness-regular-1 Started
      41) 2024-05-10 07:20:33.66 +0000 UTC Readiness-regular-1 Delaying
      42) 2024-05-10 07:20:33.679 +0000 UTC Readiness-regular-1 Exiting
      43) 2024-05-10 07:20:34.617 +0000 UTC Readiness-regular-1 Starting
      44) 2024-05-10 07:20:34.624 +0000 UTC Liveness-regular-1 Starting
      45) 2024-05-10 07:20:34.629 +0000 UTC Readiness-regular-1 Started
      46) 2024-05-10 07:20:34.634 +0000 UTC Liveness-regular-1 Started
      47) 2024-05-10 07:20:34.642 +0000 UTC Liveness-regular-1 Delaying
      48) 2024-05-10 07:20:34.642 +0000 UTC Readiness-regular-1 Delaying
      49) 2024-05-10 07:20:34.649 +0000 UTC Liveness-regular-1 Exiting
      50) 2024-05-10 07:20:34.664 +0000 UTC Readiness-regular-1 Exiting
      51) 2024-05-10 07:20:34.7 +0000 UTC PreStop-regular-1 Starting
      52) 2024-05-10 07:20:34.708 +0000 UTC PreStop-regular-1 Started
      53) 2024-05-10 07:20:34.715 +0000 UTC PreStop-regular-1 Delaying
      54) 2024-05-10 07:20:35.719 +0000 UTC Readiness-regular-1 Starting
      55) 2024-05-10 07:20:35.735 +0000 UTC PreStop-regular-1 Exiting
      56) 2024-05-10 07:20:35.771 +0000 UTC Readiness-regular-1 Started
      57) 2024-05-10 07:20:35.804 +0000 UTC Readiness-regular-1 Delaying
      58) 2024-05-10 07:20:35.823 +0000 UTC Readiness-regular-1 Exiting
      59) 2024-05-10 07:20:36.581 +0000 UTC Readiness-regular-1 Starting
      60) 2024-05-10 07:20:36.589 +0000 UTC Readiness-regular-1 Started
      61) 2024-05-10 07:20:36.617 +0000 UTC Readiness-regular-1 Delaying
      62) 2024-05-10 07:20:36.636 +0000 UTC Readiness-regular-1 Exiting
      63) 2024-05-10 07:20:37.689 +0000 UTC Readiness-regular-1 Starting
      64) 2024-05-10 07:20:37.708 +0000 UTC Readiness-regular-1 Started
      65) 2024-05-10 07:20:37.731 +0000 UTC Readiness-regular-1 Delaying
      66) 2024-05-10 07:20:37.74 +0000 UTC Readiness-regular-1 Exiting
      67) 2024-05-10 07:20:38.585 +0000 UTC Readiness-regular-1 Starting
      68) 2024-05-10 07:20:38.592 +0000 UTC Readiness-regular-1 Started
      69) 2024-05-10 07:20:38.599 +0000 UTC Readiness-regular-1 Delaying
      70) 2024-05-10 07:20:38.604 +0000 UTC Readiness-regular-1 Exiting
      71) 2024-05-10 07:20:39.629 +0000 UTC Readiness-regular-1 Starting
      72) 2024-05-10 07:20:39.645 +0000 UTC Readiness-regular-1 Started
      73) 2024-05-10 07:20:39.659 +0000 UTC Readiness-regular-1 Delaying
      74) 2024-05-10 07:20:39.678 +0000 UTC Readiness-regular-1 Exiting
      75) 2024-05-10 07:20:40.581 +0000 UTC Readiness-regular-1 Starting
      76) 2024-05-10 07:20:40.588 +0000 UTC Readiness-regular-1 Started
      77) 2024-05-10 07:20:40.596 +0000 UTC Readiness-regular-1 Delaying
      78) 2024-05-10 07:20:40.601 +0000 UTC Readiness-regular-1 Exiting
      79) 2024-05-10 07:20:41.644 +0000 UTC Readiness-regular-1 Starting
      80) 2024-05-10 07:20:41.671 +0000 UTC Readiness-regular-1 Started
      81) 2024-05-10 07:20:41.696 +0000 UTC Readiness-regular-1 Delaying
      82) 2024-05-10 07:20:41.716 +0000 UTC Readiness-regular-1 Exiting
      83) 2024-05-10 07:20:42.582 +0000 UTC Readiness-regular-1 Starting
      84) 2024-05-10 07:20:42.589 +0000 UTC Readiness-regular-1 Started
      85) 2024-05-10 07:20:42.597 +0000 UTC Readiness-regular-1 Delaying
      86) 2024-05-10 07:20:42.606 +0000 UTC Readiness-regular-1 Exiting
      87) 2024-05-10 07:20:43.605 +0000 UTC Readiness-regular-1 Starting
      88) 2024-05-10 07:20:43.637 +0000 UTC Readiness-regular-1 Started
      89) 2024-05-10 07:20:43.691 +0000 UTC Readiness-regular-1 Delaying
      90) 2024-05-10 07:20:43.775 +0000 UTC Readiness-regular-1 Exiting
      91) 2024-05-10 07:20:44.607 +0000 UTC Readiness-regular-1 Starting
      92) 2024-05-10 07:20:44.625 +0000 UTC Readiness-regular-1 Started
      93) 2024-05-10 07:20:44.651 +0000 UTC Readiness-regular-1 Delaying
      94) 2024-05-10 07:20:44.661 +0000 UTC Readiness-regular-1 Exiting
      95) 2024-05-10 07:20:45.683 +0000 UTC Readiness-regular-1 Starting
      96) 2024-05-10 07:20:45.698 +0000 UTC Readiness-regular-1 Started
      97) 2024-05-10 07:20:45.71 +0000 UTC Readiness-regular-1 Delaying
      98) 2024-05-10 07:20:45.727 +0000 UTC Readiness-regular-1 Exiting
      99) 2024-05-10 07:20:46.591 +0000 UTC Readiness-regular-1 Starting
      100) 2024-05-10 07:20:46.6 +0000 UTC Readiness-regular-1 Started
      101) 2024-05-10 07:20:46.605 +0000 UTC Readiness-regular-1 Delaying
      102) 2024-05-10 07:20:46.611 +0000 UTC Readiness-regular-1 Exiting
      103) 2024-05-10 07:20:47.595 +0000 UTC Readiness-regular-1 Starting
      104) 2024-05-10 07:20:47.601 +0000 UTC Readiness-regular-1 Started
      105) 2024-05-10 07:20:47.607 +0000 UTC Readiness-regular-1 Delaying
      106) 2024-05-10 07:20:47.612 +0000 UTC Readiness-regular-1 Exiting
      107) 2024-05-10 07:20:48.649 +0000 UTC Readiness-regular-1 Starting
      108) 2024-05-10 07:20:48.664 +0000 UTC Readiness-regular-1 Started
      109) 2024-05-10 07:20:48.678 +0000 UTC Readiness-regular-1 Delaying
      110) 2024-05-10 07:20:48.689 +0000 UTC Readiness-regular-1 Exiting
      111) 2024-05-10 07:20:49.592 +0000 UTC Readiness-regular-1 Starting
      112) 2024-05-10 07:20:49.61 +0000 UTC Readiness-regular-1 Started
      113) 2024-05-10 07:20:49.617 +0000 UTC Readiness-regular-1 Delaying
      114) 2024-05-10 07:20:49.621 +0000 UTC Readiness-regular-1 Exiting
      115) 2024-05-10 07:20:50.634 +0000 UTC Readiness-regular-1 Starting
      116) 2024-05-10 07:20:50.667 +0000 UTC Readiness-regular-1 Started
      117) 2024-05-10 07:20:50.68 +0000 UTC Readiness-regular-1 Delaying
      118) 2024-05-10 07:20:50.691 +0000 UTC Readiness-regular-1 Exiting
      119) 2024-05-10 07:20:50.877 +0000 UTC regular-1 Exiting

Anything else we need to know?

It has been constantly failing in our internal CI on ppc64le https://testgrid.k8s.io/ibm-k8s-e2e-node-ppc64le#periodic-k8s-e2e-node-ppc64le

STEP: Analyzing results - k8s.io/kubernetes/test/e2e_node/container_lifecycle_test.go:835 @ 05/10/24 09:32:36.824
I0510 09:32:36.824908 73450 container_lifecycle_test.go:837] Unexpected error: 
    <*errors.errorString | 0xc00064b8a0>: 
    couldn't find that PreStop-regular-1 ever started, got
    0) 2024-05-10 09:32:30.635 +0000 UTC regular-1 Starting
    1) 2024-05-10 09:32:31.119 +0000 UTC regular-1 Started
    2) 2024-05-10 09:32:31.124 +0000 UTC regular-1 Delaying
    3) 2024-05-10 09:32:31.625 +0000 UTC Readiness-regular-1 Starting
    4) 2024-05-10 09:32:31.63 +0000 UTC Readiness-regular-1 Started
    5) 2024-05-10 09:32:31.826 +0000 UTC Readiness-regular-1 Delaying
    6) 2024-05-10 09:32:31.986 +0000 UTC Readiness-regular-1 Exiting
    7) 2024-05-10 09:32:32.197 +0000 UTC Readiness-regular-1 Starting
    8) 2024-05-10 09:32:32.263 +0000 UTC Readiness-regular-1 Started
    9) 2024-05-10 09:32:32.285 +0000 UTC Readiness-regular-1 Delaying
    10) 2024-05-10 09:32:33.216 +0000 UTC Readiness-regular-1 Starting
    11) 2024-05-10 09:32:33.487 +0000 UTC Readiness-regular-1 Started

Relevant SIG(s)

/sig node

@Rajalakshmi-Girish Rajalakshmi-Girish added the kind/flake Categorizes issue or PR as related to a flaky test. label May 10, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 10, 2024
@Rajalakshmi-Girish
Copy link
Contributor Author

@matthyx @SergeyKanzhelev ^^

@matthyx
Copy link
Contributor

matthyx commented May 13, 2024

@Rajalakshmi-Girish this is strange, I don't understand why the preStop hook is never executed... is it a matter of waiting more for this to happen?
What's the performance of these ppc runners?

@Rajalakshmi-Girish
Copy link
Contributor Author

@Rajalakshmi-Girish this is strange, I don't understand why the preStop hook is never executed... is it a matter of waiting more for this to happen?
What's the performance of these ppc runners?

On our ppc node it seems to a OOM error. I see the regular-1 container related to testcase being killed with OOM message.

[{"name":"regular-1","state":{"terminated":{"exitCode":137,"reason":"OOMKilled","startedAt":"2024-05-14T03:32:28Z","finishedAt":"2024-05-14T03:32:29Z","containerID":"containerd://1a614ed586af6a7988e4dc22d1ef59e5a10a6d10a0596602199315df57171458"}},"lastState":{},"ready":false,"restartCount":0,"image":"registry.k8s.io/e2e-test-images/busybox:1.36.1-1","imageID":"registry.k8s.io/e2e-test-images/busybox@sha256:a9155b13325b2abef48e71de77bb8ac015412a566829f621d06bfae5c699b1b9","containerID":"containerd://1a614ed586af6a7988e4dc22d1ef59e5a10a6d10a0596602199315df57171458","started":false}]

@matthyx request you to point me out if there is any memory requirement to run the e2e-node suite.
Also, is there anything related to this particular test that consumes more memory?

@matthyx
Copy link
Contributor

matthyx commented May 14, 2024

Interesting... not that I knew. These are small containers using BusyBox images, I don't think we can do much smaller than that.

Do you have memory pressure on the node? Because we don't set any resource limit on our Pods.

Maybe we should try to reduce the PARALLELISM env variable?

@Rajalakshmi-Girish
Copy link
Contributor Author

Rajalakshmi-Girish commented May 14, 2024

PARALLELISM

I shall try reducing the PARALLELISM as you suggested.
But, the test fails with OOM messages even when run alone (not as part of the complete suite)

@Rajalakshmi-Girish
Copy link
Contributor Author

@matthyx Any clue on why we are experiencing OOM error only for this testcase?
I am trying to compare pod spec to see if this is defined differently.

@matthyx
Copy link
Contributor

matthyx commented May 14, 2024

@matthyx Any clue on why we are experiencing OOM error only for this testcase? I am trying to compare pod spec to see if this is defined differently.

I don't think we have anything special in the Pod spec... I might try to replicate exactly one test that works.

@gjkim42
Copy link
Member

gjkim42 commented May 14, 2024

cc

@eddiezane
Copy link
Member

I think this is still failing.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/125073/pull-kubernetes-node-e2e-containerd/1793373424723169280

/reopen

@k8s-ci-robot
Copy link
Contributor

@eddiezane: Reopened this issue.

In response to this:

I think this is still failing.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/125073/pull-kubernetes-node-e2e-containerd/1793373424723169280

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gjkim42
Copy link
Member

gjkim42 commented May 23, 2024

I think this is still failing.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/125073/pull-kubernetes-node-e2e-containerd/1793373424723169280

/reopen

#124718 (comment)

A short preStop hook period looks like an issue. I'll post a PR.

@gjkim42
Copy link
Member

gjkim42 commented May 23, 2024

/assign
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants