Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRT-1576: Fail if operator has Available=False unless in upgrade window #28735

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

DennisPeriquet
Copy link
Contributor

@DennisPeriquet DennisPeriquet commented Apr 23, 2024

For this test: [bz-%v] clusteroperator/%v should not change condition/Available]:

  • For non-upgrade jobs, fail when operator goes to Available=False
  • For upgrade-jobs, fail when operator goes to Available=False unless it's during an upgrade window and the condition lasts for less than 10 minutes.

Once the PR where storage operator stops reporting Available status merges, we can remove the exception for it.

@openshift-ci openshift-ci bot requested review from deads2k and soltysh April 23, 2024 11:25
Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: DennisPeriquet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2024
@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

This will see if my new exception allows the upgrade job to pass despite the single storage operator replica.

Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/272b5a20-0187-11ef-95a0-20b3d6d376a7-0

@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

retry because the last one didn't really run

Copy link
Contributor

openshift-ci bot commented Apr 23, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/61bc6960-0194-11ef-8313-791cce82a878-0

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 63d0936

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd IncompleteTests
Tests for this run (16) are below the historical average (536): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 3014822

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd IncompleteTests
Tests for this run (25) are below the historical average (531): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

@DennisPeriquet DennisPeriquet changed the title DO NOT MERGE: See how many jobs fail with Degraded=True and Available=False DO NOT MERGE: See how many jobs fail with Available=False Apr 26, 2024
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: d950634

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-csi High
[OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace
This test has passed 100.00% of 25 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-csi'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade Medium
[OLM][invariant] alert/KubePodNotReady should not be at or above info in ns/openshift-marketplace
This test has passed 96.70% of 818 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-gcp-ovn-upgrade' 'periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade'] in the last 14 days.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 2e4493a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade High
[sig-apps] job-upgrade
This test has passed 100.00% of 32 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Low
[bz-apiserver-auth] clusteroperator/authentication should not change condition/Available
This test has passed 0.00% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
Single short-lived operand blip shouldn't cause authentication operator Available=False
---
[bz-Storage] clusteroperator/storage should not change condition/Available
This test has passed 0.00% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
Setup new vsphere informing job
---
[sig-arch] events should not repeat pathologically for ns/openshift-etcd-operator
This test has passed 51.61% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
This test has passed 1.61% of 62 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
Showing 4 of 12 test results

@DennisPeriquet
Copy link
Contributor Author

/test unit

@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented Apr 29, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/8a3d2950-0627-11ef-99cb-168bfde7d9b7-0

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 80a02e7

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade High
[sig-apps] job-upgrade
This test has passed 100.00% of 23 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Low
[bz-Storage] clusteroperator/storage should not change condition/Available
This test has passed 0.00% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
Setup new vsphere informing job
---
[bz-Routing] clusteroperator/ingress should not change condition/Available
This test has passed 0.00% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available
This test has passed 22.22% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
CI: fail update suite if any ClusterOperator go Available=False outside of updates
---
[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
This test has passed 2.22% of 45 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
Showing 4 of 11 test results

@DennisPeriquet
Copy link
Contributor Author

/test unit

@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented Apr 30, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6ff37c20-0690-11ef-86e4-c1c128b91d20-0

@DennisPeriquet DennisPeriquet changed the title DO NOT MERGE: See how many jobs fail with Available=False TRT-1576: Fail if operator has Available=False unless in upgrade window Apr 30, 2024
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 30, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 30, 2024

@DennisPeriquet: This pull request references TRT-1576 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Fail the [bz-%v] clusteroperator/%v should not change condition/Available] test for operators when Available=False outside of any upgrade window.

Add an exception for storage operator since it has only one replica.

This will give me a list of failures to look into. From the list of failures, we can see if there are already Jiras and decide if we want to add exceptions. Then, we'll update the PR with exceptions.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Apr 30, 2024

@DennisPeriquet: This pull request references TRT-1576 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

For this test: [bz-%v] clusteroperator/%v should not change condition/Available]:

  • For non-upgrade jobs, fail when operator goes to Available=False
  • For upgrade-jobs, fail when operator goes to Available=False unless it's during an upgrade window and the condition lasts for less than 10 minutes.

Once the PR where storage operator stops reporting Available status merges, we can remove the exception for it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: efde445

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade Low
[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
This test has passed 61.54% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.
---
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Available
This test has passed 61.54% of 52 runs on jobs ['periodic-ci-openshift-release-master-ci-4.16-e2e-aws-ovn-upgrade'] in the last 14 days.

Open Bugs
control-plane-machine-set goes Available=False with UnavailableReplicas during updates

@DennisPeriquet
Copy link
Contributor Author

/test e2e-agnostic-ovn-cmd

@DennisPeriquet
Copy link
Contributor Author

/test verify

@DennisPeriquet
Copy link
Contributor Author

/test e2e-aws-ovn-cgroupsv2

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: b8aec3c

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Low
[bz-openshift-controller-manager] clusteroperator/openshift-controller-manager should not change condition/Available
This test has passed 6.25% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[bz-Image Registry] clusteroperator/image-registry should not change condition/Available
This test has passed 23.44% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
CI: fail update suite if any ClusterOperator go Available=False outside of updates
---
[bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available
This test has passed 3.12% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[bz-kube-storage-version-migrator] clusteroperator/kube-storage-version-migrator should not change condition/Available
This test has passed 10.94% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
Showing 4 of 11 test results

@DennisPeriquet
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented May 5, 2024

@DennisPeriquet: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-vsphere-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4d404fc0-0b34-11ef-921f-8306786e2a9d-0

Copy link
Contributor

openshift-ci bot commented May 6, 2024

@DennisPeriquet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-agnostic-ovn-cmd 676bb08 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-aws-ovn-fips 676bb08 link true /test e2e-aws-ovn-fips
ci/prow/e2e-aws-ovn-single-node-serial 676bb08 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-metal-ipi-ovn-ipv6 676bb08 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-single-node-upgrade 676bb08 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@DennisPeriquet
Copy link
Contributor Author

re: the last /payload with vsphere:


: [bz-Image Registry] clusteroperator/image-registry should not change condition/Available | 1h34m34s
-- | --
{  4 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:
May 06 00:30:25.569 E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created May 06 00:30:25.569 - 51s   E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created 
May 06 00:50:27.986 E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created May 06 00:50:27.986 - 50s   E clusteroperator/image-registry condition/Available reason/NoReplicasAvailable status/False Available: The deployment does not have available replicas\nNodeCADaemonAvailable: The daemon set node-ca has available replicas\nImagePrunerAvailable: Pruner CronJob has been created

Those two events happened within the upgrade window (but the logs indicate no replicas, which I'm betting is why the test failed):

$ cat e2e-events_20240506-000613.json | jq '.items[] | select(.source == "KubeEvent" and .locator.keys.clusterversion? == "cluster")| "\(.from) \(.to) \(.message.reason)"'
"2024-05-06T00:07:00Z 2024-05-06T00:07:00Z UpgradeStarted"
"2024-05-06T00:58:25Z 2024-05-06T00:58:25Z UpgradeVersion"
"2024-05-06T00:58:25Z 2024-05-06T00:58:25Z UpgradeComplete"

@wking
Copy link
Member

wking commented May 6, 2024

re: the last /payload with vsphere:

image

I'm not clear on why that run has an Available=False image-registry while we don't have any exceptions in place around that component today besides a single-replica carve-out. This wasn't a single-control-plane-node test:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/ar
tifacts/e2e-vsphere-ovn-upgrade/gather-extra/artifacts/nodes.json | jq -r '.items[].metadata.name'
ci-op-6wykcgk2-d2645-7nlts-master-0
ci-op-6wykcgk2-d2645-7nlts-master-1
ci-op-6wykcgk2-d2645-7nlts-master-2
ci-op-6wykcgk2-d2645-7nlts-worker-0-6bdsp
ci-op-6wykcgk2-d2645-7nlts-worker-0-8c5wm
ci-op-6wykcgk2-d2645-7nlts-worker-0-kxfhn

And the cluster was configured for highly-available infrastructure (which includes the registry):

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/artifacts/e2e-vsphere-ovn-upgrade/gather-must-gather/artifacts/must-gather.tar | tar -xOz registry-apps-build02-vmc-ci-openshift-org-ci-op-6wykcgk2-stable-sha256-e7b33149e705570ebcdcebe24c57af8336229175099fb5d53100330fd61015f1/cluster-scoped-resources/config.openshift.io/infrastructures/cluster.yaml | yaml2json | jq -r .status.infrastructureTopology
HighlyAvailable

And yet:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/openshift-origin-28735-ci-4.16-e2e-vsphere-ovn-upgrade/1787257947932332032/artifacts/e2e-vsphere-ovn-upgrade/gather-extra/artifacts/deployments.json | jq -c '.items[] | select(.metadata.name == "image-registry").spec | {replicas, strategy}'
{"replicas":1,"strategy":{"type":"Recreate"}}

I don't think the registry operator should be trying to wake the admin from sleep with an Available=False ClusterOperator condition, when it is configuring 1 replica and a Recreate strategy (which makes continual availability impossible). Either the registry operator should configure its operand to be more available (in line with infrastructureTopology: HighlyAvailable), or it should accept that 1 Recreate pod will not be highly available and not alarm anyone on a brief, expected pod-handoff gap to match the Available godocs contract.

[edit: Ah, looks like the 1-replicas may be expected, and the Available=False noise is getting tracked in OCPBUGS-22382]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants