-
Notifications
You must be signed in to change notification settings - Fork 31
OCPBUGS-60564: [OTE] Add webhook to validate openshift-service-ca certificate rotation #450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-60564: [OTE] Add webhook to validate openshift-service-ca certificate rotation #450
Conversation
@camilamacedo86: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6d0aec90-7d55-11f0-9f6f-67d156b2d624-0 |
@camilamacedo86: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7f328464-7d55-11f0-88c5-080334c7b5fe-0 |
b736e1b
to
fa7e8b5
Compare
@@ -167,6 +168,64 @@ var _ = Describe("[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServi | |||
})) | |||
}) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orginal that does not work well was:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the problem?
err := k8sClient.Delete(ctx, signingKeySecret, client.PropagationPolicy(metav1.DeletePropagationBackground)) | ||
Expect(client.IgnoreNotFound(err)).ToNot(HaveOccurred()) | ||
|
||
By("waiting for the webhook operator's service certificate secret to be recreated and populated") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See that now we wait and ensure first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need all of this extra code to check if the secret is recreated though? The next Eventually
block will fail anyway if the secret isn't recreated - and we'll know why when we dump the info on failure (from line 210)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that solves the issue. It can pass locally without the explicit wait, but it fails in CI due to timing. For OPC ci/Sippy we need to be conservative, so this change gates the check until the service cert secret is actually recreated and populated before we probe the webhook—eliminating the race and avoiding flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seem we had two problems: the Eventually
block's timing settings were too short and we had a potential false positive due to a race condition, i.e. we may be checking for successful creation of the resource before the deletion of the signing key secret has time to propagate and invalidate the certificate.
We may still have a similar race condition here where we check for the existence of the certificate secret before the deletion of the signing key secret had time to propagate. This could lead to flaky behavior as the next Eventually
block will be doing this one's job (since anyway having a the certificate secret created and populated if anyway a precondition for success, it will need to wait for it to succeed) and it doesn't have sufficient retry budget to reach success.
I would suggest we update this to:
- ensure the current certificate secret exists and extract its UID
- delete the signing key certificate secret
- wait until the certificate secret exists and had a different UID
I don't think checking that its populated is strictly necessary since this is anyway a pre-condition for the following Eventually
block to succeed (and we'll be able to diagnose the problem with the failure logging). But, I'm ok with keeping it.
Great catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @perdasilva thank you for looking on this
I addressed your suggestions for we either more conservative within, but now we need wait to pass in the tests here + a new agreegate
So
/hold
Until we have all in place again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will run 1 of 1 specs
------------------------------
[sig-olmv1][OCPFeatureGate:NewOLMWebhookProviderOpenshiftServiceCA][Skipped:Disconnected][Serial] OLMv1 operator with webhooks should be tolerant to openshift-service-ca certificate rotation
/Users/camilam/go/src/github/operator-framework-operator-controller/openshift/tests-extension/test/webhooks.go:171
STEP: initializing Kubernetes client @ 08/20/25 19:43:20.431
STEP: requiring OLMv1 capability on OpenShift @ 08/20/25 19:43:20.431
STEP: ensuring no ClusterExtension and CRD from a previous run @ 08/20/25 19:43:20.565
STEP: checking if the webhook-operator-catalog exists @ 08/20/25 19:43:20.817
STEP: webhook-operator catalog webhook-operator-catalog already exists, skipping creation @ 08/20/25 19:43:20.947
STEP: installing the webhook operator in namespace webhook-operator-6xmxx @ 08/20/25 19:43:20.947
STEP: creating a ClusterRoleBinding to cluster-admin for the webhook operator @ 08/20/25 19:43:21.332
STEP: waiting for the webhook operator to be installed @ 08/20/25 19:43:21.718
STEP: waiting for the webhook operator's service to be ready @ 08/20/25 19:43:25.226
STEP: waiting for the webhook operator's service certificate secret to exist and be populated @ 08/20/25 19:43:25.351
STEP: ensuring the webhook operator's service certificate secret exists and getting its ResourceVersion @ 08/20/25 19:43:25.487
STEP: deleting the openshift-service-ca signing-key secret @ 08/20/25 19:43:25.645
STEP: waiting for the webhook operator's service certificate secret to be recreated with a new ResourceVersion @ 08/20/25 19:43:25.774
STEP: checking webhook is responsive through cert rotation @ 08/20/25 19:44:36.874
STEP: performing webhook operator cleanup @ 08/20/25 19:44:57.516
STEP: cleanup: deleting ClusterExtension webhook-operator-6xmxx @ 08/20/25 19:44:57.516
STEP: cleanup: deleting ClusterRoleBinding webhook-operator-6xmxx-operator-crb @ 08/20/25 19:44:57.66
STEP: cleanup: deleting ServiceAccount webhook-operator-6xmxx-installer in namespace webhook-operator-6xmxx @ 08/20/25 19:44:57.804
STEP: cleanup: deleting namespace webhook-operator-6xmxx @ 08/20/25 19:44:57.938
STEP: waiting for namespace webhook-operator-6xmxx to be fully deleted @ 08/20/25 19:44:58.074
• [205.468 seconds]
------------------------------
Ran 1 of 1 Specs in 205.468 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 0 Skipped
Running Suite: - /Users/camilam/go/src/github/operator-framework-operator-controller/openshift/tests-extension
===============================================================================================================
Random Seed: 1755715343 - will randomize all specs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
@camilamacedo86: This pull request references Jira Issue OCPBUGS-60564, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
g.Expect(client.IgnoreNotFound(err)).ToNot(HaveOccurred(), fmt.Sprintf("failed to delete test resource %s: %v", resourceName, err)) | ||
}).WithTimeout(5 * time.Minute).WithPolling(10 * time.Second).Should(Succeed()) | ||
|
||
DeferCleanup(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add debug to validate the cert in case of failures
It should us identify the issue better if we need to do so.
/jira refresh |
@camilamacedo86: This pull request references Jira Issue OCPBUGS-60564, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (bandrade@redhat.com), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
fa7e8b5
to
ccef44f
Compare
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-gcp-ovn-techpreview-serial 10 (after address changes: #450 (comment)) |
@camilamacedo86: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a98be120-7db3-11f0-932d-104de548d36c-0 |
@camilamacedo86: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b54b3bf0-7db3-11f0-9ec5-819302d54b72-0 |
aafb7b5
to
f8e9ba0
Compare
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-gcp-ovn-techpreview-serial 10 |
@camilamacedo86: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/52b277e0-7df6-11f0-99c5-0f4a557cd6de-0 |
@camilamacedo86: This pull request references Jira Issue OCPBUGS-60564, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (bandrade@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@camilamacedo86: This pull request references Jira Issue OCPBUGS-60564, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (bandrade@redhat.com), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test tests-extension |
… certificate rotation This change is a refactor of code from openshift/origin#30059. Assisted-by: Gemini
f8e9ba0
to
dbad83e
Compare
Eventually(func(g Gomega) { | ||
secret := &corev1.Secret{} | ||
err := k8sClient.Get(ctx, client.ObjectKey{Name: certificateSecretName, Namespace: webhookOperatorInstallNamespace}, secret) | ||
g.Expect(err).ToNot(HaveOccurred(), fmt.Sprintf("failed to get secret %s/%s", webhookOperatorInstallNamespace, certificateSecretName)) | ||
oldSecretResourceVersion = secret.ResourceVersion | ||
g.Expect(oldSecretResourceVersion).ToNot(BeEmpty(), "expected secret ResourceVersion to not be empty") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@perdasilva I did your suggestion here as well
97f848e
to
38b521a
Compare
…tion test, mirroring the logic used in the certificate rotation test. This makes the test more robust by ensuring a new secret is created, not just that an existing one is still present.
38b521a
to
f3a2561
Compare
/hold cancel I think we can move forward with it we check that this check is not so longer flaking: #450 (comment) . https://pr-payload-tests.ci.openshift.org/runs/ci/52b277e0-7df6-11f0-99c5-0f4a557cd6de-0 We just make the other test more robust so will be less flake as well :-) |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: camilamacedo86, perdasilva The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0bb1953
into
openshift:main
@camilamacedo86: Jira Issue OCPBUGS-60564: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged: These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-60564 has not been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
This PR introduces the test that we faced intermitent issues in the OCP test env.
You can compare the migration from original: https://github.com/openshift/origin/pull/30059/files#diff-6dd6fa78ac85235012d3c9910f8510bc1640c830a83888681bd5922cec0dffcbR149-R164
What fixed the test:
Tests
(agreggate 10 times)
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/52b277e0-7df6-11f0-99c5-0f4a557cd6de-0