Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test for internal TLS certificate rotation #15217

Merged
merged 24 commits into from
May 28, 2024

Conversation

mgencur
Copy link
Contributor

@mgencur mgencur commented May 17, 2024

Add test for TLS certificate rotation

Proposed Changes

  • Add test for TLS certificate rotation (including rotation of certificates for Activator, queue-proxy, Ingress)
  • Re-implement the local function scanPodLogs as a more versatile WaitForLog that can wait for a given condition.
  • Fix a bug related to ServingFlags.SkipCleanupOnFail. The cleanup should really be skipped only if the test fails. This was missed during reviews on the original PR
  • Function getCertManagerCA now also waits for the Secret to exist rather than failing immediately
  • Fix e2e-tests.sh for shell check

Release Note


@knative-prow knative-prow bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 17, 2024
@knative-prow knative-prow bot requested review from ReToCode and skonto May 17, 2024 08:05
@mgencur
Copy link
Contributor Author

mgencur commented May 17, 2024

/hold

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 17, 2024
@mgencur
Copy link
Contributor Author

mgencur commented May 17, 2024

/unhold

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 17, 2024
@mgencur
Copy link
Contributor Author

mgencur commented May 17, 2024

The test fails with:

system_internal_tls_test.go:234: Deleting trust-bundle ConfigMap
    stream.go:305: D 11:08:49.430 autoscaler-7d8fbf5b9-m86j7 [metrics/stats_scraper.go:252] [serving-tests/tls-certificate-rotation-ebvlcsmm-00001] |OldPods| = 1, |YoungPods| = 0
    spoof.go:110: Spoofing tls-certificate-rotation-ebvlcsmm.serving-tests.example.com -> 34.138.175.134
    system_internal_tls_test.go:254: The endpoint http://tls-certificate-rotation-ebvlcsmm.serving-tests.example.com/ didn't serve the expected text "Hello World! How about some tasty noodles?": response: status: 503, body: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS_error:|268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end:TLS_error_end, headers: map[Content-Length:[230] Content-Type:[text/plain] Date:[Fri, 17 May 2024 11:08:49 GMT] Server:[istio-envoy] Zipkin_trace_id:[d99519b5453a913fa98600ab36eea963]] did not pass checks: status = 503 503 Service Unavailable, want one of: [200]

@ReToCode I suppose the certificates might not be properly reloaded yet after deleting the routing-serving-certs secret and before we remove the "trust-bundle" configmap?

@mgencur
Copy link
Contributor Author

mgencur commented May 17, 2024

The test passed when I added waiting for ca.crt in routing-serving-certs. Will try again during reviews.

Copy link
Member

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, commented inline. I'm not sure about the error - could be anything IMHO and would need debugging to see where the 503 actually comes from. Probably not deleting the secret could be a start, though.

@mgencur mgencur force-pushed the certificate_rotation_test branch from d20e8fa to b26a59d Compare May 22, 2024 09:59
Copy link

codecov bot commented May 22, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.74%. Comparing base (b5455c9) to head (4e5bd95).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15217      +/-   ##
==========================================
- Coverage   84.78%   84.74%   -0.04%     
==========================================
  Files         218      218              
  Lines       13479    13479              
==========================================
- Hits        11428    11423       -5     
- Misses       1685     1687       +2     
- Partials      366      369       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mgencur
Copy link
Contributor Author

mgencur commented May 23, 2024

@ReToCode this is ready for another review.

Copy link
Member

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@knative-prow knative-prow bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 23, 2024
@ReToCode
Copy link
Member

/override "style / suggester / shell"

Copy link

knative-prow bot commented May 23, 2024

@ReToCode: ReToCode unauthorized: /override is restricted to Repo administrators.

In response to this:

/override "style / suggester / shell"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ReToCode
Copy link
Member

@dprotaso mind overriding the shell-check? IMHO, this should be in a separate PR, as only the time changed in this one.

@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label May 28, 2024
@knative-prow-robot knative-prow-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2024
The original intention was to skip the cleanup only when the test fail,
not always. This was missed in the original PR that introduced the flag.
@mgencur mgencur force-pushed the certificate_rotation_test branch from f79af1d to 7e3df10 Compare May 28, 2024 07:47
@knative-prow-robot knative-prow-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 28, 2024
@mgencur
Copy link
Contributor Author

mgencur commented May 28, 2024

@ReToCode @dprotaso OK. I went ahead and fixed the shell check.

@ReToCode
Copy link
Member

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2024
@mgencur
Copy link
Contributor Author

mgencur commented May 28, 2024

@ReToCode Hmm. The latest error looks genuine:

system_internal_tls_test.go:225: Deleting secret in system namespace
    spoof.go:110: Spoofing tls-certificate-rotation-jryhlssh.serving-tests.example.com -> 35.231.189.127
    stream.go:305: D 08:40:20.489 autoscaler-56bfd47c76-4m7mn [statforwarder/processor.go:119] [serving-tests/tls-certificate-rotation-jryhlssh-00001] Forward stat of bucket autoscaler-bucket-07-of-10 to the holder autoscaler-56bfd47c76-dx6fr_10.120.4.6
    stream.go:305: D 08:40:20.489 autoscaler-56bfd47c76-dx6fr [statforwarder/processor.go:64] [serving-tests/tls-certificate-rotation-jryhlssh-00001] Accept stat as owner of bucket autoscaler-bucket-07-of-10
system_internal_tls_test.go:231: Deleting secret in ingress namespace
    spoof.go:110: Spoofing tls-certificate-rotation-jryhlssh.serving-tests.example.com -> 35.231.189.127
    stream.go:305: D 08:40:20.605 autoscaler-56bfd47c76-dx6fr [statforwarder/processor.go:64] [serving-tests/tls-certificate-rotation-jryhlssh-00001] Accept stat as owner of bucket autoscaler-bucket-07-of-10
system_internal_tls_test.go:237: Deleting old certificate from trust-bundle ConfigMap
    stream.go:305: D 08:40:20.707 autoscaler-56bfd47c76-dx6fr [metrics/stats_scraper.go:252] [serving-tests/tls-certificate-rotation-jryhlssh-00001] |OldPods| = 1, |YoungPods| = 0
    spoof.go:110: Spoofing tls-certificate-rotation-jryhlssh.serving-tests.example.com -> 35.231.189.127
    system_internal_tls_test.go:266: The endpoint http://tls-certificate-rotation-jryhlssh.serving-tests.example.com didn't serve the expected text "Hello World! How about some tasty noodles?": response: status: 503, body: upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: TLS_error:|268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED:TLS_error_end:TLS_error_end, headers: map[Content-Length:[230] Content-Type:[text/plain] Date:[Tue, 28 May 2024 08:40:20 GMT] Server:[istio-envoy] Zipkin_trace_id:[95ccff430d23ad30786fccb1f81036fa]] did not pass checks: status = 503 503 Service Unavailable, want one of: [200]

So, after deleting the old cert from the trust-bundle there's an error with TLS. Any idea why this could happen?

@mgencur
Copy link
Contributor Author

mgencur commented May 28, 2024

@ReToCode As you mentioned here, we should not be deleting the old cert from the trust-bundle ConfigMap because the test is quick and the certificates might not be rotated yet. So, let's keep the ConfigMap?

@ReToCode
Copy link
Member

@ReToCode As you mentioned #15217 (comment), we should not be deleting the old cert from the trust-bundle ConfigMap because the test is quick and the certificates might not be rotated yet. So, let's keep the ConfigMap?

+1, yes that could take a while to fully complete. Makes not much sense to wait for that in an e2e test.

The cert rotation might not be fully complete yet. This also adds a
little randomness to the test.
@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label May 28, 2024
@ReToCode
Copy link
Member

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label May 28, 2024
Copy link

knative-prow bot commented May 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mgencur, ReToCode

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ReToCode
Copy link
Member

IMHO, we can skip certmanager-integration-tests_serving_main, as per #15261

@dprotaso
Copy link
Member

/override "certmanager-integration-tests_serving_main"

Copy link

knative-prow bot commented May 28, 2024

@dprotaso: Overrode contexts on behalf of dprotaso: certmanager-integration-tests_serving_main

In response to this:

/override "certmanager-integration-tests_serving_main"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

knative-prow bot commented May 28, 2024

@mgencur: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
certmanager-integration-tests_serving_main 4e5bd95 link unknown /test certmanager-integration-tests

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@knative-prow knative-prow bot merged commit b15ce9a into knative:main May 28, 2024
68 checks passed
ReToCode pushed a commit to ReToCode/serving that referenced this pull request May 29, 2024
* TestTLSCertificateRotation

* Use SkipCleanupOnFail only when the test fails

The original intention was to skip the cleanup only when the test fail,
not always. This was missed in the original PR that introduced the flag.

* Use default poll interval and timeout

* Fix lint

* Make function GetCASecret return the secret also for internal encryption

* Move WaitForLog to e2e package to prevent "redefined flag"

* Wait for 2 occurrences of cert reload

It is loaded at the beginning and then after re-creating the
serving-certs secret during the test.

* Increase timeout for systeminternaltls tests to 4m

* Wait for CA cert to be re-created in routing-serving-certs before checking state

* The test usually takes around 120 seconds

* Rename waitForCaCert to waitForCACertSecret

* Fix lint

* Use new ClusterIssuer for testing certificate rotation

Also, do not rely on ca.crt to be in routing-serving-certs. We rely
solely on the trust-bundle ConfigMap.

* Actually rename the new issuer

* Run ./hack/update-codegen.sh

* Create const for name of renewed issuer

* Fixes

* Update test

* Revert "Use new ClusterIssuer for testing certificate rotation"

* Fix lint and deps

* Use arrays for GO_TEST_FLAGS and E2E_TEST_FLAGS

This is to properly expand the variables when passing to tests and to
pass shellcheck.

* One more fix for shellcheck

* Add missing bracket

* Do NOT delete old cert from trust-bundle ConfigMap

The cert rotation might not be fully complete yet. This also adds a
little randomness to the test.

(cherry picked from commit b15ce9a)
openshift-merge-bot bot pushed a commit to openshift-knative/serving that referenced this pull request May 29, 2024
* TestTLSCertificateRotation

* Use SkipCleanupOnFail only when the test fails

The original intention was to skip the cleanup only when the test fail,
not always. This was missed in the original PR that introduced the flag.

* Use default poll interval and timeout

* Fix lint

* Make function GetCASecret return the secret also for internal encryption

* Move WaitForLog to e2e package to prevent "redefined flag"

* Wait for 2 occurrences of cert reload

It is loaded at the beginning and then after re-creating the
serving-certs secret during the test.

* Increase timeout for systeminternaltls tests to 4m

* Wait for CA cert to be re-created in routing-serving-certs before checking state

* The test usually takes around 120 seconds

* Rename waitForCaCert to waitForCACertSecret

* Fix lint

* Use new ClusterIssuer for testing certificate rotation

Also, do not rely on ca.crt to be in routing-serving-certs. We rely
solely on the trust-bundle ConfigMap.

* Actually rename the new issuer

* Run ./hack/update-codegen.sh

* Create const for name of renewed issuer

* Fixes

* Update test

* Revert "Use new ClusterIssuer for testing certificate rotation"

* Fix lint and deps

* Use arrays for GO_TEST_FLAGS and E2E_TEST_FLAGS

This is to properly expand the variables when passing to tests and to
pass shellcheck.

* One more fix for shellcheck

* Add missing bracket

* Do NOT delete old cert from trust-bundle ConfigMap

The cert rotation might not be fully complete yet. This also adds a
little randomness to the test.

(cherry picked from commit b15ce9a)

Co-authored-by: Martin Gencur <mgencur@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants