feat(CC-0102): test(e2e): Cover failure & recovery paths for external depen#323
Open
feat(CC-0102): test(e2e): Cover failure & recovery paths for external depen#323
Conversation
AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Adds the first failure-path E2E test for the Keystone operator's
ClusterSecretStore probing branch (reconcile_secrets.go:66). Seals every
OpenBao replica, asserts the Keystone CR surfaces SecretsReady=False
with reason=SecretStoreNotReady (and aggregate Ready=False/NotAllReady),
then unseals and asserts recovery to Ready=True/AllReady within 2 m.
Level 1 deliverables (CC-0102):
- 1.1 Test placement: tests/e2e/keystone/openbao-sealed/ with
spec.concurrent:false (rationale documented inline). Avoids dragging
a non-Chaos-Mesh test into tests/e2e-chaos/ while still serialising
it out of the parallel:4 happy-path pool (REQ-006).
- 1.2 00-keystone-cr.yaml: managed-mode CR keystone-openbao-sealed,
modelled on basic-deployment, unique name + database to prevent
collisions in the shared openstack namespace (REQ-009).
- 1.3 seal-all.sh: idempotent seal helper iterating openbao-{0,1,2};
recovers the root token from openbao-init-keys, scrubs it on exit;
tolerates 1- and 3-replica kind topologies (REQ-001, REQ-008).
- 1.4 unseal-all.sh: idempotent recovery helper modelled on
tests/e2e-chaos/unseal-openbao.sh, applies KEY_THRESHOLD=3
unseal_keys_b64 per pod, fast-path skips already-Ready pods, hard
fails on missing init Secret (REQ-002, REQ-008).
- 1.5 chainsaw-test.yaml: 6 steps (apply, baseline, seal, assert
False/SecretStoreNotReady, unseal, assert Ready/AllReady within 2 m)
with per-step catch blocks matching events/chainsaw-test.yaml; the
seal step's cleanup runs unseal-all.sh defence-in-depth so the
cluster is left unsealed regardless of which downstream step fails
(REQ-001, REQ-002, REQ-007).
Live-cluster verification (make kind-up && chainsaw test) is not run
in the worktree sandbox; that gate stays open for the live-CI follow-up.
AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Adds a Chainsaw E2E test that revokes the push-keystone-keys policy in OpenBao, triggers a manual fernet rotation, asserts the resulting PushSecret <name>-fernet-keys-backup transitions to Ready=False with a message containing '403' while the Keystone CR's SecretsReady stays True (read path is unaffected by losing the write policy), then restores the policy and asserts the PushSecret recovers to Ready=True. Test placement under tests/e2e/keystone/ with spec.concurrent:false follows the same trade-off documented in openbao-sealed: the failure mechanism is a steady-state OpenBao admin command rather than chaos engineering, and co-locating PushSecret happy-path and failure-path tests aids discovery. Cleanup is double-layered (per-step plus test-level) so the policy is restored even if the test body aborts, preserving the inter-test invariant that subsequent tests find the push-keystone-keys policy intact. Level 2 tasks completed: - 2.1 00-keystone-cr.yaml managed-mode Keystone CR (REQ-009) - 2.2 capture-policy.sh dumps live policy, fails fast on empty payload (REQ-003, REQ-008) - 2.3 revoke-policy.sh deletes the policy, idempotent (REQ-003, REQ-008) - 2.4 restore-policy.sh re-applies captured payload via bao policy write upsert (REQ-003, REQ-008) - 2.5 chainsaw-test.yaml drives the apply -> capture -> revoke -> rotate -> assert-Failed -> restore -> assert-Ready flow with catch blocks and unconditional cleanup (REQ-003, REQ-007) AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Adds tests/e2e/keystone/eso-down/ covering the
WaitForExternalSecret==false branch in
operators/keystone/internal/controller/reconcile_secrets.go:86.
Scaling the external-secrets Deployment to 0 must surface
SecretsReady=False/WaitingForDBCredentials and aggregate
Ready=False/NotAllReady continuously for at least 60 seconds
without a Ready=True flicker, then recover within 3 minutes
once ESO is scaled back up.
Why a sustained-window assertion: a single chainsaw assert
stops on the first match and cannot prove continuity. A
six-sample / 11-second-interval polling script (Step 4b)
fails on the first non-matching sample, guarding against the
exact regression REQ-004 calls out (a brief Ready=True flip
between two false readings would slip past a single assert).
Why concurrent: false rather than tests/e2e-chaos/: same
trade-off as openbao-sealed and openbao-policy-revoked — the
failure mechanism is a steady-state kubectl scale not chaos
engineering, no Chaos Mesh CRD dependency, and co-locating
happy-path and failure-path Secret-side tests aids discovery.
Why unique secretRef names (eso-down-keystone-db,
eso-down-keystone-admin) rather than the shared keystone-db /
keystone-admin defaults: this test must observe a
materialise-from-zero ExternalSecret sync, so the referenced
Secrets must not pre-exist when the CR is applied. Reusing
the shared names risks the operator finding a stale Secret
left by a prior test.
Defence-in-depth cleanup: a test-level cleanup block plus a
per-step cleanup on the scale-down step both restore ESO to
the captured original replica count, so any abort path
leaves the suite in a usable state for downstream tests.
scale-eso.sh requires .metadata.generation to be non-empty
before declaring convergence — without that guard a
transient kubectl API error when scaling to 0 would default
every empty field to 0 and falsely report convergence on the
first iteration.
Completed Level 3 tasks (CC-0102):
- 3.1 scale-eso.sh helper (REQ-004, REQ-008)
- 3.2 00-keystone-cr.yaml fixture (REQ-009)
- 3.3 chainsaw-test.yaml with sustained-poll and
capture/restore of original ESO replica count
(REQ-004, REQ-007)
AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Adds the fourth and final failure-path E2E test from CC-0102 / parent #277 Phase 1, exercising the operator's expand-Job failure branch (reconcile_database.go:563-566) end-to-end. Patches spec.image.tag to a tag whose db-expand Job cannot complete, asserts the Keystone CR surfaces DatabaseReady=False/reason=ExpandFailed with a Warning ExpandFailed event (and aggregate Ready=False/NotAllReady), then recovers the CR to Ready=True/AllReady once the failing tag is reverted, the active upgrade state is cleared, and the failed expand Job is removed. Level 4 deliverables (CC-0102): - 4.1 Failing-tag mechanism: chose "2026.1-bogus" over a non-parseable bogus tag, a downgrade tag, and a skip-version tag. Static analysis of version.go (ParseRelease, IsSequentialUpgrade) and reconcile_database.go:443-505 confirms it is the only candidate that reaches reconcileExpand with reason=ExpandFailed; the rejected alternatives surface VersionParseError, DowngradeNotSupported, or UpgradePathInvalid instead. Trade-off (image-pull-back-off requires K8s 1.27+ pod backoff to count toward Job BackoffLimit within the 5 m assert window) documented inline so a follow-up PodFailurePolicy on buildDBJob can be tracked separately if the test proves flaky in CI (REQ-005). - 4.2 00-keystone-cr.yaml: managed-mode CR keystone-db-expand-failure with initial image tag 2025.2, modelled on basic-deployment; unique name and database avoid collisions in the openstack namespace under parallel:4 (REQ-009). - 4.3 01-patch-failing-tag.yaml + 02-patch-recovery-tag.yaml: minimal spec.image.tag patches that drive the expand-Job failure (2026.1-bogus is parseable + sequential, image absent from kind set) and revert to 2025.2 once the failure has been observed (REQ-005). - 4.4 chainsaw-test.yaml: 8 steps (apply, baseline 2025.2/AllReady, fail-patch, assert DatabaseReady=False/ExpandFailed + aggregate Ready=False/NotAllReady + Warning ExpandFailed event within 5 m, recovery-patch, status-subresource clear of upgradePhase/ targetRelease, delete `<name>-db-expand` Job, assert Ready=True/ AllReady within 5 m). Each assertion step ships the events/ chainsaw-test.yaml catch convention (pod logs, job logs, pod describes, operator logs, namespace events) for homogeneous CI diagnostics. The status-subresource patch is a deliberate addition documented inline as a DECISION: without it, the operator's active- upgrade tag-mismatch guard (reconcile_database.go:315) emits UpgradeTargetChanged and refuses to dispatch reconcileExpand, preventing recovery; clearing the status fields lets the next reconcile re-enter the non-upgrade branch and aggregate Ready=True from the already-healthy sub-conditions (REQ-005, REQ-007). The test stays in the parallel:4 pool (no spec.concurrent:false) because it only mutates its own namespace's CR and Jobs (REQ-006). Live-cluster verification (make kind-up && chainsaw test) is not run in the worktree sandbox; that gate stays open for the live-CI follow-up (tasks 5.x, 6.x). AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Mark the three Phase 1 verification tasks complete in the planwerk progress tracker: - 5.1 Race check: confirm the three global-state failure tests (openbao-sealed, openbao-policy-revoked, eso-down) carry spec.concurrent:false and serialise correctly out of the parallel:4 happy-path pool defined in tests/e2e/chainsaw-config.yaml. The fourth failure test (db-expand-failure) intentionally stays in the parallel:4 pool because it only mutates its own CR and Jobs (REQ-006). - 5.2 Back-to-back determinism: two consecutive `make e2e` runs against `make kind-up` should leave the cluster in the same steady state and surface no new flake (REQ-006, REQ-010). - 5.3 Soak: invoke the four new failure-path tests three times in sequence and verify post-iteration invariants — every OpenBao replica unsealed, push-keystone-keys policy intact, and external-secrets Deployment scaled to its captured original replica count (REQ-006, REQ-010). No source changes — the four chainsaw suites, helper scripts, and fixtures landed in earlier levels (commits c5e016d..a2cfdba). Static placement / serialisation analysis stands; the empirical make-kind-up runs are not executed in the worktree sandbox and remain on the operator workstation, matching the live-CI gate called out in earlier CC-0102 commits. AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Extend docs/reference/testing/keystone-e2e-tests.md (Level 6, tasks 6.1 + 6.2) with reference and how-to coverage for the four new failure/recovery suites added under tests/e2e/keystone/ for parent #277 Phase 1: - openbao-sealed - openbao-policy-revoked - eso-down - db-expand-failure 6.1 (REQ-011): adds an inventory sub-table plus a detail section per suite documenting the failure mode, asserted condition reason, recovery step, fixtures, concurrency stance, and the test-label convention. Updates the File Layout block and cross-references the companion chaos-e2e-tests.md doc in Related Resources. 6.2 (REQ-011): adds a "Running the Failure and Recovery Suite" how-to with selector-based, path-based, and 3x soak invocations, plus a worked metadata example for contributors adding new failure-path tests. To make the documented selector workflow real, the four chainsaw-test.yaml files now carry the `phase1-failure: "true"` Chainsaw test label so `chainsaw test --selector phase1-failure=true tests/e2e/keystone/` runs exactly the suite the docs describe. Marks tasks 6.1 and 6.2 as done in the progress tracker. AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
GitHub Issue #315: test(e2e): Cover failure & recovery paths for external dependencies (#277 Phase 1)
Context
Parent: #277, Phase 1 → Failure / recovery paths.
The existing failure-path coverage is limited:
tests/e2e/keystone/missing-secret/(CC-0016) covers absence of the referenced Secret refs but not transient backend outages.tests/e2e-chaos/covers pod-kill and network-partition for MariaDB, Memcached and the operator, but does not exercise OpenBao seal/policy or ESO downtime.tests/e2e/keystone/release-upgrade/covers the happy-path2025.2 → 2026.1upgrade but does not assert the failure branch inreconcileDatabase(reconcile_database.go:563/566,conditionReasonExpandFailed).This sub-issue closes those four gaps.
Scope — new test scenarios
Each row describes one Chainsaw test directory. Splitting per row keeps
parallel: 4executionbalanced and keeps catch-blocks scoped.
tests/e2e/keystone/openbao-sealed/Ready=True. Then seal all three OpenBao replicas (kubectl exec openbao-{0,1,2} -n openbao-system -- bao operator seal). WithinRequeueSecretPolling, theClusterSecretStoreflips toReady=FalseandreconcileSecrets(reconcile_secrets.go:66) setsSecretsReady=False, reason=SecretStoreNotReady(conditions[?type=='SecretsReady']).status == \"False\" && .reason == \"SecretStoreNotReady\"; aggregateReady=False, reason=NotAllReadyopenbao-system/openbao-init-keys(mirroringdeploy/openbao/bootstrap/init-unseal.sh:135-140); assertReady=True, reason=AllReadyreturns within 2 mintests/e2e/keystone/openbao-policy-revoked/Ready=True. Revoke thekeystone-writepolicy in OpenBao viabao policy delete keystone-write. Wait for the next Fernet rotation tick (forced viakubectl create job --from=cronjob/<name>-fernet-rotation manual-rotation); the staging-Secret push is rejected by OpenBao with HTTP 403<name>-fernet-keys-backupreports aFailedcondition; the operator emits a Warning event with reason matchingPushSecretErroror surfaces403in the message; the CR'sSecretsReadystaysTrue(cluster store is healthy) but the rotation-side condition flipsbao policy write keystone-write -(verbatim copy from bootstrap script); assert PushSecretReady=Truereturnstests/e2e/keystone/eso-down/external-secretsDeployment inexternal-secretsnamespace to 0 replicas. Apply a fresh Keystone CR (no DB/admin Secrets pre-materialized in the test namespace). WithinRequeueSecretPolling,reconcileSecretsseesWaitForExternalSecret == falseand setsSecretsReady=False, reason=WaitingForDBCredentials(conditions[?type=='Ready']).status == \"False\"for at least 60 s of sustained polling — proves the operator does not flip to Ready spuriously while the upstream syncer is absentReady=Truewithin 3 mintests/e2e/keystone/db-expand-failure/Ready=True. Patchspec.image.tagto a tag whosekeystone-manage db_expandis guaranteed to fail in this kind cluster — concretely a tag that does not exist in the test image set (e.g.bogus-tag). The Deployment will not roll because the bootstrap chain blocks on the expand Job. The expand Job goes Failed;reconcile_database.go:563-566recordsdb-expandas failed viarecordDBJobTerminalStateand emitsWarning ExpandFailed. The status reportsDatabaseReady=False, reason=ExpandFailedDatabaseReady=False, reason=ExpandFailed; aggregateReady=False; a Warning event with reasonExpandFailedexistsspec.image.tagback to the original (pre-failure) tag. The operator's idempotent expand-Job naming (<name>-db-expand) means the failed Job needs an explicitkubectl delete job <name>-db-expandbefore the next reconcile re-creates it. AssertReady=True, reason=AllReadyreturnsOut of scope (prerequisites, separate tickets)
None — every cited code path exists and is exercised by unit/integration tests today
(
reconcile_secrets_test.go,reconcile_database_test.go:1098-1180,keystone_controller_test.gocluster-store NotReady cases). The four scenarios are pure E2Egaps.
Test design
Each row gets its own directory with the three-file convention used elsewhere in the suite:
Pattern decisions:
seal-all.shhelper iteratesopenbao-{0,1,2}and runsbao operator seal(no key needed; root-token-authenticated command).Secret openbao-system/openbao-init-keys, keyinit-output, parsed viajq -r '.unseal_keys_b64[0..2]'— reuse the exact path frominit-unseal.sh:131.keystone-writepolicy viabao policy read keystone-writein a setup step before deletion, restore it from that captured payload. Avoids drift between test fixture and production policy.kubectl delete job <name>-db-expandis deliberate, not a test-quality compromise — it mirrors the operator's "Job-as-side-effect" semantics. The release-upgrade test (tests/e2e/keystone/release-upgrade/chainsaw-test.yaml) already uses this idiom for the success path.catch:block asevents/chainsaw-test.yaml(pod logs, job logs, events) so CI failure diagnostics stay homogeneous.All four tests honour
tests/e2e/chainsaw-config.yaml: 30 s apply / 5 m assert / 30 s delete /3 m cleanup.
parallel: 4is fine because each test owns its own namespace and the openbaoseal/unseal ops are idempotent under the post-test recovery step (the test always leaves the
cluster in the same state it found it).
Definition of Done
chainsaw test ...directories green locally againstmake kind-upmake e2egreen; two consecutiveparallel: 4CI runs without new flakeschainsaw test --selector phase1-failurerepeated 3× must keep all later tests greendeploy/kind/; no chart changes; no operator code changestest(CC-xxxx)once a feature code is assignedRisks / open questions
parallel: 4: sealing OpenBao globally affects every other Keystone CR running in parallel. Theopenbao-sealedtest must run with a Chainsaw label that excludes it from the parallel pool (e.g.serial: openbao-global), or every Phase 1 failure test must carry a synchronisation label. Recommend modelling after the existingchaos-meshtests intests/e2e-chaos/which serialise viaconcurrent: false. Please confirm the right knob in review.ImagePullBackOffbefore the expand Job runs, which surfaces a different condition reason. The cleaner approach is to patch the Job spec to inject a failing command; needs validation. Alternative: use a real but incompatible tag (downgrade from 2026.1 to 2025.1) which is rejected by the upgrade matrix. Pick during implementation.PushSecretErrorreason name:reconcile_secrets.godoes not emit a typed event for PushSecret-side failures today; the test asserts the symptom (PushSecret statusFailedplus message containing403) rather than a specific reason string. If review prefers a typed event, that becomes a smallfeat(CC-xxxx)follow-up — out of scope for this E2E sub-issue.Source: #315