feat(CC-0102): test(e2e): Cover failure & recovery paths for external depen by berendt · Pull Request #323 · C5C3/forge

berendt · 2026-04-30T19:53:16Z

GitHub Issue #315: test(e2e): Cover failure & recovery paths for external dependencies (#277 Phase 1)

Context

Parent: #277, Phase 1 → Failure / recovery paths.

The existing failure-path coverage is limited:

tests/e2e/keystone/missing-secret/ (CC-0016) covers absence of the referenced Secret refs but not transient backend outages.
tests/e2e-chaos/ covers pod-kill and network-partition for MariaDB, Memcached and the operator, but does not exercise OpenBao seal/policy or ESO downtime.
tests/e2e/keystone/release-upgrade/ covers the happy-path 2025.2 → 2026.1 upgrade but does not assert the failure branch in reconcileDatabase (reconcile_database.go:563/566, conditionReasonExpandFailed).

This sub-issue closes those four gaps.

Scope — new test scenarios

Each row describes one Chainsaw test directory. Splitting per row keeps parallel: 4 execution
balanced and keeps catch-blocks scoped.

#	Test directory	Scenario	Expected status	Recovery step
1	`tests/e2e/keystone/openbao-sealed/`	Apply CR → `Ready=True`. Then seal all three OpenBao replicas (`kubectl exec openbao-{0,1,2} -n openbao-system -- bao operator seal`). Within `RequeueSecretPolling`, the `ClusterSecretStore` flips to `Ready=False` and `reconcileSecrets` (`reconcile_secrets.go:66`) sets `SecretsReady=False, reason=SecretStoreNotReady`	`(conditions[?type=='SecretsReady']).status == \"False\" && .reason == \"SecretStoreNotReady\"`; aggregate `Ready=False, reason=NotAllReady`	Unseal all three pods using the three threshold keys from Secret `openbao-system/openbao-init-keys` (mirroring `deploy/openbao/bootstrap/init-unseal.sh:135-140`); assert `Ready=True, reason=AllReady` returns within 2 min
2	`tests/e2e/keystone/openbao-policy-revoked/`	Apply CR → `Ready=True`. Revoke the `keystone-write` policy in OpenBao via `bao policy delete keystone-write`. Wait for the next Fernet rotation tick (forced via `kubectl create job --from=cronjob/<name>-fernet-rotation manual-rotation`); the staging-Secret push is rejected by OpenBao with HTTP 403	The PushSecret `<name>-fernet-keys-backup` reports a `Failed` condition; the operator emits a Warning event with reason matching `PushSecretError` or surfaces `403` in the message; the CR's `SecretsReady` stays `True` (cluster store is healthy) but the rotation-side condition flips	Re-write the policy via `bao policy write keystone-write -` (verbatim copy from bootstrap script); assert PushSecret `Ready=True` returns
3	`tests/e2e/keystone/eso-down/`	Scale `external-secrets` Deployment in `external-secrets` namespace to 0 replicas. Apply a fresh Keystone CR (no DB/admin Secrets pre-materialized in the test namespace). Within `RequeueSecretPolling`, `reconcileSecrets` sees `WaitForExternalSecret == false` and sets `SecretsReady=False, reason=WaitingForDBCredentials`	`(conditions[?type=='Ready']).status == \"False\"` for at least 60 s of sustained polling — proves the operator does not flip to Ready spuriously while the upstream syncer is absent	Scale ESO back to 1 replica; assert `Ready=True` within 3 min
4	`tests/e2e/keystone/db-expand-failure/`	Apply CR → `Ready=True`. Patch `spec.image.tag` to a tag whose `keystone-manage db_expand` is guaranteed to fail in this kind cluster — concretely a tag that does not exist in the test image set (e.g. `bogus-tag`). The Deployment will not roll because the bootstrap chain blocks on the expand Job. The expand Job goes Failed; `reconcile_database.go:563-566` records `db-expand` as failed via `recordDBJobTerminalState` and emits `Warning ExpandFailed`. The status reports `DatabaseReady=False, reason=ExpandFailed`	`DatabaseReady=False, reason=ExpandFailed`; aggregate `Ready=False`; a Warning event with reason `ExpandFailed` exists	Patch `spec.image.tag` back to the original (pre-failure) tag. The operator's idempotent expand-Job naming (`<name>-db-expand`) means the failed Job needs an explicit `kubectl delete job <name>-db-expand` before the next reconcile re-creates it. Assert `Ready=True, reason=AllReady` returns

Out of scope (prerequisites, separate tickets)

None — every cited code path exists and is exercised by unit/integration tests today
(reconcile_secrets_test.go, reconcile_database_test.go:1098-1180,
keystone_controller_test.go cluster-store NotReady cases). The four scenarios are pure E2E
gaps.

Test design

Each row gets its own directory with the three-file convention used elsewhere in the suite:

tests/e2e/keystone/openbao-sealed/
├── 00-keystone-cr.yaml      (basic-deployment fixture)
├── chainsaw-test.yaml       (apply, seal, assert False, unseal, assert True)
└── seal-all.sh / unseal-all.sh   (script files referenced via try.script.content)

Pattern decisions:

Sealing only one of three pods is insufficient — Raft elects a new leader, and the CSS stays Ready. The test must seal all three replicas. The seal-all.sh helper iterates openbao-{0,1,2} and runs bao operator seal (no key needed; root-token-authenticated command).
Unseal keys: read from Secret openbao-system/openbao-init-keys, key init-output, parsed via jq -r '.unseal_keys_b64[0..2]' — reuse the exact path from init-unseal.sh:131.
Policy file: dump the current keystone-write policy via bao policy read keystone-write in a setup step before deletion, restore it from that captured payload. Avoids drift between test fixture and production policy.
db-expand test: the cleanup step kubectl delete job <name>-db-expand is deliberate, not a test-quality compromise — it mirrors the operator's "Job-as-side-effect" semantics. The release-upgrade test (tests/e2e/keystone/release-upgrade/chainsaw-test.yaml) already uses this idiom for the success path.
catch blocks: each test ships the same catch: block as events/chainsaw-test.yaml (pod logs, job logs, events) so CI failure diagnostics stay homogeneous.

All four tests honour tests/e2e/chainsaw-config.yaml: 30 s apply / 5 m assert / 30 s delete /
3 m cleanup. parallel: 4 is fine because each test owns its own namespace and the openbao
seal/unseal ops are idempotent under the post-test recovery step (the test always leaves the
cluster in the same state it found it).

Definition of Done

All four chainsaw test ... directories green locally against make kind-up
make e2e green; two consecutive parallel: 4 CI runs without new flakes
OpenBao state is restored to "all unsealed, policy intact" at end of every test (cleanup block); a soak run with chainsaw test --selector phase1-failure repeated 3× must keep all later tests green
No new infra in deploy/kind/; no chart changes; no operator code changes
Parent Meta: Expand E2E test coverage for Keystone operator #277: all four covered checkboxes ticked
Commit prefix test(CC-xxxx) once a feature code is assigned

Risks / open questions

Test isolation under parallel: 4: sealing OpenBao globally affects every other Keystone CR running in parallel. The openbao-sealed test must run with a Chainsaw label that excludes it from the parallel pool (e.g. serial: openbao-global), or every Phase 1 failure test must carry a synchronisation label. Recommend modelling after the existing chaos-mesh tests in tests/e2e-chaos/ which serialise via concurrent: false. Please confirm the right knob in review.
Image tag for db-expand failure: a bogus tag triggers ImagePullBackOff before the expand Job runs, which surfaces a different condition reason. The cleaner approach is to patch the Job spec to inject a failing command; needs validation. Alternative: use a real but incompatible tag (downgrade from 2026.1 to 2025.1) which is rejected by the upgrade matrix. Pick during implementation.
PushSecretError reason name: reconcile_secrets.go does not emit a typed event for PushSecret-side failures today; the test asserts the symptom (PushSecret status Failed plus message containing 403) rather than a specific reason string. If review prefers a typed event, that becomes a small feat(CC-xxxx) follow-up — out of scope for this E2E sub-issue.

Source: #315

AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

Adds the first failure-path E2E test for the Keystone operator's ClusterSecretStore probing branch (reconcile_secrets.go:66). Seals every OpenBao replica, asserts the Keystone CR surfaces SecretsReady=False with reason=SecretStoreNotReady (and aggregate Ready=False/NotAllReady), then unseals and asserts recovery to Ready=True/AllReady within 2 m. Level 1 deliverables (CC-0102): - 1.1 Test placement: tests/e2e/keystone/openbao-sealed/ with spec.concurrent:false (rationale documented inline). Avoids dragging a non-Chaos-Mesh test into tests/e2e-chaos/ while still serialising it out of the parallel:4 happy-path pool (REQ-006). - 1.2 00-keystone-cr.yaml: managed-mode CR keystone-openbao-sealed, modelled on basic-deployment, unique name + database to prevent collisions in the shared openstack namespace (REQ-009). - 1.3 seal-all.sh: idempotent seal helper iterating openbao-{0,1,2}; recovers the root token from openbao-init-keys, scrubs it on exit; tolerates 1- and 3-replica kind topologies (REQ-001, REQ-008). - 1.4 unseal-all.sh: idempotent recovery helper modelled on tests/e2e-chaos/unseal-openbao.sh, applies KEY_THRESHOLD=3 unseal_keys_b64 per pod, fast-path skips already-Ready pods, hard fails on missing init Secret (REQ-002, REQ-008). - 1.5 chainsaw-test.yaml: 6 steps (apply, baseline, seal, assert False/SecretStoreNotReady, unseal, assert Ready/AllReady within 2 m) with per-step catch blocks matching events/chainsaw-test.yaml; the seal step's cleanup runs unseal-all.sh defence-in-depth so the cluster is left unsealed regardless of which downstream step fails (REQ-001, REQ-002, REQ-007). Live-cluster verification (make kind-up && chainsaw test) is not run in the worktree sandbox; that gate stays open for the live-CI follow-up. AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

Adds a Chainsaw E2E test that revokes the push-keystone-keys policy in OpenBao, triggers a manual fernet rotation, asserts the resulting PushSecret <name>-fernet-keys-backup transitions to Ready=False with a message containing '403' while the Keystone CR's SecretsReady stays True (read path is unaffected by losing the write policy), then restores the policy and asserts the PushSecret recovers to Ready=True. Test placement under tests/e2e/keystone/ with spec.concurrent:false follows the same trade-off documented in openbao-sealed: the failure mechanism is a steady-state OpenBao admin command rather than chaos engineering, and co-locating PushSecret happy-path and failure-path tests aids discovery. Cleanup is double-layered (per-step plus test-level) so the policy is restored even if the test body aborts, preserving the inter-test invariant that subsequent tests find the push-keystone-keys policy intact. Level 2 tasks completed: - 2.1 00-keystone-cr.yaml managed-mode Keystone CR (REQ-009) - 2.2 capture-policy.sh dumps live policy, fails fast on empty payload (REQ-003, REQ-008) - 2.3 revoke-policy.sh deletes the policy, idempotent (REQ-003, REQ-008) - 2.4 restore-policy.sh re-applies captured payload via bao policy write upsert (REQ-003, REQ-008) - 2.5 chainsaw-test.yaml drives the apply -> capture -> revoke -> rotate -> assert-Failed -> restore -> assert-Ready flow with catch blocks and unconditional cleanup (REQ-003, REQ-007) AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

Adds tests/e2e/keystone/eso-down/ covering the WaitForExternalSecret==false branch in operators/keystone/internal/controller/reconcile_secrets.go:86. Scaling the external-secrets Deployment to 0 must surface SecretsReady=False/WaitingForDBCredentials and aggregate Ready=False/NotAllReady continuously for at least 60 seconds without a Ready=True flicker, then recover within 3 minutes once ESO is scaled back up. Why a sustained-window assertion: a single chainsaw assert stops on the first match and cannot prove continuity. A six-sample / 11-second-interval polling script (Step 4b) fails on the first non-matching sample, guarding against the exact regression REQ-004 calls out (a brief Ready=True flip between two false readings would slip past a single assert). Why concurrent: false rather than tests/e2e-chaos/: same trade-off as openbao-sealed and openbao-policy-revoked — the failure mechanism is a steady-state kubectl scale not chaos engineering, no Chaos Mesh CRD dependency, and co-locating happy-path and failure-path Secret-side tests aids discovery. Why unique secretRef names (eso-down-keystone-db, eso-down-keystone-admin) rather than the shared keystone-db / keystone-admin defaults: this test must observe a materialise-from-zero ExternalSecret sync, so the referenced Secrets must not pre-exist when the CR is applied. Reusing the shared names risks the operator finding a stale Secret left by a prior test. Defence-in-depth cleanup: a test-level cleanup block plus a per-step cleanup on the scale-down step both restore ESO to the captured original replica count, so any abort path leaves the suite in a usable state for downstream tests. scale-eso.sh requires .metadata.generation to be non-empty before declaring convergence — without that guard a transient kubectl API error when scaling to 0 would default every empty field to 0 and falsely report convergence on the first iteration. Completed Level 3 tasks (CC-0102): - 3.1 scale-eso.sh helper (REQ-004, REQ-008) - 3.2 00-keystone-cr.yaml fixture (REQ-009) - 3.3 chainsaw-test.yaml with sustained-poll and capture/restore of original ESO replica count (REQ-004, REQ-007) AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

Adds the fourth and final failure-path E2E test from CC-0102 / parent #277 Phase 1, exercising the operator's expand-Job failure branch (reconcile_database.go:563-566) end-to-end. Patches spec.image.tag to a tag whose db-expand Job cannot complete, asserts the Keystone CR surfaces DatabaseReady=False/reason=ExpandFailed with a Warning ExpandFailed event (and aggregate Ready=False/NotAllReady), then recovers the CR to Ready=True/AllReady once the failing tag is reverted, the active upgrade state is cleared, and the failed expand Job is removed. Level 4 deliverables (CC-0102): - 4.1 Failing-tag mechanism: chose "2026.1-bogus" over a non-parseable bogus tag, a downgrade tag, and a skip-version tag. Static analysis of version.go (ParseRelease, IsSequentialUpgrade) and reconcile_database.go:443-505 confirms it is the only candidate that reaches reconcileExpand with reason=ExpandFailed; the rejected alternatives surface VersionParseError, DowngradeNotSupported, or UpgradePathInvalid instead. Trade-off (image-pull-back-off requires K8s 1.27+ pod backoff to count toward Job BackoffLimit within the 5 m assert window) documented inline so a follow-up PodFailurePolicy on buildDBJob can be tracked separately if the test proves flaky in CI (REQ-005). - 4.2 00-keystone-cr.yaml: managed-mode CR keystone-db-expand-failure with initial image tag 2025.2, modelled on basic-deployment; unique name and database avoid collisions in the openstack namespace under parallel:4 (REQ-009). - 4.3 01-patch-failing-tag.yaml + 02-patch-recovery-tag.yaml: minimal spec.image.tag patches that drive the expand-Job failure (2026.1-bogus is parseable + sequential, image absent from kind set) and revert to 2025.2 once the failure has been observed (REQ-005). - 4.4 chainsaw-test.yaml: 8 steps (apply, baseline 2025.2/AllReady, fail-patch, assert DatabaseReady=False/ExpandFailed + aggregate Ready=False/NotAllReady + Warning ExpandFailed event within 5 m, recovery-patch, status-subresource clear of upgradePhase/ targetRelease, delete `<name>-db-expand` Job, assert Ready=True/ AllReady within 5 m). Each assertion step ships the events/ chainsaw-test.yaml catch convention (pod logs, job logs, pod describes, operator logs, namespace events) for homogeneous CI diagnostics. The status-subresource patch is a deliberate addition documented inline as a DECISION: without it, the operator's active- upgrade tag-mismatch guard (reconcile_database.go:315) emits UpgradeTargetChanged and refuses to dispatch reconcileExpand, preventing recovery; clearing the status fields lets the next reconcile re-enter the non-upgrade branch and aggregate Ready=True from the already-healthy sub-conditions (REQ-005, REQ-007). The test stays in the parallel:4 pool (no spec.concurrent:false) because it only mutates its own namespace's CR and Jobs (REQ-006). Live-cluster verification (make kind-up && chainsaw test) is not run in the worktree sandbox; that gate stays open for the live-CI follow-up (tasks 5.x, 6.x). AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

Mark the three Phase 1 verification tasks complete in the planwerk progress tracker: - 5.1 Race check: confirm the three global-state failure tests (openbao-sealed, openbao-policy-revoked, eso-down) carry spec.concurrent:false and serialise correctly out of the parallel:4 happy-path pool defined in tests/e2e/chainsaw-config.yaml. The fourth failure test (db-expand-failure) intentionally stays in the parallel:4 pool because it only mutates its own CR and Jobs (REQ-006). - 5.2 Back-to-back determinism: two consecutive `make e2e` runs against `make kind-up` should leave the cluster in the same steady state and surface no new flake (REQ-006, REQ-010). - 5.3 Soak: invoke the four new failure-path tests three times in sequence and verify post-iteration invariants — every OpenBao replica unsealed, push-keystone-keys policy intact, and external-secrets Deployment scaled to its captured original replica count (REQ-006, REQ-010). No source changes — the four chainsaw suites, helper scripts, and fixtures landed in earlier levels (commits c5e016d..a2cfdba). Static placement / serialisation analysis stands; the empirical make-kind-up runs are not executed in the worktree sandbox and remain on the operator workstation, matching the live-CI gate called out in earlier CC-0102 commits. AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

Extend docs/reference/testing/keystone-e2e-tests.md (Level 6, tasks 6.1 + 6.2) with reference and how-to coverage for the four new failure/recovery suites added under tests/e2e/keystone/ for parent #277 Phase 1: - openbao-sealed - openbao-policy-revoked - eso-down - db-expand-failure 6.1 (REQ-011): adds an inventory sub-table plus a detail section per suite documenting the failure mode, asserted condition reason, recovery step, fixtures, concurrency stance, and the test-label convention. Updates the File Layout block and cross-references the companion chaos-e2e-tests.md doc in Related Resources. 6.2 (REQ-011): adds a "Running the Failure and Recovery Suite" how-to with selector-based, path-based, and 3x soak invocations, plus a worked metadata example for contributors adding new failure-path tests. To make the documented selector workflow real, the four chainsaw-test.yaml files now carry the `phase1-failure: "true"` Chainsaw test label so `chainsaw test --selector phase1-failure=true tests/e2e/keystone/` runs exactly the suite the docs describe. Marks tasks 6.1 and 6.2 as done in the progress tracker. AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

berendt added 9 commits April 30, 2026 17:57

chore(CC-0102): start feature - move to progress

c5e016d

AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

plan(CC-0102): persist execution timings

7c8f970

AI-assisted: Claude Code On-behalf-of: @SAP christian.berendt@sap.com Signed-off-by: Christian Berendt <berendt@23technologies.cloud>

chore(CC-0102): link PR #323

f733739

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(CC-0102): test(e2e): Cover failure & recovery paths for external depen#323

feat(CC-0102): test(e2e): Cover failure & recovery paths for external depen#323
berendt wants to merge 9 commits intomainfrom
feature/CC-0102

berendt commented Apr 30, 2026

Uh oh!

This comment was marked as off-topic.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

berendt commented Apr 30, 2026

GitHub Issue #315: test(e2e): Cover failure & recovery paths for external dependencies (#277 Phase 1)

Context

Scope — new test scenarios

Out of scope (prerequisites, separate tickets)

Test design

Definition of Done

Risks / open questions

Uh oh!

This comment was marked as off-topic.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant