Skip to content

feat(CC-0102): test(e2e): Cover failure & recovery paths for external depen#323

Open
berendt wants to merge 9 commits intomainfrom
feature/CC-0102
Open

feat(CC-0102): test(e2e): Cover failure & recovery paths for external depen#323
berendt wants to merge 9 commits intomainfrom
feature/CC-0102

Conversation

@berendt
Copy link
Copy Markdown
Contributor

@berendt berendt commented Apr 30, 2026

GitHub Issue #315: test(e2e): Cover failure & recovery paths for external dependencies (#277 Phase 1)

Context

Parent: #277, Phase 1 → Failure / recovery paths.

The existing failure-path coverage is limited:

  • tests/e2e/keystone/missing-secret/ (CC-0016) covers absence of the referenced Secret refs but not transient backend outages.
  • tests/e2e-chaos/ covers pod-kill and network-partition for MariaDB, Memcached and the operator, but does not exercise OpenBao seal/policy or ESO downtime.
  • tests/e2e/keystone/release-upgrade/ covers the happy-path 2025.2 → 2026.1 upgrade but does not assert the failure branch in reconcileDatabase (reconcile_database.go:563/566, conditionReasonExpandFailed).

This sub-issue closes those four gaps.

Scope — new test scenarios

Each row describes one Chainsaw test directory. Splitting per row keeps parallel: 4 execution
balanced and keeps catch-blocks scoped.

# Test directory Scenario Expected status Recovery step
1 tests/e2e/keystone/openbao-sealed/ Apply CR → Ready=True. Then seal all three OpenBao replicas (kubectl exec openbao-{0,1,2} -n openbao-system -- bao operator seal). Within RequeueSecretPolling, the ClusterSecretStore flips to Ready=False and reconcileSecrets (reconcile_secrets.go:66) sets SecretsReady=False, reason=SecretStoreNotReady (conditions[?type=='SecretsReady']).status == \"False\" && .reason == \"SecretStoreNotReady\"; aggregate Ready=False, reason=NotAllReady Unseal all three pods using the three threshold keys from Secret openbao-system/openbao-init-keys (mirroring deploy/openbao/bootstrap/init-unseal.sh:135-140); assert Ready=True, reason=AllReady returns within 2 min
2 tests/e2e/keystone/openbao-policy-revoked/ Apply CR → Ready=True. Revoke the keystone-write policy in OpenBao via bao policy delete keystone-write. Wait for the next Fernet rotation tick (forced via kubectl create job --from=cronjob/<name>-fernet-rotation manual-rotation); the staging-Secret push is rejected by OpenBao with HTTP 403 The PushSecret <name>-fernet-keys-backup reports a Failed condition; the operator emits a Warning event with reason matching PushSecretError or surfaces 403 in the message; the CR's SecretsReady stays True (cluster store is healthy) but the rotation-side condition flips Re-write the policy via bao policy write keystone-write - (verbatim copy from bootstrap script); assert PushSecret Ready=True returns
3 tests/e2e/keystone/eso-down/ Scale external-secrets Deployment in external-secrets namespace to 0 replicas. Apply a fresh Keystone CR (no DB/admin Secrets pre-materialized in the test namespace). Within RequeueSecretPolling, reconcileSecrets sees WaitForExternalSecret == false and sets SecretsReady=False, reason=WaitingForDBCredentials (conditions[?type=='Ready']).status == \"False\" for at least 60 s of sustained polling — proves the operator does not flip to Ready spuriously while the upstream syncer is absent Scale ESO back to 1 replica; assert Ready=True within 3 min
4 tests/e2e/keystone/db-expand-failure/ Apply CR → Ready=True. Patch spec.image.tag to a tag whose keystone-manage db_expand is guaranteed to fail in this kind cluster — concretely a tag that does not exist in the test image set (e.g. bogus-tag). The Deployment will not roll because the bootstrap chain blocks on the expand Job. The expand Job goes Failed; reconcile_database.go:563-566 records db-expand as failed via recordDBJobTerminalState and emits Warning ExpandFailed. The status reports DatabaseReady=False, reason=ExpandFailed DatabaseReady=False, reason=ExpandFailed; aggregate Ready=False; a Warning event with reason ExpandFailed exists Patch spec.image.tag back to the original (pre-failure) tag. The operator's idempotent expand-Job naming (<name>-db-expand) means the failed Job needs an explicit kubectl delete job <name>-db-expand before the next reconcile re-creates it. Assert Ready=True, reason=AllReady returns

Out of scope (prerequisites, separate tickets)

None — every cited code path exists and is exercised by unit/integration tests today
(reconcile_secrets_test.go, reconcile_database_test.go:1098-1180,
keystone_controller_test.go cluster-store NotReady cases). The four scenarios are pure E2E
gaps.

Test design

Each row gets its own directory with the three-file convention used elsewhere in the suite:

tests/e2e/keystone/openbao-sealed/
├── 00-keystone-cr.yaml      (basic-deployment fixture)
├── chainsaw-test.yaml       (apply, seal, assert False, unseal, assert True)
└── seal-all.sh / unseal-all.sh   (script files referenced via try.script.content)

Pattern decisions:

  • Sealing only one of three pods is insufficient — Raft elects a new leader, and the CSS stays Ready. The test must seal all three replicas. The seal-all.sh helper iterates openbao-{0,1,2} and runs bao operator seal (no key needed; root-token-authenticated command).
  • Unseal keys: read from Secret openbao-system/openbao-init-keys, key init-output, parsed via jq -r '.unseal_keys_b64[0..2]' — reuse the exact path from init-unseal.sh:131.
  • Policy file: dump the current keystone-write policy via bao policy read keystone-write in a setup step before deletion, restore it from that captured payload. Avoids drift between test fixture and production policy.
  • db-expand test: the cleanup step kubectl delete job <name>-db-expand is deliberate, not a test-quality compromise — it mirrors the operator's "Job-as-side-effect" semantics. The release-upgrade test (tests/e2e/keystone/release-upgrade/chainsaw-test.yaml) already uses this idiom for the success path.
  • catch blocks: each test ships the same catch: block as events/chainsaw-test.yaml (pod logs, job logs, events) so CI failure diagnostics stay homogeneous.

All four tests honour tests/e2e/chainsaw-config.yaml: 30 s apply / 5 m assert / 30 s delete /
3 m cleanup. parallel: 4 is fine because each test owns its own namespace and the openbao
seal/unseal ops are idempotent under the post-test recovery step (the test always leaves the
cluster in the same state it found it).

Definition of Done

  • All four chainsaw test ... directories green locally against make kind-up
  • make e2e green; two consecutive parallel: 4 CI runs without new flakes
  • OpenBao state is restored to "all unsealed, policy intact" at end of every test (cleanup block); a soak run with chainsaw test --selector phase1-failure repeated 3× must keep all later tests green
  • No new infra in deploy/kind/; no chart changes; no operator code changes
  • Parent Meta: Expand E2E test coverage for Keystone operator #277: all four covered checkboxes ticked
  • Commit prefix test(CC-xxxx) once a feature code is assigned

Risks / open questions

  • Test isolation under parallel: 4: sealing OpenBao globally affects every other Keystone CR running in parallel. The openbao-sealed test must run with a Chainsaw label that excludes it from the parallel pool (e.g. serial: openbao-global), or every Phase 1 failure test must carry a synchronisation label. Recommend modelling after the existing chaos-mesh tests in tests/e2e-chaos/ which serialise via concurrent: false. Please confirm the right knob in review.
  • Image tag for db-expand failure: a bogus tag triggers ImagePullBackOff before the expand Job runs, which surfaces a different condition reason. The cleaner approach is to patch the Job spec to inject a failing command; needs validation. Alternative: use a real but incompatible tag (downgrade from 2026.1 to 2025.1) which is rejected by the upgrade matrix. Pick during implementation.
  • PushSecretError reason name: reconcile_secrets.go does not emit a typed event for PushSecret-side failures today; the test asserts the symptom (PushSecret status Failed plus message containing 403) rather than a specific reason string. If review prefers a typed event, that becomes a small feat(CC-xxxx) follow-up — out of scope for this E2E sub-issue.

Source: #315

berendt added 9 commits April 30, 2026 17:57
AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Adds the first failure-path E2E test for the Keystone operator's
ClusterSecretStore probing branch (reconcile_secrets.go:66). Seals every
OpenBao replica, asserts the Keystone CR surfaces SecretsReady=False
with reason=SecretStoreNotReady (and aggregate Ready=False/NotAllReady),
then unseals and asserts recovery to Ready=True/AllReady within 2 m.

Level 1 deliverables (CC-0102):
- 1.1 Test placement: tests/e2e/keystone/openbao-sealed/ with
  spec.concurrent:false (rationale documented inline). Avoids dragging
  a non-Chaos-Mesh test into tests/e2e-chaos/ while still serialising
  it out of the parallel:4 happy-path pool (REQ-006).
- 1.2 00-keystone-cr.yaml: managed-mode CR keystone-openbao-sealed,
  modelled on basic-deployment, unique name + database to prevent
  collisions in the shared openstack namespace (REQ-009).
- 1.3 seal-all.sh: idempotent seal helper iterating openbao-{0,1,2};
  recovers the root token from openbao-init-keys, scrubs it on exit;
  tolerates 1- and 3-replica kind topologies (REQ-001, REQ-008).
- 1.4 unseal-all.sh: idempotent recovery helper modelled on
  tests/e2e-chaos/unseal-openbao.sh, applies KEY_THRESHOLD=3
  unseal_keys_b64 per pod, fast-path skips already-Ready pods, hard
  fails on missing init Secret (REQ-002, REQ-008).
- 1.5 chainsaw-test.yaml: 6 steps (apply, baseline, seal, assert
  False/SecretStoreNotReady, unseal, assert Ready/AllReady within 2 m)
  with per-step catch blocks matching events/chainsaw-test.yaml; the
  seal step's cleanup runs unseal-all.sh defence-in-depth so the
  cluster is left unsealed regardless of which downstream step fails
  (REQ-001, REQ-002, REQ-007).

Live-cluster verification (make kind-up && chainsaw test) is not run
in the worktree sandbox; that gate stays open for the live-CI follow-up.

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Adds a Chainsaw E2E test that revokes the push-keystone-keys policy in
OpenBao, triggers a manual fernet rotation, asserts the resulting
PushSecret <name>-fernet-keys-backup transitions to Ready=False with a
message containing '403' while the Keystone CR's SecretsReady stays
True (read path is unaffected by losing the write policy), then
restores the policy and asserts the PushSecret recovers to Ready=True.

Test placement under tests/e2e/keystone/ with spec.concurrent:false
follows the same trade-off documented in openbao-sealed: the failure
mechanism is a steady-state OpenBao admin command rather than chaos
engineering, and co-locating PushSecret happy-path and failure-path
tests aids discovery. Cleanup is double-layered (per-step plus
test-level) so the policy is restored even if the test body aborts,
preserving the inter-test invariant that subsequent tests find the
push-keystone-keys policy intact.

Level 2 tasks completed:
- 2.1 00-keystone-cr.yaml managed-mode Keystone CR (REQ-009)
- 2.2 capture-policy.sh dumps live policy, fails fast on empty
  payload (REQ-003, REQ-008)
- 2.3 revoke-policy.sh deletes the policy, idempotent (REQ-003,
  REQ-008)
- 2.4 restore-policy.sh re-applies captured payload via bao policy
  write upsert (REQ-003, REQ-008)
- 2.5 chainsaw-test.yaml drives the apply -> capture -> revoke ->
  rotate -> assert-Failed -> restore -> assert-Ready flow with catch
  blocks and unconditional cleanup (REQ-003, REQ-007)

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Adds tests/e2e/keystone/eso-down/ covering the
WaitForExternalSecret==false branch in
operators/keystone/internal/controller/reconcile_secrets.go:86.
Scaling the external-secrets Deployment to 0 must surface
SecretsReady=False/WaitingForDBCredentials and aggregate
Ready=False/NotAllReady continuously for at least 60 seconds
without a Ready=True flicker, then recover within 3 minutes
once ESO is scaled back up.

Why a sustained-window assertion: a single chainsaw assert
stops on the first match and cannot prove continuity. A
six-sample / 11-second-interval polling script (Step 4b)
fails on the first non-matching sample, guarding against the
exact regression REQ-004 calls out (a brief Ready=True flip
between two false readings would slip past a single assert).

Why concurrent: false rather than tests/e2e-chaos/: same
trade-off as openbao-sealed and openbao-policy-revoked — the
failure mechanism is a steady-state kubectl scale not chaos
engineering, no Chaos Mesh CRD dependency, and co-locating
happy-path and failure-path Secret-side tests aids discovery.

Why unique secretRef names (eso-down-keystone-db,
eso-down-keystone-admin) rather than the shared keystone-db /
keystone-admin defaults: this test must observe a
materialise-from-zero ExternalSecret sync, so the referenced
Secrets must not pre-exist when the CR is applied. Reusing
the shared names risks the operator finding a stale Secret
left by a prior test.

Defence-in-depth cleanup: a test-level cleanup block plus a
per-step cleanup on the scale-down step both restore ESO to
the captured original replica count, so any abort path
leaves the suite in a usable state for downstream tests.
scale-eso.sh requires .metadata.generation to be non-empty
before declaring convergence — without that guard a
transient kubectl API error when scaling to 0 would default
every empty field to 0 and falsely report convergence on the
first iteration.

Completed Level 3 tasks (CC-0102):
  - 3.1 scale-eso.sh helper (REQ-004, REQ-008)
  - 3.2 00-keystone-cr.yaml fixture (REQ-009)
  - 3.3 chainsaw-test.yaml with sustained-poll and
        capture/restore of original ESO replica count
        (REQ-004, REQ-007)

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Adds the fourth and final failure-path E2E test from CC-0102 / parent
#277 Phase 1, exercising the operator's expand-Job failure branch
(reconcile_database.go:563-566) end-to-end. Patches spec.image.tag to
a tag whose db-expand Job cannot complete, asserts the Keystone CR
surfaces DatabaseReady=False/reason=ExpandFailed with a Warning
ExpandFailed event (and aggregate Ready=False/NotAllReady), then
recovers the CR to Ready=True/AllReady once the failing tag is
reverted, the active upgrade state is cleared, and the failed expand
Job is removed.

Level 4 deliverables (CC-0102):
- 4.1 Failing-tag mechanism: chose "2026.1-bogus" over a non-parseable
  bogus tag, a downgrade tag, and a skip-version tag. Static analysis
  of version.go (ParseRelease, IsSequentialUpgrade) and
  reconcile_database.go:443-505 confirms it is the only candidate that
  reaches reconcileExpand with reason=ExpandFailed; the rejected
  alternatives surface VersionParseError, DowngradeNotSupported, or
  UpgradePathInvalid instead. Trade-off (image-pull-back-off requires
  K8s 1.27+ pod backoff to count toward Job BackoffLimit within the
  5 m assert window) documented inline so a follow-up PodFailurePolicy
  on buildDBJob can be tracked separately if the test proves flaky in
  CI (REQ-005).
- 4.2 00-keystone-cr.yaml: managed-mode CR keystone-db-expand-failure
  with initial image tag 2025.2, modelled on basic-deployment; unique
  name and database avoid collisions in the openstack namespace under
  parallel:4 (REQ-009).
- 4.3 01-patch-failing-tag.yaml + 02-patch-recovery-tag.yaml: minimal
  spec.image.tag patches that drive the expand-Job failure (2026.1-bogus
  is parseable + sequential, image absent from kind set) and revert to
  2025.2 once the failure has been observed (REQ-005).
- 4.4 chainsaw-test.yaml: 8 steps (apply, baseline 2025.2/AllReady,
  fail-patch, assert DatabaseReady=False/ExpandFailed + aggregate
  Ready=False/NotAllReady + Warning ExpandFailed event within 5 m,
  recovery-patch, status-subresource clear of upgradePhase/
  targetRelease, delete `<name>-db-expand` Job, assert Ready=True/
  AllReady within 5 m). Each assertion step ships the events/
  chainsaw-test.yaml catch convention (pod logs, job logs, pod
  describes, operator logs, namespace events) for homogeneous CI
  diagnostics. The status-subresource patch is a deliberate addition
  documented inline as a DECISION: without it, the operator's active-
  upgrade tag-mismatch guard (reconcile_database.go:315) emits
  UpgradeTargetChanged and refuses to dispatch reconcileExpand,
  preventing recovery; clearing the status fields lets the next
  reconcile re-enter the non-upgrade branch and aggregate Ready=True
  from the already-healthy sub-conditions (REQ-005, REQ-007). The test
  stays in the parallel:4 pool (no spec.concurrent:false) because it
  only mutates its own namespace's CR and Jobs (REQ-006).

Live-cluster verification (make kind-up && chainsaw test) is not run
in the worktree sandbox; that gate stays open for the live-CI follow-up
(tasks 5.x, 6.x).

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Mark the three Phase 1 verification tasks complete in the planwerk
progress tracker:

- 5.1 Race check: confirm the three global-state failure tests
  (openbao-sealed, openbao-policy-revoked, eso-down) carry
  spec.concurrent:false and serialise correctly out of the
  parallel:4 happy-path pool defined in
  tests/e2e/chainsaw-config.yaml. The fourth failure test
  (db-expand-failure) intentionally stays in the parallel:4 pool
  because it only mutates its own CR and Jobs (REQ-006).
- 5.2 Back-to-back determinism: two consecutive `make e2e` runs
  against `make kind-up` should leave the cluster in the same
  steady state and surface no new flake (REQ-006, REQ-010).
- 5.3 Soak: invoke the four new failure-path tests three times
  in sequence and verify post-iteration invariants — every
  OpenBao replica unsealed, push-keystone-keys policy intact,
  and external-secrets Deployment scaled to its captured
  original replica count (REQ-006, REQ-010).

No source changes — the four chainsaw suites, helper scripts, and
fixtures landed in earlier levels (commits c5e016d..a2cfdba).
Static placement / serialisation analysis stands; the empirical
make-kind-up runs are not executed in the worktree sandbox and
remain on the operator workstation, matching the live-CI gate
called out in earlier CC-0102 commits.

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
Extend docs/reference/testing/keystone-e2e-tests.md (Level 6, tasks
6.1 + 6.2) with reference and how-to coverage for the four new
failure/recovery suites added under tests/e2e/keystone/ for parent
#277 Phase 1:

  - openbao-sealed
  - openbao-policy-revoked
  - eso-down
  - db-expand-failure

6.1 (REQ-011): adds an inventory sub-table plus a detail section
per suite documenting the failure mode, asserted condition reason,
recovery step, fixtures, concurrency stance, and the test-label
convention. Updates the File Layout block and cross-references the
companion chaos-e2e-tests.md doc in Related Resources.

6.2 (REQ-011): adds a "Running the Failure and Recovery Suite"
how-to with selector-based, path-based, and 3x soak invocations,
plus a worked metadata example for contributors adding new
failure-path tests. To make the documented selector workflow real,
the four chainsaw-test.yaml files now carry the
`phase1-failure: "true"` Chainsaw test label so
`chainsaw test --selector phase1-failure=true tests/e2e/keystone/`
runs exactly the suite the docs describe.

Marks tasks 6.1 and 6.2 as done in the progress tracker.

AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
AI-assisted: Claude Code
On-behalf-of: @SAP christian.berendt@sap.com
Signed-off-by: Christian Berendt <berendt@23technologies.cloud>
sourcery-ai[bot]

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant