Skip to content

Refactor KWOK argocd-* sync gate from bash/jq to Chainsaw #962

@mchmarny

Description

@mchmarny

Background

PR #956 introduced a KWOK CI deployer matrix with three lanes: helm, argocd-oci, argocd-helm-oci. The argocd-* lanes need to wait for Argo CD reconciliation to complete and then assert that every Application reached a terminal pass state. This is currently implemented as a bash function (wait_for_argocd_sync) in kwok/scripts/validate-scheduling.sh (~150 lines of bash + jq).

The function works and handles three real-world pass states correctly (commit cc2e87e1):

  1. Synced + Healthy — canonical pass
  2. Synced + Progressing — KWOK pod-readiness simulation gap (per ADR-008)
  3. OutOfSync + Healthy + operationState.phase=Succeeded — operator-mutation drift (gpu-operator's ClusterPolicy, nvidia-dra-driver-gpu's DeviceClass, etc.)

…and rejects two false-positive traps (sync that genuinely failed but resources happen to look healthy; OutOfSync + Degraded).

It works. It is also harder to read, harder to test in isolation, and not consistent with the rest of the AICR test surface (tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml).

Proposal

Replace the bash/jq sync-wait logic in validate-scheduling.sh (and the surrounding orchestration where it makes sense) with a Chainsaw scenario, matching the pattern already established under tests/chainsaw/cli/.

Why a separate issue (not in #956)

This is an architectural refactor of how the KWOK lane drives its assertions, not a bug fix. PR #956 is already large (KWOK matrix scaffolding, two blocker fixes, hermeticity cleanup), and the Chainsaw migration trade-offs deserve their own review:

  • Output format: bash logs stream into the kwok-test composite-action diagnostic dump; Chainsaw emits its own test-result format. The CI summary action needs adaptation.
  • Exit-code semantics: run-all-recipes.sh reads EXIT_ARGOCD_SYNC_TIMEOUT (50) to implement the 3-strike rule. Chainsaw's exit code is fixed pass/fail.
  • Diagnostics on failure: dump_argocd_failures collects per-Application status, repo-server logs, and application-controller logs in one place. Reproducing that in Chainsaw requires outputs blocks plus an explicit logs step.
  • State sharing: bundle output dir, OCI ref, root-app name, recipe name — currently bash vars, would need to be threaded as Chainsaw bindings.

Scope

In scope:

  • Replace wait_for_argocd_sync (and its wait_for_argocd_root_app helper) with a Chainsaw test under tests/chainsaw/kwok/argocd-sync/ (or similar).
  • Express the three-way pass predicate in Chainsaw assert (likely with a CEL expression for the disjunction across the Application list).
  • Wire failure-path diagnostics: equivalent of dump_argocd_failures via Chainsaw outputs or a post-failure step.
  • Preserve the exit-code contract run-all-recipes.sh depends on (3-strike rule + EXIT_ARGOCD_SYNC_TIMEOUT=50), or migrate run-all-recipes.sh to a Chainsaw harness that owns the strike logic.

Out of scope:

  • The bundle/push step. aicr bundle --output oci://... belongs in bash because it shells out to the AICR CLI and depends on workdir-relative output paths.
  • verify_pods — already small and consumed by the helm lane too; leaving it untouched keeps the lanes symmetric.

Acceptance criteria

  1. The argocd-oci and argocd-helm-oci lanes pass the existing eks-training recipe end-to-end on KWOK using Chainsaw instead of wait_for_argocd_sync.
  2. A deliberate regression that takes an Application Degraded continues to fail the lane (negative test included).
  3. The 3-strike timeout rule in run-all-recipes.sh still bails the whole job at exit 50 after three consecutive sync deadlines.
  4. Diagnostics on failure include per-App sync/health/operationState.message and repo-server log tail — same content as dump_argocd_failures today.

References

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions