Background
PR #956 introduced a KWOK CI deployer matrix with three lanes: helm, argocd-oci, argocd-helm-oci. The argocd-* lanes need to wait for Argo CD reconciliation to complete and then assert that every Application reached a terminal pass state. This is currently implemented as a bash function (wait_for_argocd_sync) in kwok/scripts/validate-scheduling.sh (~150 lines of bash + jq).
The function works and handles three real-world pass states correctly (commit cc2e87e1):
Synced + Healthy — canonical pass
Synced + Progressing — KWOK pod-readiness simulation gap (per ADR-008)
OutOfSync + Healthy + operationState.phase=Succeeded — operator-mutation drift (gpu-operator's ClusterPolicy, nvidia-dra-driver-gpu's DeviceClass, etc.)
…and rejects two false-positive traps (sync that genuinely failed but resources happen to look healthy; OutOfSync + Degraded).
It works. It is also harder to read, harder to test in isolation, and not consistent with the rest of the AICR test surface (tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml).
Proposal
Replace the bash/jq sync-wait logic in validate-scheduling.sh (and the surrounding orchestration where it makes sense) with a Chainsaw scenario, matching the pattern already established under tests/chainsaw/cli/.
Why a separate issue (not in #956)
This is an architectural refactor of how the KWOK lane drives its assertions, not a bug fix. PR #956 is already large (KWOK matrix scaffolding, two blocker fixes, hermeticity cleanup), and the Chainsaw migration trade-offs deserve their own review:
- Output format: bash logs stream into the
kwok-test composite-action diagnostic dump; Chainsaw emits its own test-result format. The CI summary action needs adaptation.
- Exit-code semantics:
run-all-recipes.sh reads EXIT_ARGOCD_SYNC_TIMEOUT (50) to implement the 3-strike rule. Chainsaw's exit code is fixed pass/fail.
- Diagnostics on failure:
dump_argocd_failures collects per-Application status, repo-server logs, and application-controller logs in one place. Reproducing that in Chainsaw requires outputs blocks plus an explicit logs step.
- State sharing: bundle output dir, OCI ref, root-app name, recipe name — currently bash vars, would need to be threaded as Chainsaw bindings.
Scope
In scope:
- Replace
wait_for_argocd_sync (and its wait_for_argocd_root_app helper) with a Chainsaw test under tests/chainsaw/kwok/argocd-sync/ (or similar).
- Express the three-way pass predicate in Chainsaw
assert (likely with a CEL expression for the disjunction across the Application list).
- Wire failure-path diagnostics: equivalent of
dump_argocd_failures via Chainsaw outputs or a post-failure step.
- Preserve the exit-code contract
run-all-recipes.sh depends on (3-strike rule + EXIT_ARGOCD_SYNC_TIMEOUT=50), or migrate run-all-recipes.sh to a Chainsaw harness that owns the strike logic.
Out of scope:
- The bundle/push step.
aicr bundle --output oci://... belongs in bash because it shells out to the AICR CLI and depends on workdir-relative output paths.
verify_pods — already small and consumed by the helm lane too; leaving it untouched keeps the lanes symmetric.
Acceptance criteria
- The argocd-oci and argocd-helm-oci lanes pass the existing
eks-training recipe end-to-end on KWOK using Chainsaw instead of wait_for_argocd_sync.
- A deliberate regression that takes an Application Degraded continues to fail the lane (negative test included).
- The 3-strike timeout rule in
run-all-recipes.sh still bails the whole job at exit 50 after three consecutive sync deadlines.
- Diagnostics on failure include per-App
sync/health/operationState.message and repo-server log tail — same content as dump_argocd_failures today.
References
Background
PR #956 introduced a KWOK CI deployer matrix with three lanes:
helm,argocd-oci,argocd-helm-oci. The argocd-* lanes need to wait for Argo CD reconciliation to complete and then assert that every Application reached a terminal pass state. This is currently implemented as a bash function (wait_for_argocd_sync) inkwok/scripts/validate-scheduling.sh(~150 lines of bash + jq).The function works and handles three real-world pass states correctly (commit
cc2e87e1):Synced + Healthy— canonical passSynced + Progressing— KWOK pod-readiness simulation gap (per ADR-008)OutOfSync + Healthy + operationState.phase=Succeeded— operator-mutation drift (gpu-operator's ClusterPolicy, nvidia-dra-driver-gpu's DeviceClass, etc.)…and rejects two false-positive traps (sync that genuinely failed but resources happen to look healthy; OutOfSync + Degraded).
It works. It is also harder to read, harder to test in isolation, and not consistent with the rest of the AICR test surface (
tests/chainsaw/cli/bundle-variants/chainsaw-test.yaml).Proposal
Replace the bash/jq sync-wait logic in
validate-scheduling.sh(and the surrounding orchestration where it makes sense) with a Chainsaw scenario, matching the pattern already established undertests/chainsaw/cli/.Why a separate issue (not in #956)
This is an architectural refactor of how the KWOK lane drives its assertions, not a bug fix. PR #956 is already large (KWOK matrix scaffolding, two blocker fixes, hermeticity cleanup), and the Chainsaw migration trade-offs deserve their own review:
kwok-testcomposite-action diagnostic dump; Chainsaw emits its own test-result format. The CI summary action needs adaptation.run-all-recipes.shreadsEXIT_ARGOCD_SYNC_TIMEOUT(50) to implement the 3-strike rule. Chainsaw's exit code is fixed pass/fail.dump_argocd_failurescollects per-Application status, repo-server logs, and application-controller logs in one place. Reproducing that in Chainsaw requiresoutputsblocks plus an explicit logs step.Scope
In scope:
wait_for_argocd_sync(and itswait_for_argocd_root_apphelper) with a Chainsaw test undertests/chainsaw/kwok/argocd-sync/(or similar).assert(likely with a CEL expression for the disjunction across the Application list).dump_argocd_failuresvia Chainsawoutputsor a post-failure step.run-all-recipes.shdepends on (3-strike rule +EXIT_ARGOCD_SYNC_TIMEOUT=50), or migrate run-all-recipes.sh to a Chainsaw harness that owns the strike logic.Out of scope:
aicr bundle --output oci://...belongs in bash because it shells out to the AICR CLI and depends on workdir-relative output paths.verify_pods— already small and consumed by the helm lane too; leaving it untouched keeps the lanes symmetric.Acceptance criteria
eks-trainingrecipe end-to-end on KWOK using Chainsaw instead ofwait_for_argocd_sync.run-all-recipes.shstill bails the whole job at exit 50 after three consecutive sync deadlines.sync/health/operationState.messageand repo-server log tail — same content asdump_argocd_failurestoday.References
cc2e87e1: the three-state filter being migrateddocs/design/008-kwok-deployer-matrix.mdtests/chainsaw/cli/bundle-variants/chainsaw-test.yaml