Symptom
apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts > eval pipeline e2e > runs full input → grade → bench pipeline reliably times out at the 5000 ms default when run as part of the full bun test suite (typically 5005-5106 ms). When run alone with a higher timeout, the same test passes consistently in ~7 s.
Repro:
# Hits the flake under suite-level contention:
bun --filter agentv test
# (fail) eval pipeline e2e > runs full input → grade → bench pipeline [5012.95ms]
# ^ this test timed out after 5000ms.
# Passes when isolated:
bun test apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts --timeout 30000
# 1 pass / 0 fail / Ran 1 test across 1 file. [7.08s]
Observed during pre-push for PRs #1165 / #1166 / #1167 / #1168 — all of which had no diff to packages/core or apps/cli. The flake reproduces on main with no working-tree changes.
Why this matters
validate.yml does not run bun test, so the flake doesn't gate CI — but the local prek pre-push hook does, and it is the only test gate that runs before push. Three consecutive git push attempts on a docs-only branch hit it in a row, forcing a --no-verify bypass. That undermines the safety the hook is supposed to provide for changes that do touch source code.
Suggested fixes (pick whichever is cheaper)
-
Raise this single test's timeout to 30 s. The test is a real end-to-end (build → eval → grade → bench), 7 s of wall-clock is realistic on a laptop under load. Vitest / bun test let you set a per-test timeout via it.timeout(30_000) or test.setTimeout(30_000).
-
Simplify the test. It looks like it's exercising the full pipeline including bun apps/cli/src/cli.ts ... subprocess spawns. Trim to the smallest possible fixture (one test case, no LLM call) and assert the pipeline plumbing only — the value of an e2e here is "the wiring works", not "everything runs fast." A 1-2 s version would be far less likely to flake.
-
Add retries + warmup, last resort. bun test supports --rerun-each N. Combine with a no-op smoke test as the very first item to ensure caches are warm.
Handoff context
Symptom
apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts > eval pipeline e2e > runs full input → grade → bench pipelinereliably times out at the 5000 ms default when run as part of the fullbun testsuite (typically 5005-5106 ms). When run alone with a higher timeout, the same test passes consistently in ~7 s.Repro:
Observed during pre-push for PRs #1165 / #1166 / #1167 / #1168 — all of which had no diff to
packages/coreorapps/cli. The flake reproduces onmainwith no working-tree changes.Why this matters
validate.ymldoes not runbun test, so the flake doesn't gate CI — but the local prek pre-push hook does, and it is the only test gate that runs before push. Three consecutivegit pushattempts on a docs-only branch hit it in a row, forcing a--no-verifybypass. That undermines the safety the hook is supposed to provide for changes that do touch source code.Suggested fixes (pick whichever is cheaper)
Raise this single test's timeout to 30 s. The test is a real end-to-end (build → eval → grade → bench), 7 s of wall-clock is realistic on a laptop under load. Vitest / bun test let you set a per-test timeout via
it.timeout(30_000)ortest.setTimeout(30_000).Simplify the test. It looks like it's exercising the full pipeline including
bun apps/cli/src/cli.ts ...subprocess spawns. Trim to the smallest possible fixture (one test case, no LLM call) and assert the pipeline plumbing only — the value of an e2e here is "the wiring works", not "everything runs fast." A 1-2 s version would be far less likely to flake.Add retries + warmup, last resort.
bun testsupports--rerun-each N. Combine with a no-op smoke test as the very first item to ensure caches are warm.Handoff context
--no-verifyfor PRs docs(examples): AI system register convention (.ai-register.yaml) + aggregator Action template #1167 (docs/examples only) and feat(examples): scenario-based red-team suites for coding and customer-facing agent archetypes #1168 (examples-only red-team archetypes). Disclosed in both PR bodies.