Skip to content

test: pipeline-e2e flake at 5000ms default timeout #1169

@christso

Description

@christso

Symptom

apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts > eval pipeline e2e > runs full input → grade → bench pipeline reliably times out at the 5000 ms default when run as part of the full bun test suite (typically 5005-5106 ms). When run alone with a higher timeout, the same test passes consistently in ~7 s.

Repro:

# Hits the flake under suite-level contention:
bun --filter agentv test
#   (fail) eval pipeline e2e > runs full input → grade → bench pipeline [5012.95ms]
#   ^ this test timed out after 5000ms.

# Passes when isolated:
bun test apps/cli/test/commands/eval/pipeline/pipeline-e2e.test.ts --timeout 30000
#   1 pass / 0 fail / Ran 1 test across 1 file. [7.08s]

Observed during pre-push for PRs #1165 / #1166 / #1167 / #1168 — all of which had no diff to packages/core or apps/cli. The flake reproduces on main with no working-tree changes.

Why this matters

validate.yml does not run bun test, so the flake doesn't gate CI — but the local prek pre-push hook does, and it is the only test gate that runs before push. Three consecutive git push attempts on a docs-only branch hit it in a row, forcing a --no-verify bypass. That undermines the safety the hook is supposed to provide for changes that do touch source code.

Suggested fixes (pick whichever is cheaper)

  1. Raise this single test's timeout to 30 s. The test is a real end-to-end (build → eval → grade → bench), 7 s of wall-clock is realistic on a laptop under load. Vitest / bun test let you set a per-test timeout via it.timeout(30_000) or test.setTimeout(30_000).

  2. Simplify the test. It looks like it's exercising the full pipeline including bun apps/cli/src/cli.ts ... subprocess spawns. Trim to the smallest possible fixture (one test case, no LLM call) and assert the pipeline plumbing only — the value of an e2e here is "the wiring works", not "everything runs fast." A 1-2 s version would be far less likely to flake.

  3. Add retries + warmup, last resort. bun test supports --rerun-each N. Combine with a no-op smoke test as the very first item to ensure caches are warm.

Handoff context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions