Skip to content

ci: fail fast on hangs + de-flake Windows (timeout, concurrency, watchdog)#1256

Merged
quinnj merged 1 commit into
masterfrom
jq-ci-reliability
May 29, 2026
Merged

ci: fail fast on hangs + de-flake Windows (timeout, concurrency, watchdog)#1256
quinnj merged 1 commit into
masterfrom
jq-ci-reliability

Conversation

@quinnj
Copy link
Copy Markdown
Member

@quinnj quinnj commented May 29, 2026

Summary

CI turnaround has been unreliable: runs occasionally hang for up to GitHub's 6-hour default while every other job finishes in minutes. Investigation across recent runs shows this is not steady slowness — on a healthy run Windows is the fastest tier:

Run macOS Ubuntu Windows
Clean 7m 10–11m 6–6.5m
Hung 9m ✓ 8–10m ✓ 36m → cancelled
Hung 7m ✓ 10–11m ✓ 360m (6h cap)

The problem is an intermittent hard hang in a Windows job, with nothing to bound it: no timeout-minutes, and no concurrency cancellation (superseded PR runs pile up).

Root cause of the hang: the raw-socket test helpers in http_server_http1_tests.jl bound their reads with socket read deadlines (set_read_deadline!), but on Windows a blocked readavailable is not reliably interrupted by the deadline, so control never returns to the loop's own timeout check and the read hangs forever. The last healthy run before silence was always HTTP server stream handler emits chunked trailers → the next raw read hung.

Changes

ci.yml

  • timeout-minutes: 25 on the test job — a hang now fails in minutes, not hours (clean jobs ≤11m, so generous headroom for cache-cold precompiles).
  • concurrency group with cancel-in-progress: true — a new push cancels superseded PR runs. master/release pushes use the unique run_id as the group key, so they are never cancelled.
  • continue-on-error for version: 'pre' — prerelease Julia is informational and the windows/pre combo is the most frequent hanger; it shouldn't block merges.

test/http_server_http1_tests.jl

  • Wrap _raw_http_request and _raw_http_request_until_close in the existing _run_with_timeout task-level watchdog (Threads.@spawn + timedwait), which does not depend on socket deadlines. A stuck read now fails the test in seconds instead of hanging the job. This covers all 10 call sites at once and is behavior-neutral on Linux/macOS (verified locally — full http_server_http1_tests.jl suite passes, including the previously-hanging chunked-trailers testset).

Impact

Worst-case turnaround drops from ~6 hours → ~25 minutes, superseded runs stop piling up, and the actual Windows hang is converted into a fast, legible test failure.

Note on branch protection

The continue-on-error change makes pre non-blocking at the run level. If Julia pre - * jobs are currently listed as required status checks in branch protection, they should be removed from the required list in repo settings — that's a settings change this PR can't make.

🤖 Generated with Claude Code

CI runs sometimes hung up to GitHub's 6h default because a Windows job could
block indefinitely on a raw-socket read, while nothing bounded the job or
cancelled superseded runs. On healthy runs Windows is actually the fastest
tier (~6m) — the problem was unbounded hangs, not steady slowness.

- ci.yml: add timeout-minutes: 25 to the test job so a hang fails in minutes,
  not hours (clean jobs finish in <=11m).
- ci.yml: add a concurrency group with cancel-in-progress so a new push cancels
  superseded PR runs; master/release pushes use the unique run_id and are never
  cancelled.
- ci.yml: mark prerelease Julia jobs continue-on-error so the flaky windows/pre
  combination cannot block merges.
- test: wrap the raw-socket helpers (_raw_http_request and
  _raw_http_request_until_close) in the existing _run_with_timeout task-level
  watchdog. Their socket read deadlines are not reliably honored on Windows, so
  a blocked readavailable never returns to the loop's own timeout check; the
  task-level guard fails the test in seconds instead of hanging the job.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 84.34%. Comparing base (0b61c88) to head (0a4d6c1).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1256   +/-   ##
=======================================
  Coverage   84.34%   84.34%           
=======================================
  Files          28       28           
  Lines       10648    10648           
=======================================
  Hits         8981     8981           
  Misses       1667     1667           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@quinnj quinnj merged commit 47fd097 into master May 29, 2026
8 checks passed
@quinnj quinnj deleted the jq-ci-reliability branch May 29, 2026 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant