ci: fail fast on hangs + de-flake Windows (timeout, concurrency, watchdog)#1256
Merged
Conversation
CI runs sometimes hung up to GitHub's 6h default because a Windows job could block indefinitely on a raw-socket read, while nothing bounded the job or cancelled superseded runs. On healthy runs Windows is actually the fastest tier (~6m) — the problem was unbounded hangs, not steady slowness. - ci.yml: add timeout-minutes: 25 to the test job so a hang fails in minutes, not hours (clean jobs finish in <=11m). - ci.yml: add a concurrency group with cancel-in-progress so a new push cancels superseded PR runs; master/release pushes use the unique run_id and are never cancelled. - ci.yml: mark prerelease Julia jobs continue-on-error so the flaky windows/pre combination cannot block merges. - test: wrap the raw-socket helpers (_raw_http_request and _raw_http_request_until_close) in the existing _run_with_timeout task-level watchdog. Their socket read deadlines are not reliably honored on Windows, so a blocked readavailable never returns to the loop's own timeout check; the task-level guard fails the test in seconds instead of hanging the job. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1256 +/- ##
=======================================
Coverage 84.34% 84.34%
=======================================
Files 28 28
Lines 10648 10648
=======================================
Hits 8981 8981
Misses 1667 1667 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CI turnaround has been unreliable: runs occasionally hang for up to GitHub's 6-hour default while every other job finishes in minutes. Investigation across recent runs shows this is not steady slowness — on a healthy run Windows is the fastest tier:
The problem is an intermittent hard hang in a Windows job, with nothing to bound it: no
timeout-minutes, and noconcurrencycancellation (superseded PR runs pile up).Root cause of the hang: the raw-socket test helpers in
http_server_http1_tests.jlbound their reads with socket read deadlines (set_read_deadline!), but on Windows a blockedreadavailableis not reliably interrupted by the deadline, so control never returns to the loop's own timeout check and the read hangs forever. The last healthy run before silence was alwaysHTTP server stream handler emits chunked trailers→ the next raw read hung.Changes
ci.ymltimeout-minutes: 25on the test job — a hang now fails in minutes, not hours (clean jobs ≤11m, so generous headroom for cache-cold precompiles).concurrencygroup withcancel-in-progress: true— a new push cancels superseded PR runs.master/release pushes use the uniquerun_idas the group key, so they are never cancelled.continue-on-errorforversion: 'pre'— prerelease Julia is informational and the windows/pre combo is the most frequent hanger; it shouldn't block merges.test/http_server_http1_tests.jl_raw_http_requestand_raw_http_request_until_closein the existing_run_with_timeouttask-level watchdog (Threads.@spawn+timedwait), which does not depend on socket deadlines. A stuck read now fails the test in seconds instead of hanging the job. This covers all 10 call sites at once and is behavior-neutral on Linux/macOS (verified locally — fullhttp_server_http1_tests.jlsuite passes, including the previously-hanging chunked-trailers testset).Impact
Worst-case turnaround drops from ~6 hours → ~25 minutes, superseded runs stop piling up, and the actual Windows hang is converted into a fast, legible test failure.
Note on branch protection
The
continue-on-errorchange makesprenon-blocking at the run level. IfJulia pre - *jobs are currently listed as required status checks in branch protection, they should be removed from the required list in repo settings — that's a settings change this PR can't make.🤖 Generated with Claude Code