queue-runner and builder: fail faster on infrastructure errors by Ericson2314 · Pull Request #1748 · NixOS/hydra

Ericson2314 · 2026-05-17T05:12:54Z

WIP. See commit for preliminary details --- it is a long run on commit message because it's a bunch of separate edits tacked together, and it will probably become multiple commits once finished up.

I am hoping this will make it easier to fix the remaining issues, especially the ones that are hard to reproduce out of real-world testing

Ericson2314 · 2026-05-18T22:27:47Z

@dasJ the second commit of this I would very much like your feedback on. The general rules I am thinking are:

destination stores are always best effort (don't die if fail to upload)
queue doesn't die on bad builders
builders don't die on bad queue runner (but could change that?)
queue runner does die on bad PostgreSQL
when in doubt ? and (at the top of the call stack) die

I think all the edits here fit that pattern but I definitely didn't finish reviewing them yet.

I am also confused why we had

tracing::error!("Rejecting new machine creation: {e}");
return Err(tonic::...);

in so many places before.

dasJ · 2026-05-19T16:43:00Z

I am fully on board with most of these rules. Not dying is never a bad thing ;)

The only thing I am unsure is the "don't die if fail to upload". Wdym by that? I don't see any unwraps or expects so I assume that's "assume the build succeeded". I don't think that's what we want because it could mean we end up with an incomplete binary cache.

Ericson2314 · 2026-05-19T18:42:16Z

I think doing #1755 first will make this more clear.

The only thing I am unsure is the "don't die if fail to upload". Wdym by that?

To be clear, this was something I am keeping! It already is OK with "destination stores" being down, and I while I am not obvious to me that we want that, I didn't feel certain enough that we didn't to change it.

Ericson2314 · 2026-05-19T18:43:35Z

I am fully on board with most of these rules. Not dying is never a bad thing ;)

Also to be clear, this PR should generally be making is die more. The counter-intuitive logic of "fail fast" is that not-dying can allow simple problems to fester, only for one to go down with a more complex problem later, and so it ends up being more sysadmin/debubgging work than it would be otherwise.

The queue-runner should not try to recover from a down database or invalid store state — only builder failures should be totally recoverable since builders are numerous and expected to come and go. The builder should die if it cannot communicate with the queue-runner — the queue-runner will notice and retry the step elsewhere. **Note: Manual review of the queue runner error comments is not yet complete!** Queue-runner changes: - Critical task loops (queue monitor, dispatch) now return `JoinHandle<Result<()>>` and are joined in the main `select!` loop. If either fails, the queue-runner exits with the error. Non-critical tasks (config reloader, uploader, fod checker) remain abort-only. - `create_build` returns `Result`; an invalid drv path is a hard error (indicates a GC rooting bug) rather than silently aborting the build - `handle_previous_failure` and `handle_cached_build` errors propagate (DB unreachable = stop processing) - `process_new_builds` returns `Result` and propagates child build creation errors - Queue monitor loop propagates `get_queued_builds`, `process_queue_change`, and `handle_jobset_change` errors instead of logging and continuing - `fail_step` handles missing jobs in queue/machine as warnings (race with builder disconnect), but DB errors propagate. Callers (`remove_machine`, cancelled steps) now use `?` instead of log-and-continue. - `abort_unsupported` propagates `inner_fail_job` DB errors through the dispatch loop. Unsupported build step log level downgraded from `error` to `warn` (expected operational condition, not a bug). - Fire-and-forget gRPC handlers (`build_step_update`, `complete_build`) use `spawn_fatal` — DB errors panic the task, bringing down the queue-runner. Builder parse errors (e.g. bad `BuildOutput`) are logged and dropped without crashing. - Machine tunnel loop: `remove_machine` is called once after the loop exits instead of at every break point. Channel errors downgraded from `error` to `warn` (builder disconnections are expected). - gRPC error messages now include the underlying error instead of generic strings. Redundant `tracing::error` calls removed where `#[tracing::instrument(err)]` already logs the error. - `parse_uuid` helper deduplicates UUID parsing across gRPC handlers. - Log directory creation at startup is fatal on failure - mTLS misconfiguration uses `anyhow::ensure!` Builder changes: - `filter_missing` returns `Result` — daemon connection failure is a hard error, not "path is present" - `handle_request` errors propagate (kills the builder — queue-runner will retry) - Ping message construction failure breaks the ping stream (builder exits), with the error captured and returned from the gRPC function - `submit_build_result` helper deduplicates the retry-then-report pattern for `complete_build` calls - Build task returns `Result<()>` instead of `()` — failure to submit results after retries propagates rather than calling `process::exit(1)` - mTLS misconfiguration uses `anyhow::ensure!` - Unreachable error paths in log stream (`utils.rs`) now panic with explanatory comments - `substitute_output` S3 replication uses async-block-as-try-block pattern for clean early exit on daemon error

Ericson2314 force-pushed the fail-fast branch 2 times, most recently from 77dc98b to 706238a Compare May 18, 2026 20:18

Ericson2314 mentioned this pull request May 18, 2026

nix_utils::BaseStore: return Result from all fallible methods #1753

Merged

Ericson2314 force-pushed the fail-fast branch 2 times, most recently from e61875b to 9fab2e1 Compare May 18, 2026 22:24

Ericson2314 force-pushed the fail-fast branch 3 times, most recently from a26c728 to e2302ac Compare May 20, 2026 16:26

Ericson2314 force-pushed the fail-fast branch from e2302ac to f92c786 Compare May 20, 2026 18:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

queue-runner and builder: fail faster on infrastructure errors#1748

queue-runner and builder: fail faster on infrastructure errors#1748
Ericson2314 wants to merge 1 commit into
NixOS:masterfrom
obsidiansystems:fail-fast

Ericson2314 commented May 17, 2026 •

edited

Loading

Uh oh!

Ericson2314 commented May 18, 2026

Uh oh!

dasJ commented May 19, 2026

Uh oh!

Ericson2314 commented May 19, 2026

Uh oh!

Ericson2314 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Ericson2314 commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ericson2314 commented May 18, 2026

Uh oh!

dasJ commented May 19, 2026

Uh oh!

Ericson2314 commented May 19, 2026

Uh oh!

Ericson2314 commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ericson2314 commented May 17, 2026 •

edited

Loading