Skip to content

queue-runner and builder: fail faster on infrastructure errors#1748

Draft
Ericson2314 wants to merge 1 commit into
NixOS:masterfrom
obsidiansystems:fail-fast
Draft

queue-runner and builder: fail faster on infrastructure errors#1748
Ericson2314 wants to merge 1 commit into
NixOS:masterfrom
obsidiansystems:fail-fast

Conversation

@Ericson2314
Copy link
Copy Markdown
Member

@Ericson2314 Ericson2314 commented May 17, 2026

WIP. See commit for preliminary details --- it is a long run on commit message because it's a bunch of separate edits tacked together, and it will probably become multiple commits once finished up.

I am hoping this will make it easier to fix the remaining issues, especially the ones that are hard to reproduce out of real-world testing

@Ericson2314
Copy link
Copy Markdown
Member Author

@dasJ the second commit of this I would very much like your feedback on. The general rules I am thinking are:

  • destination stores are always best effort (don't die if fail to upload)
  • queue doesn't die on bad builders
  • builders don't die on bad queue runner (but could change that?)
  • queue runner does die on bad PostgreSQL
  • when in doubt ? and (at the top of the call stack) die

I think all the edits here fit that pattern but I definitely didn't finish reviewing them yet.

I am also confused why we had

tracing::error!("Rejecting new machine creation: {e}");
return Err(tonic::...);

in so many places before.

@dasJ
Copy link
Copy Markdown
Member

dasJ commented May 19, 2026

I am fully on board with most of these rules. Not dying is never a bad thing ;)

The only thing I am unsure is the "don't die if fail to upload". Wdym by that? I don't see any unwraps or expects so I assume that's "assume the build succeeded". I don't think that's what we want because it could mean we end up with an incomplete binary cache.

@Ericson2314
Copy link
Copy Markdown
Member Author

I think doing #1755 first will make this more clear.

The only thing I am unsure is the "don't die if fail to upload". Wdym by that?

To be clear, this was something I am keeping! It already is OK with "destination stores" being down, and I while I am not obvious to me that we want that, I didn't feel certain enough that we didn't to change it.

@Ericson2314
Copy link
Copy Markdown
Member Author

I am fully on board with most of these rules. Not dying is never a bad thing ;)

Also to be clear, this PR should generally be making is die more. The counter-intuitive logic of "fail fast" is that not-dying can allow simple problems to fester, only for one to go down with a more complex problem later, and so it ends up being more sysadmin/debubgging work than it would be otherwise.

@Ericson2314 Ericson2314 force-pushed the fail-fast branch 3 times, most recently from a26c728 to e2302ac Compare May 20, 2026 16:26
The queue-runner should not try to recover from a down database or
invalid store state — only builder failures should be totally
recoverable since builders are numerous and expected to come and go.

The builder should die if it cannot communicate with the queue-runner —
the queue-runner will notice and retry the step elsewhere.

**Note: Manual review of the queue runner error comments is not yet
complete!**

Queue-runner changes:

- Critical task loops (queue monitor, dispatch) now return
  `JoinHandle<Result<()>>` and are joined in the main `select!` loop.
  If either fails, the queue-runner exits with the error.
  Non-critical tasks (config reloader, uploader, fod checker) remain
  abort-only.

- `create_build` returns `Result`; an invalid drv path is a hard error
  (indicates a GC rooting bug) rather than silently aborting the build

- `handle_previous_failure` and `handle_cached_build` errors propagate
  (DB unreachable = stop processing)

- `process_new_builds` returns `Result` and propagates child build
  creation errors

- Queue monitor loop propagates `get_queued_builds`,
  `process_queue_change`, and `handle_jobset_change` errors instead of
  logging and continuing

- `fail_step` handles missing jobs in queue/machine as warnings (race
  with builder disconnect), but DB errors propagate. Callers
  (`remove_machine`, cancelled steps) now use `?` instead of
  log-and-continue.

- `abort_unsupported` propagates `inner_fail_job` DB errors through
  the dispatch loop. Unsupported build step log level downgraded from
  `error` to `warn` (expected operational condition, not a bug).

- Fire-and-forget gRPC handlers (`build_step_update`, `complete_build`)
  use `spawn_fatal` — DB errors panic the task, bringing down the
  queue-runner. Builder parse errors (e.g. bad `BuildOutput`) are
  logged and dropped without crashing.

- Machine tunnel loop: `remove_machine` is called once after the loop
  exits instead of at every break point. Channel errors downgraded
  from `error` to `warn` (builder disconnections are expected).

- gRPC error messages now include the underlying error instead of
  generic strings. Redundant `tracing::error` calls removed where
  `#[tracing::instrument(err)]` already logs the error.

- `parse_uuid` helper deduplicates UUID parsing across gRPC handlers.

- Log directory creation at startup is fatal on failure

- mTLS misconfiguration uses `anyhow::ensure!`

Builder changes:

- `filter_missing` returns `Result` — daemon connection failure is
  a hard error, not "path is present"

- `handle_request` errors propagate (kills the builder — queue-runner
  will retry)

- Ping message construction failure breaks the ping stream (builder
  exits), with the error captured and returned from the gRPC function

- `submit_build_result` helper deduplicates the retry-then-report
  pattern for `complete_build` calls

- Build task returns `Result<()>` instead of `()` — failure to submit
  results after retries propagates rather than calling
  `process::exit(1)`

- mTLS misconfiguration uses `anyhow::ensure!`

- Unreachable error paths in log stream (`utils.rs`) now panic with
  explanatory comments

- `substitute_output` S3 replication uses async-block-as-try-block
  pattern for clean early exit on daemon error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants