CI hardening: openipc-bisect stability + enrich_manifest retry#2129
Merged
Conversation
widgetii
added a commit
that referenced
this pull request
May 23, 2026
GitHub Actions had a flaky API/token plane today (2026-05-23): two manifest workflow runs against the same upstream build (run 26331664183, commit 7a2c1b3) failed with HTTP 401 "Bad credentials" on `gh release view`, while a third run between them succeeded — same script, same permissions block. Pure flake. Add a 4-attempt retry budget (delays 0, 5, 15, 40 seconds) around the `gh()` wrapper. Total wait ≤60 s on the worst case, which is small compared to the disruption of having to manually re-dispatch manifest.yml whenever GH wobbles. Discrimination: - "release not found" / 404 → fail FAST (one attempt, ~0.7s). These are permanent and re-trying just wastes CI time. - Everything else (401, 5xx, network) → retry with backoff. Each attempt logs to stderr so the action log shows the retry trail. The script's caller (`manifest.yml`) is unchanged. The happy path still resolves the live 4-build manifest in <1s with no retries fired. Note: this lands in PR #2129 alongside the openipc-bisect fixes because the user reported both as part of one "transient CI flake" debugging session. The two changes are unrelated in code but share the same root cause class: real bugs only visible after the design runs against actual production load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… run Four bugs surfaced when running the first real convergence loop on openipc-hi3520dv200.dlab.torturelabs.com (4-build window) the morning after PR #2117 landed. None of them would have been caught by the jq-against-static-manifest dry-runs done at PR time; they only emerge under real flash+reboot cycles. Fixes: 1. **`status` had a jq syntax error.** `(log/log(2)) | floor + 1` — jq has no `log` function (that's Python's math module). Status crashed at the JSON-construction step. Fix: compute ceil(log2(window_size)) in awk before invoking jq and pass via --argjson. 2. **`pick_next` returned "" when 1 unverified candidate remained.** Threshold was `<= 1` instead of `== 0`. A real bisect with the wrong verdict cadence would terminate early and miss the last build that needed testing. Threshold corrected to `== 0`; with 1 unverified the index math `length / 2 | floor` correctly returns 0, selecting the lone unverified build. 3. **SSH lacked ServerAliveInterval / ServerAliveCountMax.** When sysupgrade reboots the camera, dropbear is killed without a graceful TCP close. The host's `ssh root@$host "sysupgrade ..."` in remote_flash() then sat on a zombie TCP connection until kernel keepalive (~2 hours) — `iterate()` never reached `wait_for_camera`. Added `-o ServerAliveInterval=15 -o ServerAliveCountMax=3` to the default SSH_OPTS so the host detects the dead session in ~45s and the iteration progresses normally. 4. **`start <host>` rejected `root@host`.** The contract was bare hostname (the script always SSHes as root), but the form everyone reaches for in OpenIPC docs — including the wiki article shipped alongside the original PR — is `root@host`. Now strips a leading `user@` prefix in cmd_start before everything downstream. End-to-end test that found these (2026-05-23 on openipc-hi3520dv200.dlab.torturelabs.com, 4-build window): * start picked nightly-20260522-7d32f00 (median) → camera reboot → UART noise interrupted u-boot autoboot → camera stuck at u-boot prompt → host process killed manually → user recovered camera via UART. State file on host stayed intact across the brick. After recovery, `openipc-bisect resume` correctly re-attached and prompted for verdict — exactly the brick-survivability promise. * `good` verdict narrowed window to a single element and printed "Bisect complete. First bad build: nightly-20260523-7a2c1b3". After these fixes the next end-to-end run (5+ builds in manifest) should be hands-off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions had a flaky API/token plane today (2026-05-23): two manifest workflow runs against the same upstream build (run 26331664183, commit 7a2c1b3) failed with HTTP 401 "Bad credentials" on `gh release view`, while a third run between them succeeded — same script, same permissions block. Pure flake. Add a 4-attempt retry budget (delays 0, 5, 15, 40 seconds) around the `gh()` wrapper. Total wait ≤60 s on the worst case, which is small compared to the disruption of having to manually re-dispatch manifest.yml whenever GH wobbles. Discrimination: - "release not found" / 404 → fail FAST (one attempt, ~0.7s). These are permanent and re-trying just wastes CI time. - Everything else (401, 5xx, network) → retry with backoff. Each attempt logs to stderr so the action log shows the retry trail. The script's caller (`manifest.yml`) is unchanged. The happy path still resolves the live 4-build manifest in <1s with no retries fired. Note: this lands in PR #2129 alongside the openipc-bisect fixes because the user reported both as part of one "transient CI flake" debugging session. The two changes are unrelated in code but share the same root cause class: real bugs only visible after the design runs against actual production load. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
40dc66c to
230902a
Compare
5 tasks
widgetii
added a commit
to OpenIPC/builder
that referenced
this pull request
May 23, 2026
PR-Bld-B in the mirror of OpenIPC/firmware's nightly redesign. Adds
the two workflows that turn dated nightly-* releases (from PR-Bld-A)
into a queryable index served via GitHub Pages.
`manifest.yml` — triggers on successful `Build` workflow_run completion
(or manual dispatch). Runs `.github/scripts/enrich_manifest.py` to
enumerate dated releases, keep the 90 newest, parse the body
(sha/short/built_at from PR-Bld-A) and asset list, then emit:
- `manifest.json` — rich JSON for hosts/agents/CI.
- `manifest.flat` — whitespace-delimited columns
`build_id platform flash size url` plus `@channel` records,
parseable by pure busybox `awk` (no jq, no jsonfilter) for
on-device sysupgrade in Phase 2.
Both committed to the `gh-pages` branch (already bootstrapped).
`cleanup.yml` — Mondays 06:00 UTC (offset by +1h from firmware's
05:00 UTC cleanup). Deletes dated nightly releases beyond the 90
newest via `gh release delete --cleanup-tag`, then re-triggers
manifest.yml.
`enrich_manifest.py` differs from firmware's only in the asset parser:
builder produces TWO filename forms (per `master.yml`'s COMMON awk
trick that counts underscores in the matrix entry name):
- **Compound** `<soc>_<variant>_<vendor>-<model>-nor.tgz` →
platform key = the full string. This is the common case
(per-device builds).
- **Simple** `openipc.<soc>-<flash>-<variant>.tgz` (firmware-style,
when matrix name has one underscore) → platform key =
`<soc>_<variant>`.
The platform key gets carried into manifest.flat as a single column.
sysupgrade's existing `awk '$2==p'` lookup works unchanged; Phase 3
(model-aware BUILD_PLATFORM field in os-release) will let cameras
filter against the full per-model key.
URLs after merge:
https://openipc.github.io/builder/manifest.json
https://openipc.github.io/builder/manifest.flat
(`gh-pages` branch + Pages config already set up; mirrors firmware-
side setup.)
Concurrency-grouped `gh-pages-manifest` so manifest.yml and
cleanup.yml can't race the gh-pages push.
Includes the retry hardening from firmware OpenIPC/firmware#2129
(4-attempt backoff for transient GitHub API 401/5xx, fast-fail on
permanent 404).
First-run behaviour validated locally: empty index emits explicit
placeholder. Parser unit-tested across 7 synthetic filenames covering
compound, simple, and edge cases.
See ~/.claude/plans/mirror-nightly-redesign-to-builder.md.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
widgetii
added a commit
to OpenIPC/builder
that referenced
this pull request
May 23, 2026
…t-* tag (#93) PR-Bld-C in the mirror of OpenIPC/firmware's nightly redesign. Mirrors OpenIPC/firmware#2116. Adds an optional `commit` workflow_- dispatch input to `build-one.yml` so: - `gh workflow run build-one.yml -f platform=X -f commit=<sha>` builds the given SHA from `git bisect run`. - Output goes to `nightly-bisect-<short>` (prerelease) tag, distinct from the dated `nightly-YYYYMMDD-<short>` namespace produced by PR-Bld-A's master.yml. The manifest aggregator's regex only indexes `^nightly-[0-9]{8}-[0-9a-f]{7}$`, so one-off bisect rebuilds never enter manifest.{json,flat}. - Without `commit`, falls back to building HEAD and tags as `nightly-bisect-<short>-<UTC ts>` so repeated dispatches of the same HEAD don't collide. Also: - Retry budget around `bash builder.sh` matching master.yml + OpenIPC/firmware#2129's hardening. - BUILD_ID / BUILD_SHA / BUILD_PLATFORM env at the Build firmware step level (forward-compat with Phase 3 of the mirror plan). - Drops dead `env: TAG_NAME: latest` at workflow level. - Release body carries sha/short/platform/one_off=true for the aggregator's downstream consumers to easily distinguish bisect builds from dated nightlies. See ~/.claude/plans/mirror-nightly-redesign-to-builder.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two unrelated-but-related fixes that surfaced from the first 72h of the redesigned nightly pipeline running for real. Both fall under "real bugs only visible after the design runs against actual production load."
Part 1 —
contrib/openipc-bisect: stability fixes from first end-to-end runFour bugs in
contrib/openipc-bisectsurfaced the first time the convergence loop actually ran against a real camera (4-build window onopenipc-hi3520dv200.dlab.torturelabs.com, the morning after this morning's third nightly populated the manifest).statushad a jq syntax error.(log/log(2)) | floor + 1— jq has nologfunction. Status crashed at the JSON-construction step. Fix: computeceil(log2(window_size))inawkbefore invoking jq, pass via--argjson.pick_nextreturned "" when 1 unverified candidate remained. Threshold was<= 1instead of== 0. A real bisect with the wrong verdict cadence would terminate early and miss the last build that still needed testing. Threshold corrected; index mathlength / 2 | floorcorrectly returns0for length 1, selecting that lone candidate.ServerAliveInterval/ServerAliveCountMax. When sysupgrade reboots the camera, dropbear is killed without a graceful TCP close. The host'sssh root@$host "sysupgrade ..."inremote_flash()sat on a zombie TCP connection until kernel keepalive (~2 hours) —iterate()never reachedwait_for_camera(). Added-o ServerAliveInterval=15 -o ServerAliveCountMax=3to defaultSSH_OPTS.start <host>rejectedroot@host. Contract was bare hostname, but every OpenIPC doc — including the wiki article shipped alongside contrib: openipc-bisect — host-side firmware bisect driver #2117 — usesroot@host.cmd_startnow strips a leadinguser@prefix.The end-to-end run on hi3520dv200 that found these also accidentally proved the brick-survivability promise. Mid-bisect, UART noise interrupted u-boot's autoboot countdown — camera halted at u-boot prompt → the host SSH from
remote_flashwas orphaned to a dead socket → user recovered camera via UART → state file on host stayed intact →openipc-bisect resumecorrectly re-attached and converged on the first try.Part 2 —
enrich_manifest.py: retry transientghAPI failuresGitHub Actions had a flaky API/token plane today (2026-05-23): two
manifest.ymlworkflow runs against the same upstream build (run 26331664183, commit7a2c1b3) failed with HTTP 401 Bad credentials ongh release view, while a third run between them succeeded — same script, same permissions block, no real bug.Added a 4-attempt retry budget (delays
0, 5, 15, 40seconds) around thegh()wrapper. Discrimination:Each attempt logs to stderr so the action log shows the retry trail. Happy path is unchanged — 4-build manifest still resolves in <1 s with no retries fired.
Diff size
contrib/openipc-bisect.github/scripts/enrich_manifest.pySingle PR, two commits, no functional dependency between them — easy to revert independently if needed.
🤖 Generated with Claude Code