Skip to content

CI hardening: openipc-bisect stability + enrich_manifest retry#2129

Merged
widgetii merged 2 commits into
masterfrom
fix/openipc-bisect-stability
May 23, 2026
Merged

CI hardening: openipc-bisect stability + enrich_manifest retry#2129
widgetii merged 2 commits into
masterfrom
fix/openipc-bisect-stability

Conversation

@widgetii
Copy link
Copy Markdown
Member

@widgetii widgetii commented May 23, 2026

Summary

Two unrelated-but-related fixes that surfaced from the first 72h of the redesigned nightly pipeline running for real. Both fall under "real bugs only visible after the design runs against actual production load."


Part 1 — contrib/openipc-bisect: stability fixes from first end-to-end run

Four bugs in contrib/openipc-bisect surfaced the first time the convergence loop actually ran against a real camera (4-build window on openipc-hi3520dv200.dlab.torturelabs.com, the morning after this morning's third nightly populated the manifest).

  1. status had a jq syntax error. (log/log(2)) | floor + 1 — jq has no log function. Status crashed at the JSON-construction step. Fix: compute ceil(log2(window_size)) in awk before invoking jq, pass via --argjson.
  2. pick_next returned "" when 1 unverified candidate remained. Threshold was <= 1 instead of == 0. A real bisect with the wrong verdict cadence would terminate early and miss the last build that still needed testing. Threshold corrected; index math length / 2 | floor correctly returns 0 for length 1, selecting that lone candidate.
  3. SSH lacked ServerAliveInterval / ServerAliveCountMax. When sysupgrade reboots the camera, dropbear is killed without a graceful TCP close. The host's ssh root@$host "sysupgrade ..." in remote_flash() sat on a zombie TCP connection until kernel keepalive (~2 hours) — iterate() never reached wait_for_camera(). Added -o ServerAliveInterval=15 -o ServerAliveCountMax=3 to default SSH_OPTS.
  4. start <host> rejected root@host. Contract was bare hostname, but every OpenIPC doc — including the wiki article shipped alongside contrib: openipc-bisect — host-side firmware bisect driver #2117 — uses root@host. cmd_start now strips a leading user@ prefix.

The end-to-end run on hi3520dv200 that found these also accidentally proved the brick-survivability promise. Mid-bisect, UART noise interrupted u-boot's autoboot countdown — camera halted at u-boot prompt → the host SSH from remote_flash was orphaned to a dead socket → user recovered camera via UART → state file on host stayed intact → openipc-bisect resume correctly re-attached and converged on the first try.


Part 2 — enrich_manifest.py: retry transient gh API failures

GitHub Actions had a flaky API/token plane today (2026-05-23): two manifest.yml workflow runs against the same upstream build (run 26331664183, commit 7a2c1b3) failed with HTTP 401 Bad credentials on gh release view, while a third run between them succeeded — same script, same permissions block, no real bug.

Added a 4-attempt retry budget (delays 0, 5, 15, 40 seconds) around the gh() wrapper. Discrimination:

  • "release not found" / 404 → fail fast (one attempt, ~0.7 s). Permanent failures don't deserve retry budget.
  • Everything else (401, 5xx, network) → retry with backoff.

Each attempt logs to stderr so the action log shows the retry trail. Happy path is unchanged — 4-build manifest still resolves in <1 s with no retries fired.


Diff size

File Lines
contrib/openipc-bisect +12 / -5
.github/scripts/enrich_manifest.py +28 / -2

Single PR, two commits, no functional dependency between them — easy to revert independently if needed.

🤖 Generated with Claude Code

widgetii added a commit that referenced this pull request May 23, 2026
GitHub Actions had a flaky API/token plane today (2026-05-23): two
manifest workflow runs against the same upstream build (run
26331664183, commit 7a2c1b3) failed with HTTP 401 "Bad credentials" on
`gh release view`, while a third run between them succeeded — same
script, same permissions block. Pure flake.

Add a 4-attempt retry budget (delays 0, 5, 15, 40 seconds) around the
`gh()` wrapper. Total wait ≤60 s on the worst case, which is small
compared to the disruption of having to manually re-dispatch
manifest.yml whenever GH wobbles.

Discrimination:
  - "release not found" / 404 → fail FAST (one attempt, ~0.7s).
    These are permanent and re-trying just wastes CI time.
  - Everything else (401, 5xx, network) → retry with backoff.

Each attempt logs to stderr so the action log shows the retry trail.

The script's caller (`manifest.yml`) is unchanged. The happy path
still resolves the live 4-build manifest in <1s with no retries fired.

Note: this lands in PR #2129 alongside the openipc-bisect fixes
because the user reported both as part of one "transient CI flake"
debugging session. The two changes are unrelated in code but share
the same root cause class: real bugs only visible after the design
runs against actual production load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii changed the title contrib/openipc-bisect: stability fixes uncovered by first end-to-end run CI hardening: openipc-bisect stability + enrich_manifest retry May 23, 2026
widgetii and others added 2 commits May 23, 2026 21:53
… run

Four bugs surfaced when running the first real convergence loop on
openipc-hi3520dv200.dlab.torturelabs.com (4-build window) the morning
after PR #2117 landed. None of them would have been caught by the
jq-against-static-manifest dry-runs done at PR time; they only emerge
under real flash+reboot cycles. Fixes:

1. **`status` had a jq syntax error.** `(log/log(2)) | floor + 1` —
   jq has no `log` function (that's Python's math module). Status
   crashed at the JSON-construction step. Fix: compute
   ceil(log2(window_size)) in awk before invoking jq and pass via
   --argjson.

2. **`pick_next` returned "" when 1 unverified candidate remained.**
   Threshold was `<= 1` instead of `== 0`. A real bisect with the
   wrong verdict cadence would terminate early and miss the last
   build that needed testing. Threshold corrected to `== 0`; with
   1 unverified the index math `length / 2 | floor` correctly
   returns 0, selecting the lone unverified build.

3. **SSH lacked ServerAliveInterval / ServerAliveCountMax.**
   When sysupgrade reboots the camera, dropbear is killed without a
   graceful TCP close. The host's `ssh root@$host "sysupgrade ..."`
   in remote_flash() then sat on a zombie TCP connection until kernel
   keepalive (~2 hours) — `iterate()` never reached `wait_for_camera`.
   Added `-o ServerAliveInterval=15 -o ServerAliveCountMax=3` to the
   default SSH_OPTS so the host detects the dead session in ~45s and
   the iteration progresses normally.

4. **`start <host>` rejected `root@host`.** The contract was bare
   hostname (the script always SSHes as root), but the form everyone
   reaches for in OpenIPC docs — including the wiki article shipped
   alongside the original PR — is `root@host`. Now strips a leading
   `user@` prefix in cmd_start before everything downstream.

End-to-end test that found these (2026-05-23 on
openipc-hi3520dv200.dlab.torturelabs.com, 4-build window):
* start picked nightly-20260522-7d32f00 (median) → camera reboot →
  UART noise interrupted u-boot autoboot → camera stuck at u-boot
  prompt → host process killed manually → user recovered camera via
  UART. State file on host stayed intact across the brick. After
  recovery, `openipc-bisect resume` correctly re-attached and
  prompted for verdict — exactly the brick-survivability promise.
* `good` verdict narrowed window to a single element and printed
  "Bisect complete. First bad build: nightly-20260523-7a2c1b3".

After these fixes the next end-to-end run (5+ builds in manifest)
should be hands-off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitHub Actions had a flaky API/token plane today (2026-05-23): two
manifest workflow runs against the same upstream build (run
26331664183, commit 7a2c1b3) failed with HTTP 401 "Bad credentials" on
`gh release view`, while a third run between them succeeded — same
script, same permissions block. Pure flake.

Add a 4-attempt retry budget (delays 0, 5, 15, 40 seconds) around the
`gh()` wrapper. Total wait ≤60 s on the worst case, which is small
compared to the disruption of having to manually re-dispatch
manifest.yml whenever GH wobbles.

Discrimination:
  - "release not found" / 404 → fail FAST (one attempt, ~0.7s).
    These are permanent and re-trying just wastes CI time.
  - Everything else (401, 5xx, network) → retry with backoff.

Each attempt logs to stderr so the action log shows the retry trail.

The script's caller (`manifest.yml`) is unchanged. The happy path
still resolves the live 4-build manifest in <1s with no retries fired.

Note: this lands in PR #2129 alongside the openipc-bisect fixes
because the user reported both as part of one "transient CI flake"
debugging session. The two changes are unrelated in code but share
the same root cause class: real bugs only visible after the design
runs against actual production load.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii force-pushed the fix/openipc-bisect-stability branch from 40dc66c to 230902a Compare May 23, 2026 18:53
@widgetii widgetii merged commit bc53e4f into master May 23, 2026
186 of 189 checks passed
@widgetii widgetii deleted the fix/openipc-bisect-stability branch May 23, 2026 19:31
widgetii added a commit to OpenIPC/builder that referenced this pull request May 23, 2026
PR-Bld-B in the mirror of OpenIPC/firmware's nightly redesign. Adds
the two workflows that turn dated nightly-* releases (from PR-Bld-A)
into a queryable index served via GitHub Pages.

`manifest.yml` — triggers on successful `Build` workflow_run completion
(or manual dispatch). Runs `.github/scripts/enrich_manifest.py` to
enumerate dated releases, keep the 90 newest, parse the body
(sha/short/built_at from PR-Bld-A) and asset list, then emit:
  - `manifest.json` — rich JSON for hosts/agents/CI.
  - `manifest.flat` — whitespace-delimited columns
    `build_id platform flash size url` plus `@channel` records,
    parseable by pure busybox `awk` (no jq, no jsonfilter) for
    on-device sysupgrade in Phase 2.
Both committed to the `gh-pages` branch (already bootstrapped).

`cleanup.yml` — Mondays 06:00 UTC (offset by +1h from firmware's
05:00 UTC cleanup). Deletes dated nightly releases beyond the 90
newest via `gh release delete --cleanup-tag`, then re-triggers
manifest.yml.

`enrich_manifest.py` differs from firmware's only in the asset parser:
builder produces TWO filename forms (per `master.yml`'s COMMON awk
trick that counts underscores in the matrix entry name):
  - **Compound** `<soc>_<variant>_<vendor>-<model>-nor.tgz` →
    platform key = the full string. This is the common case
    (per-device builds).
  - **Simple** `openipc.<soc>-<flash>-<variant>.tgz` (firmware-style,
    when matrix name has one underscore) → platform key =
    `<soc>_<variant>`.

The platform key gets carried into manifest.flat as a single column.
sysupgrade's existing `awk '$2==p'` lookup works unchanged; Phase 3
(model-aware BUILD_PLATFORM field in os-release) will let cameras
filter against the full per-model key.

URLs after merge:
  https://openipc.github.io/builder/manifest.json
  https://openipc.github.io/builder/manifest.flat

(`gh-pages` branch + Pages config already set up; mirrors firmware-
side setup.)

Concurrency-grouped `gh-pages-manifest` so manifest.yml and
cleanup.yml can't race the gh-pages push.

Includes the retry hardening from firmware OpenIPC/firmware#2129
(4-attempt backoff for transient GitHub API 401/5xx, fast-fail on
permanent 404).

First-run behaviour validated locally: empty index emits explicit
placeholder. Parser unit-tested across 7 synthetic filenames covering
compound, simple, and edge cases.

See ~/.claude/plans/mirror-nightly-redesign-to-builder.md.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
widgetii added a commit to OpenIPC/builder that referenced this pull request May 23, 2026
…t-* tag (#93)

PR-Bld-C in the mirror of OpenIPC/firmware's nightly redesign.
Mirrors OpenIPC/firmware#2116. Adds an optional `commit` workflow_-
dispatch input to `build-one.yml` so:

  - `gh workflow run build-one.yml -f platform=X -f commit=<sha>`
    builds the given SHA from `git bisect run`.
  - Output goes to `nightly-bisect-<short>` (prerelease) tag, distinct
    from the dated `nightly-YYYYMMDD-<short>` namespace produced by
    PR-Bld-A's master.yml. The manifest aggregator's regex only
    indexes `^nightly-[0-9]{8}-[0-9a-f]{7}$`, so one-off bisect
    rebuilds never enter manifest.{json,flat}.
  - Without `commit`, falls back to building HEAD and tags as
    `nightly-bisect-<short>-<UTC ts>` so repeated dispatches of the
    same HEAD don't collide.

Also:
  - Retry budget around `bash builder.sh` matching master.yml +
    OpenIPC/firmware#2129's hardening.
  - BUILD_ID / BUILD_SHA / BUILD_PLATFORM env at the Build firmware
    step level (forward-compat with Phase 3 of the mirror plan).
  - Drops dead `env: TAG_NAME: latest` at workflow level.
  - Release body carries sha/short/platform/one_off=true for the
    aggregator's downstream consumers to easily distinguish bisect
    builds from dated nightlies.

See ~/.claude/plans/mirror-nightly-redesign-to-builder.md.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant