Skip to content

ci/manifest: publish on partial failure; fix built_at=; extend retry budget#2115

Merged
widgetii merged 2 commits into
masterfrom
ci/manifest-publish-on-partial-failure
May 21, 2026
Merged

ci/manifest: publish on partial failure; fix built_at=; extend retry budget#2115
widgetii merged 2 commits into
masterfrom
ci/manifest-publish-on-partial-failure

Conversation

@widgetii
Copy link
Copy Markdown
Member

@widgetii widgetii commented May 21, 2026

Summary

Hotfix on top of #2111 + #2112. Three issues uncovered when the first real cron after the redesign hit a flaky board.

Context: what happened on the 2026-05-20 cron

Run 26196546684 — one matrix job, hi3516av100_ultimate, exhausted its retry budget while the GitHub releases CDN / Hisilicon toolchain mirror was returning 502 Bad Gateway over a >10 min window. 91 of 92 jobs succeeded, the dated release nightly-20260520-887328c and the rolling nightly both ended up with 102 assets each, the preflight gate worked correctly. (The user's GH email "9 assets" was a snapshot taken the moment the release was first created — assets continued piling on as the remaining matrix jobs finished.)

Fix #1 — manifest.yml was skipped

Trigger condition was workflow_run.conclusion == 'success'. That's the whole workflow conclusion, which is failure if ANY matrix job fails. So the manifest workflow was skipped and gh-pages kept serving the empty-state placeholder while 102 valid artifacts existed on the release.

Loosened to publish on success OR failure, only skip cancelled. enrich_manifest.py reads whatever assets actually exist — a partial release just gets fewer entries in its platforms map.

Fix #2built_at= empty in release body

${{ github.run_started_at }} rendered as empty in softprops/action-gh-release@v2's body: field (context-availability quirk). Moved the timestamp calculation into the preflight job (date -u +%Y-%m-%dT%H:%M:%SZ), exposed as a job output, referenced from both upload steps. Cosmetic-only — enrich_manifest.py already falls back to the release's created_at — but the body now has a real RFC3339 timestamp.

Fix #3 — extend retry budget for upstream toolchain/CDN flakes

Backoffs 30 60 120 30030 60 120 300 600 1200. Total attempts: 5 → 7. Max idle sleep: ~8.5 min → ~40 min. The longer tail only fires on persistent upstream issues — happy-path builds still finish on the first attempt with zero added wait.

Without this, even a routine GH releases CDN hiccup is enough to fail a matrix entry and (pre-Fix #1) deny manifest updates for the other 91 platforms.

Immediate remediation done

Manually dispatched manifest.yml (the workflow_dispatch path was already correct). gh-pages now serves the 102-entry index live at https://openipc.github.io/firmware/manifest.flat — lab cameras running PR #2114's sysupgrade can --list-builds and see real data.

Test plan

  • CI passes on this PR.
  • Next scheduled cron triggers manifest.yml automatically — verify gh-pages updates without needing manual dispatch.
  • Release body of next dated build shows built_at=2026-MM-DDThh:mm:ssZ.
  • If a future toolchain/CDN flake recurs, the longer backoff tail (10/20 min) gives it room to recover before declaring matrix failure.

🤖 Generated with Claude Code

Last night's cron (run 26196546684) had one matrix job fail —
hi3516av100_ultimate hit transient 502/timeouts pulling the toolchain
and exhausted its retry budget. The 91 other boards uploaded their
artifacts successfully and the dated release `nightly-20260520-887328c`
ended with 102 assets, the rolling `nightly` was a faithful mirror,
and the SHA-gate worked correctly.

But manifest.yml's trigger condition was
  github.event.workflow_run.conclusion == 'success'
which evaluates the whole workflow's conclusion (failure if any matrix
job failed). The manifest workflow was therefore SKIPPED for last
night's build and gh-pages still served the empty-state placeholder
even though 102 valid artifacts existed.

Fix #1: trigger manifest on success OR failure, only skip on
'cancelled'. enrich_manifest.py reads whatever assets actually exist
on the release, so a partial release just shows up in the index with
fewer platforms in its `platforms` map — the right behaviour.

Fix #2: built_at= rendered empty in the release body because
${{ github.run_started_at }} doesn't interpolate inside softprops/
action-gh-release@v2's `body:` field (likely a context-availability
quirk). Compute BUILT_AT in the preflight job with `date -u +...`,
expose it as a job output, reference it from both upload steps. Now
the body has a real RFC3339 timestamp.

Manually backfilled tonight's stale manifest after the issue was
identified — the workflow_dispatch path was already correct, and
manifest.flat now serves 102 entries for nightly-20260520-887328c.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two more entries to the per-board retry backoff: 30 60 120 300 ->
30 60 120 300 600 1200. Total attempts: 5 -> 7. Max idle sleep:
~8.5 min -> ~40 min.

Motivated by hi3516av100_ultimate failing on the 2026-05-20 cron
(run 26196546684): GitHub releases CDN / toolchain mirror was
returning 502 Bad Gateway over a >10 min window, and the existing
budget exhausted before it cleared. The build itself was fine; just
upstream flake.

The longer tail only fires on persistent upstream issues — happy-path
builds still finish in their first attempt with zero added wait.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@widgetii widgetii changed the title ci/manifest: publish on partial-failure builds; fix empty built_at= ci/manifest: publish on partial failure; fix built_at=; extend retry budget May 21, 2026
@widgetii widgetii merged commit c1cd003 into master May 21, 2026
93 checks passed
@widgetii widgetii deleted the ci/manifest-publish-on-partial-failure branch May 21, 2026 05:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect detection of the sc2035

1 participant