ci/manifest: publish on partial failure; fix built_at=; extend retry budget#2115
Merged
Conversation
Last night's cron (run 26196546684) had one matrix job fail — hi3516av100_ultimate hit transient 502/timeouts pulling the toolchain and exhausted its retry budget. The 91 other boards uploaded their artifacts successfully and the dated release `nightly-20260520-887328c` ended with 102 assets, the rolling `nightly` was a faithful mirror, and the SHA-gate worked correctly. But manifest.yml's trigger condition was github.event.workflow_run.conclusion == 'success' which evaluates the whole workflow's conclusion (failure if any matrix job failed). The manifest workflow was therefore SKIPPED for last night's build and gh-pages still served the empty-state placeholder even though 102 valid artifacts existed. Fix #1: trigger manifest on success OR failure, only skip on 'cancelled'. enrich_manifest.py reads whatever assets actually exist on the release, so a partial release just shows up in the index with fewer platforms in its `platforms` map — the right behaviour. Fix #2: built_at= rendered empty in the release body because ${{ github.run_started_at }} doesn't interpolate inside softprops/ action-gh-release@v2's `body:` field (likely a context-availability quirk). Compute BUILT_AT in the preflight job with `date -u +...`, expose it as a job output, reference it from both upload steps. Now the body has a real RFC3339 timestamp. Manually backfilled tonight's stale manifest after the issue was identified — the workflow_dispatch path was already correct, and manifest.flat now serves 102 entries for nightly-20260520-887328c. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two more entries to the per-board retry backoff: 30 60 120 300 -> 30 60 120 300 600 1200. Total attempts: 5 -> 7. Max idle sleep: ~8.5 min -> ~40 min. Motivated by hi3516av100_ultimate failing on the 2026-05-20 cron (run 26196546684): GitHub releases CDN / toolchain mirror was returning 502 Bad Gateway over a >10 min window, and the existing budget exhausted before it cleared. The build itself was fine; just upstream flake. The longer tail only fires on persistent upstream issues — happy-path builds still finish in their first attempt with zero added wait. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hotfix on top of #2111 + #2112. Three issues uncovered when the first real cron after the redesign hit a flaky board.
Context: what happened on the 2026-05-20 cron
Run 26196546684 — one matrix job,
hi3516av100_ultimate, exhausted its retry budget while the GitHub releases CDN / Hisilicon toolchain mirror was returning 502 Bad Gateway over a >10 min window. 91 of 92 jobs succeeded, the dated releasenightly-20260520-887328cand the rollingnightlyboth ended up with 102 assets each, the preflight gate worked correctly. (The user's GH email "9 assets" was a snapshot taken the moment the release was first created — assets continued piling on as the remaining matrix jobs finished.)Fix #1 — manifest.yml was skipped
Trigger condition was
workflow_run.conclusion == 'success'. That's the whole workflow conclusion, which isfailureif ANY matrix job fails. So the manifest workflow was skipped and gh-pages kept serving the empty-state placeholder while 102 valid artifacts existed on the release.Loosened to publish on
successORfailure, only skipcancelled.enrich_manifest.pyreads whatever assets actually exist — a partial release just gets fewer entries in itsplatformsmap.Fix #2 —
built_at=empty in release body${{ github.run_started_at }}rendered as empty in softprops/action-gh-release@v2'sbody:field (context-availability quirk). Moved the timestamp calculation into the preflight job (date -u +%Y-%m-%dT%H:%M:%SZ), exposed as a job output, referenced from both upload steps. Cosmetic-only —enrich_manifest.pyalready falls back to the release'screated_at— but the body now has a real RFC3339 timestamp.Fix #3 — extend retry budget for upstream toolchain/CDN flakes
Backoffs
30 60 120 300→30 60 120 300 600 1200. Total attempts: 5 → 7. Max idle sleep: ~8.5 min → ~40 min. The longer tail only fires on persistent upstream issues — happy-path builds still finish on the first attempt with zero added wait.Without this, even a routine GH releases CDN hiccup is enough to fail a matrix entry and (pre-Fix #1) deny manifest updates for the other 91 platforms.
Immediate remediation done
Manually dispatched
manifest.yml(theworkflow_dispatchpath was already correct). gh-pages now serves the 102-entry index live at https://openipc.github.io/firmware/manifest.flat — lab cameras running PR #2114's sysupgrade can--list-buildsand see real data.Test plan
built_at=2026-MM-DDThh:mm:ssZ.🤖 Generated with Claude Code