feat(baker): bounded retry/backoff for HF 429 commit-rate cap (#225)#227
Merged
feat(baker): bounded retry/backoff for HF 429 commit-rate cap (#225)#227
Conversation
The HF Hub enforces ~128 commits/hour/repo. Phase-3 matrix bakes plus phase-4 derives in the same hour saturated this cap and the gpuopen derive job died at its final manifest commit with HfHubHTTPError 429. Changes: - New hf_retry._create_commit_with_backoff helper wraps api.create_commit with bounded retries on 429, parsing Retry-After header (int seconds or HTTP-date) with body-text fallback and exponential backoff floor (30s) / ceil (120s) as the last resort. Re-raises every other exception untouched so the existing 412 CAS retry loop keeps working. - Emits one structured "bake_throttle" log line per retry, mirroring the format conventions in mat_vis_baker.progress (#217). - Wired into every api.create_commit call site in hf_bake_per_file.py (per-batch, catalog/manifest CAS loop, sentinel) and hf_derive_per_file.py (per-batch, catalog/manifest CAS loop, sentinel). - Workflow concurrency caps in bake.yml + derive.yml: replaced the per-tier group with a repo-scoped group so all writers against a given HF repo serialize. Trade-off (slower wallclock, higher reliability under the rate-limit cap) documented in the header comments along with the narrower future-option group. No new runtime deps - pure stdlib time/random/regex/email.utils.
This was referenced Apr 28, 2026
gerchowl
added a commit
that referenced
this pull request
Apr 30, 2026
The current `mat-vis-client 0.6.0` section was written against the tar plan (ADR-0007). That substrate was deleted in #189/#203 in favour of the per-file substrate (ADR-0012). Rewrite 0.6.0 to reflect what actually shipped, add 0.6.1 (#239) and 0.6.2 (#243), and drop the stale `vig-os/devcontainer` template noise from `Unreleased`. Key corrections: - per-file commits on HF (ADR-0012, #183), not tar archives - release-manifest schema_version=3, not 2 - first per-file CalVer is v2026.04.2, not v2026.04.1 - fetch_texture is a plain GET on `<source>/<tier>/<mid>/<channel>.png`, no rowmap, no Range header - tiers map is `{<tier>: {complete: bool}}`, no tar/rowmap keys New 0.6.0 entry credits the supporting infrastructure that landed in the same window: streaming progress (#220), bytes-aware batching (#229), HF 429 backoff (#227), audit-orphans (#202/#222), matrix workflow_dispatch + within-run serialisation (#235/#236), TestConcurrentBakesShareTag plumbing (#231). Closes #245.
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #225.
The HF Hub enforces ~128 commits / hour / repo. Phase-3 matrix bakes plus
phase-4 derives running in the same hour saturated this cap and the
gpuopen derive job died at its final manifest commit:
This PR adds a small bounded retry helper around every
api.create_commitcall site, plus a repo-scoped GH-Actions concurrency cap so cross-source
matrix dispatches no longer race for the same per-repo budget.
Changes
src/mat_vis_baker/hf_retry.py(new) —_create_commit_with_backoff(api, *, source="unknown", max_retries=5, **kwargs):response.status_code.Retry-Afterheader (integer seconds or HTTP-date)."Retry after N seconds"from the response body.HfHubHTTPErroron exhaustion.bake_one_per_filekeeps working.src/mat_vis_baker/hf_bake_per_file.py— wired the helper into all 3create_commitsites: per-batch flush, catalog + manifest CAS loop, sentinel.src/mat_vis_baker/hf_derive_per_file.py— wired the helper into all 3create_commitsites: per-batch flush, catalog + manifest CAS loop, sentinel..github/workflows/bake.yml+.github/workflows/derive.yml— replaced the per-tier concurrency group with a repo-scoped<repo-id>-budgetgroup so all writers against a given HF repo serialize. Trade-off (slower wallclock, higher reliability under rate-limit cap) documented in header comment along with the narrower future-option group preserved as a comment.tests/test_hf_retry.py(new) — 10 unit tests with pure mocks (no real HF traffic, since phase-4 derives are still running againstgerchowl/mat-vis-tst@v0.0.0-phase3).Structured log format
One line per retry, streaming via
print(..., flush=True)to match theprogress.pyconventions from #217:Regex contract (asserted in tests):
Acceptance criteria checklist (#225)
_create_commit_with_backoffparsesRetry-Afterheader (int seconds + HTTP-date).bake_throttlelog per retry,flush=True.api.create_commitsite in bake + derive (6 sites total).cancel-in-progress: false).uv run --extra baker pytest tests/ --ignore=tests/e2e→ 359 passed, 5 skipped, 1 deselected*).* Deselected test is
tests/test_version_sync.py::test_client_runtime_version_matches_pyproject— pre-existing failure unrelated to this PR (mat-vis-clientversion drift), reproduces ondevHEAD.Test plan
uv run --extra baker --extra dev pytest tests/test_hf_retry.py -q(10 passed).uv run --extra baker --extra dev pytest tests/ -q --ignore=tests/e2e(no regressions).uv run --extra dev ruff check+ruff format --check(clean).yamllint .github/workflows/bake.yml .github/workflows/derive.yml(clean).MAT_VIS_E2E_CONCURRENCY=3 uv run --extra baker pytest tests/e2e/test_per_file_roundtrip.py::TestConcurrentBakesShareTag -v) — exercised by the existing test: tighten mock contract + expand e2e.yml so future #207-class bugs cannot ship #210 test once phase-4 derives complete; expectations not changed (helper transparently passes through successful commits).gerchowl/mat-vis-tstonce phase-4 finishes — confirms the repo-scoped concurrency serializes as intended.Judgment calls
Workflow concurrency: GH Actions only supports one
concurrency:block per workflow/job, not nested. The issue offered two options; I chose the workflow-level repo-scoped group (<repo-id>-budget) because it's the smallest change and most aligned with "one writer per repo at a time without cancellation". The narrower per-(source,tier) group is preserved as a comment-noted future option for when telemetry shows the 429 backoff has absorbed enough residual contention to safely re-parallelize.Re-raise the original 429 on exhaustion rather than wrapping in a new exception type. Keeps the call-site's existing exception-handling behaviour identical when retries genuinely run out, surfaces the real HF response to the operator, and leaves room for a future structured
RetriesExhaustedErrorif anyone needs to distinguish.source="unknown"default: helper accepts source as an optional kwarg so future call sites that haven't been updated still emit validbake_throttlelines. Both bake and derive currently pass an explicit source.Exponential backoff jitter is ±10%, not ±50%: enough to desync N concurrent retriers without making the worst-case wait blow past the 120s ceiling. Real production traffic at low concurrency (≤4 matrix workers) doesn't need wider jitter.