Skip to content

feat(baker): bounded retry/backoff for HF 429 commit-rate cap (#225)#227

Merged
gerchowl merged 1 commit intodevfrom
feature/225-hf-429-backoff
Apr 28, 2026
Merged

feat(baker): bounded retry/backoff for HF 429 commit-rate cap (#225)#227
gerchowl merged 1 commit intodevfrom
feature/225-hf-429-backoff

Conversation

@gerchowl
Copy link
Copy Markdown
Contributor

Summary

Closes #225.

The HF Hub enforces ~128 commits / hour / repo. Phase-3 matrix bakes plus
phase-4 derives running in the same hour saturated this cap and the
gpuopen derive job died at its final manifest commit:

huggingface_hub.errors.HfHubHTTPError: 429 Client Error: Too Many Requests
You have exceeded the rate limit for repository commits (128 per hour).
Retry after 30 seconds (647/1000 requests remaining in current 300s window).

This PR adds a small bounded retry helper around every api.create_commit
call site, plus a repo-scoped GH-Actions concurrency cap so cross-source
matrix dispatches no longer race for the same per-repo budget.

Changes

  • src/mat_vis_baker/hf_retry.py (new) — _create_commit_with_backoff(api, *, source="unknown", max_retries=5, **kwargs):
    • Detects 429 via response.status_code.
    • Parses Retry-After header (integer seconds or HTTP-date).
    • Falls back to parsing "Retry after N seconds" from the response body.
    • Falls back to exponential backoff with ±10% jitter, floored at 30s, capped at 120s.
    • Bounded at 5 attempts; re-raises original HfHubHTTPError on exhaustion.
    • Re-raises every other exception unchanged so the existing 412 CAS retry loop in bake_one_per_file keeps working.
  • src/mat_vis_baker/hf_bake_per_file.py — wired the helper into all 3 create_commit sites: per-batch flush, catalog + manifest CAS loop, sentinel.
  • src/mat_vis_baker/hf_derive_per_file.py — wired the helper into all 3 create_commit sites: per-batch flush, catalog + manifest CAS loop, sentinel.
  • .github/workflows/bake.yml + .github/workflows/derive.yml — replaced the per-tier concurrency group with a repo-scoped <repo-id>-budget group so all writers against a given HF repo serialize. Trade-off (slower wallclock, higher reliability under rate-limit cap) documented in header comment along with the narrower future-option group preserved as a comment.
  • tests/test_hf_retry.py (new) — 10 unit tests with pure mocks (no real HF traffic, since phase-4 derives are still running against gerchowl/mat-vis-tst@v0.0.0-phase3).

Structured log format

One line per retry, streaming via print(..., flush=True) to match the progress.py conventions from #217:

bake_throttle source=<src> retry=<n>/<max> reason=429 wait_s=<W> elapsed=<NmSs>

Regex contract (asserted in tests):

^bake_throttle source=\S+ retry=\d+/\d+ reason=429 wait_s=[\d.]+ elapsed=\d+m\d+s$

Acceptance criteria checklist (#225)

  • Helper _create_commit_with_backoff parses Retry-After header (int seconds + HTTP-date).
  • Falls back to body parse, then exponential backoff (30s..120s).
  • Bounded at 5 retries.
  • Re-raises non-429 exceptions unchanged (412 CAS continues to work).
  • Structured bake_throttle log per retry, flush=True.
  • Applied at every api.create_commit site in bake + derive (6 sites total).
  • Workflow concurrency cap added (repo-scoped, cancel-in-progress: false).
  • No new runtime deps.
  • Unit tests with pure mocks (no real HF traffic).
  • Existing baker test suite passes (uv run --extra baker pytest tests/ --ignore=tests/e2e → 359 passed, 5 skipped, 1 deselected*).

* Deselected test is tests/test_version_sync.py::test_client_runtime_version_matches_pyproject — pre-existing failure unrelated to this PR (mat-vis-client version drift), reproduces on dev HEAD.

Test plan

  • uv run --extra baker --extra dev pytest tests/test_hf_retry.py -q (10 passed).
  • uv run --extra baker --extra dev pytest tests/ -q --ignore=tests/e2e (no regressions).
  • uv run --extra dev ruff check + ruff format --check (clean).
  • yamllint .github/workflows/bake.yml .github/workflows/derive.yml (clean).
  • Phase-6 e2e concurrency stress test (MAT_VIS_E2E_CONCURRENCY=3 uv run --extra baker pytest tests/e2e/test_per_file_roundtrip.py::TestConcurrentBakesShareTag -v) — exercised by the existing test: tighten mock contract + expand e2e.yml so future #207-class bugs cannot ship #210 test once phase-4 derives complete; expectations not changed (helper transparently passes through successful commits).
  • Real workflow dispatch on gerchowl/mat-vis-tst once phase-4 finishes — confirms the repo-scoped concurrency serializes as intended.

Judgment calls

  1. Workflow concurrency: GH Actions only supports one concurrency: block per workflow/job, not nested. The issue offered two options; I chose the workflow-level repo-scoped group (<repo-id>-budget) because it's the smallest change and most aligned with "one writer per repo at a time without cancellation". The narrower per-(source,tier) group is preserved as a comment-noted future option for when telemetry shows the 429 backoff has absorbed enough residual contention to safely re-parallelize.

  2. Re-raise the original 429 on exhaustion rather than wrapping in a new exception type. Keeps the call-site's existing exception-handling behaviour identical when retries genuinely run out, surfaces the real HF response to the operator, and leaves room for a future structured RetriesExhaustedError if anyone needs to distinguish.

  3. source="unknown" default: helper accepts source as an optional kwarg so future call sites that haven't been updated still emit valid bake_throttle lines. Both bake and derive currently pass an explicit source.

  4. Exponential backoff jitter is ±10%, not ±50%: enough to desync N concurrent retriers without making the worst-case wait blow past the 120s ceiling. Real production traffic at low concurrency (≤4 matrix workers) doesn't need wider jitter.

The HF Hub enforces ~128 commits/hour/repo. Phase-3 matrix bakes plus
phase-4 derives in the same hour saturated this cap and the gpuopen
derive job died at its final manifest commit with HfHubHTTPError 429.

Changes:
- New hf_retry._create_commit_with_backoff helper wraps api.create_commit
  with bounded retries on 429, parsing Retry-After header (int seconds
  or HTTP-date) with body-text fallback and exponential backoff floor
  (30s) / ceil (120s) as the last resort. Re-raises every other
  exception untouched so the existing 412 CAS retry loop keeps working.
- Emits one structured "bake_throttle" log line per retry, mirroring
  the format conventions in mat_vis_baker.progress (#217).
- Wired into every api.create_commit call site in hf_bake_per_file.py
  (per-batch, catalog/manifest CAS loop, sentinel) and
  hf_derive_per_file.py (per-batch, catalog/manifest CAS loop, sentinel).
- Workflow concurrency caps in bake.yml + derive.yml: replaced the
  per-tier group with a repo-scoped group so all writers against a
  given HF repo serialize. Trade-off (slower wallclock, higher
  reliability under the rate-limit cap) documented in the header
  comments along with the narrower future-option group.

No new runtime deps - pure stdlib time/random/regex/email.utils.
@github-actions github-actions Bot added area:testing Test infrastructure, BATS, pytest area:baker Baker pipeline, Dagger, data fetchers labels Apr 28, 2026
@gerchowl gerchowl enabled auto-merge (squash) April 28, 2026 13:58
@gerchowl gerchowl merged commit 572cbad into dev Apr 28, 2026
5 checks passed
gerchowl added a commit that referenced this pull request Apr 30, 2026
The current `mat-vis-client 0.6.0` section was written against the tar
plan (ADR-0007). That substrate was deleted in #189/#203 in favour of
the per-file substrate (ADR-0012). Rewrite 0.6.0 to reflect what
actually shipped, add 0.6.1 (#239) and 0.6.2 (#243), and drop the
stale `vig-os/devcontainer` template noise from `Unreleased`.

Key corrections:

- per-file commits on HF (ADR-0012, #183), not tar archives
- release-manifest schema_version=3, not 2
- first per-file CalVer is v2026.04.2, not v2026.04.1
- fetch_texture is a plain GET on `<source>/<tier>/<mid>/<channel>.png`,
  no rowmap, no Range header
- tiers map is `{<tier>: {complete: bool}}`, no tar/rowmap keys

New 0.6.0 entry credits the supporting infrastructure that landed in
the same window: streaming progress (#220), bytes-aware batching (#229),
HF 429 backoff (#227), audit-orphans (#202/#222), matrix
workflow_dispatch + within-run serialisation (#235/#236),
TestConcurrentBakesShareTag plumbing (#231).

Closes #245.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:baker Baker pipeline, Dagger, data fetchers area:testing Test infrastructure, BATS, pytest

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(baker): commit-rate backoff for HF 128/hour cap (no-go from #214 phase 4)

1 participant