test: retry HF-network integration tests on Hub 5xx flakiness by shuheng-liu · Pull Request #287 · TensorAuto/OpenTau

shuheng-liu · 2026-05-08T20:43:41Z

What this does

CPU Tests CI has been red since ~19:21 UTC today due to intermittent
HuggingFace Hub 500s on the xet-read-token endpoint. The failures are
all in 4 integration tests that load real HF datasets:

tests/datasets/test_datasets.py::test_lerobot_dataset_factory (3 params)
tests/datasets/test_datasets.py::test_do_not_use_imagenet_stats (2 params)
tests/datasets/test_dataset_mixture.py::TestWeightedDatasetMixtureIntegration::test_integration_basic_functionality_with_same_fps_as_dataset
tests/scripts/test_attach_metadata.py::test_attach_metadata_end_to_end_droid_100

When xet-read-token returns 500, snapshot_download falls back to an
incomplete local_dir and the test eventually crashes either directly with
HfHubHTTPError or downstream with FileNotFoundError on a missing parquet.
Three back-to-back CI re-runs each failed on a different subset of these
tests; the original CI run 25575775842
chewed through three attempts.

This PR adds a small retry_on_hf_flakiness decorator in tests/utils.py
and applies it to the 4 affected tests. The decorator:

catches HfHubHTTPError with 5xx status (Hub server-side outage), and
catches FileNotFoundError whose path is under the HF cache (downstream
effect of the 500 fallback);
propagates everything else immediately so real test bugs still fail fast.

Defaults: 2 reruns, 10s delay between attempts. No new dependency.

How it was tested

Verified decorator semantics with hand-written cases (5xx retries, 4xx
propagates, real FileNotFoundError outside HF cache propagates,
ValueError propagates).
Ran pytest tests/datasets/test_datasets.py::test_dataset_initialization
and tests/datasets/test_datasets.py::test_lerobot_dataset_factory[lerobot/droid_100]
locally — both pass.
Pre-commit hooks pass (ruff, ruff-format, pyupgrade, typos,
bandit).

Cannot reproduce the original 5xx failure now — Hub recovered around 20:12 UTC.
The retry behavior here is defensive against the next outage.

How to checkout & try? (for the reviewer)

gh pr checkout <this-pr>
pytest -sx "tests/datasets/test_datasets.py::test_lerobot_dataset_factory[lerobot/droid_100]"

To exercise the retry path itself, the unit-style behavior was checked by
forcing a 500 mock on the first call:

from tests.utils import retry_on_hf_flakiness
from huggingface_hub.errors import HfHubHTTPError
from unittest.mock import MagicMock

resp = MagicMock(); resp.status_code = 500
n = {"i": 0}

@retry_on_hf_flakiness(reruns=2, delay=0.0)
def f():
    n["i"] += 1
    if n["i"] < 3:
        raise HfHubHTTPError("500", response=resp)
    return "ok"

assert f() == "ok" and n["i"] == 3

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

CPU Tests CI has been failing for ~80 minutes due to HuggingFace Hub returning intermittent 500 errors on the xet-read-token endpoint (snapshot_download falls back to an incomplete local_dir, then dataset[0] crashes with FileNotFoundError on a missing parquet). Add a small retry_on_hf_flakiness decorator in tests/utils.py that catches HfHubHTTPError 5xx and FileNotFoundError under the HF cache path, sleeps, and retries. Other exceptions propagate so real test bugs still fail fast. Apply to the four integration tests that hit real HF Hub: - tests/datasets/test_datasets.py::test_lerobot_dataset_factory - tests/datasets/test_datasets.py::test_do_not_use_imagenet_stats - tests/datasets/test_dataset_mixture.py::TestWeightedDatasetMixtureIntegration::test_integration_basic_functionality_with_same_fps_as_dataset - tests/scripts/test_attach_metadata.py::test_attach_metadata_end_to_end_droid_100

shuheng-liu

Surgical fix to a real CI flake — happy with the approach. Retry semantics, exception classification, and decorator placement (innermost, so the wrapper is what pytest.mark.parametrize sees; @wraps keeps the signature inspectable) all look correct. A few minor things worth thinking about before merging — none are blockers.

No log line on retry. When a retry fires there's nothing in stdout/stderr, so a future "this test took 30s instead of 10s" won't tip off the next person that a Hub flake got masked. A single print(f"[retry_on_hf_flakiness] attempt {attempt + 1}/{reruns + 1} hit {type(exc).__name__}: {exc}", file=sys.stderr) right before time.sleep(delay) makes the retry visible and greppable.
"/huggingface/" substring is loose. It matches the real cache (~/.cache/huggingface/hub/...) but also any path that happens to contain /huggingface/ (e.g. a fixture dir literally named huggingface), and it misses the older ~/.cache/huggingface_hub/ layout. The intent is "under HF cache root" — huggingface_hub.constants.HF_HUB_CACHE resolves the cache dir (and respects HF_HOME), so a prefix check against it would be more precise. Low priority since the affected tests don't have false-positive candidates.
Total backoff vs outage duration. Defaults give reruns * delay = 20s of total backoff, but today's outage was ~50 min. So this rides out brief blips, not a multi-minute outage. Worth being explicit in the docstring so the next person debugging a red CI doesn't expect retries to save them.
Lazy import of HfHubHTTPError. huggingface_hub is a hard dep (pyproject.toml:54), so the in-function import is unnecessary — module-level would match the rest of tests/utils.py. Trivial.
No unit test for the decorator itself. PR body covers hand-verification; a tiny test exercising the 5xx-range check, FileNotFoundError path filter, and propagation-on-non-flaky cases would guard against future regressions (e.g. someone narrowing the status range or rewriting the path check). Optional.

Generated by Claude Code

shuheng-liu · 2026-05-12T17:13:10Z

Good to merge since non of the review items are blocking.

claude · 2026-05-12T17:38:48Z

[claude-review] summary for commit 142abd4

No blocking issues found. Surgical fix for a real CI flake; retry semantics, exception classification (HfHubHTTPError 5xx + cache-path-anchored FileNotFoundError), and decorator/parametrize ordering are correct. The maintainer's prior review covers the remaining non-blocking items (no retry log line, loose /huggingface/ substring vs HF_HUB_CACHE, ~20 s total backoff vs the actual outage length, in-function import of HfHubHTTPError, missing unit test for the decorator) — concur on all five, none are merge-blockers.

nit — .github/workflows/claude-pr-review.yml:50 — the continue-on-error: true hunk is already on main via ci: make Claude PR Review non-blocking when API limit hits #291 (identical content, so it auto-merges cleanly), unrelated to the PR title; harmless but noise in the diff.

shuheng-liu added bug Something isn't working test labels May 8, 2026

shuheng-liu self-assigned this May 8, 2026

shuheng-liu marked this pull request as ready for review May 8, 2026 21:13

shuheng-liu requested review from WilliamYue37 and akshay18iitg May 12, 2026 16:54

shuheng-liu commented May 12, 2026

View reviewed changes

ci: make Claude PR Review non-blocking when API limit hits (#291)

142abd4

WilliamYue37 approved these changes May 12, 2026

View reviewed changes

shuheng-liu merged commit 234980a into main May 13, 2026
7 checks passed

shuheng-liu deleted the claude/admiring-hoover-ee574c branch May 13, 2026 21:38

WilliamYue37 mentioned this pull request May 14, 2026

Nightly regression test timing out at 30m in "Train with Model Parallelism" #306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: retry HF-network integration tests on Hub 5xx flakiness#287

test: retry HF-network integration tests on Hub 5xx flakiness#287
shuheng-liu merged 2 commits into
mainfrom
claude/admiring-hoover-ee574c

shuheng-liu commented May 8, 2026

Uh oh!

shuheng-liu left a comment

Uh oh!

shuheng-liu commented May 12, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shuheng-liu commented May 8, 2026

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Uh oh!

shuheng-liu left a comment

Choose a reason for hiding this comment

Uh oh!

shuheng-liu commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shuheng-liu commented May 12, 2026 •

edited

Loading