Skip to content

test: retry HF-network integration tests on Hub 5xx flakiness#287

Merged
shuheng-liu merged 2 commits into
mainfrom
claude/admiring-hoover-ee574c
May 13, 2026
Merged

test: retry HF-network integration tests on Hub 5xx flakiness#287
shuheng-liu merged 2 commits into
mainfrom
claude/admiring-hoover-ee574c

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

What this does

CPU Tests CI has been red since ~19:21 UTC today due to intermittent
HuggingFace Hub 500s on the xet-read-token endpoint. The failures are
all in 4 integration tests that load real HF datasets:

  • tests/datasets/test_datasets.py::test_lerobot_dataset_factory (3 params)
  • tests/datasets/test_datasets.py::test_do_not_use_imagenet_stats (2 params)
  • tests/datasets/test_dataset_mixture.py::TestWeightedDatasetMixtureIntegration::test_integration_basic_functionality_with_same_fps_as_dataset
  • tests/scripts/test_attach_metadata.py::test_attach_metadata_end_to_end_droid_100

When xet-read-token returns 500, snapshot_download falls back to an
incomplete local_dir and the test eventually crashes either directly with
HfHubHTTPError or downstream with FileNotFoundError on a missing parquet.
Three back-to-back CI re-runs each failed on a different subset of these
tests; the original CI run 25575775842
chewed through three attempts.

This PR adds a small retry_on_hf_flakiness decorator in tests/utils.py
and applies it to the 4 affected tests. The decorator:

  • catches HfHubHTTPError with 5xx status (Hub server-side outage), and
  • catches FileNotFoundError whose path is under the HF cache (downstream
    effect of the 500 fallback);
  • propagates everything else immediately so real test bugs still fail fast.

Defaults: 2 reruns, 10s delay between attempts. No new dependency.

How it was tested

  • Verified decorator semantics with hand-written cases (5xx retries, 4xx
    propagates, real FileNotFoundError outside HF cache propagates,
    ValueError propagates).
  • Ran pytest tests/datasets/test_datasets.py::test_dataset_initialization
    and tests/datasets/test_datasets.py::test_lerobot_dataset_factory[lerobot/droid_100]
    locally — both pass.
  • Pre-commit hooks pass (ruff, ruff-format, pyupgrade, typos,
    bandit).

Cannot reproduce the original 5xx failure now — Hub recovered around 20:12 UTC.
The retry behavior here is defensive against the next outage.

How to checkout & try? (for the reviewer)

gh pr checkout <this-pr>
pytest -sx "tests/datasets/test_datasets.py::test_lerobot_dataset_factory[lerobot/droid_100]"

To exercise the retry path itself, the unit-style behavior was checked by
forcing a 500 mock on the first call:

from tests.utils import retry_on_hf_flakiness
from huggingface_hub.errors import HfHubHTTPError
from unittest.mock import MagicMock

resp = MagicMock(); resp.status_code = 500
n = {"i": 0}

@retry_on_hf_flakiness(reruns=2, delay=0.0)
def f():
    n["i"] += 1
    if n["i"] < 3:
        raise HfHubHTTPError("500", response=resp)
    return "ok"

assert f() == "ok" and n["i"] == 3

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

CPU Tests CI has been failing for ~80 minutes due to HuggingFace Hub
returning intermittent 500 errors on the xet-read-token endpoint
(snapshot_download falls back to an incomplete local_dir, then
dataset[0] crashes with FileNotFoundError on a missing parquet).

Add a small retry_on_hf_flakiness decorator in tests/utils.py that
catches HfHubHTTPError 5xx and FileNotFoundError under the HF cache
path, sleeps, and retries. Other exceptions propagate so real test
bugs still fail fast.

Apply to the four integration tests that hit real HF Hub:
- tests/datasets/test_datasets.py::test_lerobot_dataset_factory
- tests/datasets/test_datasets.py::test_do_not_use_imagenet_stats
- tests/datasets/test_dataset_mixture.py::TestWeightedDatasetMixtureIntegration::test_integration_basic_functionality_with_same_fps_as_dataset
- tests/scripts/test_attach_metadata.py::test_attach_metadata_end_to_end_droid_100
@shuheng-liu shuheng-liu added bug Something isn't working test labels May 8, 2026
@shuheng-liu shuheng-liu self-assigned this May 8, 2026
@shuheng-liu shuheng-liu marked this pull request as ready for review May 8, 2026 21:13
Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surgical fix to a real CI flake — happy with the approach. Retry semantics, exception classification, and decorator placement (innermost, so the wrapper is what pytest.mark.parametrize sees; @wraps keeps the signature inspectable) all look correct. A few minor things worth thinking about before merging — none are blockers.

  1. No log line on retry. When a retry fires there's nothing in stdout/stderr, so a future "this test took 30s instead of 10s" won't tip off the next person that a Hub flake got masked. A single print(f"[retry_on_hf_flakiness] attempt {attempt + 1}/{reruns + 1} hit {type(exc).__name__}: {exc}", file=sys.stderr) right before time.sleep(delay) makes the retry visible and greppable.

  2. "/huggingface/" substring is loose. It matches the real cache (~/.cache/huggingface/hub/...) but also any path that happens to contain /huggingface/ (e.g. a fixture dir literally named huggingface), and it misses the older ~/.cache/huggingface_hub/ layout. The intent is "under HF cache root" — huggingface_hub.constants.HF_HUB_CACHE resolves the cache dir (and respects HF_HOME), so a prefix check against it would be more precise. Low priority since the affected tests don't have false-positive candidates.

  3. Total backoff vs outage duration. Defaults give reruns * delay = 20s of total backoff, but today's outage was ~50 min. So this rides out brief blips, not a multi-minute outage. Worth being explicit in the docstring so the next person debugging a red CI doesn't expect retries to save them.

  4. Lazy import of HfHubHTTPError. huggingface_hub is a hard dep (pyproject.toml:54), so the in-function import is unnecessary — module-level would match the rest of tests/utils.py. Trivial.

  5. No unit test for the decorator itself. PR body covers hand-verification; a tiny test exercising the 5xx-range check, FileNotFoundError path filter, and propagation-on-non-flaky cases would guard against future regressions (e.g. someone narrowing the status range or rewriting the path check). Optional.


Generated by Claude Code

@shuheng-liu
Copy link
Copy Markdown
Member Author

shuheng-liu commented May 12, 2026

Good to merge since non of the review items are blocking.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

[claude-review] summary for commit 142abd4

No blocking issues found. Surgical fix for a real CI flake; retry semantics, exception classification (HfHubHTTPError 5xx + cache-path-anchored FileNotFoundError), and decorator/parametrize ordering are correct. The maintainer's prior review covers the remaining non-blocking items (no retry log line, loose /huggingface/ substring vs HF_HUB_CACHE, ~20 s total backoff vs the actual outage length, in-function import of HfHubHTTPError, missing unit test for the decorator) — concur on all five, none are merge-blockers.

@shuheng-liu shuheng-liu merged commit 234980a into main May 13, 2026
7 checks passed
@shuheng-liu shuheng-liu deleted the claude/admiring-hoover-ee574c branch May 13, 2026 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants