test: retry HF-network integration tests on Hub 5xx flakiness#287
Conversation
CPU Tests CI has been failing for ~80 minutes due to HuggingFace Hub returning intermittent 500 errors on the xet-read-token endpoint (snapshot_download falls back to an incomplete local_dir, then dataset[0] crashes with FileNotFoundError on a missing parquet). Add a small retry_on_hf_flakiness decorator in tests/utils.py that catches HfHubHTTPError 5xx and FileNotFoundError under the HF cache path, sleeps, and retries. Other exceptions propagate so real test bugs still fail fast. Apply to the four integration tests that hit real HF Hub: - tests/datasets/test_datasets.py::test_lerobot_dataset_factory - tests/datasets/test_datasets.py::test_do_not_use_imagenet_stats - tests/datasets/test_dataset_mixture.py::TestWeightedDatasetMixtureIntegration::test_integration_basic_functionality_with_same_fps_as_dataset - tests/scripts/test_attach_metadata.py::test_attach_metadata_end_to_end_droid_100
shuheng-liu
left a comment
There was a problem hiding this comment.
Surgical fix to a real CI flake — happy with the approach. Retry semantics, exception classification, and decorator placement (innermost, so the wrapper is what pytest.mark.parametrize sees; @wraps keeps the signature inspectable) all look correct. A few minor things worth thinking about before merging — none are blockers.
-
No log line on retry. When a retry fires there's nothing in stdout/stderr, so a future "this test took 30s instead of 10s" won't tip off the next person that a Hub flake got masked. A single
print(f"[retry_on_hf_flakiness] attempt {attempt + 1}/{reruns + 1} hit {type(exc).__name__}: {exc}", file=sys.stderr)right beforetime.sleep(delay)makes the retry visible and greppable. -
"/huggingface/"substring is loose. It matches the real cache (~/.cache/huggingface/hub/...) but also any path that happens to contain/huggingface/(e.g. a fixture dir literally namedhuggingface), and it misses the older~/.cache/huggingface_hub/layout. The intent is "under HF cache root" —huggingface_hub.constants.HF_HUB_CACHEresolves the cache dir (and respectsHF_HOME), so a prefix check against it would be more precise. Low priority since the affected tests don't have false-positive candidates. -
Total backoff vs outage duration. Defaults give
reruns * delay = 20sof total backoff, but today's outage was ~50 min. So this rides out brief blips, not a multi-minute outage. Worth being explicit in the docstring so the next person debugging a red CI doesn't expect retries to save them. -
Lazy import of
HfHubHTTPError.huggingface_hubis a hard dep (pyproject.toml:54), so the in-function import is unnecessary — module-level would match the rest oftests/utils.py. Trivial. -
No unit test for the decorator itself. PR body covers hand-verification; a tiny test exercising the 5xx-range check,
FileNotFoundErrorpath filter, and propagation-on-non-flaky cases would guard against future regressions (e.g. someone narrowing the status range or rewriting the path check). Optional.
Generated by Claude Code
|
Good to merge since non of the review items are blocking. |
|
[claude-review] summary for commit 142abd4 No blocking issues found. Surgical fix for a real CI flake; retry semantics, exception classification (
|
What this does
CPU Tests CI has been red since ~19:21 UTC today due to intermittent
HuggingFace Hub 500s on the
xet-read-tokenendpoint. The failures areall in 4 integration tests that load real HF datasets:
tests/datasets/test_datasets.py::test_lerobot_dataset_factory(3 params)tests/datasets/test_datasets.py::test_do_not_use_imagenet_stats(2 params)tests/datasets/test_dataset_mixture.py::TestWeightedDatasetMixtureIntegration::test_integration_basic_functionality_with_same_fps_as_datasettests/scripts/test_attach_metadata.py::test_attach_metadata_end_to_end_droid_100When
xet-read-tokenreturns 500,snapshot_downloadfalls back to anincomplete local_dir and the test eventually crashes either directly with
HfHubHTTPErroror downstream withFileNotFoundErroron a missing parquet.Three back-to-back CI re-runs each failed on a different subset of these
tests; the original CI run 25575775842
chewed through three attempts.
This PR adds a small
retry_on_hf_flakinessdecorator intests/utils.pyand applies it to the 4 affected tests. The decorator:
HfHubHTTPErrorwith 5xx status (Hub server-side outage), andFileNotFoundErrorwhose path is under the HF cache (downstreameffect of the 500 fallback);
Defaults: 2 reruns, 10s delay between attempts. No new dependency.
How it was tested
propagates, real
FileNotFoundErroroutside HF cache propagates,ValueErrorpropagates).pytest tests/datasets/test_datasets.py::test_dataset_initializationand
tests/datasets/test_datasets.py::test_lerobot_dataset_factory[lerobot/droid_100]locally — both pass.
ruff,ruff-format,pyupgrade,typos,bandit).Cannot reproduce the original 5xx failure now — Hub recovered around 20:12 UTC.
The retry behavior here is defensive against the next outage.
How to checkout & try? (for the reviewer)
To exercise the retry path itself, the unit-style behavior was checked by
forcing a 500 mock on the first call:
Checklist