Skip to content

fix(datasets): harden dataset loading for large pretrain mixtures#304

Merged
shuheng-liu merged 18 commits into
mainfrom
claude/intelligent-chatelet-f16fb0
May 14, 2026
Merged

fix(datasets): harden dataset loading for large pretrain mixtures#304
shuheng-liu merged 18 commits into
mainfrom
claude/intelligent-chatelet-f16fb0

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 14, 2026

What this does

A series of fixes that make the dataset-loading path robust enough to start a large heterogeneous pretraining mixture (~390 LeRobot datasets). Each fix addresses a distinct crash or hang hit while bringing up such a run; the net change is 4 files.

datasets/lerobot_dataset.py

  • SHA-format revisions — guard get_safe_version with is_valid_version in the data-download path (mirrors the existing guard on the metadata path). A dataset pinned to a Git-SHA revision no longer crashes with packaging.version.InvalidVersion; SHA / branch refs skip the vX.Y codebase-version lookup and fall through to the download, which the Hub accepts directly.
  • Memory-bounded parquet loadingload_hf_dataset now loads per-episode parquet via load_dataset("parquet", ...), whose Arrow cache is genuinely memory-mapped (resident pages are file-backed and reclaimable). The previous hand-rolled pyarrow.dataset.to_table() + Dataset(table) materialised the full filtered table into anonymous RAM — with every rank loading the full mixture, the multi-hundred-GB video repos OOM'd the box. Files are passed in sorted-episode order so the row layout stays aligned with episode_data_index. Because load_dataset infers features from the parquet itself, it also sidesteps the strict-schema cast errors that info.json / parquet column drift used to raise.
  • Episode download pathdownload_episodes now routes whole-repo datasets (no episode subset) to snapshot_download (lists the repo tree in O(1) API calls) and episode-subset datasets to a new download_files helper. download_files fetches each file with hf_hub_download in a thread pool but skips files already on diskhf_hub_download issues a network metadata request per file even for cached files, so calling it across a pre-downloaded mixture burned one API request per file and tripped the 3000-req / 5-min rate limit (429). Passing thousands of explicit per-episode paths to snapshot_download(allow_patterns=...) was the other failure mode: its filter_repo_objects fnmatch loop is O(repo_files × patterns) and ran GIL-held long enough to trip the NCCL watchdog. The new split avoids both.

datasets/speed_percentiles.py

  • Distributed-barrier correctnessload_or_compute_speed_percentiles wraps its body in try/finally so wait_for_everyone() runs on every path. Previously the early-return-on-cached-file branch skipped the barrier; a rank arriving after rank 0 wrote the cache file took that branch, skipped the barrier, and silently desynced the collective counter — surfacing as a NCCL hang at a much later, unrelated sync point.
  • Metadata drift toleranceepisode_to_task_index_from_episodes skips episodes whose task label is absent from tasks.jsonl (deduped warning) instead of raising KeyError. Skipped episodes are still trained on; they fall back to the sparse speed bucket downstream, which already tolerates a missing entry.

tests/fixtures/

  • Adds mock_hf_hub_download_factory (mirrors mock_snapshot_download_factory) and wires it into lerobot_dataset_factory, so the download_files path is exercised without hitting the Hub.

🐛 Bug

The branch history includes a couple of exploratory commits that were reverted on-branch (a py-spy dev-dep, an HF-filelock approach); the net change is the 4 files above.

How it was tested

  • pytest -m "not gpu" tests/datasets/ — all 150 dataset tests pass. This includes 3 subset-episode tests (test_dataset_initialization, test_dataset_unsorted_episodes_row_alignment, test_dataset_sparse_episodes_row_alignment) that the download_files switch broke and the new mock_hf_hub_download_factory fixes.
  • pre-commit run --files ... passes on all changed files.
  • End-to-end: a large heterogeneous pretraining mixture (~390 datasets) that previously crashed during dataset loading — on each of the issues above in turn — now flows through dataset prep → model load → stable training steps with a healthy loss curve (total loss decreasing, grad norm settling).

How to checkout & try? (for the reviewer)

pytest -sx tests/datasets/test_datasets.py

The three row-alignment / initialization tests exercise the episode-subset download path through the new hf_hub_download mock. To spot-check the memory-mapping behaviour, load any LeRobot dataset and confirm the resident set stays bounded:

python -c "from opentau.datasets.lerobot_dataset import LeRobotDataset; ds = LeRobotDataset(repo_id='physical-intelligence/libero'); print(len(ds), 'frames')"

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

@shuheng-liu shuheng-liu self-assigned this May 14, 2026
@shuheng-liu shuheng-liu added bug Something isn't working and removed 🐛 Bug labels May 14, 2026
…isk files

download_episodes never reached its "episodes is None" branch: __init__
backfills self.episodes with the full episode list before it runs, so
whole-repo datasets (no episode subset) wrongly took the per-file
download_files path and fired one hf_hub_download per file — thousands
of HF API requests, tripping the 3000 req / 5 min rate limit (429).
Capture _episodes_were_specified at construction so the branch survives
the backfill and whole-repo datasets use snapshot_download (O(1) listing
API calls).

download_files also called hf_hub_download for every file with no
on-disk check; hf_hub_download issues a network metadata request per
file even when the file is already in local_dir. Skip files already on
disk so a pre-downloaded episode set makes zero requests.
lerobot_dataset_factory patched snapshot_download but not
hf_hub_download. Once download_episodes started routing the
episode-subset path through download_files (which uses hf_hub_download),
the 3 subset-episode tests in test_datasets.py made real Hub calls and
404'd on the dummy repo. Add mock_hf_hub_download_factory that writes the
requested fixture file, mirroring mock_snapshot_download_factory.
@shuheng-liu shuheng-liu changed the title fix(datasets): allow SHA-format revisions in dataset loader fix(datasets): harden dataset loading for large pretrain mixtures May 14, 2026
Copy link
Copy Markdown
Member Author

Code review

Overview

Four files, four distinct dataset-loading fixes for large heterogeneous pretrain mixtures:

  • lerobot_dataset.py — (a) guard get_safe_version with is_valid_version on the data-download path; (b) swap the hand-rolled pa_ds.to_table() + Dataset(table) for load_dataset("parquet", ...) to get a genuinely mmap'd Arrow cache; (c) split download_episodes into a whole-repo snapshot_download path and a per-file hf_hub_download path that skips already-present files.
  • speed_percentiles.py — (d) try/finally so the distributed barrier runs on the cached-file early-return path; (e) episode_to_task_index_from_episodes skips unresolvable task labels instead of raising KeyError.
  • tests/fixtures/ — adds mock_hf_hub_download_factory and wires it in.

What checks out

  • is_valid_version guard correctly mirrors the existing metadata-path guard — SHA/branch refs now skip the vX.Y lookup.
  • Import cleanup is clean: pa_ds / Dataset / DatasetInfo are fully removed with no remaining references; get_hf_features_from_features is still used by create_hf_dataset, concatenate_datasets by add_episode. No orphans.
  • on_accelerate_main_proc(local=True, _sync=True) on download_files is safe — download_episodes is undecorated and runs on all ranks, so every rank reaches both download_files and pull_from_repo; no risk of the nested-decorator deadlock the decorator docstring warns about, and _sync=True broadcasts exceptions symmetrically.
  • Path handling is consistent: get_data_file_path returns a repo-relative Path, so (self.root / f).is_file(), hf_hub_download(filename=f, local_dir=self.root), and load_dataset(str(self.root / f)) all line up.
  • Row-order alignment is preserved: self.episodes is sorted(), the file list is built in that order, and load_dataset concatenates an explicit data_files list in list order — same final (episode, frame) order the old to_table produced. test_dataset_unsorted/sparse_episodes_row_alignment lock this in.
  • speed_percentiles try/finally correctly fixes the real barrier-skip desync. The .get() + dedup-warn branch handles task_index == 0 correctly (checks is None, not falsiness).
  • CI is green — all 7 check runs pass (CPU Tests, Pre-commit, check-checklist, review).

Issues

  1. Dead code in load_hf_dataset (if self.episodes is None). __init__ unconditionally backfills self.episodes = list(self.meta.episodes) before the try/except that calls load_hf_dataset, so inside the method self.episodes is never None — the glob branch is unreachable, and the else branch already handles whole-repo loads (the full episode list). The docstring also leads with the glob behavior that never runs. Either drop the branch and trim the docstring, or add a comment that it's deliberately kept standalone-safe — as written it reads as live code. (Per repo CLAUDE.md "Simplicity First", I'd drop it.)

  2. Schema is now inferred from parquet, not info.json. The old path passed explicit features=get_hf_features_from_features(self.meta.features); load_dataset("parquet", ...) infers from the files. This is framed as a feature (sidesteps cast errors) and it is — but info.json is no longer the authoritative schema, and cross-file schema drift within a dataset now surfaces as a load_dataset failure instead of a cast. Worth a one-line docstring note that schema validation against info.json is intentionally dropped here.

  3. Arrow cache disk footprint. load_dataset now builds a 1–5x-parquet Arrow cache for the full mixture on every box (the old to_table path didn't). The docstring acknowledges it, but for a multi-hundred-GB mixture this is a large, uncleaned $HF_HOME/datasets/ footprint — worth calling out operationally (disk provisioning; nothing prunes it).

Test coverage gaps

  • The download_files skip-already-on-disk logic — the core of the 429-avoidance fix — isn't asserted anywhere. A test that pre-populates the episode files and asserts hf_hub_download is called zero times would lock in the exact behavior this PR introduces.
  • episode_to_task_index_from_episodes's new unresolvable-task-label skip has no test. TestEpisodeToTaskIndex covers single/multi/empty-tasks but not "task not in task_to_task_index". It's a pure function — a 4-line test ({0: {"tasks": ["ghost"]}}, t2i={}{}) covers it.
  • The speed_percentiles try/finally barrier behavior needs a distributed harness to test; manual verification as noted in the PR is acceptable.

Minor

  • download_files: max_workers=16 is a hardcoded magic number — a module constant would be marginally cleaner.
  • download_files treats any on-disk file as complete. hf_hub_download writes atomically so this holds for its own downloads, but a truncated file from another source would be silently trusted. Edge case, matches the old snapshot_download behavior — non-blocking.
  • speed_percentiles: the finally now runs wait_for_everyone() even when the try raises. An asymmetric failure (one rank raises, others don't) becomes a barrier hang instead of a fast crash. Still strictly better than the old skip-the-barrier bug, and realistic failures here hit all ranks symmetrically — just noting the residual.

Verdict

Solid, well-reasoned fixes — each traces to a concrete failure mode and the comments explain the "why" well. Nothing here is a correctness blocker. Before un-drafting I'd (1) drop or explicitly justify the dead episodes is None branch in load_hf_dataset, and (2) add the two small unit tests above, since both are cheap and cover the exact behaviors this PR exists to introduce.


Generated by Claude Code

- load_hf_dataset: drop the unreachable `episodes is None` glob branch
  (__init__ backfills self.episodes before this method runs), remove the
  now-unused `import re`, and document in the docstring that schema is
  inferred from parquet (not validated against info.json) and that the
  Arrow cache is unpruned.
- download_files: hoist the thread-pool width to a named module constant
  _DOWNLOAD_MAX_WORKERS.
- Add test_download_files_skips_present_files (asserts zero hf_hub_download
  calls when the episode files are already on disk) and
  test_unresolvable_task_label_skipped (episode_to_task_index_from_episodes
  skips a task label absent from tasks.jsonl).
@shuheng-liu
Copy link
Copy Markdown
Member Author

Review feedback addressed — 9e1f8ec

Before-un-drafting items

  • Dead episodes is None branch in load_hf_dataset — dropped. __init__ backfills self.episodes with the full episode list before either load_hf_dataset call, so the glob branch was unreachable. Removed it, the now-unused import re, and the stale glob comment; added a docstring line noting self.episodes is always concrete in this method.
  • download_files skip-on-disk test — added test_download_files_skips_present_files: builds the dataset (which places every selected-episode file on disk), then asserts hf_hub_download is called zero times on a second download_files pass.
  • episode_to_task_index_from_episodes unresolvable-label test — added test_unresolvable_task_label_skipped to TestEpisodeToTaskIndex (a task label absent from tasks.jsonl is skipped, resolvable episodes still come through).

Also addressed

  • Docstring note that schema is inferred from the parquet, not validated against info.json (issue 2).
  • Docstring note that the Arrow cache is unpruned and needs disk provisioned (issue 3).
  • max_workers=16 → module constant _DOWNLOAD_MAX_WORKERS (minor).

Not changed — matches the review's own call

  • speed_percentiles try/finally distributed-barrier behavior — needs a multi-rank harness; manual verification stands.
  • download_files trusting a truncated on-disk file, and the finally-barrier-on-asymmetric-raise residual — both flagged non-blocking; left as-is.

All 170 tests in tests/datasets/{test_datasets,test_speed_percentiles,test_obs_history,test_segment_lerobot_dataset}.py pass locally (-m "not gpu"), pre-commit clean.

Likely addresses #306

This branch is based directly on 758d473 (current main), and its net diff rewrites the exact code in both commits #306 names as prime suspects:

So whichever of #301 / #302 is the actual culprit, this PR covers it. Not yet reproduced — re-running the nightly regression on this branch (or post-merge) would confirm.

Copy link
Copy Markdown
Member Author

Re-review — follow-up commit 9e1f8ec

The follow-up commit cleanly resolves everything from the review above, and CI is green on the new head (6/6 checks).

Addressed:

  • Dead codeload_hf_dataset now builds only the explicit per-episode file list; the unreachable episodes is None glob branch is gone and import re is removed (verified no other re. usage remains in the file). Docstring updated to state self.episodes is always a concrete list here.
  • Schema inference — docstring paragraph added making explicit that the schema is intentionally not validated against info.json.
  • Disk footprint — docstring now notes the Arrow cache is unpruned and a multi-hundred-GB mixture needs that much extra disk provisioned.
  • Test gap (download_files)test_download_files_skips_present_files asserts zero hf_hub_download calls when the episode files are already on disk — directly locks in the 429-avoidance behavior.
  • Test gap (episode_to_task_index)test_unresolvable_task_label_skipped covers the new skip-and-warn branch and confirms resolvable episodes in the same dataset still pass through.
  • Magic numbermax_workers=16_DOWNLOAD_MAX_WORKERS module constant with an explanatory comment.

One minor nit (non-blocking): the new schema docstring paragraph calls the parquet-vs-info.json case "cross-file column drift" — but "cross-file" reads more naturally as parquet-file-to-parquet-file. The two cases behave differently: parquet/info.json mismatch now loads silently (parquet wins), whereas parquet-to-parquet mismatch is what actually raises a load_dataset failure. As written it reads slightly self-contradictory ("lets ... drift load" → "drift surfaces as a failure"). Worth relabeling for precision, but the substance is correct.

Looks good to un-draft once the wording nit is settled.


Generated by Claude Code

Split the two distinct schema-drift cases the re-review flagged as
conflated: a parquet/info.json mismatch now loads silently (parquet
wins), whereas a mismatch between parquet files of one dataset fails as
a load_dataset concatenation error.
@shuheng-liu
Copy link
Copy Markdown
Member Author

Wording nit addressed — b6bc39d

Reworded the load_hf_dataset schema docstring to separate the two drift cases that read as conflated:

  • parquet vs info.json mismatch → now loads silently; the parquet's own schema wins and info.json is no longer authoritative.
  • mismatch between the parquet files of one dataset → still fails, but as a load_dataset concatenation error rather than the old explicit feature-cast error.

Pure docstring change, CI unaffected. That was the last open item from the re-review.

Copy link
Copy Markdown
Member Author

Re-review — follow-up commit b6bc39d

Docs-only change; rewrites the load_hf_dataset schema-inference paragraph from the previous nit. The new wording is accurate and cleanly separates the two cases — a parquet/info.json mismatch now loads silently (parquet wins), while a mismatch between the parquet files of a single dataset still fails as a load_dataset concatenation error.

Nit fully resolved. No open items remain — looks ready to un-draft. CI green on the substantive checks (pre-commit, review, check-checklist); CPU Tests in progress at time of writing but unaffected by a docstring-only change.


Generated by Claude Code

@shuheng-liu shuheng-liu marked this pull request as ready for review May 14, 2026 22:27
@shuheng-liu shuheng-liu merged commit 8bd159b into main May 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant