fix(datasets): harden dataset loading for large pretrain mixtures#304
Conversation
This reverts commit 25d50cc.
…d metadata download" This reverts commit 0e34b65.
…isk files download_episodes never reached its "episodes is None" branch: __init__ backfills self.episodes with the full episode list before it runs, so whole-repo datasets (no episode subset) wrongly took the per-file download_files path and fired one hf_hub_download per file — thousands of HF API requests, tripping the 3000 req / 5 min rate limit (429). Capture _episodes_were_specified at construction so the branch survives the backfill and whole-repo datasets use snapshot_download (O(1) listing API calls). download_files also called hf_hub_download for every file with no on-disk check; hf_hub_download issues a network metadata request per file even when the file is already in local_dir. Skip files already on disk so a pre-downloaded episode set makes zero requests.
lerobot_dataset_factory patched snapshot_download but not hf_hub_download. Once download_episodes started routing the episode-subset path through download_files (which uses hf_hub_download), the 3 subset-episode tests in test_datasets.py made real Hub calls and 404'd on the dummy repo. Add mock_hf_hub_download_factory that writes the requested fixture file, mirroring mock_snapshot_download_factory.
Code reviewOverviewFour files, four distinct dataset-loading fixes for large heterogeneous pretrain mixtures:
What checks out
Issues
Test coverage gaps
Minor
VerdictSolid, well-reasoned fixes — each traces to a concrete failure mode and the comments explain the "why" well. Nothing here is a correctness blocker. Before un-drafting I'd (1) drop or explicitly justify the dead Generated by Claude Code |
- load_hf_dataset: drop the unreachable `episodes is None` glob branch (__init__ backfills self.episodes before this method runs), remove the now-unused `import re`, and document in the docstring that schema is inferred from parquet (not validated against info.json) and that the Arrow cache is unpruned. - download_files: hoist the thread-pool width to a named module constant _DOWNLOAD_MAX_WORKERS. - Add test_download_files_skips_present_files (asserts zero hf_hub_download calls when the episode files are already on disk) and test_unresolvable_task_label_skipped (episode_to_task_index_from_episodes skips a task label absent from tasks.jsonl).
Review feedback addressed — 9e1f8ecBefore-un-drafting items
Also addressed
Not changed — matches the review's own call
All 170 tests in Likely addresses #306This branch is based directly on
So whichever of #301 / #302 is the actual culprit, this PR covers it. Not yet reproduced — re-running the nightly regression on this branch (or post-merge) would confirm. |
Re-review — follow-up commit
|
Split the two distinct schema-drift cases the re-review flagged as conflated: a parquet/info.json mismatch now loads silently (parquet wins), whereas a mismatch between parquet files of one dataset fails as a load_dataset concatenation error.
Wording nit addressed — b6bc39dReworded the
Pure docstring change, CI unaffected. That was the last open item from the re-review. |
Re-review — follow-up commit
|
What this does
A series of fixes that make the dataset-loading path robust enough to start a large heterogeneous pretraining mixture (~390 LeRobot datasets). Each fix addresses a distinct crash or hang hit while bringing up such a run; the net change is 4 files.
datasets/lerobot_dataset.pyget_safe_versionwithis_valid_versionin the data-download path (mirrors the existing guard on the metadata path). A dataset pinned to a Git-SHA revision no longer crashes withpackaging.version.InvalidVersion; SHA / branch refs skip thevX.Ycodebase-version lookup and fall through to the download, which the Hub accepts directly.load_hf_datasetnow loads per-episode parquet viaload_dataset("parquet", ...), whose Arrow cache is genuinely memory-mapped (resident pages are file-backed and reclaimable). The previous hand-rolledpyarrow.dataset.to_table()+Dataset(table)materialised the full filtered table into anonymous RAM — with every rank loading the full mixture, the multi-hundred-GB video repos OOM'd the box. Files are passed in sorted-episode order so the row layout stays aligned withepisode_data_index. Becauseload_datasetinfers features from the parquet itself, it also sidesteps the strict-schema cast errors that info.json / parquet column drift used to raise.download_episodesnow routes whole-repo datasets (no episode subset) tosnapshot_download(lists the repo tree in O(1) API calls) and episode-subset datasets to a newdownload_fileshelper.download_filesfetches each file withhf_hub_downloadin a thread pool but skips files already on disk —hf_hub_downloadissues a network metadata request per file even for cached files, so calling it across a pre-downloaded mixture burned one API request per file and tripped the 3000-req / 5-min rate limit (429). Passing thousands of explicit per-episode paths tosnapshot_download(allow_patterns=...)was the other failure mode: itsfilter_repo_objectsfnmatch loop is O(repo_files × patterns) and ran GIL-held long enough to trip the NCCL watchdog. The new split avoids both.datasets/speed_percentiles.pyload_or_compute_speed_percentileswraps its body intry/finallysowait_for_everyone()runs on every path. Previously the early-return-on-cached-file branch skipped the barrier; a rank arriving after rank 0 wrote the cache file took that branch, skipped the barrier, and silently desynced the collective counter — surfacing as a NCCL hang at a much later, unrelated sync point.episode_to_task_index_from_episodesskips episodes whose task label is absent fromtasks.jsonl(deduped warning) instead of raisingKeyError. Skipped episodes are still trained on; they fall back to the sparse speed bucket downstream, which already tolerates a missing entry.tests/fixtures/mock_hf_hub_download_factory(mirrorsmock_snapshot_download_factory) and wires it intolerobot_dataset_factory, so thedownload_filespath is exercised without hitting the Hub.🐛 Bug
The branch history includes a couple of exploratory commits that were reverted on-branch (a py-spy dev-dep, an HF-filelock approach); the net change is the 4 files above.
How it was tested
pytest -m "not gpu" tests/datasets/— all 150 dataset tests pass. This includes 3 subset-episode tests (test_dataset_initialization,test_dataset_unsorted_episodes_row_alignment,test_dataset_sparse_episodes_row_alignment) that thedownload_filesswitch broke and the newmock_hf_hub_download_factoryfixes.pre-commit run --files ...passes on all changed files.How to checkout & try? (for the reviewer)
The three row-alignment / initialization tests exercise the episode-subset download path through the new
hf_hub_downloadmock. To spot-check the memory-mapping behaviour, load any LeRobot dataset and confirm the resident set stays bounded:python -c "from opentau.datasets.lerobot_dataset import LeRobotDataset; ds = LeRobotDataset(repo_id='physical-intelligence/libero'); print(len(ds), 'frames')"Checklist
Note: Before submitting this PR, please read the contributor guideline.