fix(datasets): harden dataset loading for large pretrain mixtures by shuheng-liu · Pull Request #304 · TensorAuto/OpenTau

shuheng-liu · 2026-05-14T03:41:50Z

What this does

A series of fixes that make the dataset-loading path robust enough to start a large heterogeneous pretraining mixture (~390 LeRobot datasets). Each fix addresses a distinct crash or hang hit while bringing up such a run; the net change is 4 files.

datasets/lerobot_dataset.py

SHA-format revisions — guard get_safe_version with is_valid_version in the data-download path (mirrors the existing guard on the metadata path). A dataset pinned to a Git-SHA revision no longer crashes with packaging.version.InvalidVersion; SHA / branch refs skip the vX.Y codebase-version lookup and fall through to the download, which the Hub accepts directly.
Memory-bounded parquet loading — load_hf_dataset now loads per-episode parquet via load_dataset("parquet", ...), whose Arrow cache is genuinely memory-mapped (resident pages are file-backed and reclaimable). The previous hand-rolled pyarrow.dataset.to_table() + Dataset(table) materialised the full filtered table into anonymous RAM — with every rank loading the full mixture, the multi-hundred-GB video repos OOM'd the box. Files are passed in sorted-episode order so the row layout stays aligned with episode_data_index. Because load_dataset infers features from the parquet itself, it also sidesteps the strict-schema cast errors that info.json / parquet column drift used to raise.
Episode download path — download_episodes now routes whole-repo datasets (no episode subset) to snapshot_download (lists the repo tree in O(1) API calls) and episode-subset datasets to a new download_files helper. download_files fetches each file with hf_hub_download in a thread pool but skips files already on disk — hf_hub_download issues a network metadata request per file even for cached files, so calling it across a pre-downloaded mixture burned one API request per file and tripped the 3000-req / 5-min rate limit (429). Passing thousands of explicit per-episode paths to snapshot_download(allow_patterns=...) was the other failure mode: its filter_repo_objects fnmatch loop is O(repo_files × patterns) and ran GIL-held long enough to trip the NCCL watchdog. The new split avoids both.

datasets/speed_percentiles.py

Distributed-barrier correctness — load_or_compute_speed_percentiles wraps its body in try/finally so wait_for_everyone() runs on every path. Previously the early-return-on-cached-file branch skipped the barrier; a rank arriving after rank 0 wrote the cache file took that branch, skipped the barrier, and silently desynced the collective counter — surfacing as a NCCL hang at a much later, unrelated sync point.
Metadata drift tolerance — episode_to_task_index_from_episodes skips episodes whose task label is absent from tasks.jsonl (deduped warning) instead of raising KeyError. Skipped episodes are still trained on; they fall back to the sparse speed bucket downstream, which already tolerates a missing entry.

tests/fixtures/

Adds mock_hf_hub_download_factory (mirrors mock_snapshot_download_factory) and wires it into lerobot_dataset_factory, so the download_files path is exercised without hitting the Hub.

🐛 Bug

The branch history includes a couple of exploratory commits that were reverted on-branch (a py-spy dev-dep, an HF-filelock approach); the net change is the 4 files above.

How it was tested

pytest -m "not gpu" tests/datasets/ — all 150 dataset tests pass. This includes 3 subset-episode tests (test_dataset_initialization, test_dataset_unsorted_episodes_row_alignment, test_dataset_sparse_episodes_row_alignment) that the download_files switch broke and the new mock_hf_hub_download_factory fixes.
pre-commit run --files ... passes on all changed files.
End-to-end: a large heterogeneous pretraining mixture (~390 datasets) that previously crashed during dataset loading — on each of the issues above in turn — now flows through dataset prep → model load → stable training steps with a healthy loss curve (total loss decreasing, grad norm settling).

How to checkout & try? (for the reviewer)

pytest -sx tests/datasets/test_datasets.py

The three row-alignment / initialization tests exercise the episode-subset download path through the new hf_hub_download mock. To spot-check the memory-mapping behaviour, load any LeRobot dataset and confirm the resident set stays bounded:

python -c "from opentau.datasets.lerobot_dataset import LeRobotDataset; ds = LeRobotDataset(repo_id='physical-intelligence/libero'); print(len(ds), 'frames')"

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

…visions

…ta download

This reverts commit 25d50cc.

…d metadata download" This reverts commit 0e34b65.

…ucketing

…ypeError

…-RAM load

…rolled paths

…wnload filter

…isk files download_episodes never reached its "episodes is None" branch: __init__ backfills self.episodes with the full episode list before it runs, so whole-repo datasets (no episode subset) wrongly took the per-file download_files path and fired one hf_hub_download per file — thousands of HF API requests, tripping the 3000 req / 5 min rate limit (429). Capture _episodes_were_specified at construction so the branch survives the backfill and whole-repo datasets use snapshot_download (O(1) listing API calls). download_files also called hf_hub_download for every file with no on-disk check; hf_hub_download issues a network metadata request per file even when the file is already in local_dir. Skip files already on disk so a pre-downloaded episode set makes zero requests.

lerobot_dataset_factory patched snapshot_download but not hf_hub_download. Once download_episodes started routing the episode-subset path through download_files (which uses hf_hub_download), the 3 subset-episode tests in test_datasets.py made real Hub calls and 404'd on the dummy repo. Add mock_hf_hub_download_factory that writes the requested fixture file, mirroring mock_snapshot_download_factory.

shuheng-liu · 2026-05-14T17:02:06Z

Code review

Overview

Four files, four distinct dataset-loading fixes for large heterogeneous pretrain mixtures:

lerobot_dataset.py — (a) guard get_safe_version with is_valid_version on the data-download path; (b) swap the hand-rolled pa_ds.to_table() + Dataset(table) for load_dataset("parquet", ...) to get a genuinely mmap'd Arrow cache; (c) split download_episodes into a whole-repo snapshot_download path and a per-file hf_hub_download path that skips already-present files.
speed_percentiles.py — (d) try/finally so the distributed barrier runs on the cached-file early-return path; (e) episode_to_task_index_from_episodes skips unresolvable task labels instead of raising KeyError.
tests/fixtures/ — adds mock_hf_hub_download_factory and wires it in.

What checks out

is_valid_version guard correctly mirrors the existing metadata-path guard — SHA/branch refs now skip the vX.Y lookup.
Import cleanup is clean: pa_ds / Dataset / DatasetInfo are fully removed with no remaining references; get_hf_features_from_features is still used by create_hf_dataset, concatenate_datasets by add_episode. No orphans.
on_accelerate_main_proc(local=True, _sync=True) on download_files is safe — download_episodes is undecorated and runs on all ranks, so every rank reaches both download_files and pull_from_repo; no risk of the nested-decorator deadlock the decorator docstring warns about, and _sync=True broadcasts exceptions symmetrically.
Path handling is consistent: get_data_file_path returns a repo-relative Path, so (self.root / f).is_file(), hf_hub_download(filename=f, local_dir=self.root), and load_dataset(str(self.root / f)) all line up.
Row-order alignment is preserved: self.episodes is sorted(), the file list is built in that order, and load_dataset concatenates an explicit data_files list in list order — same final (episode, frame) order the old to_table produced. test_dataset_unsorted/sparse_episodes_row_alignment lock this in.
speed_percentiles try/finally correctly fixes the real barrier-skip desync. The .get() + dedup-warn branch handles task_index == 0 correctly (checks is None, not falsiness).
CI is green — all 7 check runs pass (CPU Tests, Pre-commit, check-checklist, review).

Issues

Dead code in load_hf_dataset (if self.episodes is None). __init__ unconditionally backfills self.episodes = list(self.meta.episodes) before the try/except that calls load_hf_dataset, so inside the method self.episodes is never None — the glob branch is unreachable, and the else branch already handles whole-repo loads (the full episode list). The docstring also leads with the glob behavior that never runs. Either drop the branch and trim the docstring, or add a comment that it's deliberately kept standalone-safe — as written it reads as live code. (Per repo CLAUDE.md "Simplicity First", I'd drop it.)
Schema is now inferred from parquet, not info.json. The old path passed explicit features=get_hf_features_from_features(self.meta.features); load_dataset("parquet", ...) infers from the files. This is framed as a feature (sidesteps cast errors) and it is — but info.json is no longer the authoritative schema, and cross-file schema drift within a dataset now surfaces as a load_dataset failure instead of a cast. Worth a one-line docstring note that schema validation against info.json is intentionally dropped here.
Arrow cache disk footprint. load_dataset now builds a 1–5x-parquet Arrow cache for the full mixture on every box (the old to_table path didn't). The docstring acknowledges it, but for a multi-hundred-GB mixture this is a large, uncleaned $HF_HOME/datasets/ footprint — worth calling out operationally (disk provisioning; nothing prunes it).

Test coverage gaps

The download_files skip-already-on-disk logic — the core of the 429-avoidance fix — isn't asserted anywhere. A test that pre-populates the episode files and asserts hf_hub_download is called zero times would lock in the exact behavior this PR introduces.
episode_to_task_index_from_episodes's new unresolvable-task-label skip has no test. TestEpisodeToTaskIndex covers single/multi/empty-tasks but not "task not in task_to_task_index". It's a pure function — a 4-line test ({0: {"tasks": ["ghost"]}}, t2i={} → {}) covers it.
The speed_percentiles try/finally barrier behavior needs a distributed harness to test; manual verification as noted in the PR is acceptable.

Minor

download_files: max_workers=16 is a hardcoded magic number — a module constant would be marginally cleaner.
download_files treats any on-disk file as complete. hf_hub_download writes atomically so this holds for its own downloads, but a truncated file from another source would be silently trusted. Edge case, matches the old snapshot_download behavior — non-blocking.
speed_percentiles: the finally now runs wait_for_everyone() even when the try raises. An asymmetric failure (one rank raises, others don't) becomes a barrier hang instead of a fast crash. Still strictly better than the old skip-the-barrier bug, and realistic failures here hit all ranks symmetrically — just noting the residual.

Verdict

Solid, well-reasoned fixes — each traces to a concrete failure mode and the comments explain the "why" well. Nothing here is a correctness blocker. Before un-drafting I'd (1) drop or explicitly justify the dead episodes is None branch in load_hf_dataset, and (2) add the two small unit tests above, since both are cheap and cover the exact behaviors this PR exists to introduce.

Generated by Claude Code

- load_hf_dataset: drop the unreachable `episodes is None` glob branch (__init__ backfills self.episodes before this method runs), remove the now-unused `import re`, and document in the docstring that schema is inferred from parquet (not validated against info.json) and that the Arrow cache is unpruned. - download_files: hoist the thread-pool width to a named module constant _DOWNLOAD_MAX_WORKERS. - Add test_download_files_skips_present_files (asserts zero hf_hub_download calls when the episode files are already on disk) and test_unresolvable_task_label_skipped (episode_to_task_index_from_episodes skips a task label absent from tasks.jsonl).

shuheng-liu · 2026-05-14T21:46:05Z

Review feedback addressed — `9e1f8ec`

Before-un-drafting items

Dead episodes is None branch in load_hf_dataset — dropped. __init__ backfills self.episodes with the full episode list before either load_hf_dataset call, so the glob branch was unreachable. Removed it, the now-unused import re, and the stale glob comment; added a docstring line noting self.episodes is always concrete in this method.
download_files skip-on-disk test — added test_download_files_skips_present_files: builds the dataset (which places every selected-episode file on disk), then asserts hf_hub_download is called zero times on a second download_files pass.
episode_to_task_index_from_episodes unresolvable-label test — added test_unresolvable_task_label_skipped to TestEpisodeToTaskIndex (a task label absent from tasks.jsonl is skipped, resolvable episodes still come through).

Also addressed

Docstring note that schema is inferred from the parquet, not validated against info.json (issue 2).
Docstring note that the Arrow cache is unpruned and needs disk provisioned (issue 3).
max_workers=16 → module constant _DOWNLOAD_MAX_WORKERS (minor).

Not changed — matches the review's own call

speed_percentiles try/finally distributed-barrier behavior — needs a multi-rank harness; manual verification stands.
download_files trusting a truncated on-disk file, and the finally-barrier-on-asymmetric-raise residual — both flagged non-blocking; left as-is.

All 170 tests in tests/datasets/{test_datasets,test_speed_percentiles,test_obs_history,test_segment_lerobot_dataset}.py pass locally (-m "not gpu"), pre-commit clean.

Likely addresses #306

This branch is based directly on 758d473 (current main), and its net diff rewrites the exact code in both commits #306 names as prime suspects:

fix(datasets): memory-map parquet instead of rewriting to Arrow #301 (27a2197, "memory-map parquet instead of rewriting to Arrow") introduced the pa_ds.dataset().to_table() path in load_hf_dataset. That to_table(filter=...) materialises the filtered table into anonymous RAM rather than mmapping — a sharp dataset-init slowdown / memory blowup, exactly the regression mode Nightly regression test timing out at 30m in "Train with Model Parallelism" #306 describes. This PR reverts it to load_dataset("parquet", ...), which is genuinely memory-mapped.
feat(datasets): bucket speed by per-task percentile rank, not duration #302 (758d473) introduced speed_percentiles.py, including load_or_compute_speed_percentiles with a conditional barrier — the cached-file early-return skipped wait_for_everyone(), desyncing the collective counter and causing a NCCL hang at a later sync point. In a multi-rank job (the failing job is "Train with Model Parallelism") that surfaces as a stall that runs to the wall-clock timeout. This PR wraps the body in try/finally so the barrier always runs.

So whichever of #301 / #302 is the actual culprit, this PR covers it. Not yet reproduced — re-running the nightly regression on this branch (or post-merge) would confirm.

shuheng-liu · 2026-05-14T22:14:07Z

Re-review — follow-up commit `9e1f8ec`

The follow-up commit cleanly resolves everything from the review above, and CI is green on the new head (6/6 checks).

Addressed:

Dead code — load_hf_dataset now builds only the explicit per-episode file list; the unreachable episodes is None glob branch is gone and import re is removed (verified no other re. usage remains in the file). Docstring updated to state self.episodes is always a concrete list here.
Schema inference — docstring paragraph added making explicit that the schema is intentionally not validated against info.json.
Disk footprint — docstring now notes the Arrow cache is unpruned and a multi-hundred-GB mixture needs that much extra disk provisioned.
Test gap (download_files) — test_download_files_skips_present_files asserts zero hf_hub_download calls when the episode files are already on disk — directly locks in the 429-avoidance behavior.
Test gap (episode_to_task_index) — test_unresolvable_task_label_skipped covers the new skip-and-warn branch and confirms resolvable episodes in the same dataset still pass through.
Magic number — max_workers=16 → _DOWNLOAD_MAX_WORKERS module constant with an explanatory comment.

One minor nit (non-blocking): the new schema docstring paragraph calls the parquet-vs-info.json case "cross-file column drift" — but "cross-file" reads more naturally as parquet-file-to-parquet-file. The two cases behave differently: parquet/info.json mismatch now loads silently (parquet wins), whereas parquet-to-parquet mismatch is what actually raises a load_dataset failure. As written it reads slightly self-contradictory ("lets ... drift load" → "drift surfaces as a failure"). Worth relabeling for precision, but the substance is correct.

Looks good to un-draft once the wording nit is settled.

Generated by Claude Code

Split the two distinct schema-drift cases the re-review flagged as conflated: a parquet/info.json mismatch now loads silently (parquet wins), whereas a mismatch between parquet files of one dataset fails as a load_dataset concatenation error.

shuheng-liu · 2026-05-14T22:17:23Z

Wording nit addressed — `b6bc39d`

Reworded the load_hf_dataset schema docstring to separate the two drift cases that read as conflated:

parquet vs info.json mismatch → now loads silently; the parquet's own schema wins and info.json is no longer authoritative.
mismatch between the parquet files of one dataset → still fails, but as a load_dataset concatenation error rather than the old explicit feature-cast error.

Pure docstring change, CI unaffected. That was the last open item from the re-review.

shuheng-liu · 2026-05-14T22:27:11Z

Re-review — follow-up commit `b6bc39d`

Docs-only change; rewrites the load_hf_dataset schema-inference paragraph from the previous nit. The new wording is accurate and cleanly separates the two cases — a parquet/info.json mismatch now loads silently (parquet wins), while a mismatch between the parquet files of a single dataset still fails as a load_dataset concatenation error.

Nit fully resolved. No open items remain — looks ready to un-draft. CI green on the substantive checks (pre-commit, review, check-checklist); CPU Tests in progress at time of writing but unaffected by a docstring-only change.

Generated by Claude Code

fix(datasets): guard get_safe_version with is_valid_version on SHA re…

1135b66

…visions

shuheng-liu added the 🐛 Bug label May 14, 2026

shuheng-liu self-assigned this May 14, 2026

fix(datasets): fall back to inferred features when info.json cast fails

ad609f1

shuheng-liu added bug Something isn't working and removed 🐛 Bug labels May 14, 2026

shuheng-liu added 14 commits May 13, 2026 21:11

fix(datasets): use HF Hub filelock instead of rank-coordinated metada…

0e34b65

…ta download

fix(datasets): always barrier in load_or_compute_speed_percentiles

40f2a34

build(deps): add py-spy to dev extras for deadlock debugging

25d50cc

Revert "build(deps): add py-spy to dev extras for deadlock debugging"

808bd4d

This reverts commit 25d50cc.

Revert "fix(datasets): use HF Hub filelock instead of rank-coordinate…

db438c4

…d metadata download" This reverts commit 0e34b65.

fix(datasets): skip episodes with unresolvable task labels in speed b…

65dfad7

…ucketing

fix(datasets): also fall back on ValueError Keys mismatch, not just T…

59e7d08

…ypeError

fix(datasets): fall back on any typed-Dataset schema error, truncate log

747e4cc

fix(datasets): stream parquet to mmap'd Arrow file instead of full in…

c9810cc

…-RAM load

fix(datasets): write Arrow IPC stream format for Dataset.from_file mmap

0d12744

fix(datasets): use load_dataset(parquet) for genuine mmap, drop hand-…

4bf9277

…rolled paths

fix(datasets): download episode files directly, skip slow snapshot_do…

f5cc691

…wnload filter

shuheng-liu changed the title ~~fix(datasets): allow SHA-format revisions in dataset loader~~ fix(datasets): harden dataset loading for large pretrain mixtures May 14, 2026

shuheng-liu marked this pull request as ready for review May 14, 2026 22:27

shuheng-liu merged commit 8bd159b into main May 14, 2026
6 checks passed

shuheng-liu deleted the claude/intelligent-chatelet-f16fb0 branch May 14, 2026 22:27

WilliamYue37 mentioned this pull request May 15, 2026

Nightly regression test timing out at 30m in "Train with Model Parallelism" #306

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datasets): harden dataset loading for large pretrain mixtures#304

fix(datasets): harden dataset loading for large pretrain mixtures#304
shuheng-liu merged 18 commits into
mainfrom
claude/intelligent-chatelet-f16fb0

shuheng-liu commented May 14, 2026 •

edited

Loading

Uh oh!

shuheng-liu commented May 14, 2026

Uh oh!

shuheng-liu commented May 14, 2026

Uh oh!

shuheng-liu commented May 14, 2026

Uh oh!

shuheng-liu commented May 14, 2026

Uh oh!

shuheng-liu commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuheng-liu commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

shuheng-liu commented May 14, 2026

Code review

Overview

What checks out

Issues

Test coverage gaps

Minor

Verdict

Uh oh!

shuheng-liu commented May 14, 2026

Review feedback addressed — 9e1f8ec

Likely addresses #306

Uh oh!

shuheng-liu commented May 14, 2026

Re-review — follow-up commit 9e1f8ec

Uh oh!

shuheng-liu commented May 14, 2026

Wording nit addressed — b6bc39d

Uh oh!

shuheng-liu commented May 14, 2026

Re-review — follow-up commit b6bc39d

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shuheng-liu commented May 14, 2026 •

edited

Loading

Review feedback addressed — `9e1f8ec`

Re-review — follow-up commit `9e1f8ec`

Wording nit addressed — `b6bc39d`

Re-review — follow-up commit `b6bc39d`