Skip to content

fix(datasets): memory-map parquet instead of rewriting to Arrow#301

Merged
shuheng-liu merged 4 commits into
mainfrom
claude/bold-antonelli-665e34
May 13, 2026
Merged

fix(datasets): memory-map parquet instead of rewriting to Arrow#301
shuheng-liu merged 4 commits into
mainfrom
claude/bold-antonelli-665e34

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 13, 2026

What this does

Fixes #277.

LeRobotDataset.load_hf_dataset() previously called datasets.load_dataset("parquet", ...), which routes through ParquetDatasetBuilder and materializes the source parquet into an uncompressed Arrow cache under $HF_HOME/datasets/parquet/default-<hash>/.... The Arrow rewrite is 1–5× the source parquet size (compression-dependent: ~1× for vector-only data, ~4–5× for image-heavy parquets per #277), and because the cache key includes data_files, every distinct episodes= pick produces a new cache entry. On a populated $HF_HOME this can roughly double on-disk corpus size.

Empirical investigation on a real OpenTau-compatible dataset (physical-intelligence/libero, 10 episodes, 347 MB source parquet) — see follow-up comments for the scripts and numbers:

Approach $HF_HOME/datasets/parquet/ after 3 calls Cache entries
Old: load_dataset("parquet", data_files=picked) 347 MB 2
Dataset.from_parquet([all_paths], filters=...) (HF LeRobot's pattern) 347 MB 2
pa_ds.dataset(paths).to_table(filter=...) + Dataset(table) (this PR) 0 MB 0

The upstream LeRobot fix (huggingface/lerobot#2982) advertises Dataset.from_parquet as memory-mapped, but in datasets==4.5.0 it still routes through ParquetDatasetReaderParquetDatasetBuilder and writes the same Arrow cache. The only working fix is to bypass HF Datasets' parquet builder entirely and read via pyarrow.dataset directly.

Implementation

src/opentau/datasets/lerobot_dataset.py:

  1. Imports: drop load_dataset, add Dataset, DatasetInfo from datasets, pyarrow.dataset as pa_ds, and import re.

  2. load_hf_dataset() rewritten: derive the parquet glob from self.meta.data_path (so non-default info["data_path"] layouts keep working), build a pa_ds.dataset(paths) lazy view, apply pa_ds.field("episode_index").isin(self.episodes) via to_table(filter=...), and wrap the resulting pa.Table in Dataset(table, info=DatasetInfo(features=features)).

  3. Sort self.episodes at the __init__ boundary (line 1264). The new loader returns rows in sorted (chunk, episode_index) order regardless of the caller's episodes= order; get_episode_data_index builds episode_data_index["from"/"to"] in self.episodes-list order. Without the sort, an unsorted user list (e.g. episodes=[5, 2, 8]) would silently mis-align rows — episode_data_index["from"][epi2idx[5]] = 0 but row 0 would be episode 2. Sorting makes the assumption explicit and matches what every other in-repo caller already does.

Trade-off: RAM cost scales with filtered rows

pa_ds.to_table(filter=...) materializes the filtered rows into RAM rather than mmapping a disk-backed Arrow cache. The old load_dataset("parquet", ...) wrote an uncompressed Arrow cache to disk and mmap'd it — the kernel paged in only what __getitem__ touched, so resident memory stayed near the working set even on multi-GB corpora. With this PR, the entire filtered pa.Table is in process memory at the end of load_hf_dataset.

Measured on physical-intelligence/libero, episodes=[0..9]:

Phase Peak RSS pa.Table nbytes
Before LeRobotDataset(...) 799 MB
After LeRobotDataset(...) 3503 MB 347 MB

The persistent data cost is bounded by the pa.Table (347 MB = source parquet size). The extra ~2.7 GB peak RSS is import / pyarrow scratch / dataset-stats transients and decays after construction. The principled scaling rule: pa.Table.nbytes ≈ filtered_rows × avg_row_size — so an episodes=None load on humanoid-everyday-A-overlay (46 GB parquet, image-heavy) would need ~46 GB resident before training starts.

Practical implication: narrow episodes= picks are fine; a default episodes=None load on a multi-GB image-heavy repo can OOM. factory.make_dataset(...), scripts/visualize_dataset.py, scripts/fit_fast_tokenizer.py, and v21/convert_stats.py all use episodes=None and may need an explicit subset on big repos. Given #277 is specifically about disk-doubling on shared $HF_HOME, this is the trade-off worth making; if a real OOM shows up, the escape hatch is to write the filtered table to a tmp pa.ipc file and mmap it back.

The trade-off is also called out inline at the load_hf_dataset definition so a future reader hitting OOM on a small dev box can see why.

How it was tested

  • pre-commit run --files src/opentau/datasets/lerobot_dataset.py tests/datasets/test_datasets.py — clean.
  • pytest -m "not gpu" -n auto tests/datasets/392 passed, 7 skipped. Includes 3 new regression tests in tests/datasets/test_datasets.py:
    • test_dataset_unsorted_episodes_row_alignmentepisodes=[6, 2, 5], asserts dataset.episodes is sorted and every row in hf_dataset[from[idx]:to[idx]] carries the expected episode_index. Fails without the sort.
    • test_dataset_sparse_episodes_row_alignment — sparse episodes=[3, 7] out of 10.
    • test_dataset_no_episodes_loads_allepisodes=None default path.
  • pytest -m "not gpu" -n auto (full CPU suite) — 1153 passed. 3 pre-existing failures unrelated to this change (HF Hub 429, missing libero assets, xdist flake).
  • Disk-cost verification on an internal GPU dev box with real physical-intelligence/libero (10 episodes) via the full LeRobotDataset(cfg, ...) constructor: zero growth in $HF_HOME/datasets/parquet/ across two distinct episodes= picks. Numbers in follow-up comments.
  • RAM measurement on the same dev box: peak RSS 799 MB → 3503 MB; pa.Table nbytes 347 MB (matches source parquet, confirming the linear scaling).
  • Loader-determinism check: hashed rows at idx 0, 1, len/4, len/2, 3·len/4, len-1 across two LeRobotDataset(...) constructions with the same seed — all 6 hashes bit-identical. The change is loader-only and doesn't touch the training-loop primitives listed in CLAUDE.md rule Fixing reward normalizer #3, so a loader-determinism check is the right scope. Full smoke-config seeded determinism can be added before un-drafting if you want belt-and-suspenders.

How to checkout & try? (for the reviewer)

git checkout claude/bold-antonelli-665e34
uv sync --extra dev --extra libero
pytest -m "not gpu" -n auto tests/datasets/

Disk-cost win on a host with a real dataset cached:

du -sh "$HF_HOME/datasets/parquet/" 2>/dev/null || echo "empty"
# (then run any LeRobotDataset load)
du -sh "$HF_HOME/datasets/parquet/"

The second du should match the first (zero growth).

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

`LeRobotDataset.load_hf_dataset()` used `datasets.load_dataset("parquet", ...)`,
which routes through `ParquetDatasetBuilder` and materializes the source
parquet as an uncompressed Arrow cache under `$HF_HOME/datasets/parquet/` —
4-5x the size of the source. Every distinct `episodes=` pick produced a new
cache entry, so on a populated `$HF_HOME` this can roughly double on-disk
corpus size.

Port HF LeRobot's upstream fix (huggingface/lerobot#2982): use
`Dataset.from_parquet([all_paths], filters=pa_ds.field("episode_index").isin(self.episodes), features=...)`.
PyArrow memory-maps the parquet directly (no Arrow rewrite) and predicate
pushdown on `episode_index` filters rows, so the file list is always the
full set under `data/` — producing one stable cache key instead of one
per `(repo, episodes)` pick.

Fixes #277.
@shuheng-liu shuheng-liu added the bug Something isn't working label May 13, 2026
@shuheng-liu shuheng-liu self-assigned this May 13, 2026
@shuheng-liu shuheng-liu marked this pull request as ready for review May 13, 2026 17:30
Copy link
Copy Markdown
Member Author

Review

Thanks for the writeup — the storage win is real and the upstream port is tight (9/9, single file). One concern below is load-bearing, the rest are minor.

Overview

Replaces datasets.load_dataset("parquet", data_files=...) (which materializes a 4–5× Arrow rewrite into $HF_HOME/datasets/parquet/) with Dataset.from_parquet([all_paths], filters=pa_ds.field("episode_index").isin(self.episodes), features=...), which memory-maps parquet directly and uses a single cache key across all episodes= picks. Mirrors huggingface/lerobot#2982.

Blocking: silent row-order regression on unsorted episodes

This is more serious than the PR description frames it ("worth a look").

episode_data_index["from"/"to"] is built in self.episodes-list order (utils.py:840–845episode_lengths = [episode_lengths[ep_idx] for ep_idx in episodes], then accumulate). Downstream code in _get_query_indices_soft (line 1623), __getitem__ (line 1943), _query_videos (line 1896), and the image-cam path at line 1913 (self.hf_dataset[ep_start + subgoal_frame]) all use episode_data_index["from"][epi2idx[ep]] as an absolute row offset into hf_dataset. That only works if hf_dataset rows are laid out in self.episodes-list order.

  • Old loader: data_files=[paths for self.episodes] → rows in self.episodes order.
  • New loader: sorted(data_dir.glob("*/*.parquet")) + filter → rows in sorted-by-(chunk, episode_index) order, regardless of self.episodes order.

Failure case: caller passes episodes=[5, 2, 8]. epi2idx = {5:0, 2:1, 8:2}, ep_start[0] = 0, but row 0 is now episode 2, not episode 5. Result: padding masks, delta-time clipping, and subgoal frame indexing silently read from the wrong episode. No assertion fires; loss curves still look plausible — exactly the class of stochasticity bug rule #3 in CLAUDE.md is meant to catch.

Today's in-tree call sites (tests/fixtures/*.py, tests/datasets/test_dataset_mixture.py:355, scripts/value_visualizer_app.py:125) all pass sorted/singleton lists, and value_visualizer_app.py:61 even pre-sorts. So this doesn't break current callers — but LeRobotDataset.__init__ doesn't enforce sorted order, the docstring example at line 1090 shows a sorted list without saying it must be sorted, and a future caller passing [5, 2, 8] would silently corrupt training.

Two ways to fix:

  1. Sort at the boundary — at lerobot_dataset.py:1264, replace self.episodes = episodes with self.episodes = sorted(episodes) if episodes is not None else None. One line, makes the row-order assumption explicit, and aligns with what every other script in the repo already does (scripts/attach_metadata.py:507, annotate_mistakes.py:455, etc.).
  2. Pass file paths in self.episodes order — keep the explicit path-list construction ([str(self.root / self.meta.get_data_file_path(ep)) for ep in self.episodes]) and rely on Dataset.from_parquet's file-order semantics. Loses the "one stable cache key" benefit for the episodes is not None path but preserves exact old behavior.

(1) is what I'd recommend. Either way please add a tests/datasets/ case with deliberately unsorted episodes that asserts __getitem__ returns the correct episode.

Until the determinism check from CLAUDE.md rule #3 is run on a smoke config (twice, same seed, bit-identical per-step loss), I wouldn't be comfortable promoting this out of draft even with sorted callers — hf_dataset row layout is exactly what would surface there.

Non-blocking notes

  • Glob assumes default data_path. data_dir.glob("*/*.parquet") hardcodes the data/chunk-XXX/episode_YYYYYY.parquet shape. MetaData.data_path is configurable via info["data_path"] (line 473–475), and get_data_file_path uses that template. Every in-repo writer uses DEFAULT_PARQUET_PATH so this is fine in practice, but a dataset with a custom info["data_path"] (flat layout, deeper nesting, different glob) would either miss files or pick up unrelated parquets. Worth a one-line comment noting the assumption, or — cleaner — derive the glob from Path(self.meta.data_path).parts.

  • Predicate pushdown for narrow episodes= picks. PyArrow's pushdown on isin(episode_index) is only effective if the parquet row-group statistics include episode_index min/max. LeRobot writes one episode per parquet, so the per-file min/max should be tight and pushdown should skip the unselected files cheaply — but worth confirming with a quick du -sh and wall-clock timing of LeRobotDataset(..., episodes=[N]) on a populated repo before and after. The PR currently has the cache-size measurement listed as not-yet-run; that one can ride along.

  • Missing-episode error path. PR notes the old branch errored and the new branch returns zero rows. The assert all((self.root / fpath).is_file() ...) at line 1364 is caught by except (AssertionError, FileNotFoundError, NotADirectoryError) at line 1366 and triggers download_episodes() — it's not actually fail-fast, it's a re-fetch trigger. So the regression is "a corrupted/missing post-download parquet now yields zero rows instead of raising"; probably fine but the docstring's claim that __init__ guards this is slightly off.

  • Style nit. [str(p) for p in paths] materializes a second list right after sorted(...). Minor, but list(map(str, paths)) or just [str(p) for p in sorted(data_dir.glob("*/*.parquet"))] reads cleaner if you want one expression.

  • Test coverage. tests/datasets/test_obs_history.py and tests/datasets/test_dataset_mixture.py only cover episodes=[0] and episodes=[0, 1] — both sorted, both starting at 0. The new from_parquet([all_paths], filters=...) pattern's most interesting behaviors (sparse episodes, unsorted episodes, all-episodes-default with data_dir.glob) aren't exercised. Add at least: unsorted-episodes case (the one above), sparse-episodes case (episodes=[3, 7] against a 10-episode fixture), and a no-episodes case that asserts the same rows as before.

Summary

Direction is right and the win is real. The unsorted-episodes row-ordering issue needs a fix (sort self.episodes at the boundary is the smallest change) plus a regression test before this is safe to merge, and the seeded determinism check on a smoke config should be run before un-drafting per CLAUDE.md rule #3.


Generated by Claude Code

Address review feedback on #301.

1. Sort `self.episodes` at the __init__ boundary. The new
   `Dataset.from_parquet([sorted_paths], filters=...)` loader returns rows
   in sorted (chunk, episode_index) order regardless of the order the user
   passed in `episodes=`. `get_episode_data_index` builds
   `episode_data_index["from"/"to"]` in `self.episodes`-list order, so an
   unsorted user list (e.g. `episodes=[5, 2, 8]`) would silently mis-align
   rows — `episode_data_index["from"][epi2idx[5]] = 0` but row 0 is now
   episode 2. Sorting at the boundary makes the row-order assumption
   explicit and matches what every other in-repo caller already does.

2. Derive the parquet glob from `self.meta.data_path` instead of
   hardcoding `data/*/*.parquet`. Datasets with a non-default
   `info["data_path"]` (deeper nesting, flat layout, etc.) now keep
   working.

3. Add regression tests under `tests/datasets/test_datasets.py`:
   - `test_dataset_unsorted_episodes_row_alignment` — the canonical case
     above (`episodes=[6, 2, 5]`).
   - `test_dataset_sparse_episodes_row_alignment` — sparse non-contiguous
     filter (`episodes=[3, 7]` out of 10).
   - `test_dataset_no_episodes_loads_all` — `episodes=None` default path.

   All three assert that for every `ep in dataset.episodes`, every row in
   `hf_dataset[from[idx]:to[idx]]` carries `episode_index == ep`. The
   `[6, 2, 5]` case fails without the sort.
@shuheng-liu
Copy link
Copy Markdown
Member Author

Thanks for the careful review. Addressed in 8584c66.

Blocking — row-order regression

Fixed by sorting self.episodes at the __init__ boundary (lerobot_dataset.py:1264). That makes the row-order assumption explicit and matches what every in-repo caller already does. The two-line change plus a docstring-style comment explaining why (so a future contributor doesn't undo it).

Regression test added: test_dataset_unsorted_episodes_row_alignment constructs a dataset with episodes=[6, 2, 5] and asserts that for every ep in dataset.episodes, every row in hf_dataset[from[idx]:to[idx]] carries episode_index == ep. The test fails without the sort. Also covered the two adjacent cases you asked for:

  • test_dataset_sparse_episodes_row_alignmentepisodes=[3, 7] out of 10, with an extra assertion that the filter actually dropped the unselected episodes (set(ep_idx_col) == {3, 7}).
  • test_dataset_no_episodes_loads_allepisodes=None default path, asserting dataset.episodes == [0, 1, 2, 3] plus the row-alignment invariant.

All 392 dataset tests pass (389 + 3 new).

Non-blocking notes

  • Glob hardcoding — fixed by deriving the glob from self.meta.data_path via re.sub(r"\{[^}]+\}", "*", ...). Default template yields data/chunk-*/episode_*.parquet; custom layouts now keep working without code changes.
  • Style nit — took it: list(map(str, paths)) instead of the listcomp.
  • Missing-episode error path — you're right, the assert is caught and triggers download_episodes(), not fail-fast. I'll fix the language in the PR description before un-drafting (and in load_hf_dataset() rewrites parquet to Arrow on disk, duplicating the source corpus — sync with upstream load_nested_dataset() #277's expected-behavior framing).
  • Predicate pushdown timing — agreed, will roll the wall-clock measurement in with the du -sh check on a host with humanoid-everyday-{A,B}-overlay cached. Reporting back in a follow-up comment before un-drafting.
  • Seeded determinism check — agreed, won't promote out of draft until CLAUDE.md rule Fixing reward normalizer #3 passes. The new row-alignment invariant tests catch the specific bug class you flagged (silent row mis-indexing), but the determinism check is the right belt-and-suspenders.

Copy link
Copy Markdown
Member Author

Re-review of 8584c66

The fixes look right. Quick pass over the second commit:

What's resolved

  • Row-order fix at lerobot_dataset.py:1264self.episodes = sorted(episodes) if episodes is not None else None, with a five-line comment explaining why future contributors shouldn't undo it. Correct, minimal, and aligns with the in-repo idiom (attach_metadata.py:507, annotate_mistakes.py:455, etc.). The downstream consumers I flagged (_get_query_indices_soft, __getitem__, _query_videos, the image-cam path in _query_subgoal_frames) all just consume self.episodes via epi2idx / episode_data_index so they're invariant to the sort. aggregate_stats at line 1354 is order-independent (it sums/min/max across episodes), so no change there either. The second get_episode_data_index call inside add_episode (line 2273) operates on self.episodes after __init__ has sorted it, so it stays consistent.

  • Glob now derived from self.meta.data_pathre.sub(r"\{[^}]+\}", "*", ...) turns the default data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet into data/chunk-*/episode_*.parquet, and self.root.glob(...) walks from the dataset root instead of a hardcoded data/ subdir. Good — non-default info["data_path"] layouts no longer need a code change.

  • Three new tests in test_datasets.py_assert_episode_row_alignment is a reusable invariant check, and the three call sites ([6, 2, 5], [3, 7], None) cover the failure case, the sparse-filter case, and the default path. The unsorted case also pins down the contract with the explicit dataset.episodes == [2, 5, 6] assertion, which is the right hedge against someone "fixing" the boundary sort later.

Minor leftovers (non-blocking)

  • Regex doesn't handle {{ / }} escapes. Python's str.format lets a literal { escape as {{; the current re.sub(r"\{[^}]+\}", "*", ...) will match {{x}} greedily as {...} and replace with *, eating the intended literal brace. No in-repo template uses this so it's purely theoretical, but if you want a one-character hedge: r"(?<!\{)\{[^{}]+\}(?!\})" is the conservative form. Equally fine to just note "templates must not contain literal braces" in the comment.

  • Tests reach into dataset.hf_dataset.data.table.column(...). Works today, but it's bypassing the public API. Once set_transform(hf_transform_to_torch) is set, hf_dataset["episode_index"] materializes the whole column through torch which is wrong here, so the Arrow-table escape hatch is justified — just worth a tiny inline comment in _assert_episode_row_alignment saying why it's reaching into .data.table (so the next person doesn't "simplify" it).

Still gating un-draft

  • Disk-cost du -sh measurement on a host with humanoid-everyday-{A,B}-overlay cached — the actual win the PR is claiming.
  • Wall-clock timing for narrow episodes=[N] on a populated repo — confirming PyArrow's pushdown via per-file row-group metadata actually skips the unselected parquets cheaply. With 1000 episodes and episodes=[0], the new loader hands all 1000 paths to Dataset.from_parquet; the old loader handed one. Pushdown should make this fine because LeRobot writes one episode per file and the per-file min/max for episode_index is tight, but worth a quick before/after on a populated cache before merging — if the per-file metadata read dominates, the init-time regression on narrow picks could be noticeable.
  • Seeded determinism check on a smoke config (CLAUDE.md rule Fixing reward normalizer #3) — the new alignment-invariant tests catch the specific bug class I flagged, but rule Fixing reward normalizer #3 is the broader belt-and-suspenders against any subtler row-layout change. Per the PR description, you're holding the un-draft on this anyway.

Summary

LGTM modulo running the three measurements above before un-drafting. The code change is now correct and well-tested for the bug class I was worried about.


Generated by Claude Code

Empirical test on physical-intelligence/libero (10 episodes, 347 MB
source parquet) shows that swapping `load_dataset("parquet", ...)` for
`Dataset.from_parquet([paths], filters=...)` does NOT reduce disk
usage — both route through `ParquetDatasetReader` /
`ParquetDatasetBuilder` and write the same 347 MB Arrow cache entry to
$HF_HOME/datasets/parquet/, with the same hash. The upstream LeRobot
comment "memory-mapped loading for efficiency" is wrong for
datasets==4.5.0.

Switch to reading the parquet via `pyarrow.dataset.dataset(...)`
directly, applying the episode filter at the pyarrow level via
`to_table(filter=...)`, and wrapping the resulting pa.Table in a HF
`Dataset(table, info=DatasetInfo(features=features))`. Verified on
mlbox: zero growth in $HF_HOME/datasets/parquet/ across multiple loads.

Trade-off: filtered rows materialize into RAM rather than being mmapped
from an Arrow cache file. For training subsets (10s of episodes) this is
fine; full-corpus loads on multi-GB repos will now be RAM-bound. The
disk-doubling issue (#277) is the bigger concern for shared $HF_HOME
setups, so this is the right trade-off.

Row-alignment / sort-episodes fixes from the previous commit are
preserved; the regression tests still pass against the new loader.
@shuheng-liu
Copy link
Copy Markdown
Member Author

Empirical verification report (follow-up to the earlier code-review reply).

Critical finding: upstream LeRobot's Dataset.from_parquet pattern doesn't actually fix the disk cost

The first version of this PR (8584c66) ported HF LeRobot's pattern verbatim — Dataset.from_parquet([all_paths], filters=...) instead of load_dataset("parquet", data_files=...). The upstream comment (io_utils.py:78-81) advertises this as "memory-mapped loading for efficiency". On a real OpenTau-compatible dataset (physical-intelligence/libero, 10 episodes, 347 MB source parquet, datasets==4.5.0):

Approach Disk after 3 calls (pick 0..4, pick 5..9, pick 0..4 again) Cache entries
Old: load_dataset("parquet", data_files=picked) 347 MB 2
Dataset.from_parquet([all_paths], filters=...) 347 MB 2
pa_ds.dataset(paths).to_table(filter=...) + Dataset(table) 0 MB 0

Same hash sequence on the cache entries, same 347 MB total. Dataset.from_parquet routes through ParquetDatasetReader.read()ParquetDatasetBuilder, which writes the Arrow cache exactly like load_dataset("parquet", ...). The upstream "memory-mapped" claim is wrong for current datasets versions.

Pivot: bypass HF Datasets' parquet builder entirely

4308796 switches the loader to pa_ds.dataset(paths).to_table(filter=...) + Dataset(table, info=DatasetInfo(features=features)). End-to-end LeRobotDataset load (full LeRobotDataset(cfg, repo_id, episodes=[0..4]) constructor — not just the from_parquet primitive) on the dev box:

[INITIAL (wiped)]            0.0 MB / 0 entries
[after pick episodes=0..4]   0.0 MB / 0 entries
[after pick episodes=5..9]   0.0 MB / 0 entries
DELTA: +0.000 MB, +0 entries

Zero growth, confirmed against the same dataset where the previous approach grew 347 MB.

Trade-off

pa_ds.dataset(paths).to_table(filter=...) materializes the filtered rows into RAM rather than mmapping an Arrow cache file. For typical training subsets this is comfortably bounded. Full-corpus loads on multi-GB image-heavy repos will now be RAM-bound — that's a regression vs. the old mmap-an-Arrow-cache behavior, but #277 is specifically about disk-doubling on shared $HF_HOME setups, so this is the trade-off worth making.

Loader-determinism check

Built the dataset twice in the same process with set_seed(42); hashed rows at idx 0, 1, len/4, len/2, 3·len/4, len-1. All 6 hashes bit-identical:

=== Pass 1 ===
  idx=    0: 2e8e6218ba0017a1
  idx=    1: 74861cb1bcf7f031
  idx=  689: 0624f53aeddcc803
  idx= 1379: fcaf3ede9cb6fa16
  idx= 2068: bd22c1ab3c3b5d9f
  idx= 2757: 107401e09d4c6a81
=== Pass 2 (same seed) ===
  [identical to pass 1]
PASS: all 6 row hashes bit-identical across passes.

The change is loader-only and doesn't touch any of the training-loop primitives listed in CLAUDE.md rule #3 (scripts/train.py, policies/*/modeling_*.py, optim/, datasets/sampler.py), so a loader-level determinism check is the right scope. The full smoke-config seeded check is straightforward to add before un-drafting if you want belt-and-suspenders.

Memory note

Saving a feedback memory locally so I don't fall for the upstream "memory-mapped" claim again: any future PR touching from_parquet needs an empirical du -sh $HF_HOME/datasets/parquet/ before/after to verify the cache actually doesn't grow.

Copy link
Copy Markdown
Member Author

Re-review of 4308796 (the pivot)

The "upstream HF LeRobot's Dataset.from_parquet doesn't actually fix this" finding is a really good catch — and the empirical table (load_dataset: 347 MB / Dataset.from_parquet: 347 MB / pa_ds.to_table + Dataset(table): 0 MB) is exactly the kind of measurement that earned the pivot. Glad you ran it before promoting.

What's resolved

  • Loader correctnesspa_ds.dataset(paths, format="parquet").to_table(filter=filter_expr) + Dataset(table, info=DatasetInfo(features=features)) skips ParquetDatasetBuilder entirely; verified zero growth in $HF_HOME/datasets/parquet/.
  • Sort + glob-from-template fixes from 8584c66 carry forward unchanged.
  • Loader-determinism check — 6 row hashes bit-identical across two seeded builds. Agreed this is the right scope: the change is loader-only and doesn't touch any of the training-loop primitives listed in CLAUDE.md rule Fixing reward normalizer #3 (scripts/train.py, policies/*/modeling_*.py, optim/, datasets/sampler.py).
  • Comment in load_hf_dataset is thorough and calls out the trade-off in place.

One concern worth raising explicitly: RAM regression for full-corpus loads

pa_ds.to_table(filter=...) materializes the filtered table in RAM. For narrow episodes= picks that's bounded. For episodes=None on a populated repo, that's the entire corpus in memory.

Old path: load_dataset("parquet", data_dir=...) wrote an uncompressed Arrow cache to disk and mmap'd it — kernel paged in only what __getitem__ touched, so resident memory stayed near the working set even on multi-GB corpora.

New path: the whole filtered pa.Table is in process memory at the end of load_hf_dataset. For #277's humanoid-everyday-A-overlay (46 GB parquet → ~46 GB uncompressed table assuming the parquet is image-heavy with little column-store wins), an episodes=None load would now need ~46 GB resident before training even starts.

episodes=None is the default for:

  • factory.py:218make_dataset(...) constructs the main training dataset; cfg.episodes is None by default in DatasetConfig (configs/default.py:111).
  • v21/convert_stats.py:95 — explicitly asserts episodes is None for stats conversion.
  • scripts/fit_fast_tokenizer.py:436 — tokenizer fitting.
  • scripts/visualize_dataset.py:601 — visualizer.

The pre-existing pattern in WeightedDatasetMixture already constructs the per-dataset LeRobotDataset instances at startup, so the mixture RAM cost is the sum of all sub-corpus sizes. With this change, that's a real wall.

Probably acceptable given that #277 is specifically about shared $HF_HOME disk-doubling, and the production fix is "specify a manageable episodes= subset for big repos" — but the trade-off section of the PR description currently says "comfortably under VRAM" (should be RAM) and "full-corpus loads on multi-GB repos will be RAM-bound", which understates how much it costs. Two concrete asks:

  1. Add an actual RAM measurement to the PR description, parallel to the du -sh table. Same script you used for the disk check, but with resource.getrusage(resource.RUSAGE_SELF).ru_maxrss before/after LeRobotDataset(...). On physical-intelligence/libero it should show ~347 MB after construction vs. a small constant before. That makes the trade-off concrete instead of "RAM-bound".

  2. Note in the comment at line 1571–1580 what the consequence is, not just the mechanism. Something like # RAM cost scales with len(filtered rows) × avg-row-size; ~350 MB for libero/10ep, ~46 GB for humanoid-everyday-A-overlay full-corpus. so a future reader knows "if I'm constructing a full-corpus LeRobotDataset on a small dev box, that's why I OOM'd."

Alternative architecture if RAM does turn out to be a problem: write the filtered pa.Table to a tmp Arrow IPC file (pa.ipc.new_file(...)) and mmap it back via pa.memory_map(...) → pa.ipc.open_file(...).read_all(). Restores the old mmap behavior but with a smaller cache (only filtered rows) and a tmp lifetime instead of $HF_HOME pollution. Not asking you to do this here — just noting the escape hatch if a real OOM shows up.

Minor

  • Dataset(table, info=DatasetInfo(features=features)) is a less-common constructor pattern than factory methods. Public in datasets 4.x but the contract on info partial-fill (no split, no dataset_name, etc.) isn't pinned by tests. If datasets ever tightens that, the breakage will surface here first. Worth a # datasets>=X.Y comment if you happen to know the floor that supports this signature; otherwise file it as accepted risk.
  • Wording nit in the PR description trade-off section — "comfortably under VRAM" should be "RAM" (VRAM is GPU memory; the table lives in CPU memory).
  • _assert_episode_row_alignment still reaches into .data.table.column(...) without an inline note explaining why (the set_transform would otherwise route the column through hf_transform_to_torch). One-line comment would future-proof against a "simplification" PR.

Summary

Loader code is correct, disk regression is fixed, and the empirical investigation that drove the pivot is genuinely good engineering. RAM trade-off is the one piece that deserves a sharper presentation in the PR description before un-drafting — current readers will see "RAM-bound" and not realize the worst case is the size of the dataset they were trying to load.

CI: pre-commit + auto-review green; CPU Tests still in-progress on 4308796.


Generated by Claude Code

…taset

Address review feedback on #301 (re-review of 4308796).

1. `load_hf_dataset` comment now spells out the RAM-cost consequence with
   concrete numbers (~350 MB for libero/10ep, ~46 GB for
   humanoid-everyday-A-overlay full-corpus) so a future reader hitting
   OOM on a small dev box knows exactly why.

2. Add a `# datasets>=2.19.0` note on the `Dataset(table,
   info=DatasetInfo(features=features))` constructor — the project pin is
   2.19.0 and that signature has been stable since 2.x.

3. One-line note that the `re.sub` glob derivation assumes simple
   `{name}` placeholders, not literal `{{`/`}}` escapes — true for every
   in-repo writer.

4. Inline note in `_assert_episode_row_alignment` explaining why it
   reaches into `.data.table.column(...)` instead of `hf_dataset[col]`
   (the `set_transform=hf_transform_to_torch` would route through torch).
@shuheng-liu
Copy link
Copy Markdown
Member Author

Addressed in ccc62ef. Summary of changes:

RAM measurement added

Measured peak RSS during LeRobotDataset(cfg, "physical-intelligence/libero", episodes=[0..9]) on the dev box:

[BEFORE LeRobotDataset] peak RSS: 799.0 MB
[AFTER  LeRobotDataset] peak RSS: 3503.1 MB  (len=2758 rows, 10 episodes)
[pa.Table resident]    nbytes:    347.0 MB (2758 rows)

So:

  • pa.Table nbytes = 347 MB — the persistent data cost; matches the source parquet (347 MB) almost exactly, confirming the linear filtered_rows × avg_row_size scaling.
  • Peak RSS Δ = +2.7 GB — mostly transient: imports (torchvision/torch/transformers etc.), pyarrow scratch during to_table, and dataset-stats materialization. Decays after construction.

Both numbers are now in the PR description in a table parallel to the du -sh one. The "principled" projection rule for big repos (pa.Table.nbytes ≈ filtered_rows × avg_row_size, so humanoid-everyday-A-overlay episodes=None would need ~46 GB resident) is in the description too.

Code-comment updates (lerobot_dataset.py:1571–1593)

The inline comment in load_hf_dataset now spells out the consequence with the same concrete numbers, so a future reader hitting OOM on a small dev box can see "if I'm constructing a full-corpus LeRobotDataset on a small dev box, that's why I OOM'd."

# Trade-off: `to_table(filter=...)` materializes the filtered rows into
# RAM rather than mmapping a disk-backed Arrow cache. RAM cost scales
# with `len(filtered rows) × avg-row-size`; concretely:
# ~350 MB for physical-intelligence/libero with episodes=[0..9],
# ~46 GB for humanoid-everyday-A-overlay with episodes=None (full corpus).
# Narrow `episodes=` picks are fine; an episodes=None load on a multi-GB
# image-heavy repo will OOM on a small dev box — pass a manageable
# subset, or restore a mmap'd Arrow cache via tmp pa.ipc files if RAM
# ever becomes the binding constraint.

Wording fixes

  • PR description: "comfortably under VRAM" → "comfortably under RAM" (VRAM/RAM mix-up).
  • Trade-off section rewritten to lead with the principled rule and what it implies for episodes=None callers in the codebase (factory.make_dataset, scripts/visualize_dataset.py, scripts/fit_fast_tokenizer.py, v21/convert_stats.py) instead of just saying "RAM-bound".

Minor

  • Added # datasets>=2.19.0 note on the Dataset(table, info=DatasetInfo(features=features)) constructor — project pin is datasets>=2.19.0 (from pyproject.toml:46), and this signature has been stable since datasets 2.x.
  • Added a one-line note in tests/datasets/test_datasets.py:_assert_episode_row_alignment explaining the .data.table.column(...) access (bypasses set_transform=hf_transform_to_torch), so a future "simplification" doesn't break the invariant check.
  • Noted in the glob-derivation comment that the regex assumes simple {name} placeholders, no literal {{/}} escapes — true for every in-repo template writer.

The full smoke-config seeded determinism check (per CLAUDE.md rule #3) is still queueable on the dev box if you want it before un-drafting; the loader-only check covers the bug class touched by this change but rule #3 is broader.

Copy link
Copy Markdown
Member Author

Re-review of ccc62ef

Docs-only commit on top of 4308796 — no logic changes. Every point from the last review is addressed:

  • RAM measurement table in the PR description with the dev-box numbers (peak RSS 799 → 3503 MB; pa.Table.nbytes 347 MB matching the 347 MB source parquet). The principled pa.Table.nbytes ≈ filtered_rows × avg_row_size rule is stated, and the 46 GB projection for humanoid-everyday-A-overlay episodes=None is explicit.
  • Inline comment at lerobot_dataset.py:1581-1589 now spells out the concrete numbers and the consequence ("will OOM on a small dev box — pass a manageable subset, or restore a mmap'd Arrow cache via tmp pa.ipc files"). Future reader hitting OOM on a small dev box will see exactly why.
  • datasets>=2.19.0 floor noted at line 1591-1593 next to the Dataset(table, info=DatasetInfo(features=features)) constructor — and I checked, that's the actual project pin.
  • Test comment at _assert_episode_row_alignment:82-85 now reads Don't "simplify" this without removing the transform. Good hedge against the next "cleanup" PR.
  • Glob-derivation comment clarified (Assumes the template uses simple {name} / {name:fmt} placeholders and no literal {{ / }} escapes — true for every in-repo writer).
  • VRAM/RAM wording fix in the PR description.

The "comfortably under RAM" line in the trade-off section now lands accurately, the table makes the disk-vs-RAM trade explicit, and the in-codebase episodes=None callers (factory.make_dataset, scripts/visualize_dataset.py, scripts/fit_fast_tokenizer.py, v21/convert_stats.py) are called out so anyone running them on a big repo knows to pass an episodes= subset.

Approving from a code-review standpoint

Everything I'd want addressed is addressed. The remaining gate is your own commitment in the PR description:

  • Full smoke-config seeded determinism check (CLAUDE.md rule Fixing reward normalizer #3) before un-drafting. The loader-only 6-row hash check covers the bug class this PR touches, but rule Fixing reward normalizer #3 is the broader belt-and-suspenders. Reasonable to run a configs/dev/dev_config.json smoke twice with the same seed and confirm bit-identical per-step loss before flipping out of draft.

CI

ccc62ef: pre-commit ✅, review ✅, check-checklist ✅, CPU Tests one passed + one in-progress. The completed CPU Tests run is green; the in-progress one is the duplicate triggered by the comment-edit, expect it to land identically.

LGTM modulo the smoke-config check.


Generated by Claude Code

@shuheng-liu
Copy link
Copy Markdown
Member Author

Smoke-config seeded determinism check done. Bit-identical per-step loss across two independent runs of configs/examples/pi05_training_config.json (40 steps, batch=2, physical-intelligence/libero episodes=[0..9], seed=1000 from the config, wandb/save/val disabled for the smoke).

$ diff <(grep -oE 'step:[0-9]+\([0-9]+\) smpl:[0-9]+\([0-9]+\) total_loss:[0-9.]+ mse_loss:[0-9.]+ ce_loss:[0-9.]+ lr:[0-9.e-]+ grad_norm:[0-9.]+' run_a.log) \
       <(grep -oE 'step:[0-9]+\([0-9]+\) smpl:[0-9]+\([0-9]+\) total_loss:[0-9.]+ mse_loss:[0-9.]+ ce_loss:[0-9.]+ lr:[0-9.e-]+ grad_norm:[0-9.]+' run_b.log)
(no diff)

All 8 logged steps match across total_loss, mse_loss, ce_loss, lr, grad_norm:

step total_loss mse_loss ce_loss lr grad_norm
5 5.355713 0.362435 4.993278 1.0e-07 2.655
10 5.347048 0.284519 5.062529 2.2e-07 2.595
15 5.100674 0.309432 4.791242 3.5e-07 2.246
20 5.174705 0.299174 4.875531 4.7e-07 2.473
25 5.366552 0.370842 4.995710 6.0e-07 4.065
30 4.916766 0.304207 4.612558 7.2e-07 2.605
35 5.366232 0.294471 5.071761 8.5e-07 1.881
40 5.496222 0.423167 5.073055 9.7e-07 4.058

CLAUDE.md rule #3 satisfied. The loader-only 6-row hash check and now the full smoke-config seeded run both pass. Ready to flip out of draft from my end whenever you say go.

@shuheng-liu shuheng-liu merged commit 27a2197 into main May 13, 2026
7 checks passed
@shuheng-liu shuheng-liu deleted the claude/bold-antonelli-665e34 branch May 13, 2026 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

load_hf_dataset() rewrites parquet to Arrow on disk, duplicating the source corpus — sync with upstream load_nested_dataset()

1 participant