fix(datasets): memory-map parquet instead of rewriting to Arrow by shuheng-liu · Pull Request #301 · TensorAuto/OpenTau

shuheng-liu · 2026-05-13T17:29:20Z

What this does

Fixes #277.

LeRobotDataset.load_hf_dataset() previously called datasets.load_dataset("parquet", ...), which routes through ParquetDatasetBuilder and materializes the source parquet into an uncompressed Arrow cache under $HF_HOME/datasets/parquet/default-<hash>/.... The Arrow rewrite is 1–5× the source parquet size (compression-dependent: ~1× for vector-only data, ~4–5× for image-heavy parquets per #277), and because the cache key includes data_files, every distinct episodes= pick produces a new cache entry. On a populated $HF_HOME this can roughly double on-disk corpus size.

Empirical investigation on a real OpenTau-compatible dataset (physical-intelligence/libero, 10 episodes, 347 MB source parquet) — see follow-up comments for the scripts and numbers:

Approach	$HF_HOME/datasets/parquet/ after 3 calls	Cache entries
Old: `load_dataset("parquet", data_files=picked)`	347 MB	2
`Dataset.from_parquet([all_paths], filters=...)` (HF LeRobot's pattern)	347 MB	2
`pa_ds.dataset(paths).to_table(filter=...)` + `Dataset(table)` (this PR)	0 MB	0

The upstream LeRobot fix (huggingface/lerobot#2982) advertises Dataset.from_parquet as memory-mapped, but in datasets==4.5.0 it still routes through ParquetDatasetReader → ParquetDatasetBuilder and writes the same Arrow cache. The only working fix is to bypass HF Datasets' parquet builder entirely and read via pyarrow.dataset directly.

Implementation

src/opentau/datasets/lerobot_dataset.py:

Imports: drop load_dataset, add Dataset, DatasetInfo from datasets, pyarrow.dataset as pa_ds, and import re.
load_hf_dataset() rewritten: derive the parquet glob from self.meta.data_path (so non-default info["data_path"] layouts keep working), build a pa_ds.dataset(paths) lazy view, apply pa_ds.field("episode_index").isin(self.episodes) via to_table(filter=...), and wrap the resulting pa.Table in Dataset(table, info=DatasetInfo(features=features)).
Sort self.episodes at the __init__ boundary (line 1264). The new loader returns rows in sorted (chunk, episode_index) order regardless of the caller's episodes= order; get_episode_data_index builds episode_data_index["from"/"to"] in self.episodes-list order. Without the sort, an unsorted user list (e.g. episodes=[5, 2, 8]) would silently mis-align rows — episode_data_index["from"][epi2idx[5]] = 0 but row 0 would be episode 2. Sorting makes the assumption explicit and matches what every other in-repo caller already does.

Trade-off: RAM cost scales with filtered rows

pa_ds.to_table(filter=...) materializes the filtered rows into RAM rather than mmapping a disk-backed Arrow cache. The old load_dataset("parquet", ...) wrote an uncompressed Arrow cache to disk and mmap'd it — the kernel paged in only what __getitem__ touched, so resident memory stayed near the working set even on multi-GB corpora. With this PR, the entire filtered pa.Table is in process memory at the end of load_hf_dataset.

Measured on physical-intelligence/libero, episodes=[0..9]:

Phase	Peak RSS	pa.Table nbytes
Before `LeRobotDataset(...)`	799 MB	—
After `LeRobotDataset(...)`	3503 MB	347 MB

The persistent data cost is bounded by the pa.Table (347 MB = source parquet size). The extra ~2.7 GB peak RSS is import / pyarrow scratch / dataset-stats transients and decays after construction. The principled scaling rule: pa.Table.nbytes ≈ filtered_rows × avg_row_size — so an episodes=None load on humanoid-everyday-A-overlay (46 GB parquet, image-heavy) would need ~46 GB resident before training starts.

Practical implication: narrow episodes= picks are fine; a default episodes=None load on a multi-GB image-heavy repo can OOM. factory.make_dataset(...), scripts/visualize_dataset.py, scripts/fit_fast_tokenizer.py, and v21/convert_stats.py all use episodes=None and may need an explicit subset on big repos. Given #277 is specifically about disk-doubling on shared $HF_HOME, this is the trade-off worth making; if a real OOM shows up, the escape hatch is to write the filtered table to a tmp pa.ipc file and mmap it back.

The trade-off is also called out inline at the load_hf_dataset definition so a future reader hitting OOM on a small dev box can see why.

How it was tested

pre-commit run --files src/opentau/datasets/lerobot_dataset.py tests/datasets/test_datasets.py — clean.
pytest -m "not gpu" -n auto tests/datasets/ — 392 passed, 7 skipped. Includes 3 new regression tests in tests/datasets/test_datasets.py:
- test_dataset_unsorted_episodes_row_alignment — episodes=[6, 2, 5], asserts dataset.episodes is sorted and every row in hf_dataset[from[idx]:to[idx]] carries the expected episode_index. Fails without the sort.
- test_dataset_sparse_episodes_row_alignment — sparse episodes=[3, 7] out of 10.
- test_dataset_no_episodes_loads_all — episodes=None default path.
pytest -m "not gpu" -n auto (full CPU suite) — 1153 passed. 3 pre-existing failures unrelated to this change (HF Hub 429, missing libero assets, xdist flake).
Disk-cost verification on an internal GPU dev box with real physical-intelligence/libero (10 episodes) via the full LeRobotDataset(cfg, ...) constructor: zero growth in $HF_HOME/datasets/parquet/ across two distinct episodes= picks. Numbers in follow-up comments.
RAM measurement on the same dev box: peak RSS 799 MB → 3503 MB; pa.Table nbytes 347 MB (matches source parquet, confirming the linear scaling).
Loader-determinism check: hashed rows at idx 0, 1, len/4, len/2, 3·len/4, len-1 across two LeRobotDataset(...) constructions with the same seed — all 6 hashes bit-identical. The change is loader-only and doesn't touch the training-loop primitives listed in CLAUDE.md rule Fixing reward normalizer #3, so a loader-determinism check is the right scope. Full smoke-config seeded determinism can be added before un-drafting if you want belt-and-suspenders.

How to checkout & try? (for the reviewer)

git checkout claude/bold-antonelli-665e34
uv sync --extra dev --extra libero
pytest -m "not gpu" -n auto tests/datasets/

Disk-cost win on a host with a real dataset cached:

du -sh "$HF_HOME/datasets/parquet/" 2>/dev/null || echo "empty"
# (then run any LeRobotDataset load)
du -sh "$HF_HOME/datasets/parquet/"

The second du should match the first (zero growth).

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

`LeRobotDataset.load_hf_dataset()` used `datasets.load_dataset("parquet", ...)`, which routes through `ParquetDatasetBuilder` and materializes the source parquet as an uncompressed Arrow cache under `$HF_HOME/datasets/parquet/` — 4-5x the size of the source. Every distinct `episodes=` pick produced a new cache entry, so on a populated `$HF_HOME` this can roughly double on-disk corpus size. Port HF LeRobot's upstream fix (huggingface/lerobot#2982): use `Dataset.from_parquet([all_paths], filters=pa_ds.field("episode_index").isin(self.episodes), features=...)`. PyArrow memory-maps the parquet directly (no Arrow rewrite) and predicate pushdown on `episode_index` filters rows, so the file list is always the full set under `data/` — producing one stable cache key instead of one per `(repo, episodes)` pick. Fixes #277.

shuheng-liu · 2026-05-13T17:36:03Z

Review

Thanks for the writeup — the storage win is real and the upstream port is tight (9/9, single file). One concern below is load-bearing, the rest are minor.

Overview

Replaces datasets.load_dataset("parquet", data_files=...) (which materializes a 4–5× Arrow rewrite into $HF_HOME/datasets/parquet/) with Dataset.from_parquet([all_paths], filters=pa_ds.field("episode_index").isin(self.episodes), features=...), which memory-maps parquet directly and uses a single cache key across all episodes= picks. Mirrors huggingface/lerobot#2982.

Blocking: silent row-order regression on unsorted `episodes`

This is more serious than the PR description frames it ("worth a look").

episode_data_index["from"/"to"] is built in self.episodes-list order (utils.py:840–845 — episode_lengths = [episode_lengths[ep_idx] for ep_idx in episodes], then accumulate). Downstream code in _get_query_indices_soft (line 1623), __getitem__ (line 1943), _query_videos (line 1896), and the image-cam path at line 1913 (self.hf_dataset[ep_start + subgoal_frame]) all use episode_data_index["from"][epi2idx[ep]] as an absolute row offset into hf_dataset. That only works if hf_dataset rows are laid out in self.episodes-list order.

Old loader: data_files=[paths for self.episodes] → rows in self.episodes order.
New loader: sorted(data_dir.glob("*/*.parquet")) + filter → rows in sorted-by-(chunk, episode_index) order, regardless of self.episodes order.

Failure case: caller passes episodes=[5, 2, 8]. epi2idx = {5:0, 2:1, 8:2}, ep_start[0] = 0, but row 0 is now episode 2, not episode 5. Result: padding masks, delta-time clipping, and subgoal frame indexing silently read from the wrong episode. No assertion fires; loss curves still look plausible — exactly the class of stochasticity bug rule #3 in CLAUDE.md is meant to catch.

Today's in-tree call sites (tests/fixtures/*.py, tests/datasets/test_dataset_mixture.py:355, scripts/value_visualizer_app.py:125) all pass sorted/singleton lists, and value_visualizer_app.py:61 even pre-sorts. So this doesn't break current callers — but LeRobotDataset.__init__ doesn't enforce sorted order, the docstring example at line 1090 shows a sorted list without saying it must be sorted, and a future caller passing [5, 2, 8] would silently corrupt training.

Two ways to fix:

Sort at the boundary — at lerobot_dataset.py:1264, replace self.episodes = episodes with self.episodes = sorted(episodes) if episodes is not None else None. One line, makes the row-order assumption explicit, and aligns with what every other script in the repo already does (scripts/attach_metadata.py:507, annotate_mistakes.py:455, etc.).
Pass file paths in self.episodes order — keep the explicit path-list construction ([str(self.root / self.meta.get_data_file_path(ep)) for ep in self.episodes]) and rely on Dataset.from_parquet's file-order semantics. Loses the "one stable cache key" benefit for the episodes is not None path but preserves exact old behavior.

(1) is what I'd recommend. Either way please add a tests/datasets/ case with deliberately unsorted episodes that asserts __getitem__ returns the correct episode.

Until the determinism check from CLAUDE.md rule #3 is run on a smoke config (twice, same seed, bit-identical per-step loss), I wouldn't be comfortable promoting this out of draft even with sorted callers — hf_dataset row layout is exactly what would surface there.

Non-blocking notes

Glob assumes default data_path. data_dir.glob("*/*.parquet") hardcodes the data/chunk-XXX/episode_YYYYYY.parquet shape. MetaData.data_path is configurable via info["data_path"] (line 473–475), and get_data_file_path uses that template. Every in-repo writer uses DEFAULT_PARQUET_PATH so this is fine in practice, but a dataset with a custom info["data_path"] (flat layout, deeper nesting, different glob) would either miss files or pick up unrelated parquets. Worth a one-line comment noting the assumption, or — cleaner — derive the glob from Path(self.meta.data_path).parts.
Predicate pushdown for narrow episodes= picks. PyArrow's pushdown on isin(episode_index) is only effective if the parquet row-group statistics include episode_index min/max. LeRobot writes one episode per parquet, so the per-file min/max should be tight and pushdown should skip the unselected files cheaply — but worth confirming with a quick du -sh and wall-clock timing of LeRobotDataset(..., episodes=[N]) on a populated repo before and after. The PR currently has the cache-size measurement listed as not-yet-run; that one can ride along.
Missing-episode error path. PR notes the old branch errored and the new branch returns zero rows. The assert all((self.root / fpath).is_file() ...) at line 1364 is caught by except (AssertionError, FileNotFoundError, NotADirectoryError) at line 1366 and triggers download_episodes() — it's not actually fail-fast, it's a re-fetch trigger. So the regression is "a corrupted/missing post-download parquet now yields zero rows instead of raising"; probably fine but the docstring's claim that __init__ guards this is slightly off.
Style nit. [str(p) for p in paths] materializes a second list right after sorted(...). Minor, but list(map(str, paths)) or just [str(p) for p in sorted(data_dir.glob("*/*.parquet"))] reads cleaner if you want one expression.
Test coverage. tests/datasets/test_obs_history.py and tests/datasets/test_dataset_mixture.py only cover episodes=[0] and episodes=[0, 1] — both sorted, both starting at 0. The new from_parquet([all_paths], filters=...) pattern's most interesting behaviors (sparse episodes, unsorted episodes, all-episodes-default with data_dir.glob) aren't exercised. Add at least: unsorted-episodes case (the one above), sparse-episodes case (episodes=[3, 7] against a 10-episode fixture), and a no-episodes case that asserts the same rows as before.

Summary

Direction is right and the win is real. The unsorted-episodes row-ordering issue needs a fix (sort self.episodes at the boundary is the smallest change) plus a regression test before this is safe to merge, and the seeded determinism check on a smoke config should be run before un-drafting per CLAUDE.md rule #3.

Generated by Claude Code

Address review feedback on #301. 1. Sort `self.episodes` at the __init__ boundary. The new `Dataset.from_parquet([sorted_paths], filters=...)` loader returns rows in sorted (chunk, episode_index) order regardless of the order the user passed in `episodes=`. `get_episode_data_index` builds `episode_data_index["from"/"to"]` in `self.episodes`-list order, so an unsorted user list (e.g. `episodes=[5, 2, 8]`) would silently mis-align rows — `episode_data_index["from"][epi2idx[5]] = 0` but row 0 is now episode 2. Sorting at the boundary makes the row-order assumption explicit and matches what every other in-repo caller already does. 2. Derive the parquet glob from `self.meta.data_path` instead of hardcoding `data/*/*.parquet`. Datasets with a non-default `info["data_path"]` (deeper nesting, flat layout, etc.) now keep working. 3. Add regression tests under `tests/datasets/test_datasets.py`: - `test_dataset_unsorted_episodes_row_alignment` — the canonical case above (`episodes=[6, 2, 5]`). - `test_dataset_sparse_episodes_row_alignment` — sparse non-contiguous filter (`episodes=[3, 7]` out of 10). - `test_dataset_no_episodes_loads_all` — `episodes=None` default path. All three assert that for every `ep in dataset.episodes`, every row in `hf_dataset[from[idx]:to[idx]]` carries `episode_index == ep`. The `[6, 2, 5]` case fails without the sort.

shuheng-liu · 2026-05-13T17:42:36Z

Thanks for the careful review. Addressed in 8584c66.

Blocking — row-order regression

Fixed by sorting self.episodes at the __init__ boundary (lerobot_dataset.py:1264). That makes the row-order assumption explicit and matches what every in-repo caller already does. The two-line change plus a docstring-style comment explaining why (so a future contributor doesn't undo it).

Regression test added: test_dataset_unsorted_episodes_row_alignment constructs a dataset with episodes=[6, 2, 5] and asserts that for every ep in dataset.episodes, every row in hf_dataset[from[idx]:to[idx]] carries episode_index == ep. The test fails without the sort. Also covered the two adjacent cases you asked for:

test_dataset_sparse_episodes_row_alignment — episodes=[3, 7] out of 10, with an extra assertion that the filter actually dropped the unselected episodes (set(ep_idx_col) == {3, 7}).
test_dataset_no_episodes_loads_all — episodes=None default path, asserting dataset.episodes == [0, 1, 2, 3] plus the row-alignment invariant.

All 392 dataset tests pass (389 + 3 new).

Non-blocking notes

Glob hardcoding — fixed by deriving the glob from self.meta.data_path via re.sub(r"\{[^}]+\}", "*", ...). Default template yields data/chunk-*/episode_*.parquet; custom layouts now keep working without code changes.
Style nit — took it: list(map(str, paths)) instead of the listcomp.
Missing-episode error path — you're right, the assert is caught and triggers download_episodes(), not fail-fast. I'll fix the language in the PR description before un-drafting (and in load_hf_dataset() rewrites parquet to Arrow on disk, duplicating the source corpus — sync with upstream load_nested_dataset() #277's expected-behavior framing).
Predicate pushdown timing — agreed, will roll the wall-clock measurement in with the du -sh check on a host with humanoid-everyday-{A,B}-overlay cached. Reporting back in a follow-up comment before un-drafting.
Seeded determinism check — agreed, won't promote out of draft until CLAUDE.md rule Fixing reward normalizer #3 passes. The new row-alignment invariant tests catch the specific bug class you flagged (silent row mis-indexing), but the determinism check is the right belt-and-suspenders.

shuheng-liu · 2026-05-13T17:44:53Z

Re-review of `8584c66`

The fixes look right. Quick pass over the second commit:

What's resolved

Row-order fix at lerobot_dataset.py:1264 — self.episodes = sorted(episodes) if episodes is not None else None, with a five-line comment explaining why future contributors shouldn't undo it. Correct, minimal, and aligns with the in-repo idiom (attach_metadata.py:507, annotate_mistakes.py:455, etc.). The downstream consumers I flagged (_get_query_indices_soft, __getitem__, _query_videos, the image-cam path in _query_subgoal_frames) all just consume self.episodes via epi2idx / episode_data_index so they're invariant to the sort. aggregate_stats at line 1354 is order-independent (it sums/min/max across episodes), so no change there either. The second get_episode_data_index call inside add_episode (line 2273) operates on self.episodes after __init__ has sorted it, so it stays consistent.
Glob now derived from self.meta.data_path — re.sub(r"\{[^}]+\}", "*", ...) turns the default data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet into data/chunk-*/episode_*.parquet, and self.root.glob(...) walks from the dataset root instead of a hardcoded data/ subdir. Good — non-default info["data_path"] layouts no longer need a code change.
Three new tests in test_datasets.py — _assert_episode_row_alignment is a reusable invariant check, and the three call sites ([6, 2, 5], [3, 7], None) cover the failure case, the sparse-filter case, and the default path. The unsorted case also pins down the contract with the explicit dataset.episodes == [2, 5, 6] assertion, which is the right hedge against someone "fixing" the boundary sort later.

Minor leftovers (non-blocking)

Regex doesn't handle {{ / }} escapes. Python's str.format lets a literal { escape as {{; the current re.sub(r"\{[^}]+\}", "*", ...) will match {{x}} greedily as {...} and replace with *, eating the intended literal brace. No in-repo template uses this so it's purely theoretical, but if you want a one-character hedge: r"(?<!\{)\{[^{}]+\}(?!\})" is the conservative form. Equally fine to just note "templates must not contain literal braces" in the comment.
Tests reach into dataset.hf_dataset.data.table.column(...). Works today, but it's bypassing the public API. Once set_transform(hf_transform_to_torch) is set, hf_dataset["episode_index"] materializes the whole column through torch which is wrong here, so the Arrow-table escape hatch is justified — just worth a tiny inline comment in _assert_episode_row_alignment saying why it's reaching into .data.table (so the next person doesn't "simplify" it).

Still gating un-draft

Disk-cost du -sh measurement on a host with humanoid-everyday-{A,B}-overlay cached — the actual win the PR is claiming.
Wall-clock timing for narrow episodes=[N] on a populated repo — confirming PyArrow's pushdown via per-file row-group metadata actually skips the unselected parquets cheaply. With 1000 episodes and episodes=[0], the new loader hands all 1000 paths to Dataset.from_parquet; the old loader handed one. Pushdown should make this fine because LeRobot writes one episode per file and the per-file min/max for episode_index is tight, but worth a quick before/after on a populated cache before merging — if the per-file metadata read dominates, the init-time regression on narrow picks could be noticeable.
Seeded determinism check on a smoke config (CLAUDE.md rule Fixing reward normalizer #3) — the new alignment-invariant tests catch the specific bug class I flagged, but rule Fixing reward normalizer #3 is the broader belt-and-suspenders against any subtler row-layout change. Per the PR description, you're holding the un-draft on this anyway.

Summary

LGTM modulo running the three measurements above before un-drafting. The code change is now correct and well-tested for the bug class I was worried about.

Generated by Claude Code

Empirical test on physical-intelligence/libero (10 episodes, 347 MB source parquet) shows that swapping `load_dataset("parquet", ...)` for `Dataset.from_parquet([paths], filters=...)` does NOT reduce disk usage — both route through `ParquetDatasetReader` / `ParquetDatasetBuilder` and write the same 347 MB Arrow cache entry to $HF_HOME/datasets/parquet/, with the same hash. The upstream LeRobot comment "memory-mapped loading for efficiency" is wrong for datasets==4.5.0. Switch to reading the parquet via `pyarrow.dataset.dataset(...)` directly, applying the episode filter at the pyarrow level via `to_table(filter=...)`, and wrapping the resulting pa.Table in a HF `Dataset(table, info=DatasetInfo(features=features))`. Verified on mlbox: zero growth in $HF_HOME/datasets/parquet/ across multiple loads. Trade-off: filtered rows materialize into RAM rather than being mmapped from an Arrow cache file. For training subsets (10s of episodes) this is fine; full-corpus loads on multi-GB repos will now be RAM-bound. The disk-doubling issue (#277) is the bigger concern for shared $HF_HOME setups, so this is the right trade-off. Row-alignment / sort-episodes fixes from the previous commit are preserved; the regression tests still pass against the new loader.

shuheng-liu · 2026-05-13T18:25:42Z

Empirical verification report (follow-up to the earlier code-review reply).

Critical finding: upstream LeRobot's `Dataset.from_parquet` pattern doesn't actually fix the disk cost

The first version of this PR (8584c66) ported HF LeRobot's pattern verbatim — Dataset.from_parquet([all_paths], filters=...) instead of load_dataset("parquet", data_files=...). The upstream comment (io_utils.py:78-81) advertises this as "memory-mapped loading for efficiency". On a real OpenTau-compatible dataset (physical-intelligence/libero, 10 episodes, 347 MB source parquet, datasets==4.5.0):

Approach	Disk after 3 calls (pick 0..4, pick 5..9, pick 0..4 again)	Cache entries
Old: `load_dataset("parquet", data_files=picked)`	347 MB	2
`Dataset.from_parquet([all_paths], filters=...)`	347 MB	2
`pa_ds.dataset(paths).to_table(filter=...)` + `Dataset(table)`	0 MB	0

Same hash sequence on the cache entries, same 347 MB total. Dataset.from_parquet routes through ParquetDatasetReader.read() → ParquetDatasetBuilder, which writes the Arrow cache exactly like load_dataset("parquet", ...). The upstream "memory-mapped" claim is wrong for current datasets versions.

Pivot: bypass HF Datasets' parquet builder entirely

4308796 switches the loader to pa_ds.dataset(paths).to_table(filter=...) + Dataset(table, info=DatasetInfo(features=features)). End-to-end LeRobotDataset load (full LeRobotDataset(cfg, repo_id, episodes=[0..4]) constructor — not just the from_parquet primitive) on the dev box:

[INITIAL (wiped)]            0.0 MB / 0 entries
[after pick episodes=0..4]   0.0 MB / 0 entries
[after pick episodes=5..9]   0.0 MB / 0 entries
DELTA: +0.000 MB, +0 entries

Zero growth, confirmed against the same dataset where the previous approach grew 347 MB.

Trade-off

pa_ds.dataset(paths).to_table(filter=...) materializes the filtered rows into RAM rather than mmapping an Arrow cache file. For typical training subsets this is comfortably bounded. Full-corpus loads on multi-GB image-heavy repos will now be RAM-bound — that's a regression vs. the old mmap-an-Arrow-cache behavior, but #277 is specifically about disk-doubling on shared $HF_HOME setups, so this is the trade-off worth making.

Loader-determinism check

Built the dataset twice in the same process with set_seed(42); hashed rows at idx 0, 1, len/4, len/2, 3·len/4, len-1. All 6 hashes bit-identical:

=== Pass 1 ===
  idx=    0: 2e8e6218ba0017a1
  idx=    1: 74861cb1bcf7f031
  idx=  689: 0624f53aeddcc803
  idx= 1379: fcaf3ede9cb6fa16
  idx= 2068: bd22c1ab3c3b5d9f
  idx= 2757: 107401e09d4c6a81
=== Pass 2 (same seed) ===
  [identical to pass 1]
PASS: all 6 row hashes bit-identical across passes.

The change is loader-only and doesn't touch any of the training-loop primitives listed in CLAUDE.md rule #3 (scripts/train.py, policies/*/modeling_*.py, optim/, datasets/sampler.py), so a loader-level determinism check is the right scope. The full smoke-config seeded check is straightforward to add before un-drafting if you want belt-and-suspenders.

Memory note

Saving a feedback memory locally so I don't fall for the upstream "memory-mapped" claim again: any future PR touching from_parquet needs an empirical du -sh $HF_HOME/datasets/parquet/ before/after to verify the cache actually doesn't grow.

shuheng-liu · 2026-05-13T18:33:24Z

Re-review of `4308796` (the pivot)

The "upstream HF LeRobot's Dataset.from_parquet doesn't actually fix this" finding is a really good catch — and the empirical table (load_dataset: 347 MB / Dataset.from_parquet: 347 MB / pa_ds.to_table + Dataset(table): 0 MB) is exactly the kind of measurement that earned the pivot. Glad you ran it before promoting.

What's resolved

Loader correctness — pa_ds.dataset(paths, format="parquet").to_table(filter=filter_expr) + Dataset(table, info=DatasetInfo(features=features)) skips ParquetDatasetBuilder entirely; verified zero growth in $HF_HOME/datasets/parquet/.
Sort + glob-from-template fixes from 8584c66 carry forward unchanged.
Loader-determinism check — 6 row hashes bit-identical across two seeded builds. Agreed this is the right scope: the change is loader-only and doesn't touch any of the training-loop primitives listed in CLAUDE.md rule Fixing reward normalizer #3 (scripts/train.py, policies/*/modeling_*.py, optim/, datasets/sampler.py).
Comment in load_hf_dataset is thorough and calls out the trade-off in place.

One concern worth raising explicitly: RAM regression for full-corpus loads

pa_ds.to_table(filter=...) materializes the filtered table in RAM. For narrow episodes= picks that's bounded. For episodes=None on a populated repo, that's the entire corpus in memory.

Old path: load_dataset("parquet", data_dir=...) wrote an uncompressed Arrow cache to disk and mmap'd it — kernel paged in only what __getitem__ touched, so resident memory stayed near the working set even on multi-GB corpora.

New path: the whole filtered pa.Table is in process memory at the end of load_hf_dataset. For #277's humanoid-everyday-A-overlay (46 GB parquet → ~46 GB uncompressed table assuming the parquet is image-heavy with little column-store wins), an episodes=None load would now need ~46 GB resident before training even starts.

episodes=None is the default for:

factory.py:218 — make_dataset(...) constructs the main training dataset; cfg.episodes is None by default in DatasetConfig (configs/default.py:111).
v21/convert_stats.py:95 — explicitly asserts episodes is None for stats conversion.
scripts/fit_fast_tokenizer.py:436 — tokenizer fitting.
scripts/visualize_dataset.py:601 — visualizer.

The pre-existing pattern in WeightedDatasetMixture already constructs the per-dataset LeRobotDataset instances at startup, so the mixture RAM cost is the sum of all sub-corpus sizes. With this change, that's a real wall.

Probably acceptable given that #277 is specifically about shared $HF_HOME disk-doubling, and the production fix is "specify a manageable episodes= subset for big repos" — but the trade-off section of the PR description currently says "comfortably under VRAM" (should be RAM) and "full-corpus loads on multi-GB repos will be RAM-bound", which understates how much it costs. Two concrete asks:

Add an actual RAM measurement to the PR description, parallel to the du -sh table. Same script you used for the disk check, but with resource.getrusage(resource.RUSAGE_SELF).ru_maxrss before/after LeRobotDataset(...). On physical-intelligence/libero it should show ~347 MB after construction vs. a small constant before. That makes the trade-off concrete instead of "RAM-bound".
Note in the comment at line 1571–1580 what the consequence is, not just the mechanism. Something like # RAM cost scales with len(filtered rows) × avg-row-size; ~350 MB for libero/10ep, ~46 GB for humanoid-everyday-A-overlay full-corpus. so a future reader knows "if I'm constructing a full-corpus LeRobotDataset on a small dev box, that's why I OOM'd."

Alternative architecture if RAM does turn out to be a problem: write the filtered pa.Table to a tmp Arrow IPC file (pa.ipc.new_file(...)) and mmap it back via pa.memory_map(...) → pa.ipc.open_file(...).read_all(). Restores the old mmap behavior but with a smaller cache (only filtered rows) and a tmp lifetime instead of $HF_HOME pollution. Not asking you to do this here — just noting the escape hatch if a real OOM shows up.

Minor

Dataset(table, info=DatasetInfo(features=features)) is a less-common constructor pattern than factory methods. Public in datasets 4.x but the contract on info partial-fill (no split, no dataset_name, etc.) isn't pinned by tests. If datasets ever tightens that, the breakage will surface here first. Worth a # datasets>=X.Y comment if you happen to know the floor that supports this signature; otherwise file it as accepted risk.
Wording nit in the PR description trade-off section — "comfortably under VRAM" should be "RAM" (VRAM is GPU memory; the table lives in CPU memory).
_assert_episode_row_alignment still reaches into .data.table.column(...) without an inline note explaining why (the set_transform would otherwise route the column through hf_transform_to_torch). One-line comment would future-proof against a "simplification" PR.

Summary

Loader code is correct, disk regression is fixed, and the empirical investigation that drove the pivot is genuinely good engineering. RAM trade-off is the one piece that deserves a sharper presentation in the PR description before un-drafting — current readers will see "RAM-bound" and not realize the worst case is the size of the dataset they were trying to load.

CI: pre-commit + auto-review green; CPU Tests still in-progress on 4308796.

Generated by Claude Code

…taset Address review feedback on #301 (re-review of 4308796). 1. `load_hf_dataset` comment now spells out the RAM-cost consequence with concrete numbers (~350 MB for libero/10ep, ~46 GB for humanoid-everyday-A-overlay full-corpus) so a future reader hitting OOM on a small dev box knows exactly why. 2. Add a `# datasets>=2.19.0` note on the `Dataset(table, info=DatasetInfo(features=features))` constructor — the project pin is 2.19.0 and that signature has been stable since 2.x. 3. One-line note that the `re.sub` glob derivation assumes simple `{name}` placeholders, not literal `{{`/`}}` escapes — true for every in-repo writer. 4. Inline note in `_assert_episode_row_alignment` explaining why it reaches into `.data.table.column(...)` instead of `hf_dataset[col]` (the `set_transform=hf_transform_to_torch` would route through torch).

shuheng-liu · 2026-05-13T18:38:32Z

Addressed in ccc62ef. Summary of changes:

RAM measurement added

Measured peak RSS during LeRobotDataset(cfg, "physical-intelligence/libero", episodes=[0..9]) on the dev box:

[BEFORE LeRobotDataset] peak RSS: 799.0 MB
[AFTER  LeRobotDataset] peak RSS: 3503.1 MB  (len=2758 rows, 10 episodes)
[pa.Table resident]    nbytes:    347.0 MB (2758 rows)

So:

pa.Table nbytes = 347 MB — the persistent data cost; matches the source parquet (347 MB) almost exactly, confirming the linear filtered_rows × avg_row_size scaling.
Peak RSS Δ = +2.7 GB — mostly transient: imports (torchvision/torch/transformers etc.), pyarrow scratch during to_table, and dataset-stats materialization. Decays after construction.

Both numbers are now in the PR description in a table parallel to the du -sh one. The "principled" projection rule for big repos (pa.Table.nbytes ≈ filtered_rows × avg_row_size, so humanoid-everyday-A-overlay episodes=None would need ~46 GB resident) is in the description too.

Code-comment updates (lerobot_dataset.py:1571–1593)

The inline comment in load_hf_dataset now spells out the consequence with the same concrete numbers, so a future reader hitting OOM on a small dev box can see "if I'm constructing a full-corpus LeRobotDataset on a small dev box, that's why I OOM'd."

# Trade-off: `to_table(filter=...)` materializes the filtered rows into
# RAM rather than mmapping a disk-backed Arrow cache. RAM cost scales
# with `len(filtered rows) × avg-row-size`; concretely:
# ~350 MB for physical-intelligence/libero with episodes=[0..9],
# ~46 GB for humanoid-everyday-A-overlay with episodes=None (full corpus).
# Narrow `episodes=` picks are fine; an episodes=None load on a multi-GB
# image-heavy repo will OOM on a small dev box — pass a manageable
# subset, or restore a mmap'd Arrow cache via tmp pa.ipc files if RAM
# ever becomes the binding constraint.

Wording fixes

PR description: "comfortably under VRAM" → "comfortably under RAM" (VRAM/RAM mix-up).
Trade-off section rewritten to lead with the principled rule and what it implies for episodes=None callers in the codebase (factory.make_dataset, scripts/visualize_dataset.py, scripts/fit_fast_tokenizer.py, v21/convert_stats.py) instead of just saying "RAM-bound".

Minor

Added # datasets>=2.19.0 note on the Dataset(table, info=DatasetInfo(features=features)) constructor — project pin is datasets>=2.19.0 (from pyproject.toml:46), and this signature has been stable since datasets 2.x.
Added a one-line note in tests/datasets/test_datasets.py:_assert_episode_row_alignment explaining the .data.table.column(...) access (bypasses set_transform=hf_transform_to_torch), so a future "simplification" doesn't break the invariant check.
Noted in the glob-derivation comment that the regex assumes simple {name} placeholders, no literal {{/}} escapes — true for every in-repo template writer.

The full smoke-config seeded determinism check (per CLAUDE.md rule #3) is still queueable on the dev box if you want it before un-drafting; the loader-only check covers the bug class touched by this change but rule #3 is broader.

shuheng-liu · 2026-05-13T18:57:17Z

Re-review of `ccc62ef`

Docs-only commit on top of 4308796 — no logic changes. Every point from the last review is addressed:

RAM measurement table in the PR description with the dev-box numbers (peak RSS 799 → 3503 MB; pa.Table.nbytes 347 MB matching the 347 MB source parquet). The principled pa.Table.nbytes ≈ filtered_rows × avg_row_size rule is stated, and the 46 GB projection for humanoid-everyday-A-overlay episodes=None is explicit.
Inline comment at lerobot_dataset.py:1581-1589 now spells out the concrete numbers and the consequence ("will OOM on a small dev box — pass a manageable subset, or restore a mmap'd Arrow cache via tmp pa.ipc files"). Future reader hitting OOM on a small dev box will see exactly why.
datasets>=2.19.0 floor noted at line 1591-1593 next to the Dataset(table, info=DatasetInfo(features=features)) constructor — and I checked, that's the actual project pin.
Test comment at _assert_episode_row_alignment:82-85 now reads Don't "simplify" this without removing the transform. Good hedge against the next "cleanup" PR.
Glob-derivation comment clarified (Assumes the template uses simple {name} / {name:fmt} placeholders and no literal {{ / }} escapes — true for every in-repo writer).
VRAM/RAM wording fix in the PR description.

The "comfortably under RAM" line in the trade-off section now lands accurately, the table makes the disk-vs-RAM trade explicit, and the in-codebase episodes=None callers (factory.make_dataset, scripts/visualize_dataset.py, scripts/fit_fast_tokenizer.py, v21/convert_stats.py) are called out so anyone running them on a big repo knows to pass an episodes= subset.

Approving from a code-review standpoint

Everything I'd want addressed is addressed. The remaining gate is your own commitment in the PR description:

Full smoke-config seeded determinism check (CLAUDE.md rule Fixing reward normalizer #3) before un-drafting. The loader-only 6-row hash check covers the bug class this PR touches, but rule Fixing reward normalizer #3 is the broader belt-and-suspenders. Reasonable to run a configs/dev/dev_config.json smoke twice with the same seed and confirm bit-identical per-step loss before flipping out of draft.

CI

ccc62ef: pre-commit ✅, review ✅, check-checklist ✅, CPU Tests one passed + one in-progress. The completed CPU Tests run is green; the in-progress one is the duplicate triggered by the comment-edit, expect it to land identically.

LGTM modulo the smoke-config check.

Generated by Claude Code

shuheng-liu · 2026-05-13T19:27:41Z

Smoke-config seeded determinism check done. Bit-identical per-step loss across two independent runs of configs/examples/pi05_training_config.json (40 steps, batch=2, physical-intelligence/libero episodes=[0..9], seed=1000 from the config, wandb/save/val disabled for the smoke).

$ diff <(grep -oE 'step:[0-9]+\([0-9]+\) smpl:[0-9]+\([0-9]+\) total_loss:[0-9.]+ mse_loss:[0-9.]+ ce_loss:[0-9.]+ lr:[0-9.e-]+ grad_norm:[0-9.]+' run_a.log) \
       <(grep -oE 'step:[0-9]+\([0-9]+\) smpl:[0-9]+\([0-9]+\) total_loss:[0-9.]+ mse_loss:[0-9.]+ ce_loss:[0-9.]+ lr:[0-9.e-]+ grad_norm:[0-9.]+' run_b.log)
(no diff)

All 8 logged steps match across total_loss, mse_loss, ce_loss, lr, grad_norm:

step	total_loss	mse_loss	ce_loss	lr	grad_norm
5	5.355713	0.362435	4.993278	1.0e-07	2.655
10	5.347048	0.284519	5.062529	2.2e-07	2.595
15	5.100674	0.309432	4.791242	3.5e-07	2.246
20	5.174705	0.299174	4.875531	4.7e-07	2.473
25	5.366552	0.370842	4.995710	6.0e-07	4.065
30	4.916766	0.304207	4.612558	7.2e-07	2.605
35	5.366232	0.294471	5.071761	8.5e-07	1.881
40	5.496222	0.423167	5.073055	9.7e-07	4.058

CLAUDE.md rule #3 satisfied. The loader-only 6-row hash check and now the full smoke-config seeded run both pass. Ready to flip out of draft from my end whenever you say go.

shuheng-liu added the bug Something isn't working label May 13, 2026

shuheng-liu self-assigned this May 13, 2026

shuheng-liu marked this pull request as ready for review May 13, 2026 17:30

shuheng-liu merged commit 27a2197 into main May 13, 2026
7 checks passed

shuheng-liu deleted the claude/bold-antonelli-665e34 branch May 13, 2026 21:39

WilliamYue37 mentioned this pull request May 14, 2026

Nightly regression test timing out at 30m in "Train with Model Parallelism" #306

Closed

shuheng-liu mentioned this pull request May 14, 2026

fix(datasets): harden dataset loading for large pretrain mixtures #304

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datasets): memory-map parquet instead of rewriting to Arrow#301

fix(datasets): memory-map parquet instead of rewriting to Arrow#301
shuheng-liu merged 4 commits into
mainfrom
claude/bold-antonelli-665e34

shuheng-liu commented May 13, 2026 •

edited

Loading

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuheng-liu commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Implementation

Trade-off: RAM cost scales with filtered rows

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Uh oh!

shuheng-liu commented May 13, 2026

Review

Overview

Blocking: silent row-order regression on unsorted episodes

Non-blocking notes

Summary

Uh oh!

shuheng-liu commented May 13, 2026

Blocking — row-order regression

Non-blocking notes

Uh oh!

shuheng-liu commented May 13, 2026

Re-review of 8584c66

What's resolved

Minor leftovers (non-blocking)

Still gating un-draft

Summary

Uh oh!

shuheng-liu commented May 13, 2026

Critical finding: upstream LeRobot's Dataset.from_parquet pattern doesn't actually fix the disk cost

Pivot: bypass HF Datasets' parquet builder entirely

Trade-off

Loader-determinism check

Memory note

Uh oh!

shuheng-liu commented May 13, 2026

Re-review of 4308796 (the pivot)

What's resolved

One concern worth raising explicitly: RAM regression for full-corpus loads

Minor

Summary

Uh oh!

shuheng-liu commented May 13, 2026

RAM measurement added

Code-comment updates (lerobot_dataset.py:1571–1593)

Wording fixes

Minor

Uh oh!

shuheng-liu commented May 13, 2026

Re-review of ccc62ef

Approving from a code-review standpoint

CI

Uh oh!

shuheng-liu commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shuheng-liu commented May 13, 2026 •

edited

Loading

Blocking: silent row-order regression on unsorted `episodes`

Re-review of `8584c66`

Critical finding: upstream LeRobot's `Dataset.from_parquet` pattern doesn't actually fix the disk cost

Re-review of `4308796` (the pivot)

Re-review of `ccc62ef`