fix(datasets): memory-map parquet instead of rewriting to Arrow#301
Conversation
`LeRobotDataset.load_hf_dataset()` used `datasets.load_dataset("parquet", ...)`,
which routes through `ParquetDatasetBuilder` and materializes the source
parquet as an uncompressed Arrow cache under `$HF_HOME/datasets/parquet/` —
4-5x the size of the source. Every distinct `episodes=` pick produced a new
cache entry, so on a populated `$HF_HOME` this can roughly double on-disk
corpus size.
Port HF LeRobot's upstream fix (huggingface/lerobot#2982): use
`Dataset.from_parquet([all_paths], filters=pa_ds.field("episode_index").isin(self.episodes), features=...)`.
PyArrow memory-maps the parquet directly (no Arrow rewrite) and predicate
pushdown on `episode_index` filters rows, so the file list is always the
full set under `data/` — producing one stable cache key instead of one
per `(repo, episodes)` pick.
Fixes #277.
ReviewThanks for the writeup — the storage win is real and the upstream port is tight (9/9, single file). One concern below is load-bearing, the rest are minor. OverviewReplaces Blocking: silent row-order regression on unsorted
|
Address review feedback on #301. 1. Sort `self.episodes` at the __init__ boundary. The new `Dataset.from_parquet([sorted_paths], filters=...)` loader returns rows in sorted (chunk, episode_index) order regardless of the order the user passed in `episodes=`. `get_episode_data_index` builds `episode_data_index["from"/"to"]` in `self.episodes`-list order, so an unsorted user list (e.g. `episodes=[5, 2, 8]`) would silently mis-align rows — `episode_data_index["from"][epi2idx[5]] = 0` but row 0 is now episode 2. Sorting at the boundary makes the row-order assumption explicit and matches what every other in-repo caller already does. 2. Derive the parquet glob from `self.meta.data_path` instead of hardcoding `data/*/*.parquet`. Datasets with a non-default `info["data_path"]` (deeper nesting, flat layout, etc.) now keep working. 3. Add regression tests under `tests/datasets/test_datasets.py`: - `test_dataset_unsorted_episodes_row_alignment` — the canonical case above (`episodes=[6, 2, 5]`). - `test_dataset_sparse_episodes_row_alignment` — sparse non-contiguous filter (`episodes=[3, 7]` out of 10). - `test_dataset_no_episodes_loads_all` — `episodes=None` default path. All three assert that for every `ep in dataset.episodes`, every row in `hf_dataset[from[idx]:to[idx]]` carries `episode_index == ep`. The `[6, 2, 5]` case fails without the sort.
|
Thanks for the careful review. Addressed in 8584c66. Blocking — row-order regressionFixed by sorting Regression test added:
All 392 dataset tests pass (389 + 3 new). Non-blocking notes
|
Re-review of 8584c66The fixes look right. Quick pass over the second commit: What's resolved
Minor leftovers (non-blocking)
Still gating un-draft
SummaryLGTM modulo running the three measurements above before un-drafting. The code change is now correct and well-tested for the bug class I was worried about. Generated by Claude Code |
Empirical test on physical-intelligence/libero (10 episodes, 347 MB
source parquet) shows that swapping `load_dataset("parquet", ...)` for
`Dataset.from_parquet([paths], filters=...)` does NOT reduce disk
usage — both route through `ParquetDatasetReader` /
`ParquetDatasetBuilder` and write the same 347 MB Arrow cache entry to
$HF_HOME/datasets/parquet/, with the same hash. The upstream LeRobot
comment "memory-mapped loading for efficiency" is wrong for
datasets==4.5.0.
Switch to reading the parquet via `pyarrow.dataset.dataset(...)`
directly, applying the episode filter at the pyarrow level via
`to_table(filter=...)`, and wrapping the resulting pa.Table in a HF
`Dataset(table, info=DatasetInfo(features=features))`. Verified on
mlbox: zero growth in $HF_HOME/datasets/parquet/ across multiple loads.
Trade-off: filtered rows materialize into RAM rather than being mmapped
from an Arrow cache file. For training subsets (10s of episodes) this is
fine; full-corpus loads on multi-GB repos will now be RAM-bound. The
disk-doubling issue (#277) is the bigger concern for shared $HF_HOME
setups, so this is the right trade-off.
Row-alignment / sort-episodes fixes from the previous commit are
preserved; the regression tests still pass against the new loader.
|
Empirical verification report (follow-up to the earlier code-review reply). Critical finding: upstream LeRobot's
|
| Approach | Disk after 3 calls (pick 0..4, pick 5..9, pick 0..4 again) | Cache entries |
|---|---|---|
Old: load_dataset("parquet", data_files=picked) |
347 MB | 2 |
Dataset.from_parquet([all_paths], filters=...) |
347 MB | 2 |
pa_ds.dataset(paths).to_table(filter=...) + Dataset(table) |
0 MB | 0 |
Same hash sequence on the cache entries, same 347 MB total. Dataset.from_parquet routes through ParquetDatasetReader.read() → ParquetDatasetBuilder, which writes the Arrow cache exactly like load_dataset("parquet", ...). The upstream "memory-mapped" claim is wrong for current datasets versions.
Pivot: bypass HF Datasets' parquet builder entirely
4308796 switches the loader to pa_ds.dataset(paths).to_table(filter=...) + Dataset(table, info=DatasetInfo(features=features)). End-to-end LeRobotDataset load (full LeRobotDataset(cfg, repo_id, episodes=[0..4]) constructor — not just the from_parquet primitive) on the dev box:
[INITIAL (wiped)] 0.0 MB / 0 entries
[after pick episodes=0..4] 0.0 MB / 0 entries
[after pick episodes=5..9] 0.0 MB / 0 entries
DELTA: +0.000 MB, +0 entries
Zero growth, confirmed against the same dataset where the previous approach grew 347 MB.
Trade-off
pa_ds.dataset(paths).to_table(filter=...) materializes the filtered rows into RAM rather than mmapping an Arrow cache file. For typical training subsets this is comfortably bounded. Full-corpus loads on multi-GB image-heavy repos will now be RAM-bound — that's a regression vs. the old mmap-an-Arrow-cache behavior, but #277 is specifically about disk-doubling on shared $HF_HOME setups, so this is the trade-off worth making.
Loader-determinism check
Built the dataset twice in the same process with set_seed(42); hashed rows at idx 0, 1, len/4, len/2, 3·len/4, len-1. All 6 hashes bit-identical:
=== Pass 1 ===
idx= 0: 2e8e6218ba0017a1
idx= 1: 74861cb1bcf7f031
idx= 689: 0624f53aeddcc803
idx= 1379: fcaf3ede9cb6fa16
idx= 2068: bd22c1ab3c3b5d9f
idx= 2757: 107401e09d4c6a81
=== Pass 2 (same seed) ===
[identical to pass 1]
PASS: all 6 row hashes bit-identical across passes.
The change is loader-only and doesn't touch any of the training-loop primitives listed in CLAUDE.md rule #3 (scripts/train.py, policies/*/modeling_*.py, optim/, datasets/sampler.py), so a loader-level determinism check is the right scope. The full smoke-config seeded check is straightforward to add before un-drafting if you want belt-and-suspenders.
Memory note
Saving a feedback memory locally so I don't fall for the upstream "memory-mapped" claim again: any future PR touching from_parquet needs an empirical du -sh $HF_HOME/datasets/parquet/ before/after to verify the cache actually doesn't grow.
Re-review of 4308796 (the pivot)The "upstream HF LeRobot's What's resolved
One concern worth raising explicitly: RAM regression for full-corpus loads
Old path: New path: the whole filtered
The pre-existing pattern in Probably acceptable given that #277 is specifically about shared
Alternative architecture if RAM does turn out to be a problem: write the filtered Minor
SummaryLoader code is correct, disk regression is fixed, and the empirical investigation that drove the pivot is genuinely good engineering. RAM trade-off is the one piece that deserves a sharper presentation in the PR description before un-drafting — current readers will see "RAM-bound" and not realize the worst case is the size of the dataset they were trying to load. CI: pre-commit + auto-review green; CPU Tests still in-progress on 4308796. Generated by Claude Code |
…taset Address review feedback on #301 (re-review of 4308796). 1. `load_hf_dataset` comment now spells out the RAM-cost consequence with concrete numbers (~350 MB for libero/10ep, ~46 GB for humanoid-everyday-A-overlay full-corpus) so a future reader hitting OOM on a small dev box knows exactly why. 2. Add a `# datasets>=2.19.0` note on the `Dataset(table, info=DatasetInfo(features=features))` constructor — the project pin is 2.19.0 and that signature has been stable since 2.x. 3. One-line note that the `re.sub` glob derivation assumes simple `{name}` placeholders, not literal `{{`/`}}` escapes — true for every in-repo writer. 4. Inline note in `_assert_episode_row_alignment` explaining why it reaches into `.data.table.column(...)` instead of `hf_dataset[col]` (the `set_transform=hf_transform_to_torch` would route through torch).
|
Addressed in ccc62ef. Summary of changes: RAM measurement addedMeasured peak RSS during So:
Both numbers are now in the PR description in a table parallel to the Code-comment updates (lerobot_dataset.py:1571–1593)The inline comment in # Trade-off: `to_table(filter=...)` materializes the filtered rows into
# RAM rather than mmapping a disk-backed Arrow cache. RAM cost scales
# with `len(filtered rows) × avg-row-size`; concretely:
# ~350 MB for physical-intelligence/libero with episodes=[0..9],
# ~46 GB for humanoid-everyday-A-overlay with episodes=None (full corpus).
# Narrow `episodes=` picks are fine; an episodes=None load on a multi-GB
# image-heavy repo will OOM on a small dev box — pass a manageable
# subset, or restore a mmap'd Arrow cache via tmp pa.ipc files if RAM
# ever becomes the binding constraint.Wording fixes
Minor
The full smoke-config seeded determinism check (per CLAUDE.md rule #3) is still queueable on the dev box if you want it before un-drafting; the loader-only check covers the bug class touched by this change but rule #3 is broader. |
Re-review of ccc62efDocs-only commit on top of 4308796 — no logic changes. Every point from the last review is addressed:
The "comfortably under RAM" line in the trade-off section now lands accurately, the table makes the disk-vs-RAM trade explicit, and the in-codebase Approving from a code-review standpointEverything I'd want addressed is addressed. The remaining gate is your own commitment in the PR description:
CIccc62ef: pre-commit ✅, review ✅, check-checklist ✅, CPU Tests one passed + one in-progress. The completed CPU Tests run is green; the in-progress one is the duplicate triggered by the comment-edit, expect it to land identically. LGTM modulo the smoke-config check. Generated by Claude Code |
|
Smoke-config seeded determinism check done. Bit-identical per-step loss across two independent runs of All 8 logged steps match across
CLAUDE.md rule #3 satisfied. The loader-only 6-row hash check and now the full smoke-config seeded run both pass. Ready to flip out of draft from my end whenever you say go. |
What this does
Fixes #277.
LeRobotDataset.load_hf_dataset()previously calleddatasets.load_dataset("parquet", ...), which routes throughParquetDatasetBuilderand materializes the source parquet into an uncompressed Arrow cache under$HF_HOME/datasets/parquet/default-<hash>/.... The Arrow rewrite is 1–5× the source parquet size (compression-dependent: ~1× for vector-only data, ~4–5× for image-heavy parquets per #277), and because the cache key includesdata_files, every distinctepisodes=pick produces a new cache entry. On a populated$HF_HOMEthis can roughly double on-disk corpus size.Empirical investigation on a real OpenTau-compatible dataset (
physical-intelligence/libero, 10 episodes, 347 MB source parquet) — see follow-up comments for the scripts and numbers:load_dataset("parquet", data_files=picked)Dataset.from_parquet([all_paths], filters=...)(HF LeRobot's pattern)pa_ds.dataset(paths).to_table(filter=...)+Dataset(table)(this PR)The upstream LeRobot fix (huggingface/lerobot#2982) advertises
Dataset.from_parquetas memory-mapped, but indatasets==4.5.0it still routes throughParquetDatasetReader→ParquetDatasetBuilderand writes the same Arrow cache. The only working fix is to bypass HF Datasets' parquet builder entirely and read viapyarrow.datasetdirectly.Implementation
src/opentau/datasets/lerobot_dataset.py:
Imports: drop
load_dataset, addDataset,DatasetInfofromdatasets,pyarrow.dataset as pa_ds, andimport re.load_hf_dataset()rewritten: derive the parquet glob fromself.meta.data_path(so non-defaultinfo["data_path"]layouts keep working), build apa_ds.dataset(paths)lazy view, applypa_ds.field("episode_index").isin(self.episodes)viato_table(filter=...), and wrap the resultingpa.TableinDataset(table, info=DatasetInfo(features=features)).Sort
self.episodesat the__init__boundary (line 1264). The new loader returns rows in sorted (chunk, episode_index) order regardless of the caller'sepisodes=order;get_episode_data_indexbuildsepisode_data_index["from"/"to"]inself.episodes-list order. Without the sort, an unsorted user list (e.g.episodes=[5, 2, 8]) would silently mis-align rows —episode_data_index["from"][epi2idx[5]] = 0but row 0 would be episode 2. Sorting makes the assumption explicit and matches what every other in-repo caller already does.Trade-off: RAM cost scales with filtered rows
pa_ds.to_table(filter=...)materializes the filtered rows into RAM rather than mmapping a disk-backed Arrow cache. The oldload_dataset("parquet", ...)wrote an uncompressed Arrow cache to disk and mmap'd it — the kernel paged in only what__getitem__touched, so resident memory stayed near the working set even on multi-GB corpora. With this PR, the entire filteredpa.Tableis in process memory at the end ofload_hf_dataset.Measured on physical-intelligence/libero, episodes=[0..9]:
LeRobotDataset(...)LeRobotDataset(...)The persistent data cost is bounded by the
pa.Table(347 MB = source parquet size). The extra ~2.7 GB peak RSS is import / pyarrow scratch / dataset-stats transients and decays after construction. The principled scaling rule:pa.Table.nbytes ≈ filtered_rows × avg_row_size— so anepisodes=Noneload onhumanoid-everyday-A-overlay(46 GB parquet, image-heavy) would need ~46 GB resident before training starts.Practical implication: narrow
episodes=picks are fine; a defaultepisodes=Noneload on a multi-GB image-heavy repo can OOM.factory.make_dataset(...),scripts/visualize_dataset.py,scripts/fit_fast_tokenizer.py, andv21/convert_stats.pyall useepisodes=Noneand may need an explicit subset on big repos. Given #277 is specifically about disk-doubling on shared$HF_HOME, this is the trade-off worth making; if a real OOM shows up, the escape hatch is to write the filtered table to a tmppa.ipcfile and mmap it back.The trade-off is also called out inline at the
load_hf_datasetdefinition so a future reader hitting OOM on a small dev box can see why.How it was tested
pre-commit run --files src/opentau/datasets/lerobot_dataset.py tests/datasets/test_datasets.py— clean.pytest -m "not gpu" -n auto tests/datasets/— 392 passed, 7 skipped. Includes 3 new regression tests intests/datasets/test_datasets.py:test_dataset_unsorted_episodes_row_alignment—episodes=[6, 2, 5], assertsdataset.episodesis sorted and every row inhf_dataset[from[idx]:to[idx]]carries the expectedepisode_index. Fails without the sort.test_dataset_sparse_episodes_row_alignment— sparseepisodes=[3, 7]out of 10.test_dataset_no_episodes_loads_all—episodes=Nonedefault path.pytest -m "not gpu" -n auto(full CPU suite) — 1153 passed. 3 pre-existing failures unrelated to this change (HF Hub 429, missing libero assets, xdist flake).physical-intelligence/libero(10 episodes) via the fullLeRobotDataset(cfg, ...)constructor: zero growth in$HF_HOME/datasets/parquet/across two distinctepisodes=picks. Numbers in follow-up comments.LeRobotDataset(...)constructions with the same seed — all 6 hashes bit-identical. The change is loader-only and doesn't touch the training-loop primitives listed in CLAUDE.md rule Fixing reward normalizer #3, so a loader-determinism check is the right scope. Full smoke-config seeded determinism can be added before un-drafting if you want belt-and-suspenders.How to checkout & try? (for the reviewer)
git checkout claude/bold-antonelli-665e34 uv sync --extra dev --extra libero pytest -m "not gpu" -n auto tests/datasets/Disk-cost win on a host with a real dataset cached:
The second
dushould match the first (zero growth).Checklist