feat(datasets): add tolerance_s and skip_timestamp_check config knobs#258
Merged
Conversation
Expose `LeRobotDataset`'s timestamp-sync tolerance through `DatasetConfig` / `DatasetMixtureConfig` and add an explicit skip flag. Mixture-level values are the default; per-dataset values override when set (None means inherit). The skip only bypasses the load-time check; the record-time check inside `add_episode` is intentionally unchanged.
Contributor
There was a problem hiding this comment.
Light review — overall this is a clean, well-tested change. A few notes; no blockers.
- The mixture-level
tolerance_sdocstring scopes the field to the load-timecheck_timestamps_synccall, butLeRobotDataset.__init__also threadsself.tolerance_sinto the per-frame video queries (query_video_frames_floor/ceil/roundedatsrc/opentau/datasets/lerobot_dataset.py:1743/1746/1753). A user looseningtolerance_s=1e-3to ease the load-time check will also widen the video-frame match window. The existingLeRobotDataset.__init__docstring explicitly notes "This also applies to frames decoded from video files." — worth carrying that caveat into the new config docstrings so users aren't surprised. default.py:316constructsidentifier = dataset_cfg.repo_id or dataset_cfg.vqa or "<unidentified dataset>". SinceDatasetConfig.__post_init__already enforces exactly one ofrepo_id/vqais set, the third branch is unreachable. Harmless, just dead.DatasetConfig.tolerance_svalidation lives only inDatasetMixtureConfig.__post_init__, so a bareDatasetConfig(repo_id="foo", tolerance_s=-1)won't raise. Matches the mixture usage path, but for symmetry you could move/duplicate the check intoDatasetConfig.__post_init__.
Contributor
|
[claude-review] summary for commit 968e6bb No blocking issues found. The three prior review items are addressed cleanly in 968e6bb:
|
Member
Author
|
@claude fix per suggestions and nits |
- addresses @claude (docstrings): note that `tolerance_s` also widens the per-frame video match window (used by `query_video_frames_*`) — added the caveat to both `DatasetConfig.tolerance_s` and `DatasetMixtureConfig.tolerance_s` docstrings so users aren't surprised when loosening it. - addresses @claude (symmetry): moved per-dataset `tolerance_s` validation into `DatasetConfig.__post_init__` so a bare `DatasetConfig(repo_id=..., tolerance_s=-1)` also raises (previously only validated when wrapped in a mixture). Added a focused test covering the bare path. - addresses @claude (dead code): removed the now-redundant per-dataset loop in `DatasetMixtureConfig.__post_init__`, which also drops the unreachable `or "<unidentified dataset>"` fallback (every `DatasetConfig` already has exactly one of `repo_id`/`vqa` set). tests: passed — pytest -m "not gpu" -n auto tests/configs/test_default.py tests/datasets/test_datasets.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
[claude-fix] addressed all three review items in 968e6bb.
tests: passed — |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Exposes
LeRobotDataset's timestamp-sync tolerance throughDatasetConfig/DatasetMixtureConfigand adds an explicit skip flag, so users can configure or bypass the load-timecheck_timestamps_synccall from aTrainPipelineConfigJSON. Previously the tolerance was hardcoded to1e-4with no escape hatch outside ofscripts/visualize_dataset.py.New fields:
tolerance_sDatasetMixtureConfig.tolerance_s: float = 1e-4DatasetConfig.tolerance_s: float | None = Noneskip_timestamp_checkDatasetMixtureConfig.skip_timestamp_check: bool = FalseDatasetConfig.skip_timestamp_check: bool | None = NonePer-dataset values win over the mixture default when set;
Nonemeans inherit. The skip only bypasses the load-time call inLeRobotDataset.__init__; the record-time call insideadd_episodeis intentionally not gated — bad timestamps in fresh recordings indicate a real capture bug, not tolerable drift.Useful when one dataset in a mixture has slightly off-fps timestamps (override per-dataset to loosen for just that one) or as a debug knob to skip the check entirely while investigating timing data.
How it was tested
tests/configs/test_default.py: defaults at both levels, validation rejects negativetolerance_sat both levels.tests/datasets/test_datasets.py: skip bypasses the load check, default runs it once, per-dataset overrides win over mixture defaults in both directions (true override, not set-when-True), mixture defaults flow through when per-dataset isNone.pre-commit run --all-files(focused on modified files): all hooks pass.pytest -m "not gpu" -n auto tests/configs/test_default.py tests/datasets/test_datasets.py: 64 passed.pytest -m "not gpu" -n auto): 959 passed, 13 skipped, no regressions. (One pre-existingtest_libero_utils.py::test_libero2torchfailure due to missing local LIBERO benchmark data — unrelated to this PR.)How to checkout & try? (for the reviewer)
Run the focused tests:
Or exercise via a smoke-config CLI override:
opentau-train \ --accelerate-config configs/examples/accelerate_ddp_config.yaml \ --config_path=configs/examples/pi05_training_config.json \ --dataset_mixture.skip_timestamp_check=true \ --steps=2Per-dataset override in JSON / CLI:
opentau-train ... \ --dataset_mixture.datasets.0.tolerance_s=1e-3 \ --dataset_mixture.datasets.0.skip_timestamp_check=trueChecklist
Note: Before submitting this PR, please read the contributor guideline.