feat(datasets): add tolerance_s and skip_timestamp_check config knobs by shuheng-liu · Pull Request #258 · TensorAuto/OpenTau

shuheng-liu · 2026-05-04T22:20:28Z

What this does

Exposes LeRobotDataset's timestamp-sync tolerance through DatasetConfig / DatasetMixtureConfig and adds an explicit skip flag, so users can configure or bypass the load-time check_timestamps_sync call from a TrainPipelineConfig JSON. Previously the tolerance was hardcoded to 1e-4 with no escape hatch outside of scripts/visualize_dataset.py.

New fields:

Field	Mixture-level default	Per-dataset override
`tolerance_s`	`DatasetMixtureConfig.tolerance_s: float = 1e-4`	`DatasetConfig.tolerance_s: float \| None = None`
`skip_timestamp_check`	`DatasetMixtureConfig.skip_timestamp_check: bool = False`	`DatasetConfig.skip_timestamp_check: bool \| None = None`

Per-dataset values win over the mixture default when set; None means inherit. The skip only bypasses the load-time call in LeRobotDataset.__init__; the record-time call inside add_episode is intentionally not gated — bad timestamps in fresh recordings indicate a real capture bug, not tolerable drift.

Useful when one dataset in a mixture has slightly off-fps timestamps (override per-dataset to loosen for just that one) or as a debug knob to skip the check entirely while investigating timing data.

How it was tested

Added 4 config-level tests in tests/configs/test_default.py: defaults at both levels, validation rejects negative tolerance_s at both levels.
Added 5 factory/dataset-level tests in tests/datasets/test_datasets.py: skip bypasses the load check, default runs it once, per-dataset overrides win over mixture defaults in both directions (true override, not set-when-True), mixture defaults flow through when per-dataset is None.
pre-commit run --all-files (focused on modified files): all hooks pass.
pytest -m "not gpu" -n auto tests/configs/test_default.py tests/datasets/test_datasets.py: 64 passed.
Full CPU suite (pytest -m "not gpu" -n auto): 959 passed, 13 skipped, no regressions. (One pre-existing test_libero_utils.py::test_libero2torch failure due to missing local LIBERO benchmark data — unrelated to this PR.)

How to checkout & try? (for the reviewer)

Run the focused tests:

pytest -sx tests/configs/test_default.py -k "tolerance"
pytest -sx tests/datasets/test_datasets.py -k "timestamp_check or tolerance"

Or exercise via a smoke-config CLI override:

opentau-train \
    --accelerate-config configs/examples/accelerate_ddp_config.yaml \
    --config_path=configs/examples/pi05_training_config.json \
    --dataset_mixture.skip_timestamp_check=true \
    --steps=2

Per-dataset override in JSON / CLI:

opentau-train ... \
    --dataset_mixture.datasets.0.tolerance_s=1e-3 \
    --dataset_mixture.datasets.0.skip_timestamp_check=true

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

Expose `LeRobotDataset`'s timestamp-sync tolerance through `DatasetConfig` / `DatasetMixtureConfig` and add an explicit skip flag. Mixture-level values are the default; per-dataset values override when set (None means inherit). The skip only bypasses the load-time check; the record-time check inside `add_episode` is intentionally unchanged.

claude

Light review — overall this is a clean, well-tested change. A few notes; no blockers.

The mixture-level tolerance_s docstring scopes the field to the load-time check_timestamps_sync call, but LeRobotDataset.__init__ also threads self.tolerance_s into the per-frame video queries (query_video_frames_floor/ceil/rounded at src/opentau/datasets/lerobot_dataset.py:1743/1746/1753). A user loosening tolerance_s=1e-3 to ease the load-time check will also widen the video-frame match window. The existing LeRobotDataset.__init__ docstring explicitly notes "This also applies to frames decoded from video files." — worth carrying that caveat into the new config docstrings so users aren't surprised.
default.py:316 constructs identifier = dataset_cfg.repo_id or dataset_cfg.vqa or "<unidentified dataset>". Since DatasetConfig.__post_init__ already enforces exactly one of repo_id / vqa is set, the third branch is unreachable. Harmless, just dead.
DatasetConfig.tolerance_s validation lives only in DatasetMixtureConfig.__post_init__, so a bare DatasetConfig(repo_id="foo", tolerance_s=-1) won't raise. Matches the mixture usage path, but for symmetry you could move/duplicate the check into DatasetConfig.__post_init__.

claude · 2026-05-04T22:23:00Z

[claude-review] summary for commit 968e6bb

No blocking issues found. The three prior review items are addressed cleanly in 968e6bb:

Docstrings on both tolerance_s fields now spell out that the same value flows into query_video_frames_*, so loosening it widens the video-frame match window.
DatasetConfig.tolerance_s is now validated in DatasetConfig.__post_init__ (bare construction raises) — the duplicate mixture-level loop and the unreachable "<unidentified dataset>" branch are gone.
New test test_invalid_negative_bare_dataset_tolerance_raises covers the bare-construction path; existing assertion regex still matches the unchanged error format.

shuheng-liu · 2026-05-04T22:29:01Z

@claude fix per suggestions and nits

@claude

- addresses @claude (docstrings): note that `tolerance_s` also widens the per-frame video match window (used by `query_video_frames_*`) — added the caveat to both `DatasetConfig.tolerance_s` and `DatasetMixtureConfig.tolerance_s` docstrings so users aren't surprised when loosening it. - addresses @claude (symmetry): moved per-dataset `tolerance_s` validation into `DatasetConfig.__post_init__` so a bare `DatasetConfig(repo_id=..., tolerance_s=-1)` also raises (previously only validated when wrapped in a mixture). Added a focused test covering the bare path. - addresses @claude (dead code): removed the now-redundant per-dataset loop in `DatasetMixtureConfig.__post_init__`, which also drops the unreachable `or "<unidentified dataset>"` fallback (every `DatasetConfig` already has exactly one of `repo_id`/`vqa` set). tests: passed — pytest -m "not gpu" -n auto tests/configs/test_default.py tests/datasets/test_datasets.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-04T22:36:07Z

[claude-fix] addressed all three review items in 968e6bb.

suggestion (docstrings, default.py:82-88,217-223) — added a note to both DatasetConfig.tolerance_s and DatasetMixtureConfig.tolerance_s docstrings spelling out that the same value flows into the per-frame video match window in query_video_frames_*, so loosening it widens the video-frame match too.
nit (default.py:316 dead branch) — removed; see below.
nit (default.py:141-150 symmetry) — moved the per-dataset tolerance_s validation into DatasetConfig.__post_init__ so a bare DatasetConfig(repo_id=\"foo\", tolerance_s=-1) now raises. The mixture-level loop is gone (which also drops the unreachable or \"<unidentified dataset>\" from the previous nit). The error message format is preserved so the existing test still asserts; added a new test_invalid_negative_bare_dataset_tolerance_raises covering the bare path.

tests: passed — pytest -m "not gpu" -n auto tests/configs/test_default.py tests/datasets/test_datasets.py (18 + 43 passed locally).

claude

Re-reviewed at 968e6bb — the three prior items are addressed cleanly. No blocking issues.

shuheng-liu added the feature New feature or request label May 4, 2026

shuheng-liu self-assigned this May 4, 2026

shuheng-liu marked this pull request as ready for review May 4, 2026 22:22

claude Bot reviewed May 4, 2026

View reviewed changes

shuheng-liu merged commit 5ab5b56 into main May 4, 2026
7 checks passed

shuheng-liu deleted the claude/jolly-wilson-b28d3b branch May 4, 2026 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): add tolerance_s and skip_timestamp_check config knobs#258

feat(datasets): add tolerance_s and skip_timestamp_check config knobs#258
shuheng-liu merged 2 commits into
mainfrom
claude/jolly-wilson-b28d3b

shuheng-liu commented May 4, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot commented May 4, 2026 •

edited

Loading

Uh oh!

shuheng-liu commented May 4, 2026

Uh oh!

claude Bot commented May 4, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuheng-liu commented May 4, 2026

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shuheng-liu commented May 4, 2026

Uh oh!

claude Bot commented May 4, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 4, 2026 •

edited

Loading