Skip to content

chore(datasets): collapse skip-timestamp-check warning to once-per-process#278

Merged
shuheng-liu merged 4 commits into
mainfrom
claude/skip-timestamp-warning-rank0
May 7, 2026
Merged

chore(datasets): collapse skip-timestamp-check warning to once-per-process#278
shuheng-liu merged 4 commits into
mainfrom
claude/skip-timestamp-warning-rank0

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 6, 2026

What this does

The Skipping timestamp sync check for ... warning at
src/opentau/datasets/lerobot_dataset.py:1348 previously fired
num_processes × num_datasets times — for a heterogeneous
pretraining mixture (~392 datasets) on 8 GPUs that's ~3,136 identical
lines between accelerate init and the first training step, drowning
the startup log.

skip_timestamp_check is fundamentally a mixture-wide decision
(with optional per-dataset override), and the warning text is identical
for every dataset that has it set. So the right collapse is once per
process
— combined with the rank-0 gate that's one line per run,
not per dataset.

Implementation mirrors the existing _CONTROL_MODE_WARNED dedup
pattern in the same file:

  • New module-level _SKIP_TIMESTAMP_WARNED: bool flag.
  • Inside LeRobotDataset.__init__, the warning is gated by both
    rank (acc.is_main_process) and the flag — fire once, then flip.
  • Falls through to logging.warning when no Accelerator is set so
    single-process dev / pytest runs are unchanged.
  • Message now mentions the first dataset hit as an example rather
    than the only one, since the flag is mixture-wide.

How it was tested

  • pre-commit run --files src/opentau/datasets/lerobot_dataset.py — clean.
  • pytest -m "not gpu" -n auto tests/datasets/ — 375 passed, 7
    skipped, no regressions.

Behavior comparison:

Setup Lines emitted (before) Lines emitted (after)
8 ranks × 392 datasets ~3,136 1
1 rank (CPU dev) × 5 datasets 5 1
1 rank × 1 dataset 1 1

The opt-in semantic and safety implication of skip_timestamp_check
itself are unchanged — only the per-rank × per-dataset duplication
is suppressed.

How to checkout & try? (for the reviewer)

git fetch origin claude/skip-timestamp-warning-rank0
git checkout claude/skip-timestamp-warning-rank0
pytest -m "not gpu" -n auto tests/datasets/

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

LeRobotDataset is constructed once per rank, so the existing
``logging.warning`` at the skip-timestamp-check branch fires
``num_processes`` x ``num_datasets`` times. For a 392-dataset
pretraining mixture on 8 GPUs that's ~3K identical log lines
between accelerate init and the first training step, drowning out
everything else.

Reuse the existing ``get_proc_accelerator`` pattern (already used
elsewhere in this file at L352 and L1287) to gate the warning to
``acc.is_main_process``, with a fall-through to logging when no
Accelerator is set so single-process dev/test runs are unchanged.

Behavior in practice:
  - 8-rank distributed run: 1x per dataset (was 8x)
  - single-process / CPU dev: unchanged
  - the underlying choice the user opted into (skip_timestamp_check
    on a heterogeneous mixture) and its safety implication are both
    unchanged
@shuheng-liu shuheng-liu added the optimization Optimizes the performance of something label May 6, 2026
@shuheng-liu shuheng-liu self-assigned this May 6, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 6, 2026

[claude-review] summary for commit 043c968

No blocking issues found. The 043c968 commit addresses the prior nit by wrapping the new test body in try/finally so _SKIP_TIMESTAMP_WARNED is captured up-front and restored in the finally clause — robust to test failure and prevents cross-test bleed-through within a pytest-xdist worker if a future test wants the warning to fire again.

Confirmed correctness end-to-end:

  • Module-level flag + rank-0 gate (acc is None or acc.is_main_process) at src/opentau/datasets/lerobot_dataset.py:1366 cleanly mirrors the established _CONTROL_MODE_WARNED dedup at line 1326. The acc is None fallthrough preserves single-process dev / pytest behavior.
  • global _SKIP_TIMESTAMP_WARNED is correctly declared before the assignment on line 1367.
  • No distributed-collective concern: each rank has its own module-level flag, only rank 0 ever logs / flips, and the flag isn't communicated across ranks (it doesn't need to be — non-rank-0 ranks short-circuit on is_main_process regardless of flag state).
  • The test at tests/datasets/test_datasets.py:1030 exercises the per-process dedup with two distinct repo_ids and the try/finally save/restore is the right shape — matches the convention requested in the prior round.

…ocess

skip_timestamp_check is a mixture-wide decision and the warning text
is identical for every dataset, so the previous "rank-0 only" gate
still left ~392 lines per run for a wide pretraining mixture.

Add a module-level _SKIP_TIMESTAMP_WARNED flag (mirroring the existing
_CONTROL_MODE_WARNED dedup pattern) so the warning fires once per
process, on the main rank only — once per run total instead of
num_processes × num_datasets.

The warning text now references the *first* dataset hit as an example
rather than the only one, since the flag is mixture-wide; users who
want a per-dataset listing can grep the resolved config.
@shuheng-liu shuheng-liu changed the title chore(datasets): gate skip-timestamp-check warning to rank 0 chore(datasets): collapse skip-timestamp-check warning to once-per-process May 6, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude-review] inline finding for commit 3a36dce

See the summary comment on this PR for the rollup.

@shuheng-liu
Copy link
Copy Markdown
Member Author

@claude fix the missing test(s).

- addresses @shuheng-liu (missing test): added
  test_skip_timestamp_warning_emitted_once_per_process to
  tests/datasets/test_datasets.py mirroring the
  _CONTROL_MODE_WARNED warn-once test — constructs two datasets
  with skip_timestamp_check=True and asserts exactly one matching
  log record via caplog, with an upfront ld_mod._SKIP_TIMESTAMP_WARNED
  = False reset to make the test independent of prior state.

tests: passed — pytest -xvs tests/datasets/test_datasets.py::test_skip_timestamp_warning_emitted_once_per_process tests/datasets/test_datasets.py::test_control_mode_warning_emitted_once_per_repo

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@WilliamYue37 WilliamYue37 self-requested a review May 6, 2026 23:09
@shuheng-liu shuheng-liu marked this pull request as ready for review May 7, 2026 03:28
@shuheng-liu
Copy link
Copy Markdown
Member Author

@claude fix the nit

- addresses @shuheng-liu (nit): wrapped
  test_skip_timestamp_warning_emitted_once_per_process body in
  try/finally so the module-level _SKIP_TIMESTAMP_WARNED flag is
  restored to its original value after the test runs, even on
  failure. Prevents cross-test bleed-through within the same
  pytest-xdist worker if a future test wants to assert the warning
  fires again.

tests: passed — pytest -m "not gpu" -n auto tests/datasets/test_datasets.py
(2 pre-existing failures in test_do_not_use_imagenet_stats unrelated
to this change — HF cache misses for lerobot/droid_100 and
lerobot/aloha_mobile_cabinet parquet files)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 7, 2026

[claude-fix] Addressed in 043c968: wrapped the test body in try/finally so _SKIP_TIMESTAMP_WARNED is saved and restored to its original value, even on failure. This prevents cross-test bleed-through within the same pytest-xdist worker if a future test wants to assert the warning fires again.

tests: passed — pytest -m "not gpu" -n auto tests/datasets/test_datasets.py (42 passed; 2 pre-existing test_do_not_use_imagenet_stats failures are HF cache misses unrelated to this change).

@shuheng-liu shuheng-liu merged commit 4451ebe into main May 7, 2026
7 checks passed
@shuheng-liu shuheng-liu deleted the claude/skip-timestamp-warning-rank0 branch May 7, 2026 21:05
@claude claude Bot mentioned this pull request May 7, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimization Optimizes the performance of something

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants