Skip to content

fix(datasets): allow action_freq=None in WeightedDatasetMixture signature#325

Merged
shuheng-liu merged 1 commit into
mainfrom
claude/interesting-shockley-a3ae4e
May 23, 2026
Merged

fix(datasets): allow action_freq=None in WeightedDatasetMixture signature#325
shuheng-liu merged 1 commit into
mainfrom
claude/interesting-shockley-a3ae4e

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 23, 2026

What this does

Tightens the type annotation on WeightedDatasetMixture.__init__ so that
action_freq is correctly declared as float | None instead of float.

Flagged while reviewing #324: after #323 made
DatasetMixtureConfig.action_freq float | None (mixed-frequency
training, each dataset at its native fps), several call sites can legitimately
forward None to the mixture constructor:

The attribute is stored as informational state and never used arithmetically
inside WeightedDatasetMixture (downstream consumers already guard the
None case: lerobot_dataset.py:1038 checks if self._action_freq is not None,
scripts/fit_fast_tokenizer.py formats None explicitly), so there is no
runtime bug — only the signature was lying. The project doesn't run a type
checker in pre-commit, so this would otherwise land silently.

Changes:

  • WeightedDatasetMixture.__init__: action_freq: floataction_freq: Optional[float],
    with a docstring entry explaining the None semantics.
  • Module-level docstring example now shows both the shared-rate
    (action_freq=30.0) and mixed-frequency (action_freq=None) cases.
  • Stored attribute annotated as Optional[float].

No call site needed None-narrowing fixes — all downstream consumers already
handle None.

Optional/List style matches the rest of dataset_mixture.py (the file
already imports both).

How it was tested

  • Added TestWeightedDatasetMixture::test_init_action_freq_none in
    tests/datasets/test_dataset_mixture.py: constructs the mixture with
    action_freq=None and asserts the attribute round-trips as None.
  • The end-to-end lerobot_dataset.py:1038 branch (emit_fps=True +
    action_freq=None → emit native meta.fps) is already covered by
    tests/datasets/test_optional_keys.py::TestEmitFps::test_fps_reports_native_when_action_freq_none
    (added in feat(datasets): mixed-frequency training + fps metadata for pi07 #323), so this PR's new test is intentionally scoped to the
    signature-level round-trip.
  • uv run pre-commit run --files src/opentau/datasets/dataset_mixture.py tests/datasets/test_dataset_mixture.py — all hooks pass.
  • uv run pytest -m "not gpu" -n auto tests/datasets — 439 passed, 8 skipped.

No policy / training-loop changes, so GPU pytests and nightly regression
tests are not required.

How to checkout & try? (for the reviewer)

gh pr checkout 325
uv run pytest -sx tests/datasets/test_dataset_mixture.py::TestWeightedDatasetMixture::test_init_action_freq_none

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

@shuheng-liu shuheng-liu self-assigned this May 23, 2026
@shuheng-liu shuheng-liu added bug Something isn't working and removed 🐛 Bug labels May 23, 2026
Copy link
Copy Markdown
Member Author

Review

LGTM. Surgical annotation-only fix, claims verified against main:

Verified

  • DatasetMixtureConfig.action_freq is float | None (configs/default.py:293), so call sites at datasets/factory.py:376,380 and scripts/fit_fast_tokenizer.py:714 can legitimately forward None.
  • action_freq is never used arithmetically inside WeightedDatasetMixture — only assigned to self.action_freq. grep confirms no other reads inside the class.
  • Downstream consumers already guard None:
    • datasets/lerobot_dataset.py:1038int(self._action_freq) if self._action_freq is not None else int(native_fps)
    • datasets/factory.py:138if action_freq is None: action_freq = ds_meta.fps
    • scripts/fit_fast_tokenizer.py:1033 — formats None explicitly for logging
  • Optional[float] style matches the file's existing from typing import List, Optional (line 71).

Nits (non-blocking)

  1. PR body says _build_mixture_parallel (L711) but the action_freq arg lands on L714; the WeightedDatasetMixture( call opens on L710. Trivial line drift.
  2. Style mismatch with DatasetMixtureConfig which uses float | None (PEP 604). The PR's Optional[float] matches the local file convention, which is the right call — flagging only so the divergence is intentional and visible.
  3. test_init_action_freq_none is marked @pytest.mark.slow # 1 sec. Borderline for the slow marker (pyproject.toml definition is ">1s"), but consistent with siblings in TestWeightedDatasetMixture that hit the same datasets_factory setup cost. Fine to keep.

Coverage
The new test asserts the attribute round-trips but doesn't exercise the downstream lerobot_dataset.py:1038 None-branch end-to-end. Not asking you to add one here — the relevant code path is older than #323 and presumably covered by the #323 PR's test suite — but worth confirming there's a test that actually runs emit_fps=True with action_freq=None somewhere (if not, that's a follow-up gap, not a blocker for this PR).

PR template: all required sections present, docstrings/types box checked, policy boxes correctly unchecked (no policy-related changes here).

Ready to merge once CI is green.


Generated by Claude Code

@shuheng-liu
Copy link
Copy Markdown
Member Author

Thanks for the review. Acted on the nits:

  1. Line drift — fixed in the PR body (L711L714). The actual action_freq arg lands at fit_fast_tokenizer.py:714; the WeightedDatasetMixture( call opens on L710.
  2. Optional[float] vs float | None — keeping the local-file convention as you noted. Divergence is intentional.
  3. @pytest.mark.slow — keeping; matches the adjacent test_init_success which has the same datasets_factory(...) setup cost.

Coverage note — already covered. tests/datasets/test_optional_keys.py:797::test_fps_reports_native_when_action_freq_none (added in #323) exercises exactly the lerobot_dataset.py:1038 None-branch end-to-end: builds a _DummyBaseDataset with emit_fps=True, meta_fps=50, action_freq=None, calls _to_standard_data_format, and asserts out["fps"].item() == 50 (native, not the absent resample target). The PR body now mentions this so reviewers don't have to re-verify.

This PR's new test is intentionally scoped to the constructor-level round-trip — that's the gap the wrong signature actually opened.

Copy link
Copy Markdown
Member Author

Re-review

Second pass — code diff unchanged from the prior review, only the PR body was updated. All four prior items are resolved:

  1. Line drift (L711 → L714) — confirmed against fit_fast_tokenizer.py:710-715. Fixed in body.
  2. Optional[float] vs float | None — acknowledged as intentional local-style match.
  3. @pytest.mark.slow marker — kept; matches test_init_success sibling.
  4. End-to-end coverage — verified tests/datasets/test_optional_keys.py:797::test_fps_reports_native_when_action_freq_none exists and exercises exactly the lerobot_dataset.py:1038 None-branch: _DummyBaseDataset(emit_fps=True, meta_fps=50, action_freq=None) → asserts out["fps"].item() == 50. The branch is covered.

CI status: all 7 check runs are success (CPU Tests, Pre-commit, check-checklist, review). mergeable_state is still blocked only because the PR is in draft.

One small docstring nit I missed last pass (non-blocking): the new test docstring at test_dataset_mixture.py:234 reads "Companion to the signature change made alongside PR #324". A reader 6 months from now will wonder why this test references #324 rather than this PR (#325) or the upstream config change (#323). Not worth a re-spin — flagging only because docstrings outlive their author's memory.

One follow-up suggestion (out of scope, do separately if at all): the PR notes "The project doesn't run a type checker in pre-commit, so this would otherwise land silently." If signature/annotation drift becomes a recurring class of bugs, adding mypy --strict or pyright to the dev extra and gating on a narrow set of files (configs + dataset signatures) would catch this category at PR time. Not for this PR.

Ready to mark ready-for-review and merge.


Generated by Claude Code

@shuheng-liu shuheng-liu marked this pull request as ready for review May 23, 2026 05:22
@shuheng-liu shuheng-liu merged commit e78db4b into main May 23, 2026
7 checks passed
@shuheng-liu shuheng-liu deleted the claude/interesting-shockley-a3ae4e branch May 23, 2026 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant