Skip to content

feat(datasets): mixed-frequency training + fps metadata for pi07#323

Merged
shuheng-liu merged 7 commits into
mainfrom
claude/awesome-liskov-941c29
May 23, 2026
Merged

feat(datasets): mixed-frequency training + fps metadata for pi07#323
shuheng-liu merged 7 commits into
mainfrom
claude/awesome-liskov-941c29

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 23, 2026

What this does

Adds opt-in mixed-frequency training for the dataset mixture and surfaces the dataset's native frame rate as an fps metadata field that pi07 / pi07_paligemma tokenize into the prefix prompt. The four pieces:

  1. DatasetMixtureConfig.action_freq: float | None = None (was float = 30.0). When None (new default), each dataset is sampled at its native fps with no resampling — resolve_delta_timestamps substitutes ds_meta.fps so a chunk of N actions becomes N consecutive native frames regardless of the dataset's rate. This is also a bug fix for sub-30-Hz single-dataset runs: the old 30 Hz default produced silently duplicated/skipped action targets for 10 / 15 / 20 Hz datasets via nearest-neighbor resampling. Setting a positive float preserves the legacy "resample everything to this rate" behaviour.

  2. DatasetMixtureConfig.emit_fps: bool = False (opt-in). When True, __getitem__ returns the effective per-sample frame rate under the fps key (torch.long scalar) with fps_is_pad=False — equal to action_freq when set, else meta.fps. VQA samples (no temporal axis) emit fps=0, fps_is_pad=True so heterogeneous VLA + VQA batches stay schema-aligned. fps is not rolled by metadata_drop_*_prob — it's an intrinsic property of the chunk, not a noisy label.

  3. pi07 / pi07_paligemma prepare_metadata × 4 files. All four implementations now tokenize fps with the "FPS: N, " header (uppercase to match the sibling Speed:/Quality:/Mistake:/Robot:/Control: labels), slotted between Robot: and Control: (final order: Speed → Quality → Mistake → Robot → FPS → Control). Missing or padded fps drops the segment entirely (no header, no comma).

  4. EnvMetadataConfig.emit_fps: bool = False (opt-in, parallels the training-side toggle). When True, add_eval_metadata broadcasts cfg.env.fps as the fps metadata key, matching the training-side contract.

Backward-compat notes for reviewers

  • action_freq default changed from 30.0 to None. Sub-30-Hz single-dataset runs (10 / 15 / 20 Hz) are silently improved (no more duplicate action targets via nearest-neighbor). Single-dataset runs at exactly 30 Hz are byte-identical. Above-30-Hz runs (50 / 60 Hz) see their prediction horizon shrink because the chunk now covers fewer real seconds at the higher native rate — pin action_freq=30.0 explicitly to preserve the old horizon.
  • emit_fps defaults to False on both DatasetMixtureConfig and EnvMetadataConfig. Pre-PR checkpoints resume cleanly because no FPS: segment is added to the policy prefix. New training runs that want fps conditioning set emit_fps=True on both sides (training + eval) explicitly.

How it was tested

  • Added unit tests covering the new contract:
    • tests/configs/test_default.pyaction_freq=None accepted, default is None, emit_fps defaults False.
    • tests/datasets/test_delta_timestamps.pyresolve_delta_timestamps substitutes native fps when action_freq=None; two datasets with different fps produce different per-action timestamps.
    • tests/datasets/test_optional_keys.py_emit_optional_keys is a no-op under default emit_fps=False; when explicitly enabled, emits the effective fps (action_freq if set, else meta.fps) as torch.long with fps_is_pad=False; fps stays non-pad even with metadata_drop_*_prob=1.0; VQA samples emit fps=0, fps_is_pad=True so heterogeneous default-collate produces aligned (B,) tensors.
    • tests/policies/test_pi07_cpu.py, tests/policies/test_pi07_paligemma_{high_level_planner,low_level}.pyprepare_metadata content + ordering: "FPS: N, " appears in the prefix between Robot: and Control: when present, segment omitted entirely when padded or absent, no stray ", ," artefacts. Mixed-fps_is_pad batches (LeRobot + VQA) honour per-sample padding row-by-row.
    • tests/envs/test_configs.py, tests/envs/test_utils.pyEnvMetadataConfig.emit_fps defaults False; add_eval_metadata broadcasts cfg.env.fps as torch.long only when emit_fps=True is opted in.
  • pre-commit run --all-files — clean.
  • pytest -m "not gpu" -n auto — full suite passes (modulo pre-existing LIBERO-assets / timing-flaky failures unrelated to this branch).
  • pytest -m "gpu" -n 0 (RTX 5090) — 19 passed, 10 skipped, 0 failed (~2:46). All skips are pre-existing unconditional @pytest.mark.skip markers (memory or environment), unaffected by this branch.

Nightly regression tests still need to land in the scheduled job.

How to checkout & try? (for the reviewer)

git fetch origin claude/awesome-liskov-941c29
git checkout claude/awesome-liskov-941c29
uv sync --extra dev --extra libero

Run the focused test slice:

pytest -sx -m "not gpu" \
  tests/configs/test_default.py \
  tests/datasets/test_delta_timestamps.py \
  tests/datasets/test_optional_keys.py \
  tests/envs/test_configs.py tests/envs/test_utils.py \
  tests/policies/test_pi07_cpu.py \
  tests/policies/test_pi07_paligemma_high_level_planner.py \
  tests/policies/test_pi07_paligemma_low_level.py

Eyeball the new prefix string in a Python REPL (no model weights needed — stubbed tokenizer):

from types import SimpleNamespace
import torch
from opentau.policies.pi07.low_level.modeling_pi07_low_level import PI07LowLevelPolicy

captured = []
def stub(metadata, **kw):
    captured.extend(metadata)
    return {"input_ids": torch.zeros((len(metadata), 4), dtype=torch.long),
            "attention_mask": torch.zeros((len(metadata), 4), dtype=torch.long)}
tok = SimpleNamespace(); tok.__call__ = stub
fake = SimpleNamespace(language_tokenizer=tok, config=SimpleNamespace(metadata_max_length=4))
batch = {
    "state": torch.zeros(1, 1),
    "fps": torch.tensor([20], dtype=torch.long),
    "fps_is_pad": torch.tensor([False]),
    "robot_type": ["franka"],
    "control_mode": ["joint"],
}
PI07LowLevelPolicy.prepare_metadata(fake, batch)
print(captured[0])
# -> 'Metadata: Robot: franka,  FPS: 20,  Control: joint, '

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

- DatasetMixtureConfig.action_freq is now Optional[float] and defaults
  to None, disabling resampling so each dataset is sampled at its
  native fps. resolve_delta_timestamps substitutes ds_meta.fps when
  action_freq is None so a chunk of N actions is N consecutive native
  frames regardless of the dataset's rate.
- New DatasetMixtureConfig.emit_fps (default True) makes __getitem__
  emit the dataset's native meta.fps as a torch.long sample key
  (always non-pad, no participation in metadata_drop_*_prob).
- All four pi07 / pi07_paligemma prepare_metadata implementations now
  tokenize fps with the lowercase "fps: N, " header between Robot: and
  Control:, omitted entirely when fps_is_pad is True.
- New EnvMetadataConfig.emit_fps (default True) makes add_eval_metadata
  broadcast cfg.env.fps as the inference-time fps metadata key,
  matching the training contract end-to-end.
@shuheng-liu shuheng-liu added the feature New feature or request label May 23, 2026
@shuheng-liu shuheng-liu self-assigned this May 23, 2026
Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the diff end-to-end against 28db260 (true PR base). Implementation looks correct and tests are thorough — the four prepare_metadata updates are consistent (Speed → Quality → Mistake → Robot → fps → Control), _emit_optional_keys correctly bypasses the dropout rolls for fps, and add_eval_metadata mirrors the training contract. A few design-level concerns worth flagging before un-drafting:

1. Two defaults silently flip behaviour for existing configs / resumed checkpoints (main concern)

The PR makes two backward-incompatible defaults at once:

  • DatasetMixtureConfig.action_freq: 30.0None
  • DatasetMixtureConfig.emit_fps / EnvMetadataConfig.emit_fps: new, both default True

The combined effect on any existing config that doesn't explicitly set these:

  • A heterogeneous mixture that was implicitly being resampled to 30 Hz now uses each dataset's native rate, so a single batch can mix samples spanning different real-time horizons (chunk_size consecutive native frames). For a 10 Hz + 30 Hz mixture, this is a 3× shift in the supervised time window for half the batch — not "no-op when you don't opt in".
  • The PaliGemma / Gemma 3 prefix gains a new "fps: N, " segment the model has never seen. The PR body acknowledges this but treats it as a footnote: resuming any pre-PR checkpoint requires remembering to set both emit_fps toggles to False.

Suggest making at least one of these opt-in:

  • Cheapest fix: default emit_fps=False on both DatasetMixtureConfig and EnvMetadataConfig. New runs that want fps tokenization flip the flag explicitly; old configs are byte-identical.
  • Or: keep action_freq=30.0 default and make None an explicit "I know I'm doing mixed-frequency" opt-in.

Doing both — flipping the resample default and injecting a new prefix segment — without an opt-in step is a footgun for anyone with in-flight runs.

2. fps is emitted on every sample for policies that never read it

emit_fps lives on the mixture, so a pi0 / pi05 / pi05_mem / pi06 run gets fps + fps_is_pad in every sample dict even though only pi07 / pi07_paligemma tokenize them. The cost is negligible (one long + one bool), and the policy-side batch.get(..., default_pad) makes it safe — just noting the asymmetry. Tying emission to the policy type would be over-engineered; the current "emit always, ignore if unused" is fine if the default is False per concern #1.

3. Minor

  • int(self.meta.fps) in _emit_optional_keys: meta.fps is typed int (lerobot_dataset.py:487), so the cast is a no-op. No real bug, but if a future dataset ever stores a non-integer rate in info.json (29.97, 59.94) the tokenized prefix and resolve_delta_timestamps would diverge by a few percent. Probably fine to leave; the type contract is integer.
  • Lowercase "fps:" vs capitalized siblings (Speed:, Robot:) — deliberate per the PR body, and "fps" is conventionally lowercase, so reads fine. Just confirming the choice is intentional.
  • f"fps: {str(fps.item())}, " — redundant str() inside f-string, but matches the existing speed/quality/mistake style. Don't change for consistency.

Concern #1 is the only one I'd block on. The rest are nits / FYI.


Generated by Claude Code

`_emit_optional_keys` previously emitted `meta.fps` regardless of the
mixture's `action_freq`. When the user pinned `action_freq=30.0` on a
mixture containing a 50 Hz dataset, `resolve_delta_timestamps` resampled
the chunk to 30 Hz but the prefix still said `fps: 50, ` — telling the
policy a different rate than the chunk actually carried.

Capture `_action_freq` on `BaseDataset` and emit
`action_freq if set else meta.fps` so the tokenized fps always matches
the chunk's effective rate.
@shuheng-liu
Copy link
Copy Markdown
Member Author

Thanks for the careful review. Responses inline.

#1 — defaults

Both defaults (action_freq=None, emit_fps=True on both sides) were intentional design choices fixed during planning, with the trade-off documented in the PR body's "Backward-compat note". The reasoning: mixed-frequency is the primary new use case, and making the intended configuration the default avoids requiring two opt-in flags for the new feature to work end-to-end.

That said, the resume-from-checkpoint footgun is real. I see two viable knobs and want your call before flipping:

  • A. emit_fps=False default on both sides — keeps action_freq=None (new training picks the rate up automatically once the user flips emit_fps=True), but old checkpoints resume cleanly because the prefix doesn't gain a new segment by default. The cost is that silently using mixed-freq without fps conditioning is now reachable (no opt-in barrier for the resample change alone).
  • B. action_freq=30.0 default — old configs are byte-identical (both the resample behaviour and, indirectly, the prefix content: every dataset gets resampled to 30 Hz, so emit_fps=True produces a constant fps: 30, segment that pre-PR checkpoints still haven't seen). Doesn't actually solve the resume problem unless we also flip emit_fps.

The combo "A and B both flipped" makes the whole feature opt-in (safest, but two extra config lines per user). Happy to push whichever you pick; my mild preference is A if I have to choose unilaterally — it preserves the planning-conversation intent that mixed-freq is the new default while still letting old checkpoints resume.

#2 — fps emitted for non-pi07 policies

Confirmed. Cost is one torch.long + one torch.bool per sample, gated by emit_fps. Policy-type-aware emission would over-engineer for ~16 bytes. No change.

#3 — nits

  • int(self.meta.fps): leaving as defensive coding. If a future dataset ever stores 29.97 in info.json, int() floors it to 29 — predictably wrong rather than mysteriously wrong via tensor-cast.
  • Lowercase fps:: intentional per the original task description.
  • Redundant str() in f-string: matches the existing speed/quality/mistake lines. Leaving for consistency.

Separate bug surfaced while re-reading — fixed in 7236b5a

_emit_optional_keys was emitting meta.fps regardless of action_freq. When a user pins action_freq=30.0 on a mixture containing a 50 Hz dataset, resolve_delta_timestamps resamples the chunk to 30 Hz — but the prefix would still say fps: 50, , telling the policy a different rate than the chunk actually carries. Fixed by capturing _action_freq on BaseDataset and emitting action_freq if set else meta.fps. Two new tests pin the behaviour (test_fps_reports_action_freq_when_resampling, test_fps_reports_native_when_action_freq_none).

Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second-pass review found one concrete bug, plus some details to confirm. Posting separately from the first review for clarity.

Bug: emit_fps=True crashes any mixture containing a VQA dataset

VQADataset.__getitem__ calls self._to_standard_data_format(item) (vqa/base.py:142), which calls _emit_optional_keys. The new fps block runs unconditionally:

if self.emit_fps:
    standard_item["fps"] = torch.tensor(int(self.meta.fps), dtype=torch.long)

But:

  • VQADatasetMetadata(DatasetMetadata) (lerobot_dataset.py:305-308) is pass.
  • DatasetMetadata (lerobot_dataset.py:253-302) defines features / image_keys / video_keys / camera_keys / names / shapes — no fps property.
  • VQADataset.create_meta builds an info dict with no "fps" key (vqa/base.py:82-91).

So self.meta.fps is a hard AttributeError for VQA samples. Combined with emit_fps=True being the new default and applied through the mixture's dm.emit_fps, any heterogeneous mixture that includes a VQA dataset crashes at the first sample fetch with the default config.

This isn't covered by the new tests: _DummyBaseDataset in tests/datasets/test_optional_keys.py:84-92 stubs meta = SimpleNamespace(fps=meta_fps, info=...), so it never exercises the real VQADatasetMetadata path.

Heterogeneous co-training (VLA + VQA) is one of the headline OpenTau features called out in CLAUDE.md ("heterogeneous-dataset co-training") — this would break the default flow for any team using that recipe.

Fix options:

  1. Cheapest: gate the fps emission on the metadata actually having an fps attribute:

    if self.emit_fps and hasattr(self.meta, "fps"):
        ...

    Downside: silent — a VQA-in-mixture run produces samples without fps, and after collate that breaks default_collate when mixed with LeRobot samples in the same batch.

  2. Better: VQA explicitly emits a pad row (fps=torch.tensor(0, dtype=torch.long), fps_is_pad=torch.tensor(True)) so the batch always has consistent keys and the policy's *_is_pad path drops the segment. Either by overriding _emit_optional_keys in VQA, or by defining VQADatasetMetadata.fps = None and checking-None in the base emitter.

  3. Best — but ties into concern #1 from the first review: default emit_fps=False, then the VQA path is a no-op and existing configs don't change. The bug becomes opt-in.

Confirmed (not bugs, just noting)

  • _to_standard_data_format_emit_optional_keys → bf16 cast order is fine: fps is torch.long, fps_is_pad is torch.bool, neither is is_floating_point, so the cast loop skips them.
  • fps_is_pad = torch.tensor(False) is a 0-d scalar matching the other _is_pad flags emitted in the same method (speed_is_pad, quality_is_pad, mistake_is_pad). default_collate will stack them to (B,) consistently — no shape mismatch with the eval-side (B,) emission in add_eval_metadata.
  • resolve_delta_timestamps with action_freq=ds_meta.fps does produce per-dataset native-frame timestamps as advertised — confirmed against tests/datasets/test_delta_timestamps.py::test_resolve_delta_timestamps_native_fps_differs_per_dataset. The semantics change (frame-stride vs real-time-stride) for subsampled action_delta_indices was already implicit in concern #1.
  • Test coverage for the four prepare_metadata updates is good (Robot < fps < Control ordering, padded-omit, missing-key, partial-segments) — the only gap is the VQA-mixture path above.

Generated by Claude Code

`_emit_optional_keys` accessed `self.meta.fps` unconditionally when
`emit_fps=True`, raising AttributeError on VQA datasets — whose
`VQADatasetMetadata` (inheriting `pass` from `DatasetMetadata`) has no
`fps` property. Any heterogeneous VLA + VQA mixture trained under the
default config crashed at the first VQA sample fetch.

Surface `fps -> None` on the `DatasetMetadata` base (overridden to int
on `LeRobotDatasetMetadata`), then have `_emit_optional_keys` pad out
the row (`fps=0, fps_is_pad=True`) when meta.fps is None. The policy's
`prepare_metadata` already drops the `FPS:` segment for padded samples,
so heterogeneous batches stay schema-aligned and the prefix omits the
fps token for VQA rows.
@shuheng-liu
Copy link
Copy Markdown
Member Author

Good catch — real bug, fixed in 090b436.

What broke

_emit_optional_keys accessed self.meta.fps unconditionally when emit_fps=True. VQADatasetMetadata inherits pass from DatasetMetadata, neither of which defined fps, so every VQA sample fetch raised AttributeError. Heterogeneous VLA + VQA mixtures (one of OpenTau's headline use cases) crashed at the first sample under the default config.

The fix — your option 2

  • Surfaced fps: int | None as a property on DatasetMetadata (returns None by default; LeRobotDatasetMetadata.fps already overrode it to return the int from info["fps"], no change there).
  • _emit_optional_keys branches on native_fps is None and emits fps=0, fps_is_pad=True for VQA samples. The policy's prepare_metadata already drops the FPS: segment when fps_is_pad is True, so VQA rows get no fps token in the prefix while LeRobot rows in the same batch get their real rate. The pad row keeps the batch schema-aligned, which is the bit default_collate cares about.
  • Test coverage added in TestFpsEmissionVQA (4 tests): base + VQA metadata return None, VQA sample emits pad, VQA pad ignores action_freq (the rate would be meaningless for a sample with no actions), and heterogeneous VLA + VQA default-collate produces fps=[30, 0] / fps_is_pad=[False, True]. These exercise the real VQADatasetMetadata class, not the SimpleNamespace stub — closing the gap you flagged.

On the other findings

Your three "confirmed" items match what I'd verified locally — the bf16-cast loop skips torch.long/torch.bool so fps survives intact, the 0-d fps_is_pad collates to (B,) consistently with the eval-side emission, and resolve_delta_timestamps does land on per-dataset native frames.

The default question from your first review (concern #1) is still open — I haven't flipped anything yet, waiting on your call between A (emit_fps=False default), B (action_freq=30.0 default), or both.

@shuheng-liu
Copy link
Copy Markdown
Member Author

Decision on the defaults question (review concern #1): keep both as-is.

  • DatasetMixtureConfig.action_freq stays None (mixed-frequency is the default).
  • DatasetMixtureConfig.emit_fps and EnvMetadataConfig.emit_fps both stay True.

Rationale: the new defaults reflect the intended new behaviour — mixed-freq training with per-sample fps conditioning works end-to-end with zero config ceremony. The resume-from-checkpoint footgun for pre-PR runs is documented in the PR body's "Backward-compat note for reviewers"; users in that case set both emit_fps=False explicitly.

No code change needed for this. Closing the loop so reviewers know the trade-off was deliberate.

Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Third pass — three smaller observations on top of the two reviews above. None are blockers if concern #1 (default emit_fps=False) is adopted; they apply mostly to the emit_fps=True future state.

1. has_metadata is now always True under emit_fps=True — silently changes ablation semantics

The pi07 low-level uses _global_or_branch_decisions over has_metadata_local = bool(metadata_masks.any()) (pi07/low_level/modeling_pi07_low_level.py:1387-1390), and the HL planner gates on metadata_masks.any() (pi07/high_level_planner/modeling_pi07_high_level.py:1005). Pre-PR, a run with metadata_drop_all_prob=1.0 (or just no metadata in the dataset) produced batches where every sample's metadata was fully padded, the has_metadata branch evaluated to False, and the metadata block + trailing ";\n " were dropped entirely.

With emit_fps=True, fps tokens are always non-padded, so metadata_masks.any() is always True. The metadata block (now containing only "fps: N, ") is always included.

Consequences:

  • metadata_drop_all_prob=1.0 no longer means "no metadata at all" — it means "fps-only metadata". Anyone running this as a no-metadata ablation needs to additionally set emit_fps=False on the mixture (and EnvMetadataConfig.emit_fps=False at eval) to recover the previous semantics.
  • The CLAUDE.md rule #5 OR-reduction in _global_or_branch_decisions still fires correctly — it just always agrees True across ranks now. Not a correctness regression, just dead branch for the True-always case.

Worth a sentence in the PR body's "Backward-compat note" section.

2. metadata_max_length=52 is tighter now — worth a sanity-check

All four pi07 / pi07_paligemma configs default to metadata_max_length: int = 52. Pre-PR worst-case prefix: "Metadata: Speed: 100, Quality: 5, Mistake: True, Robot: <name>, Control: <mode>, " — roughly 22-26 BPE tokens with Gemma 3 / PaliGemma tokenizers depending on robot name. Post-PR adds " fps: NN, " for another 3-4 tokens. Still fits in 52, but the headroom for long robot_type strings (e.g. "so100_follower_arm") is narrower.

Suggest a one-line note in the config docstring or, better, a runtime warning when the tokenized prefix hits truncation=True with truncation=True, max_length=52 (the tokenizer silently drops tail tokens today — fps would survive since it slots before Control:, but Control: would be the first thing truncated).

3. Test gap: mixed-fps_is_pad batches

The new tests cover:

  • All samples non-pad (test_fps_present_emits_lowercase_segment)
  • All samples pad (test_fps_padded_omits_segment)
  • Key absent (test_fps_key_absent_omits_segment)

Not covered: a single batch where sample 0 has fps_is_pad=False and sample 1 has fps_is_pad=True. Today's production paths don't produce this mix (dataloader emits non-pad unconditionally, add_eval_metadata emits non-pad for all envs at once), so the gap is theoretical. But the policy code's per-sample zip(...) iteration is the only thing keeping it correct — a regression that broadcasts a batch-wide fps_is_pad would slip through current tests. One mixed-batch assertion would close that hole cheaply.


That's the end of my passes. Summary across all three reviews:

# Severity Finding
1 Blocker on default emit_fps=True + action_freq=None defaults silently change behaviour for resumed checkpoints and existing configs
2 Blocker bug VQA mixtures crash with AttributeError on self.meta.fps
3 Note in PR body has_metadata becomes always-True; ablation semantics shift
4 Sanity check metadata_max_length=52 headroom narrower with fps
5 Test gap No mixed-fps_is_pad batch coverage

Findings 1 and 2 want addressing before this lands. 3-5 are nice-to-haves.


Generated by Claude Code

… shift

- Add `test_fps_mixed_pad_across_batch` in all three fps test classes —
  closes the per-sample fps_is_pad coverage gap, now reached in
  production by heterogeneous LeRobot + VQA mixtures where the VQA pad
  row (fps=0, fps_is_pad=True) sits next to a LeRobot row in the same
  batch.
- concepts.rst: document the metadata_drop_all_prob=1.0 semantic shift
  under emit_fps=True (no longer "no metadata at all" — it's "fps-only
  metadata") and note metadata_max_length=52 headroom.
@shuheng-liu
Copy link
Copy Markdown
Member Author

Addressed in 1467b57:

#1metadata_drop_all_prob=1.0 semantics under emit_fps=True

Real semantic shift. Added a .. note:: block to concepts.rst (in the dropout table section) explaining that under the new default, metadata_drop_all_prob=1.0 means "fps-only metadata segment" instead of "no metadata segment at all". The note tells ablation users to also flip emit_fps=False on both sides to recover pre-PR behaviour. The has_metadata branch and the OR-reduction are still correct — they just always agree True for LeRobot samples now.

#2metadata_max_length=52 headroom

Added a one-line note in concepts.rst flagging that the new FPS: N, segment costs ~3-4 tokens, that the underlying tokenizer call uses truncation=True silently, and that users should bump metadata_max_length if they start seeing Control: get clipped. Skipped a runtime warning for now — too aggressive for a feature that "still fits in 52" as you noted.

#3 — Mixed-fps_is_pad batch coverage

Added test_fps_mixed_pad_across_batch to all three fps test classes (test_pi07_cpu.py ×2 via the [low, high] parametrize, test_pi07_paligemma_high_level_planner.py, test_pi07_paligemma_low_level.py). Each test builds a 2-sample batch with sample 0 at fps_is_pad=False and sample 1 at fps_is_pad=True and asserts:

  • Sample 0 keeps its FPS: 30, segment.
  • Sample 1's FPS: substring is absent.

This is the exact production path your second-pass review fix triggered — a heterogeneous LeRobot + VQA batch where the VQA pad row sits next to a LeRobot real-rate row.

22 fps tests pass; pre-commit clean.

Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fourth pass — one substantive new finding and a verification of the resume-checkpoint scenario.

Material finding: switching to native fps silently changes prediction-horizon semantics for non-30-Hz datasets

resolve_delta_timestamps with the new action_freq=None substitutes ds_meta.fps, then video sampling does nearest-neighbor at frame_indices_soft.round() (lerobot_dataset.py:1906). Comparing the three regimes against the old action_freq=30.0 default for chunk_size=50:

Native fps Old default (action_freq=30) New default (action_freq=None) Effect
20 Hz delta_ts = [i/30]round(i*20/30) = [0,1,1,2,3,3,4,5,5,…] (duplicates) delta_ts = [i/20][0,1,2,3,…] consecutive Bug fix — old default produced silently duplicated action targets
30 Hz [0,1,2,…] [0,1,2,…] identical
60 Hz delta_ts = [i/30]round(i*60/30) = [0,2,4,6,…] (subsampled) → 1.67 s horizon delta_ts = [i/60][0,1,2,3,…] consecutive → 0.83 s horizon Prediction horizon shrinks 2×

Two consequences the PR body doesn't acknowledge:

  1. The PR is also a bug fix for sub-30-Hz datasets. Anyone training on 10/15/20 Hz data with the old default was getting silently duplicated/skipped action targets. The "mixed-frequency" framing in the description undersells this — even single-dataset 20-Hz runs benefit. Worth a sentence in the PR body.
  2. For >30-Hz datasets, the prediction horizon collapses. A 60-Hz dataset run that used to predict 1.67 s of future motion now predicts 0.83 s. That's a different task — the model's action distribution, evaluation rollout length, and downstream planner timing all shift. This is a real semantic break for anyone using high-rate datasets (industrial arms, force sensors, anything ≥50 Hz), distinct from concern #1.

Together with concerns #1 and #2 (defaults + VQA crash), this strengthens the case for an opt-in default. A user who explicitly wrote action_freq=30.0 in their config gets that value preserved on resume — but a user who never set it (relied on the implicit 30 Hz default) now gets a horizon shift they didn't ask for.

Verified: resumed pre-PR checkpoints silently flip emit_fps on

Checked cfg.save_pretrained_save_pretrained (configs/train.py:310-319) — it uses draccus.dump(self, f), which serialises only fields present on the dataclass at save time. So a checkpoint saved before this PR has no emit_fps key in its JSON. When that JSON is reloaded after the PR, draccus instantiates DatasetMixtureConfig with the new default emit_fps=True, and EnvMetadataConfig.emit_fps=True similarly. The resumed run starts injecting "fps: N, " into the prefix on step 1 even though the previous training loop never tokenized fps.

This is the exact failure mode CLAUDE.md flags as "ML bugs hide in stochasticity" — a loss curve that looks plausible because the model adapts, masking that the policy is being silently retrained on a slightly different input distribution. The fix is mechanical: default emit_fps=False, document opt-in.

No new findings on

  • Key collisions: standard_item["fps"] doesn't conflict with anything else in _to_standard_data_format (built fresh per sample at lerobot_dataset.py:832).
  • bf16 cast: confirmed once more that the cast loop at lerobot_dataset.py:861-864 only touches is_floating_point, so fps (long) and fps_is_pad (bool) pass through unchanged.

Updated summary across all four passes

# Severity Finding
1 Blocker (default) emit_fps=True + action_freq=None flip behaviour silently for any existing config or resumed checkpoint
2 Blocker (bug) VQA mixtures crash with AttributeError on self.meta.fps
3 Note in PR body The PR is also a bug fix for sub-30-Hz datasets — currently un-credited
4 Note in PR body Prediction-horizon shrinks for >30-Hz datasets — distinct from #1, not in the description
5 Nice-to-have has_metadata always True → ablation semantics shift
6 Nice-to-have metadata_max_length=52 headroom narrower
7 Test gap No mixed-fps_is_pad batch coverage

That's the natural stopping point. Findings #1 and #2 still need addressing; #3 and #4 would tighten the PR description; #5-7 are polish.


Generated by Claude Code

`DatasetMixtureConfig.emit_fps` and `EnvMetadataConfig.emit_fps` now
default to `False` so pre-PR checkpoints resume cleanly — the policy's
metadata prefix doesn't gain an unfamiliar `FPS:` segment until the
user explicitly opts in. `action_freq=None` (mixed-frequency training)
stays the default; opt-in to `emit_fps=True` to surface per-sample
fps conditioning to the policy.

BaseDataset's dm-None fallback matched to False for the "no mixture
config" path (VQA-only / unit tests). Docs + tests updated to reflect
the new default — existing fps tests now set `emit_fps=True`
explicitly when they exercise the emission path.
@shuheng-liu
Copy link
Copy Markdown
Member Author

Flipped both emit_fps defaults to False in 60bc00c — your fourth-pass case (sub-30Hz silently-improves + 60Hz horizon shrinks + resume implicitly enables fps tokenization) was compelling enough that the previous "keep True" call was wrong. action_freq=None stays the new default (mixed-frequency is the intended new behaviour); emit_fps is the opt-in barrier.

Recap of what's now opt-in vs. silently changed

Knob Pre-PR default Post-PR default Resumed-checkpoint behaviour
action_freq implicit 30.0 None Sub-30Hz: silently improved (no more duplicate-target bug). 30Hz: byte-identical. >30Hz: horizon shrinks — users pin action_freq=30.0 to preserve.
DatasetMixtureConfig.emit_fps n/a False (new) No FPS: segment added to prefix. Resume is clean.
EnvMetadataConfig.emit_fps n/a False (new) Same — eval prefix unchanged.

PR body's "Backward-compat notes" section now spells out the sub-30Hz fix and the >30Hz horizon-shrink as distinct items (your findings #3 and #4).

Tests updated

  • _DummyBaseDataset fixture default flipped to emit_fps=False to match production; tests that exercise the emission path now set emit_fps=True explicitly.
  • test_emit_fps_defaults_to_false / test_emit_fps_can_be_enabled replace the pre-flip counterparts.
  • add_eval_metadata tests: the "all-None default" now asserts no keys are injected (was "only fps/fps_is_pad"); the fps-emission tests opt in to emit_fps=True explicitly.

764 CPU tests pass; pre-commit clean. Picking up resumed pre-PR checkpoints is now a true no-op for the prefix.

That should close the third-pass items (#3 ablation note still in concepts.rst — still applies under explicit emit_fps=True runs) and the fourth-pass blockers.

@shuheng-liu shuheng-liu added the bug Something isn't working label May 23, 2026
Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after the new commits. Both blockers from earlier passes are properly addressed, plus one extra correctness fix I missed.

Finding Commit Resolution
#1 Default-flip silent backward incompat 60bc00c emit_fps=False on both DatasetMixtureConfig and EnvMetadataConfig; BaseDataset dm-None fallback matched to False
#2 VQA crash 090b436 DatasetMetadata.fps -> None base property, LeRobotDatasetMetadata overrides to int, _emit_optional_keys pads (fps=0, fps_is_pad=True) when meta.fps is None — heterogeneous batches stay schema-aligned
#5 has_metadata always-True semantic shift 1467b57 Documented in concepts.rst
#6 metadata_max_length=52 headroom 1467b57 Documented
#7 Mixed-pad batch coverage 1467b57 test_fps_mixed_pad_across_batch added in all three fps test classes
Cosmetic: lowercase fps: fe04ffd Capitalized to FPS: for sibling consistency

Extra fix 7236b5a (caught by author, not by me): _emit_optional_keys now emits the effective per-sample fps (action_freq if set else meta.fps) rather than always the native rate. Previously, pinning action_freq=30.0 on a 50-Hz mixture would tokenize FPS: 50 while the chunk had been resampled to 30 Hz — a real prefix-vs-chunk-timing misalignment. Nice find.

action_freq=None default stays. With emit_fps=False now opt-in, the prefix doesn't change for resumed checkpoints, and the prediction-horizon shift on >30-Hz datasets (finding #4 from pass 4) becomes a deliberate config decision rather than a silent default flip — acceptable.

CI: pre-commit and check-checklist green; CPU tests still running. Lifting my hold — looks good to merge once CPU tests pass.


Generated by Claude Code

@shuheng-liu shuheng-liu marked this pull request as ready for review May 23, 2026 03:45
@shuheng-liu shuheng-liu merged commit b1e1cde into main May 23, 2026
8 checks passed
@shuheng-liu shuheng-liu deleted the claude/awesome-liskov-941c29 branch May 23, 2026 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant