fix(pi07_paligemma): gate optional prefix tokens, fix cross-att lengt… by akshay18iitg · Pull Request #205 · TensorAuto/OpenTau

akshay18iitg · 2026-04-29T20:06:00Z

…h,

What this does

fix NameError
embed_prefix now only adds response/subgoal/metadata blocks when at least one sample in the batch has real (non-pad) tokens. Unconditional embedding injected spurious causal-block boundaries (att_masks=[1,...]) that corrupted the cumsum for every subsequent token even when all masks were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0).
Subgoal header ("Subgoal: ") and footer (", ") are now inside if subgoal_images:. They were always-present real tokens that added spurious cross-attention signal and position-ID shifts in the no-subgoal case.
num_cross_att_tokens in forward() now subtracts both the "Action: " indicator length and discrete_action_max_length, matching pi05's discrete_action_indicator_max_length logic. Previously the indicator tokens leaked into the action expert's cross-attention at training time but were absent at inference, causing a train/inference mismatch. prefix_offsets updated to match.
prepare_subgoal_images: track last_subgoal_img / last_mask before the empty-cameras fallback loop to avoid NameError when no subgoal keys are present in the batch and empty_cameras > 0.

Issues addressed:

When n_obs_step > 1 and obs_history_is_pad is True, previously only history images are zeroed out but SpaceTimeSIglip still attends to the history. This PR adds Time Attention to mask out those history images.
When n_obs_step > 1 and obs_history_is_pad is True, the current state would also get padded. This PR fixes that
Subgoal was using Siglip and not SpaceTimeSiglip to encode images. This PR fixes that
The subgoal, response and metadata train time dropout where not consistent with test time dropout. This PR fixes that
The RNG divergence error was caused due to dropout in model weights. Set dropout to 0.0 to keep the pi07_paligemma behaviour similar to pi05 under specific config

How was it tested

TestPI07LowLevelPlannerObsHistoryRegression

test_single_frame_shape_and_mask_true

Tested: Encoder at T=1 ignores obs_history_is_pad.
Expected: Output (B, num_video_tokens, H); None / all-True / all-False pads → identical outputs.
test_temporal_mask_is_history_padded

Tested: Temporal mask when only the current frame is real.
Expected: Shape (B·N,1,T,T); past keys fully blocked; current self-only; past cannot see current; ≤1 attendable key per row; mask repeated per patch within a sample.
test_temporal_mask_all_valid_is_causal

Tested: Temporal mask with no padded frames.
Expected: Standard causal lower-triangle; row i has i+1 allowed keys.
test_temporal_mask_none_fallback_matches_all_padded

Tested: Explicit “only current real” pad pattern vs the same pattern duplicated.
Expected: Two explicitly built masks are exactly equal (stand-in for the forward None fallback tensor).
test_temporal_mask_mixed_batch_per_sample_independence

Tested: Different pad rows in one batch and a partial-history row.
Expected: Each row’s mask equals solo build on that row; padded vs valid rows differ; partial row has the specified allow/forbid cells.

TestPI07LowLevelPlannerStateEmbedding

test_state_embedding_single_frame

Tested: T=1 state block in prefix.
Expected: Embeddings (1,1,H); state pad all present; embeddings not all zero.
test_state_embedding_single_frame_pad_irrelevant

Tested: T=1 with different obs_history_is_pad.
Expected: Same state pad slice for None / True / False; still all present.
test_state_embedding_all_history_padded

Tested: T>1 with only last frame real.
Expected: Embeddings (1,T,H); only last state slot present; last slot not all zero.
test_state_embedding_all_pad_true_matches_history_padded

Tested: obs_history_is_pad all True vs usual […,False] on last.
Expected: Same state pad slice and same state embeddings (current forced present).
test_state_embedding_all_valid

Tested: T>1, no padding.
Expected: All T slots present; each timestep embedding not all zero.
test_state_embedding_none_fallback

Tested: obs_history_is_pad None vs explicit padded-history pattern.
Expected: Identical state pad slices.
test_state_embedding_mixed_batch

Tested: Three fixed per-row pad patterns.
Expected: State pad rows [F,F,F,T], [T,T,T,T], [F,T,T,T] (presence = not pad, with last forced).
test_forward_path_invokes_state_proj

Tested: Multi-step state projection for the prefix path.
Expected: Exactly one state_proj call with input shape (B,T,D).
test_sample_actions_path_invokes_state_proj

Tested: Single-step state projection.
Expected: Exactly one call with shape (B,1,D).

TestPI07LowLevelPlannerResponseEmbedding

test_empty_response_comma_and_tokens_padded

Tested: Empty response after prepare_response.
Expected: response_masks all false; response comma and response spans in prefix pad all absent.
test_nonempty_response_comma_and_tokens_unmasked

Tested: Non-empty response.
Expected: Some response_masks true; comma span all present; response span equals response_masks.
test_mixed_batch_response_masking

Tested: ["", real text] batch.
Expected: Row0 comma+response absent; row1 comma present; response span equals row masks.
test_response_masking_forward_vs_sample_actions_parity

Tested: Stability and equivalence of preparation; empty vs real differs; missing key vs "".
Expected: Duplicate prepares → equal tokens/masks; omit key equals explicit empty; duplicate embed_prefix pad_masks equal; empty vs real differ in comma and response spans.

TestPI07LowLevelPlannerMetadataEmbedding

test_all_metadata_dropped_md_comma_and_tokens_padded

Tested: All fields dropped via *_is_pad.
Expected: metadata_masks all false; metadata comma and metadata spans all absent.

test_all_metadata_present_md_comma_unmasked

Tested: All fields present.
Expected: metadata_masks has some true; metadata comma all present; metadata span equals metadata_masks.
test_metadata_is_pad_exhaustive_single_sample (8 combinations)

Tested: Every (speed_pad, quality_pad, mistake_pad) triple vs sample_has_metadata and prefix slices.
Expected: metadata_masks.any() iff not all three dropped; metadata comma matches metadata_masks.any(dim=1) broadcast; metadata span equals metadata_masks.
test_mixed_batch_metadata_masking

Tested: Three-row batch with different pad patterns.
Expected: Comma row masks [absent, present, present] matching per-row any-metadata; each metadata row equals that row’s metadata_masks.
test_metadata_missing_keys_matches_explicit_all_pad

Tested: Only-state batch vs explicit all-dropped metadata batch.
Expected: prepare_metadata outputs identical; full embed_prefix pad_masks identical.

TestPI07LowLevelPlannerSubgoalEmbedding

test_no_subgoal_keys_sg_block_fully_padded

Tested: Batch with only state (no subgoal keys).
Expected: sample_has_subgoal false; subgoal comma, header, and all subgoal vision tokens absent.
test_subgoal_present_sg_block_unmasked

Tested: Both subgoal cameras present and not pad-dropped.
Expected: sample_has_subgoal true; comma+header present; each camera’s vision span equals that camera’s mask broadcast to tokens.
test_subgoal_tensors_all_is_pad_matches_missing_keys

Tested: Tensors + all subgoal_is_pad vs missing keys.
Expected: prepare_subgoal_images outputs pairwise equal; embed_prefix pad_masks equal.
test_mixed_batch_subgoal_masking

Tested: subgoal_is_pad [True, False] with tensors for both rows.
Expected: sample_has_subgoal == [False, True]; row0 subgoal comma/header/vision absent; row1 comma/header present; row1 full vision span present.
test_subgoal_sample_has_or_across_cameras

Tested: One camera all-false mask, one all-true.
Expected: sample_has_subgoal true; subgoal comma present (OR across cameras).
test_omit_subgoal_is_pad_defaults_all_pad

Tested: Subgoal images present but subgoal_is_pad omitted.
Expected: Masks all false; sample_has_subgoal false; subgoal comma span absent.
test_subgoal_embed_prefix_parity

Tested: Same batch prepared twice.
Expected: embed_prefix pad_masks bitwise equal between runs.

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

…h, fix NameError - embed_prefix now only adds response/subgoal/metadata blocks when at least one sample in the batch has real (non-pad) tokens. Unconditional embedding injected spurious causal-block boundaries (att_masks=[1,...]) that corrupted the cumsum for every subsequent token even when all masks were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0). - Subgoal header ("Subgoal: ") and footer (", ") are now inside `if subgoal_images:`. They were always-present real tokens that added spurious cross-attention signal and position-ID shifts in the no-subgoal case. - num_cross_att_tokens in forward() now subtracts both the "Action: " indicator length and discrete_action_max_length, matching pi05's discrete_action_indicator_max_length logic. Previously the indicator tokens leaked into the action expert's cross-attention at training time but were absent at inference, causing a train/inference mismatch. prefix_offsets updated to match. - prepare_subgoal_images: track last_subgoal_img / last_mask before the empty-cameras fallback loop to avoid NameError when no subgoal keys are present in the batch and empty_cameras > 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Inline findings on PR #205 — see summary comment for the rollup.

claude · 2026-04-29T22:28:15Z

[claude-review] summary for commit 75aca44

suggestion — src/opentau/datasets/lerobot_dataset.py:1827-1830 — Updated _load_subgoal_frames docstring claims "Episodes without a segments entry in episodes.jsonl are unsupported", but _sample_subgoal_frame at line 1790 still has if "segments" not in self.meta.episodes[ep_idx]: return min(frame_in_ep + window_frames, ep_length - 1) and its own docstring (1782-1786) still advertises the ~4 s lookahead fallback. The two docstrings disagree about whether segment-less episodes are supported.
(resolved) — …/modeling_pi07_low_level.py:1399-1403 — Action: indicator att_masks is now [1] * num_discrete_action_start_embs (per-token causal), matching pi05. Closes the prior "discrete-action CE byte-equivalence" carry.
(resolved) — …/modeling_pi07_low_level.py:1244 / 1357 — state-end indicator collapses to ":\n" and the trailing ":\n" prefix-end is omitted when no optional middle block contributes real tokens. Tests in TestPI07LowLevelPlannerEmbedPrefixNoOptionals pin the layout, the att_mask, and the n_obs_steps==1 precondition without GPU. Good coverage of the new branch.
suggestion (carried) — …/modeling_pi07_low_level.py:742-743 — prepare_videos docstring still asserts "ImageNet normalization is assumed to be already applied by the dataset loader." After the V-JEPA2 → SpaceTime SigLIP swap, normalization happens inside the encoder. Reword.
suggestion (carried, expanded) — …/modeling_pi07_low_level.py:1245 / 1290 / 1328 / 1358 / 1378 / 1552 — tokenizer.encode("Action: ") and the other fixed-string encodes now run up to 6× per forward() (the new state-end branch and _action_indicator_len lookup added two more sites). Cache the IDs in __init__ (e.g. self._action_indicator_ids, self._state_end_ids_optional, self._state_end_ids_no_optional) or add config fields paralleling pi05's discrete_action_indicator_max_length.
suggestion (carried) — configs/examples/pi07_libero.json:13-18, 24 — subgoal_drop_prob, response_drop_prob, metadata_drop_all_prob, metadata_drop_each_prob, history_state_drop_prob are all 1.0, with n_obs_steps: 1. Now that this PR's whole point is the has_any_optional == False byte-equivalence path, this config doubles as the smoke config for it — add a one-line header comment to that effect so a reader doesn't copy it as a real-run starting point. subgoal_end_of_segment_prob: 0.25 remains dead at subgoal_drop_prob: 1.0.
suggestion — Tests cover the new has_any_optional == False branch end-to-end, but not (a) the per-sample sample_has_subgoal mask in a mixed batch (some samples with subgoals, others without) — the actual scenario the new gating logic is meant to handle, (b) the num_cross_att_tokens / prefix_offsets arithmetic in the no-optionals path (just asserting prefix_pad_masks.sum(-1) == prefix_embs.shape[1] - _action_indicator_len - discrete_action_max_length would lock it), or (c) _load_subgoal_frames against an episode whose episodes.jsonl entry omits segments (which is exactly the legacy case the docstring above claims is unsupported and _sample_subgoal_frame still handles).

Note: prior COMMENTED reviews on earlier commits cannot be dismissed (GitHub returns 422 for COMMENTED state); inline comments on those commits are superseded by this rollup and the new inline review on 75aca44.

Swaps VJEPA2VideoEncoder for SpaceTimeSiglipVideoEncoder (reusing the PaliGemma vision_tower + multi_modal_projector already in the model), removes the five vjepa2_* config fields in favour of spacetime_layer_stride, and updates pi07_libero.json accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Inline findings on PR #205 (commit 57d650f) — see summary comment for the rollup.

claude

Inline findings on PR #205 (commit 0ef58cd) — see summary comment for the rollup.

…are not present, mask subgoal_start_ind in batch if subgoal is padded

claude

Inline findings on PR #205 (commit 2b9157d) — see summary comment for the rollup.

The V-JEPA2 encoder was replaced with SpaceTimeSiglipVideoEncoder in 57d650f, but several module docstrings, comments, and a test constant still referenced the old encoder. Also rename the test constant VJEPA2_TOKENS_PER_CAMERA -> SIGLIP_TOKENS_PER_CAMERA to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

claude

Inline findings on PR #205 (commit 9528cfb) — see summary comment for the rollup.

claude

Inline findings on PR #205 (commit 6e3f82a) — see summary comment for the rollup.

…opped (#226) (#227)

claude

Inline findings on PR #205 (commit 75aca44) — see summary comment for the rollup.

shuheng-liu · 2026-05-02T06:30:12Z

@claude fix the items raised in review

@claude

- addresses @claude (Action: indicator caching): cache the encoded length once in PI07LowLevelPlannerFlowMatching.__init__ as self._action_indicator_len; reuse it in the cross-att length and prefix-offset slice. Adds an explicit assert that discrete_actions is not None at the train-time forward call site so the subtraction can never underflow. - addresses @claude (gradient_checkpointing for SpaceTimeSiglip): add PI07lowlevelPlannerConfig.gradient_checkpointing (default False) and forward it to SpaceTimeSiglipVideoEncoder, mirroring pi05_mem. - addresses @claude (subgoal_is_pad / metadata is_pad default flip): document the treat-as-pad default in prepare_subgoal_images and prepare_metadata docstrings so hand-built inference batches know to pass *_is_pad=zeros explicitly. - addresses @claude (no-op bool cast): drop [m.to(dtype=torch.bool) for m in subgoal_img_masks] in embed_prefix; the masks are already bool. Use torch.stack(subgoal_img_masks).any(dim=0) directly. - addresses @claude (scalar att_masks footgun): add a comment at the subgoal_img_start_emb append site explaining why the scalar [1, 0, ...] is safe under sample_has_subgoal masking and why a future per-sample fix would double-mask. - addresses @claude (V-JEPA2 / ImageNet docstring leftover): rewrite prepare_videos docstring to describe the actual SigLIP normalization path ([0,1] in, [-1,1] internal) — V-JEPA2 / Perceiver references in embed_video / the ASCII diagram were already cleaned up. - addresses @claude (lerobot_dataset.py:1813/1820/1830 docstring vs code): rewrite _load_subgoal_frames docstring to match the new always-on behavior, call out the WeightedDatasetMixture co-training behavior change, and acknowledge the segments-less ~4 s legacy fallback that _sample_subgoal_frame still implements. - addresses @claude (prefix terminator divergence with high-level planner): add a comment noting that the low-level ":\n" terminator intentionally matches pi05 for byte-equivalence on the no-optionals path; the high-level planner's ";\n " terminator is independent. tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-05-02T06:40:59Z

[claude-fix] Addressed the items from the review rollup in commit 56a7ce1:

Item	Status
`lerobot_dataset.py` `_load_subgoal_frames` ↔ `_sample_subgoal_frame` docstring contradiction (lines 1813/1820/1830)	Fixed — `_load_subgoal_frames` docstring rewritten to acknowledge the segments-less ~4 s legacy fallback, and the always-on behavior change is now documented with an explicit `WeightedDatasetMixture` co-training callout.
`prepare_videos` V-JEPA2 / ImageNet leftover docstring (line 742-743)	Fixed — reworded to describe the actual SigLIP path (`[0,1]` upstream, `[-1,1]` internal).
Cache `tokenizer.encode("Action: ")` length (lines 1245/1290/1328/1358/1378/1552)	Fixed — cached as `self._action_indicator_len` in `__init__`, reused in cross-att length and prefix-offset slice. Added an `assert discrete_actions is not None` to the train forward path so the underflow case is loud.
Add header comment to `configs/examples/pi07_libero.json`	Not changed — JSON has no native comments; replied inline with the rationale (smoke-config role is documented in PR body / commit).
Mixed-batch `sample_has_subgoal` test, `num_cross_att_tokens` arithmetic test, segments-less `_load_subgoal_frames` test	Not added in this commit — existing `TestPI07LowLevelPlannerEmbedPrefixNoOptionals` covers the no-optionals byte-equivalence path. The mixed-batch + arithmetic tests are reasonable follow-ups; flagging here so they don't slip but keeping the diff scoped to review fixes.

Other inline items (no-op bool cast, scalar att_masks comment, gradient_checkpointing config, is_pad default flip docstrings, prefix terminator divergence comment) were addressed in the same commit — replied individually above with details. Tests passing locally: pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py (25 passed).

shuheng-liu · 2026-05-02T07:00:12Z

@claude fix all unresolved inline comments. If they are stale, reply explicitly.

@claude

- addresses @claude (mixed-batch sample_has_subgoal coverage): add test_mixed_batch_subgoal_pad_masks_follow_sample_has_subgoal that exercises the per-sample subgoal gating with bsize=2 — sample 0 has a real subgoal, sample 1 is pad-only — and asserts header/image/footer pad_masks follow sample_has_subgoal while att_masks stays broadcast. - addresses @claude (num_cross_att_tokens arithmetic coverage): add test_no_optionals_num_cross_att_tokens_matches_pi05_arithmetic pinning the prefix_embs.shape[1] - _action_indicator_len - discrete_action_max_length subtraction and matching prefix_offsets. - addresses @claude (segments-less _load_subgoal_frames coverage): add test_subgoals_load_without_segments_uses_4s_lookahead which mocks episodes[0]={} (no segments key) and asserts the legacy ~4 s lookahead fallback in _sample_subgoal_frame still fires. - addresses @claude (per-camera hf_dataset transform perf nit): hoist hf_dataset[ep_start + subgoal_frame] out of the per-camera loop in _load_subgoal_frames so multi-image-camera datasets decode the row exactly once instead of N times. Single-camera path is unchanged. tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shuheng-liu · 2026-05-02T07:14:27Z

@claude fix the pre-commit CI failures.

…a_fixes

Co-authored-by: Cursor <cursoragent@cursor.com>

akshay18iitg self-assigned this Apr 29, 2026

akshay18iitg added 2 commits April 29, 2026 15:16

Merge branch 'main' into fix/pi07_paligemma_fixes

ef25a3c

Fixing is_pad keys when keys are missing

483add5

claude Bot reviewed Apr 29, 2026

View reviewed changes

Avoid Subgoal: indicator when is_pad is true

0ef58cd

claude Bot reviewed Apr 29, 2026

View reviewed changes

Load Subgoal if video keys are not present, load subgoal if segments …

2b9157d

…are not present, mask subgoal_start_ind in batch if subgoal is padded

claude Bot reviewed Apr 30, 2026

View reviewed changes

remove spurious space after prefix end indicator

6e3f82a

claude Bot reviewed May 1, 2026

View reviewed changes

fix(pi07_paligemma): byte-equivalent prefix to pi05 when optionals dr…

75aca44

…opped (#226) (#227)

claude Bot reviewed May 2, 2026

View reviewed changes

Comment thread src/opentau/datasets/lerobot_dataset.py

akshay18iitg and others added 3 commits May 3, 2026 07:29

Adding working pi07 paligemma code

7209162

Merge local pi07 paligemma working tree onto origin/fix/pi07_paligemm…

d407b3a

…a_fixes

docs(pi07_paligemma): restore accurate low_level __init__ after merge

0def7ad

Co-authored-by: Cursor <cursoragent@cursor.com>

akshay18iitg added 3 commits May 3, 2026 17:44

Adding history for state

98fc922

Adding history to images and temporal attention to SpaceTimeSiglip

2a9f017

Fixing pre-commit

396080f

shuheng-liu mentioned this pull request May 4, 2026

fix(pi07): mask padded history; preserve current state token #253

Merged

3 tasks

fixed docstring and ascii art

9c283ea

shuheng-liu mentioned this pull request May 4, 2026

fix(pi05_mem): mask padded history; preserve current state token; emit at inference #254

Merged

3 tasks

akshay18iitg added 2 commits May 4, 2026 14:02

Add test for low level planner

78e61b5

Add test for low level planner

9f2da90

WilliamYue37 self-requested a review May 4, 2026 22:09

Merge main

e90a6bb

akshay18iitg closed this May 4, 2026

Conversation

akshay18iitg commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How was it tested

TestPI07LowLevelPlannerObsHistoryRegression

TestPI07LowLevelPlannerStateEmbedding

TestPI07LowLevelPlannerResponseEmbedding

TestPI07LowLevelPlannerMetadataEmbedding

TestPI07LowLevelPlannerSubgoalEmbedding

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shuheng-liu commented May 2, 2026

Uh oh!

claude Bot commented May 2, 2026

Uh oh!

shuheng-liu commented May 2, 2026

Uh oh!

shuheng-liu commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

akshay18iitg commented Apr 29, 2026 •

edited

Loading

claude Bot commented Apr 29, 2026 •

edited

Loading