Fix/pi07 paligemma fixes by akshay18iitg · Pull Request #260 · TensorAuto/OpenTau

akshay18iitg · 2026-05-04T22:41:27Z

…h,

What this does

fix NameError
embed_prefix now only adds response/subgoal/metadata blocks when at least one sample in the batch has real (non-pad) tokens. Unconditional embedding injected spurious causal-block boundaries (att_masks=[1,...]) that corrupted the cumsum for every subsequent token even when all masks were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0).
Subgoal header ("Subgoal: ") and footer (", ") are now inside if subgoal_images:. They were always-present real tokens that added spurious cross-attention signal and position-ID shifts in the no-subgoal case.
num_cross_att_tokens in forward() now subtracts both the "Action: " indicator length and discrete_action_max_length, matching pi05's discrete_action_indicator_max_length logic. Previously the indicator tokens leaked into the action expert's cross-attention at training time but were absent at inference, causing a train/inference mismatch. prefix_offsets updated to match.
prepare_subgoal_images: track last_subgoal_img / last_mask before the empty-cameras fallback loop to avoid NameError when no subgoal keys are present in the batch and empty_cameras > 0.

Issues addressed:

When n_obs_step > 1 and obs_history_is_pad is True, previously only history images are zeroed out but SpaceTimeSIglip still attends to the history. This PR adds Time Attention to mask out those history images.
When n_obs_step > 1 and obs_history_is_pad is True, the current state would also get padded. This PR fixes that
Subgoal was using Siglip and not SpaceTimeSiglip to encode images. This PR fixes that
The subgoal, response and metadata train time dropout where not consistent with test time dropout. This PR fixes that
The RNG divergence error was caused due to dropout in model weights. Set dropout to 0.0 to keep the pi07_paligemma behaviour similar to pi05 under specific config

How was it tested

TestPI07LowLevelPlannerObsHistoryRegression

test_single_frame_shape_and_mask_true

Tested: Encoder at T=1 ignores obs_history_is_pad.
Expected: Output (B, num_video_tokens, H); None / all-True / all-False pads → identical outputs.
test_temporal_mask_is_history_padded

Tested: Temporal mask when only the current frame is real.
Expected: Shape (B·N,1,T,T); past keys fully blocked; current self-only; past cannot see current; ≤1 attendable key per row; mask repeated per patch within a sample.
test_temporal_mask_all_valid_is_causal

Tested: Temporal mask with no padded frames.
Expected: Standard causal lower-triangle; row i has i+1 allowed keys.
test_temporal_mask_none_fallback_matches_all_padded

Tested: Explicit “only current real” pad pattern vs the same pattern duplicated.
Expected: Two explicitly built masks are exactly equal (stand-in for the forward None fallback tensor).
test_temporal_mask_mixed_batch_per_sample_independence

Tested: Different pad rows in one batch and a partial-history row.
Expected: Each row’s mask equals solo build on that row; padded vs valid rows differ; partial row has the specified allow/forbid cells.

TestPI07LowLevelPlannerStateEmbedding

test_state_embedding_single_frame

Tested: T=1 state block in prefix.
Expected: Embeddings (1,1,H); state pad all present; embeddings not all zero.
test_state_embedding_single_frame_pad_irrelevant

Tested: T=1 with different obs_history_is_pad.
Expected: Same state pad slice for None / True / False; still all present.
test_state_embedding_all_history_padded

Tested: T>1 with only last frame real.
Expected: Embeddings (1,T,H); only last state slot present; last slot not all zero.
test_state_embedding_all_pad_true_matches_history_padded

Tested: obs_history_is_pad all True vs usual […,False] on last.
Expected: Same state pad slice and same state embeddings (current forced present).
test_state_embedding_all_valid

Tested: T>1, no padding.
Expected: All T slots present; each timestep embedding not all zero.
test_state_embedding_none_fallback

Tested: obs_history_is_pad None vs explicit padded-history pattern.
Expected: Identical state pad slices.
test_state_embedding_mixed_batch

Tested: Three fixed per-row pad patterns.
Expected: State pad rows [F,F,F,T], [T,T,T,T], [F,T,T,T] (presence = not pad, with last forced).
test_forward_path_invokes_state_proj

Tested: Multi-step state projection for the prefix path.
Expected: Exactly one state_proj call with input shape (B,T,D).
test_sample_actions_path_invokes_state_proj

Tested: Single-step state projection.
Expected: Exactly one call with shape (B,1,D).

TestPI07LowLevelPlannerResponseEmbedding

test_empty_response_comma_and_tokens_padded

Tested: Empty response after prepare_response.
Expected: response_masks all false; response comma and response spans in prefix pad all absent.
test_nonempty_response_comma_and_tokens_unmasked

Tested: Non-empty response.
Expected: Some response_masks true; comma span all present; response span equals response_masks.
test_mixed_batch_response_masking

Tested: ["", real text] batch.
Expected: Row0 comma+response absent; row1 comma present; response span equals row masks.
test_response_masking_forward_vs_sample_actions_parity

Tested: Stability and equivalence of preparation; empty vs real differs; missing key vs "".
Expected: Duplicate prepares → equal tokens/masks; omit key equals explicit empty; duplicate embed_prefix pad_masks equal; empty vs real differ in comma and response spans.

TestPI07LowLevelPlannerMetadataEmbedding

test_all_metadata_dropped_md_comma_and_tokens_padded

Tested: All fields dropped via *_is_pad.
Expected: metadata_masks all false; metadata comma and metadata spans all absent.

test_all_metadata_present_md_comma_unmasked

Tested: All fields present.
Expected: metadata_masks has some true; metadata comma all present; metadata span equals metadata_masks.
test_metadata_is_pad_exhaustive_single_sample (8 combinations)

Tested: Every (speed_pad, quality_pad, mistake_pad) triple vs sample_has_metadata and prefix slices.
Expected: metadata_masks.any() iff not all three dropped; metadata comma matches metadata_masks.any(dim=1) broadcast; metadata span equals metadata_masks.
test_mixed_batch_metadata_masking

Tested: Three-row batch with different pad patterns.
Expected: Comma row masks [absent, present, present] matching per-row any-metadata; each metadata row equals that row’s metadata_masks.
test_metadata_missing_keys_matches_explicit_all_pad

Tested: Only-state batch vs explicit all-dropped metadata batch.
Expected: prepare_metadata outputs identical; full embed_prefix pad_masks identical.

TestPI07LowLevelPlannerSubgoalEmbedding

test_no_subgoal_keys_sg_block_fully_padded

Tested: Batch with only state (no subgoal keys).
Expected: sample_has_subgoal false; subgoal comma, header, and all subgoal vision tokens absent.
test_subgoal_present_sg_block_unmasked

Tested: Both subgoal cameras present and not pad-dropped.
Expected: sample_has_subgoal true; comma+header present; each camera’s vision span equals that camera’s mask broadcast to tokens.
test_subgoal_tensors_all_is_pad_matches_missing_keys

Tested: Tensors + all subgoal_is_pad vs missing keys.
Expected: prepare_subgoal_images outputs pairwise equal; embed_prefix pad_masks equal.
test_mixed_batch_subgoal_masking

Tested: subgoal_is_pad [True, False] with tensors for both rows.
Expected: sample_has_subgoal == [False, True]; row0 subgoal comma/header/vision absent; row1 comma/header present; row1 full vision span present.
test_subgoal_sample_has_or_across_cameras

Tested: One camera all-false mask, one all-true.
Expected: sample_has_subgoal true; subgoal comma present (OR across cameras).
test_omit_subgoal_is_pad_defaults_all_pad

Tested: Subgoal images present but subgoal_is_pad omitted.
Expected: Masks all false; sample_has_subgoal false; subgoal comma span absent.
test_subgoal_embed_prefix_parity

Tested: Same batch prepared twice.
Expected: embed_prefix pad_masks bitwise equal between runs.

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

…h, fix NameError - embed_prefix now only adds response/subgoal/metadata blocks when at least one sample in the batch has real (non-pad) tokens. Unconditional embedding injected spurious causal-block boundaries (att_masks=[1,...]) that corrupted the cumsum for every subsequent token even when all masks were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0). - Subgoal header ("Subgoal: ") and footer (", ") are now inside `if subgoal_images:`. They were always-present real tokens that added spurious cross-attention signal and position-ID shifts in the no-subgoal case. - num_cross_att_tokens in forward() now subtracts both the "Action: " indicator length and discrete_action_max_length, matching pi05's discrete_action_indicator_max_length logic. Previously the indicator tokens leaked into the action expert's cross-attention at training time but were absent at inference, causing a train/inference mismatch. prefix_offsets updated to match. - prepare_subgoal_images: track last_subgoal_img / last_mask before the empty-cameras fallback loop to avoid NameError when no subgoal keys are present in the batch and empty_cameras > 0. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Swaps VJEPA2VideoEncoder for SpaceTimeSiglipVideoEncoder (reusing the PaliGemma vision_tower + multi_modal_projector already in the model), removes the five vjepa2_* config fields in favour of spacetime_layer_stride, and updates pi07_libero.json accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…are not present, mask subgoal_start_ind in batch if subgoal is padded

The V-JEPA2 encoder was replaced with SpaceTimeSiglipVideoEncoder in 57d650f, but several module docstrings, comments, and a test constant still referenced the old encoder. Also rename the test constant VJEPA2_TOKENS_PER_CAMERA -> SIGLIP_TOKENS_PER_CAMERA to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…opped (#226) (#227)

@claude

- addresses @claude (Action: indicator caching): cache the encoded length once in PI07LowLevelPlannerFlowMatching.__init__ as self._action_indicator_len; reuse it in the cross-att length and prefix-offset slice. Adds an explicit assert that discrete_actions is not None at the train-time forward call site so the subtraction can never underflow. - addresses @claude (gradient_checkpointing for SpaceTimeSiglip): add PI07lowlevelPlannerConfig.gradient_checkpointing (default False) and forward it to SpaceTimeSiglipVideoEncoder, mirroring pi05_mem. - addresses @claude (subgoal_is_pad / metadata is_pad default flip): document the treat-as-pad default in prepare_subgoal_images and prepare_metadata docstrings so hand-built inference batches know to pass *_is_pad=zeros explicitly. - addresses @claude (no-op bool cast): drop [m.to(dtype=torch.bool) for m in subgoal_img_masks] in embed_prefix; the masks are already bool. Use torch.stack(subgoal_img_masks).any(dim=0) directly. - addresses @claude (scalar att_masks footgun): add a comment at the subgoal_img_start_emb append site explaining why the scalar [1, 0, ...] is safe under sample_has_subgoal masking and why a future per-sample fix would double-mask. - addresses @claude (V-JEPA2 / ImageNet docstring leftover): rewrite prepare_videos docstring to describe the actual SigLIP normalization path ([0,1] in, [-1,1] internal) — V-JEPA2 / Perceiver references in embed_video / the ASCII diagram were already cleaned up. - addresses @claude (lerobot_dataset.py:1813/1820/1830 docstring vs code): rewrite _load_subgoal_frames docstring to match the new always-on behavior, call out the WeightedDatasetMixture co-training behavior change, and acknowledge the segments-less ~4 s legacy fallback that _sample_subgoal_frame still implements. - addresses @claude (prefix terminator divergence with high-level planner): add a comment noting that the low-level ":\n" terminator intentionally matches pi05 for byte-equivalence on the no-optionals path; the high-level planner's ";\n " terminator is independent. tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@claude

- addresses @claude (mixed-batch sample_has_subgoal coverage): add test_mixed_batch_subgoal_pad_masks_follow_sample_has_subgoal that exercises the per-sample subgoal gating with bsize=2 — sample 0 has a real subgoal, sample 1 is pad-only — and asserts header/image/footer pad_masks follow sample_has_subgoal while att_masks stays broadcast. - addresses @claude (num_cross_att_tokens arithmetic coverage): add test_no_optionals_num_cross_att_tokens_matches_pi05_arithmetic pinning the prefix_embs.shape[1] - _action_indicator_len - discrete_action_max_length subtraction and matching prefix_offsets. - addresses @claude (segments-less _load_subgoal_frames coverage): add test_subgoals_load_without_segments_uses_4s_lookahead which mocks episodes[0]={} (no segments key) and asserts the legacy ~4 s lookahead fallback in _sample_subgoal_frame still fires. - addresses @claude (per-camera hf_dataset transform perf nit): hoist hf_dataset[ep_start + subgoal_frame] out of the per-camera loop in _load_subgoal_frames so multi-image-camera datasets decode the row exactly once instead of N times. Single-camera path is unchanged. tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…a_fixes

Co-authored-by: Cursor <cursoragent@cursor.com>

claude

Posting inline findings — see comments. Summary in a separate top-level comment.

claude · 2026-05-04T22:49:40Z

[claude-review] summary for commit ea74155

All four resolved findings from the prior review (subgoal num_frames mismatch, prepare_subgoal_images cardinality, dropout default, print→logging) verified addressed in ea74155. The deferred test sync nit was acknowledged as "Ignore for now" by the maintainer. New test_subgoal_padded_to_num_frames_when_n_obs_history_gt_one plus tightened embed_video stubs (T==num_frames check) provide a CPU regression for the original crash.

resolved (was blocking) — src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py:1545-1572 — single-frame subgoal videos now padded to (B, num_frames, C, H, W) with obs_history_is_pad blocking all-but-last; embed_video signature accepts the mask. Verified correct: _build_temporal_attn_mask masks all past keys, current = hidden[:, -1] discards padded-frame outputs.
resolved (was suggestion) — src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py:995-998 — pad loop now always appends len(missing_keys) zero slots, so total slots = len(subgoal_keys) matching the all-missing-keys early-return path. Prefix length no longer depends on which subgoal{k} keys are present.
resolved (was suggestion) — src/opentau/policies/pi07_paligemma/low_level_planner/configuration_pi07_low_level.py:139 — dropout default flipped 0.1 → 0.0 with explanatory comment; the determinism rule from CLAUDE.md now actually holds without manual override.
resolved (was nit) — src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py:353-422 — from_pretrained restored to logging.info/warning/debug throughout.
deferred (nit) — tests/policies/test_pi07_paligemma_low_level_planner.py:199 — _verify_position_ids still subtracts only DISCRETE_ACTION_MAX_LENGTH; out-of-sync with prefix_offsets now subtracting both indicator + action lengths. The test is @pytest.mark.skip'd so doesn't fail CI; maintainer asked to ignore.

No blocking issues found.

akshay18iitg · 2026-05-04T23:08:31Z

@claude fix the points agreed on

@claude

- addresses @claude (subgoal/SpaceTimeSigLIP num_frames mismatch): in embed_prefix, pad single-frame subgoal videos to (B, num_frames, C, H, W) with zeros for past frames and pass obs_history_is_pad blocking all but the current frame so SpaceTimeSiglipVideoEncoder no longer crashes when n_obs_history > 1 and a real subgoal is present. - addresses @claude (prepare_subgoal_images cardinality): always pad to len(subgoal_keys) in the present-keys path so prefix length no longer depends on which subgoal{k} keys are in the batch (matches all-missing early-return path). - addresses @claude (dropout default for pi05 parity): change PI07lowlevelPlannerConfig.dropout default from 0.1 to 0.0 so the determinism rule from CLAUDE.md actually holds without manual override. - addresses @claude (logging vs print regression): restore logging.info / warning / debug calls in PI07LowLevelPlannerPolicy.from_pretrained instead of bare print(...). - adds CPU regression test test_subgoal_padded_to_num_frames_when_n_obs_history_gt_one and tightens the embed_video stubs to enforce T == num_frames so the previous bug would have failed CI. tests: passed (the 13 CPU-only tests that don't need HF auth); 24 remaining failures all require gated google/paligemma-3b-pt-224 access from the sandbox and pass in CI under HF_TOKEN. tests: passed — uv run pytest tests/policies/test_pi07_paligemma_low_level_planner.py -m "not gpu" -n 0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

akshay18iitg and others added 21 commits April 29, 2026 13:04

Merge branch 'main' into fix/pi07_paligemma_fixes

ef25a3c

Fixing is_pad keys when keys are missing

483add5

Avoid Subgoal: indicator when is_pad is true

0ef58cd

Load Subgoal if video keys are not present, load subgoal if segments …

2b9157d

…are not present, mask subgoal_start_ind in batch if subgoal is padded

remove spurious space after prefix end indicator

6e3f82a

fix(pi07_paligemma): byte-equivalent prefix to pi05 when optionals dr…

75aca44

…opped (#226) (#227)

Adding working pi07 paligemma code

7209162

Merge local pi07 paligemma working tree onto origin/fix/pi07_paligemm…

d407b3a

…a_fixes

docs(pi07_paligemma): restore accurate low_level __init__ after merge

0def7ad

Co-authored-by: Cursor <cursoragent@cursor.com>

Adding history for state

98fc922

Adding history to images and temporal attention to SpaceTimeSiglip

2a9f017

Fixing pre-commit

396080f

fixed docstring and ascii art

9c283ea

Add test for low level planner

78e61b5

Add test for low level planner

9f2da90

Merge main

e90a6bb

akshay18iitg requested a review from shuheng-liu May 4, 2026 22:41

claude Bot reviewed May 4, 2026

View reviewed changes

akshay18iitg requested a review from WilliamYue37 May 4, 2026 23:39

akshay18iitg self-assigned this May 4, 2026

WilliamYue37 approved these changes May 4, 2026

View reviewed changes

WilliamYue37 merged commit 45a34cb into main May 4, 2026
8 checks passed

WilliamYue37 deleted the fix/pi07_paligemma_fixes branch May 4, 2026 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/pi07 paligemma fixes#260

Fix/pi07 paligemma fixes#260
WilliamYue37 merged 22 commits into
mainfrom
fix/pi07_paligemma_fixes

akshay18iitg commented May 4, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 4, 2026 •

edited

Loading

Uh oh!

akshay18iitg commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akshay18iitg commented May 4, 2026

What this does

How was it tested

TestPI07LowLevelPlannerObsHistoryRegression

TestPI07LowLevelPlannerStateEmbedding

TestPI07LowLevelPlannerResponseEmbedding

TestPI07LowLevelPlannerMetadataEmbedding

TestPI07LowLevelPlannerSubgoalEmbedding

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akshay18iitg commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

claude Bot commented May 4, 2026 •

edited

Loading