Skip to content

Fix/pi07 paligemma fixes#260

Merged
WilliamYue37 merged 22 commits into
mainfrom
fix/pi07_paligemma_fixes
May 4, 2026
Merged

Fix/pi07 paligemma fixes#260
WilliamYue37 merged 22 commits into
mainfrom
fix/pi07_paligemma_fixes

Conversation

@akshay18iitg
Copy link
Copy Markdown
Collaborator

…h,

What this does

  • fix NameError

  • embed_prefix now only adds response/subgoal/metadata blocks when at least one sample in the batch has real (non-pad) tokens. Unconditional embedding injected spurious causal-block boundaries (att_masks=[1,...]) that corrupted the cumsum for every subsequent token even when all masks were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0).

  • Subgoal header ("Subgoal: ") and footer (", ") are now inside if subgoal_images:. They were always-present real tokens that added spurious cross-attention signal and position-ID shifts in the no-subgoal case.

  • num_cross_att_tokens in forward() now subtracts both the "Action: " indicator length and discrete_action_max_length, matching pi05's discrete_action_indicator_max_length logic. Previously the indicator tokens leaked into the action expert's cross-attention at training time but were absent at inference, causing a train/inference mismatch. prefix_offsets updated to match.

  • prepare_subgoal_images: track last_subgoal_img / last_mask before the empty-cameras fallback loop to avoid NameError when no subgoal keys are present in the batch and empty_cameras > 0.

Issues addressed:

  • When n_obs_step > 1 and obs_history_is_pad is True, previously only history images are zeroed out but SpaceTimeSIglip still attends to the history. This PR adds Time Attention to mask out those history images.
  • When n_obs_step > 1 and obs_history_is_pad is True, the current state would also get padded. This PR fixes that
  • Subgoal was using Siglip and not SpaceTimeSiglip to encode images. This PR fixes that
  • The subgoal, response and metadata train time dropout where not consistent with test time dropout. This PR fixes that
  • The RNG divergence error was caused due to dropout in model weights. Set dropout to 0.0 to keep the pi07_paligemma behaviour similar to pi05 under specific config

How was it tested

TestPI07LowLevelPlannerObsHistoryRegression

  • test_single_frame_shape_and_mask_true

    Tested: Encoder at T=1 ignores obs_history_is_pad.
    Expected: Output (B, num_video_tokens, H); None / all-True / all-False pads → identical outputs.

  • test_temporal_mask_is_history_padded

    Tested: Temporal mask when only the current frame is real.
    Expected: Shape (B·N,1,T,T); past keys fully blocked; current self-only; past cannot see current; ≤1 attendable key per row; mask repeated per patch within a sample.

  • test_temporal_mask_all_valid_is_causal

    Tested: Temporal mask with no padded frames.
    Expected: Standard causal lower-triangle; row i has i+1 allowed keys.

  • test_temporal_mask_none_fallback_matches_all_padded

    Tested: Explicit “only current real” pad pattern vs the same pattern duplicated.
    Expected: Two explicitly built masks are exactly equal (stand-in for the forward None fallback tensor).

  • test_temporal_mask_mixed_batch_per_sample_independence

    Tested: Different pad rows in one batch and a partial-history row.
    Expected: Each row’s mask equals solo build on that row; padded vs valid rows differ; partial row has the specified allow/forbid cells.

TestPI07LowLevelPlannerStateEmbedding

  • test_state_embedding_single_frame

    Tested: T=1 state block in prefix.
    Expected: Embeddings (1,1,H); state pad all present; embeddings not all zero.

  • test_state_embedding_single_frame_pad_irrelevant

    Tested: T=1 with different obs_history_is_pad.
    Expected: Same state pad slice for None / True / False; still all present.

  • test_state_embedding_all_history_padded

    Tested: T>1 with only last frame real.
    Expected: Embeddings (1,T,H); only last state slot present; last slot not all zero.

  • test_state_embedding_all_pad_true_matches_history_padded

    Tested: obs_history_is_pad all True vs usual […,False] on last.
    Expected: Same state pad slice and same state embeddings (current forced present).

  • test_state_embedding_all_valid

    Tested: T>1, no padding.
    Expected: All T slots present; each timestep embedding not all zero.

  • test_state_embedding_none_fallback

    Tested: obs_history_is_pad None vs explicit padded-history pattern.
    Expected: Identical state pad slices.

  • test_state_embedding_mixed_batch

    Tested: Three fixed per-row pad patterns.
    Expected: State pad rows [F,F,F,T], [T,T,T,T], [F,T,T,T] (presence = not pad, with last forced).

  • test_forward_path_invokes_state_proj

    Tested: Multi-step state projection for the prefix path.
    Expected: Exactly one state_proj call with input shape (B,T,D).

  • test_sample_actions_path_invokes_state_proj

    Tested: Single-step state projection.
    Expected: Exactly one call with shape (B,1,D).

TestPI07LowLevelPlannerResponseEmbedding

  • test_empty_response_comma_and_tokens_padded

    Tested: Empty response after prepare_response.
    Expected: response_masks all false; response comma and response spans in prefix pad all absent.

  • test_nonempty_response_comma_and_tokens_unmasked

    Tested: Non-empty response.
    Expected: Some response_masks true; comma span all present; response span equals response_masks.

  • test_mixed_batch_response_masking

    Tested: ["", real text] batch.
    Expected: Row0 comma+response absent; row1 comma present; response span equals row masks.

  • test_response_masking_forward_vs_sample_actions_parity

    Tested: Stability and equivalence of preparation; empty vs real differs; missing key vs "".
    Expected: Duplicate prepares → equal tokens/masks; omit key equals explicit empty; duplicate embed_prefix pad_masks equal; empty vs real differ in comma and response spans.

TestPI07LowLevelPlannerMetadataEmbedding

  • test_all_metadata_dropped_md_comma_and_tokens_padded

Tested: All fields dropped via *_is_pad.
Expected: metadata_masks all false; metadata comma and metadata spans all absent.

  • test_all_metadata_present_md_comma_unmasked

    Tested: All fields present.
    Expected: metadata_masks has some true; metadata comma all present; metadata span equals metadata_masks.

  • test_metadata_is_pad_exhaustive_single_sample (8 combinations)

    Tested: Every (speed_pad, quality_pad, mistake_pad) triple vs sample_has_metadata and prefix slices.
    Expected: metadata_masks.any() iff not all three dropped; metadata comma matches metadata_masks.any(dim=1) broadcast; metadata span equals metadata_masks.

  • test_mixed_batch_metadata_masking

    Tested: Three-row batch with different pad patterns.
    Expected: Comma row masks [absent, present, present] matching per-row any-metadata; each metadata row equals that row’s metadata_masks.

  • test_metadata_missing_keys_matches_explicit_all_pad

    Tested: Only-state batch vs explicit all-dropped metadata batch.
    Expected: prepare_metadata outputs identical; full embed_prefix pad_masks identical.

TestPI07LowLevelPlannerSubgoalEmbedding

  • test_no_subgoal_keys_sg_block_fully_padded

    Tested: Batch with only state (no subgoal keys).
    Expected: sample_has_subgoal false; subgoal comma, header, and all subgoal vision tokens absent.

  • test_subgoal_present_sg_block_unmasked

    Tested: Both subgoal cameras present and not pad-dropped.
    Expected: sample_has_subgoal true; comma+header present; each camera’s vision span equals that camera’s mask broadcast to tokens.

  • test_subgoal_tensors_all_is_pad_matches_missing_keys

    Tested: Tensors + all subgoal_is_pad vs missing keys.
    Expected: prepare_subgoal_images outputs pairwise equal; embed_prefix pad_masks equal.

  • test_mixed_batch_subgoal_masking

    Tested: subgoal_is_pad [True, False] with tensors for both rows.
    Expected: sample_has_subgoal == [False, True]; row0 subgoal comma/header/vision absent; row1 comma/header present; row1 full vision span present.

  • test_subgoal_sample_has_or_across_cameras

    Tested: One camera all-false mask, one all-true.
    Expected: sample_has_subgoal true; subgoal comma present (OR across cameras).

  • test_omit_subgoal_is_pad_defaults_all_pad

    Tested: Subgoal images present but subgoal_is_pad omitted.
    Expected: Masks all false; sample_has_subgoal false; subgoal comma span absent.

  • test_subgoal_embed_prefix_parity

    Tested: Same batch prepared twice.
    Expected: embed_prefix pad_masks bitwise equal between runs.

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

akshay18iitg and others added 21 commits April 29, 2026 13:04
…h, fix NameError

- embed_prefix now only adds response/subgoal/metadata blocks when at
  least one sample in the batch has real (non-pad) tokens.  Unconditional
  embedding injected spurious causal-block boundaries (att_masks=[1,...])
  that corrupted the cumsum for every subsequent token even when all masks
  were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0).

- Subgoal header ("Subgoal: ") and footer (", ") are now inside
  `if subgoal_images:`.  They were always-present real tokens that added
  spurious cross-attention signal and position-ID shifts in the no-subgoal
  case.

- num_cross_att_tokens in forward() now subtracts both the "Action: "
  indicator length and discrete_action_max_length, matching pi05's
  discrete_action_indicator_max_length logic.  Previously the indicator
  tokens leaked into the action expert's cross-attention at training time
  but were absent at inference, causing a train/inference mismatch.
  prefix_offsets updated to match.

- prepare_subgoal_images: track last_subgoal_img / last_mask before the
  empty-cameras fallback loop to avoid NameError when no subgoal keys are
  present in the batch and empty_cameras > 0.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Swaps VJEPA2VideoEncoder for SpaceTimeSiglipVideoEncoder (reusing the
PaliGemma vision_tower + multi_modal_projector already in the model),
removes the five vjepa2_* config fields in favour of spacetime_layer_stride,
and updates pi07_libero.json accordingly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…are not present, mask subgoal_start_ind in batch if subgoal is padded
The V-JEPA2 encoder was replaced with SpaceTimeSiglipVideoEncoder in
57d650f, but several module docstrings, comments, and a test constant
still referenced the old encoder. Also rename the test constant
VJEPA2_TOKENS_PER_CAMERA -> SIGLIP_TOKENS_PER_CAMERA to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- addresses @claude (Action: indicator caching): cache the encoded length once in PI07LowLevelPlannerFlowMatching.__init__ as self._action_indicator_len; reuse it in the cross-att length and prefix-offset slice. Adds an explicit assert that discrete_actions is not None at the train-time forward call site so the subtraction can never underflow.
- addresses @claude (gradient_checkpointing for SpaceTimeSiglip): add PI07lowlevelPlannerConfig.gradient_checkpointing (default False) and forward it to SpaceTimeSiglipVideoEncoder, mirroring pi05_mem.
- addresses @claude (subgoal_is_pad / metadata is_pad default flip): document the treat-as-pad default in prepare_subgoal_images and prepare_metadata docstrings so hand-built inference batches know to pass *_is_pad=zeros explicitly.
- addresses @claude (no-op bool cast): drop [m.to(dtype=torch.bool) for m in subgoal_img_masks] in embed_prefix; the masks are already bool. Use torch.stack(subgoal_img_masks).any(dim=0) directly.
- addresses @claude (scalar att_masks footgun): add a comment at the subgoal_img_start_emb append site explaining why the scalar [1, 0, ...] is safe under sample_has_subgoal masking and why a future per-sample fix would double-mask.
- addresses @claude (V-JEPA2 / ImageNet docstring leftover): rewrite prepare_videos docstring to describe the actual SigLIP normalization path ([0,1] in, [-1,1] internal) — V-JEPA2 / Perceiver references in embed_video / the ASCII diagram were already cleaned up.
- addresses @claude (lerobot_dataset.py:1813/1820/1830 docstring vs code): rewrite _load_subgoal_frames docstring to match the new always-on behavior, call out the WeightedDatasetMixture co-training behavior change, and acknowledge the segments-less ~4 s legacy fallback that _sample_subgoal_frame still implements.
- addresses @claude (prefix terminator divergence with high-level planner): add a comment noting that the low-level ":\n" terminator intentionally matches pi05 for byte-equivalence on the no-optionals path; the high-level planner's ";\n " terminator is independent.

tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- addresses @claude (mixed-batch sample_has_subgoal coverage):
  add test_mixed_batch_subgoal_pad_masks_follow_sample_has_subgoal that
  exercises the per-sample subgoal gating with bsize=2 — sample 0 has a
  real subgoal, sample 1 is pad-only — and asserts header/image/footer
  pad_masks follow sample_has_subgoal while att_masks stays broadcast.
- addresses @claude (num_cross_att_tokens arithmetic coverage):
  add test_no_optionals_num_cross_att_tokens_matches_pi05_arithmetic
  pinning the prefix_embs.shape[1] - _action_indicator_len -
  discrete_action_max_length subtraction and matching prefix_offsets.
- addresses @claude (segments-less _load_subgoal_frames coverage):
  add test_subgoals_load_without_segments_uses_4s_lookahead which mocks
  episodes[0]={} (no segments key) and asserts the legacy ~4 s lookahead
  fallback in _sample_subgoal_frame still fires.
- addresses @claude (per-camera hf_dataset transform perf nit):
  hoist hf_dataset[ep_start + subgoal_frame] out of the per-camera loop
  in _load_subgoal_frames so multi-image-camera datasets decode the row
  exactly once instead of N times. Single-camera path is unchanged.

tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@akshay18iitg akshay18iitg requested a review from shuheng-liu May 4, 2026 22:41
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting inline findings — see comments. Summary in a separate top-level comment.

Comment thread src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py Outdated
Comment thread configs/examples/pi07_libero.json
Comment thread tests/policies/test_pi07_paligemma_low_level_planner.py
Comment thread src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py Outdated
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

[claude-review] summary for commit ea74155

All four resolved findings from the prior review (subgoal num_frames mismatch, prepare_subgoal_images cardinality, dropout default, print→logging) verified addressed in ea74155. The deferred test sync nit was acknowledged as "Ignore for now" by the maintainer. New test_subgoal_padded_to_num_frames_when_n_obs_history_gt_one plus tightened embed_video stubs (T==num_frames check) provide a CPU regression for the original crash.

  • resolved (was blocking)src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py:1545-1572 — single-frame subgoal videos now padded to (B, num_frames, C, H, W) with obs_history_is_pad blocking all-but-last; embed_video signature accepts the mask. Verified correct: _build_temporal_attn_mask masks all past keys, current = hidden[:, -1] discards padded-frame outputs.
  • resolved (was suggestion)src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py:995-998 — pad loop now always appends len(missing_keys) zero slots, so total slots = len(subgoal_keys) matching the all-missing-keys early-return path. Prefix length no longer depends on which subgoal{k} keys are present.
  • resolved (was suggestion)src/opentau/policies/pi07_paligemma/low_level_planner/configuration_pi07_low_level.py:139dropout default flipped 0.1 → 0.0 with explanatory comment; the determinism rule from CLAUDE.md now actually holds without manual override.
  • resolved (was nit)src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py:353-422from_pretrained restored to logging.info/warning/debug throughout.
  • deferred (nit)tests/policies/test_pi07_paligemma_low_level_planner.py:199_verify_position_ids still subtracts only DISCRETE_ACTION_MAX_LENGTH; out-of-sync with prefix_offsets now subtracting both indicator + action lengths. The test is @pytest.mark.skip'd so doesn't fail CI; maintainer asked to ignore.

No blocking issues found.

@akshay18iitg
Copy link
Copy Markdown
Collaborator Author

@claude fix the points agreed on

- addresses @claude (subgoal/SpaceTimeSigLIP num_frames mismatch): in
  embed_prefix, pad single-frame subgoal videos to (B, num_frames, C, H, W)
  with zeros for past frames and pass obs_history_is_pad blocking all but
  the current frame so SpaceTimeSiglipVideoEncoder no longer crashes when
  n_obs_history > 1 and a real subgoal is present.
- addresses @claude (prepare_subgoal_images cardinality): always pad to
  len(subgoal_keys) in the present-keys path so prefix length no longer
  depends on which subgoal{k} keys are in the batch (matches all-missing
  early-return path).
- addresses @claude (dropout default for pi05 parity): change
  PI07lowlevelPlannerConfig.dropout default from 0.1 to 0.0 so the
  determinism rule from CLAUDE.md actually holds without manual override.
- addresses @claude (logging vs print regression): restore logging.info /
  warning / debug calls in PI07LowLevelPlannerPolicy.from_pretrained
  instead of bare print(...).
- adds CPU regression test
  test_subgoal_padded_to_num_frames_when_n_obs_history_gt_one and tightens
  the embed_video stubs to enforce T == num_frames so the previous bug
  would have failed CI.

tests: passed (the 13 CPU-only tests that don't need HF auth);
24 remaining failures all require gated google/paligemma-3b-pt-224
access from the sandbox and pass in CI under HF_TOKEN.
tests: passed — uv run pytest tests/policies/test_pi07_paligemma_low_level_planner.py -m "not gpu" -n 0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@akshay18iitg akshay18iitg requested a review from WilliamYue37 May 4, 2026 23:39
@akshay18iitg akshay18iitg self-assigned this May 4, 2026
@WilliamYue37 WilliamYue37 merged commit 45a34cb into main May 4, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants