Fix/pi07 paligemma fixes#260
Merged
Merged
Conversation
…h, fix NameError
- embed_prefix now only adds response/subgoal/metadata blocks when at
least one sample in the batch has real (non-pad) tokens. Unconditional
embedding injected spurious causal-block boundaries (att_masks=[1,...])
that corrupted the cumsum for every subsequent token even when all masks
were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0).
- Subgoal header ("Subgoal: ") and footer (", ") are now inside
`if subgoal_images:`. They were always-present real tokens that added
spurious cross-attention signal and position-ID shifts in the no-subgoal
case.
- num_cross_att_tokens in forward() now subtracts both the "Action: "
indicator length and discrete_action_max_length, matching pi05's
discrete_action_indicator_max_length logic. Previously the indicator
tokens leaked into the action expert's cross-attention at training time
but were absent at inference, causing a train/inference mismatch.
prefix_offsets updated to match.
- prepare_subgoal_images: track last_subgoal_img / last_mask before the
empty-cameras fallback loop to avoid NameError when no subgoal keys are
present in the batch and empty_cameras > 0.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Swaps VJEPA2VideoEncoder for SpaceTimeSiglipVideoEncoder (reusing the PaliGemma vision_tower + multi_modal_projector already in the model), removes the five vjepa2_* config fields in favour of spacetime_layer_stride, and updates pi07_libero.json accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…are not present, mask subgoal_start_ind in batch if subgoal is padded
The V-JEPA2 encoder was replaced with SpaceTimeSiglipVideoEncoder in 57d650f, but several module docstrings, comments, and a test constant still referenced the old encoder. Also rename the test constant VJEPA2_TOKENS_PER_CAMERA -> SIGLIP_TOKENS_PER_CAMERA to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- addresses @claude (Action: indicator caching): cache the encoded length once in PI07LowLevelPlannerFlowMatching.__init__ as self._action_indicator_len; reuse it in the cross-att length and prefix-offset slice. Adds an explicit assert that discrete_actions is not None at the train-time forward call site so the subtraction can never underflow. - addresses @claude (gradient_checkpointing for SpaceTimeSiglip): add PI07lowlevelPlannerConfig.gradient_checkpointing (default False) and forward it to SpaceTimeSiglipVideoEncoder, mirroring pi05_mem. - addresses @claude (subgoal_is_pad / metadata is_pad default flip): document the treat-as-pad default in prepare_subgoal_images and prepare_metadata docstrings so hand-built inference batches know to pass *_is_pad=zeros explicitly. - addresses @claude (no-op bool cast): drop [m.to(dtype=torch.bool) for m in subgoal_img_masks] in embed_prefix; the masks are already bool. Use torch.stack(subgoal_img_masks).any(dim=0) directly. - addresses @claude (scalar att_masks footgun): add a comment at the subgoal_img_start_emb append site explaining why the scalar [1, 0, ...] is safe under sample_has_subgoal masking and why a future per-sample fix would double-mask. - addresses @claude (V-JEPA2 / ImageNet docstring leftover): rewrite prepare_videos docstring to describe the actual SigLIP normalization path ([0,1] in, [-1,1] internal) — V-JEPA2 / Perceiver references in embed_video / the ASCII diagram were already cleaned up. - addresses @claude (lerobot_dataset.py:1813/1820/1830 docstring vs code): rewrite _load_subgoal_frames docstring to match the new always-on behavior, call out the WeightedDatasetMixture co-training behavior change, and acknowledge the segments-less ~4 s legacy fallback that _sample_subgoal_frame still implements. - addresses @claude (prefix terminator divergence with high-level planner): add a comment noting that the low-level ":\n" terminator intentionally matches pi05 for byte-equivalence on the no-optionals path; the high-level planner's ";\n " terminator is independent. tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- addresses @claude (mixed-batch sample_has_subgoal coverage): add test_mixed_batch_subgoal_pad_masks_follow_sample_has_subgoal that exercises the per-sample subgoal gating with bsize=2 — sample 0 has a real subgoal, sample 1 is pad-only — and asserts header/image/footer pad_masks follow sample_has_subgoal while att_masks stays broadcast. - addresses @claude (num_cross_att_tokens arithmetic coverage): add test_no_optionals_num_cross_att_tokens_matches_pi05_arithmetic pinning the prefix_embs.shape[1] - _action_indicator_len - discrete_action_max_length subtraction and matching prefix_offsets. - addresses @claude (segments-less _load_subgoal_frames coverage): add test_subgoals_load_without_segments_uses_4s_lookahead which mocks episodes[0]={} (no segments key) and asserts the legacy ~4 s lookahead fallback in _sample_subgoal_frame still fires. - addresses @claude (per-camera hf_dataset transform perf nit): hoist hf_dataset[ep_start + subgoal_frame] out of the per-camera loop in _load_subgoal_frames so multi-image-camera datasets decode the row exactly once instead of N times. Single-camera path is unchanged. tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
|
[claude-review] summary for commit ea74155 All four resolved findings from the prior review (subgoal num_frames mismatch, prepare_subgoal_images cardinality, dropout default, print→logging) verified addressed in
No blocking issues found. |
Collaborator
Author
|
@claude fix the points agreed on |
- addresses @claude (subgoal/SpaceTimeSigLIP num_frames mismatch): in embed_prefix, pad single-frame subgoal videos to (B, num_frames, C, H, W) with zeros for past frames and pass obs_history_is_pad blocking all but the current frame so SpaceTimeSiglipVideoEncoder no longer crashes when n_obs_history > 1 and a real subgoal is present. - addresses @claude (prepare_subgoal_images cardinality): always pad to len(subgoal_keys) in the present-keys path so prefix length no longer depends on which subgoal{k} keys are in the batch (matches all-missing early-return path). - addresses @claude (dropout default for pi05 parity): change PI07lowlevelPlannerConfig.dropout default from 0.1 to 0.0 so the determinism rule from CLAUDE.md actually holds without manual override. - addresses @claude (logging vs print regression): restore logging.info / warning / debug calls in PI07LowLevelPlannerPolicy.from_pretrained instead of bare print(...). - adds CPU regression test test_subgoal_padded_to_num_frames_when_n_obs_history_gt_one and tightens the embed_video stubs to enforce T == num_frames so the previous bug would have failed CI. tests: passed (the 13 CPU-only tests that don't need HF auth); 24 remaining failures all require gated google/paligemma-3b-pt-224 access from the sandbox and pass in CI under HF_TOKEN. tests: passed — uv run pytest tests/policies/test_pi07_paligemma_low_level_planner.py -m "not gpu" -n 0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
WilliamYue37
approved these changes
May 4, 2026
This was referenced May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…h,
What this does
fix NameError
embed_prefix now only adds response/subgoal/metadata blocks when at least one sample in the batch has real (non-pad) tokens. Unconditional embedding injected spurious causal-block boundaries (att_masks=[1,...]) that corrupted the cumsum for every subsequent token even when all masks were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0).
Subgoal header ("Subgoal: ") and footer (", ") are now inside
if subgoal_images:. They were always-present real tokens that added spurious cross-attention signal and position-ID shifts in the no-subgoal case.num_cross_att_tokens in forward() now subtracts both the "Action: " indicator length and discrete_action_max_length, matching pi05's discrete_action_indicator_max_length logic. Previously the indicator tokens leaked into the action expert's cross-attention at training time but were absent at inference, causing a train/inference mismatch. prefix_offsets updated to match.
prepare_subgoal_images: track last_subgoal_img / last_mask before the empty-cameras fallback loop to avoid NameError when no subgoal keys are present in the batch and empty_cameras > 0.
Issues addressed:
How was it tested
TestPI07LowLevelPlannerObsHistoryRegression
test_single_frame_shape_and_mask_true
Tested: Encoder at T=1 ignores obs_history_is_pad.
Expected: Output (B, num_video_tokens, H); None / all-True / all-False pads → identical outputs.
test_temporal_mask_is_history_padded
Tested: Temporal mask when only the current frame is real.
Expected: Shape (B·N,1,T,T); past keys fully blocked; current self-only; past cannot see current; ≤1 attendable key per row; mask repeated per patch within a sample.
test_temporal_mask_all_valid_is_causal
Tested: Temporal mask with no padded frames.
Expected: Standard causal lower-triangle; row i has i+1 allowed keys.
test_temporal_mask_none_fallback_matches_all_padded
Tested: Explicit “only current real” pad pattern vs the same pattern duplicated.
Expected: Two explicitly built masks are exactly equal (stand-in for the forward None fallback tensor).
test_temporal_mask_mixed_batch_per_sample_independence
Tested: Different pad rows in one batch and a partial-history row.
Expected: Each row’s mask equals solo build on that row; padded vs valid rows differ; partial row has the specified allow/forbid cells.
TestPI07LowLevelPlannerStateEmbedding
test_state_embedding_single_frame
Tested: T=1 state block in prefix.
Expected: Embeddings (1,1,H); state pad all present; embeddings not all zero.
test_state_embedding_single_frame_pad_irrelevant
Tested: T=1 with different obs_history_is_pad.
Expected: Same state pad slice for None / True / False; still all present.
test_state_embedding_all_history_padded
Tested: T>1 with only last frame real.
Expected: Embeddings (1,T,H); only last state slot present; last slot not all zero.
test_state_embedding_all_pad_true_matches_history_padded
Tested: obs_history_is_pad all True vs usual […,False] on last.
Expected: Same state pad slice and same state embeddings (current forced present).
test_state_embedding_all_valid
Tested: T>1, no padding.
Expected: All T slots present; each timestep embedding not all zero.
test_state_embedding_none_fallback
Tested: obs_history_is_pad None vs explicit padded-history pattern.
Expected: Identical state pad slices.
test_state_embedding_mixed_batch
Tested: Three fixed per-row pad patterns.
Expected: State pad rows [F,F,F,T], [T,T,T,T], [F,T,T,T] (presence = not pad, with last forced).
test_forward_path_invokes_state_proj
Tested: Multi-step state projection for the prefix path.
Expected: Exactly one state_proj call with input shape (B,T,D).
test_sample_actions_path_invokes_state_proj
Tested: Single-step state projection.
Expected: Exactly one call with shape (B,1,D).
TestPI07LowLevelPlannerResponseEmbedding
test_empty_response_comma_and_tokens_padded
Tested: Empty response after prepare_response.
Expected: response_masks all false; response comma and response spans in prefix pad all absent.
test_nonempty_response_comma_and_tokens_unmasked
Tested: Non-empty response.
Expected: Some response_masks true; comma span all present; response span equals response_masks.
test_mixed_batch_response_masking
Tested: ["", real text] batch.
Expected: Row0 comma+response absent; row1 comma present; response span equals row masks.
test_response_masking_forward_vs_sample_actions_parity
Tested: Stability and equivalence of preparation; empty vs real differs; missing key vs "".
Expected: Duplicate prepares → equal tokens/masks; omit key equals explicit empty; duplicate embed_prefix pad_masks equal; empty vs real differ in comma and response spans.
TestPI07LowLevelPlannerMetadataEmbedding
Tested: All fields dropped via *_is_pad.
Expected: metadata_masks all false; metadata comma and metadata spans all absent.
test_all_metadata_present_md_comma_unmasked
Tested: All fields present.
Expected: metadata_masks has some true; metadata comma all present; metadata span equals metadata_masks.
test_metadata_is_pad_exhaustive_single_sample (8 combinations)
Tested: Every (speed_pad, quality_pad, mistake_pad) triple vs sample_has_metadata and prefix slices.
Expected: metadata_masks.any() iff not all three dropped; metadata comma matches metadata_masks.any(dim=1) broadcast; metadata span equals metadata_masks.
test_mixed_batch_metadata_masking
Tested: Three-row batch with different pad patterns.
Expected: Comma row masks [absent, present, present] matching per-row any-metadata; each metadata row equals that row’s metadata_masks.
test_metadata_missing_keys_matches_explicit_all_pad
Tested: Only-state batch vs explicit all-dropped metadata batch.
Expected: prepare_metadata outputs identical; full embed_prefix pad_masks identical.
TestPI07LowLevelPlannerSubgoalEmbedding
test_no_subgoal_keys_sg_block_fully_padded
Tested: Batch with only state (no subgoal keys).
Expected: sample_has_subgoal false; subgoal comma, header, and all subgoal vision tokens absent.
test_subgoal_present_sg_block_unmasked
Tested: Both subgoal cameras present and not pad-dropped.
Expected: sample_has_subgoal true; comma+header present; each camera’s vision span equals that camera’s mask broadcast to tokens.
test_subgoal_tensors_all_is_pad_matches_missing_keys
Tested: Tensors + all subgoal_is_pad vs missing keys.
Expected: prepare_subgoal_images outputs pairwise equal; embed_prefix pad_masks equal.
test_mixed_batch_subgoal_masking
Tested: subgoal_is_pad [True, False] with tensors for both rows.
Expected: sample_has_subgoal == [False, True]; row0 subgoal comma/header/vision absent; row1 comma/header present; row1 full vision span present.
test_subgoal_sample_has_or_across_cameras
Tested: One camera all-false mask, one all-true.
Expected: sample_has_subgoal true; subgoal comma present (OR across cameras).
test_omit_subgoal_is_pad_defaults_all_pad
Tested: Subgoal images present but subgoal_is_pad omitted.
Expected: Masks all false; sample_has_subgoal false; subgoal comma span absent.
test_subgoal_embed_prefix_parity
Tested: Same batch prepared twice.
Expected: embed_prefix pad_masks bitwise equal between runs.
Checklist
Note: Before submitting this PR, please read the contributor guideline.