fix(pi07_paligemma): gate optional prefix tokens, fix cross-att lengt…#205
fix(pi07_paligemma): gate optional prefix tokens, fix cross-att lengt…#205akshay18iitg wants to merge 21 commits into
Conversation
…h, fix NameError
- embed_prefix now only adds response/subgoal/metadata blocks when at
least one sample in the batch has real (non-pad) tokens. Unconditional
embedding injected spurious causal-block boundaries (att_masks=[1,...])
that corrupted the cumsum for every subsequent token even when all masks
were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0).
- Subgoal header ("Subgoal: ") and footer (", ") are now inside
`if subgoal_images:`. They were always-present real tokens that added
spurious cross-attention signal and position-ID shifts in the no-subgoal
case.
- num_cross_att_tokens in forward() now subtracts both the "Action: "
indicator length and discrete_action_max_length, matching pi05's
discrete_action_indicator_max_length logic. Previously the indicator
tokens leaked into the action expert's cross-attention at training time
but were absent at inference, causing a train/inference mismatch.
prefix_offsets updated to match.
- prepare_subgoal_images: track last_subgoal_img / last_mask before the
empty-cameras fallback loop to avoid NameError when no subgoal keys are
present in the batch and empty_cameras > 0.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
[claude-review] summary for commit 75aca44
Note: prior |
Swaps VJEPA2VideoEncoder for SpaceTimeSiglipVideoEncoder (reusing the PaliGemma vision_tower + multi_modal_projector already in the model), removes the five vjepa2_* config fields in favour of spacetime_layer_stride, and updates pi07_libero.json accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…are not present, mask subgoal_start_ind in batch if subgoal is padded
The V-JEPA2 encoder was replaced with SpaceTimeSiglipVideoEncoder in 57d650f, but several module docstrings, comments, and a test constant still referenced the old encoder. Also rename the test constant VJEPA2_TOKENS_PER_CAMERA -> SIGLIP_TOKENS_PER_CAMERA to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
@claude fix the items raised in review |
- addresses @claude (Action: indicator caching): cache the encoded length once in PI07LowLevelPlannerFlowMatching.__init__ as self._action_indicator_len; reuse it in the cross-att length and prefix-offset slice. Adds an explicit assert that discrete_actions is not None at the train-time forward call site so the subtraction can never underflow. - addresses @claude (gradient_checkpointing for SpaceTimeSiglip): add PI07lowlevelPlannerConfig.gradient_checkpointing (default False) and forward it to SpaceTimeSiglipVideoEncoder, mirroring pi05_mem. - addresses @claude (subgoal_is_pad / metadata is_pad default flip): document the treat-as-pad default in prepare_subgoal_images and prepare_metadata docstrings so hand-built inference batches know to pass *_is_pad=zeros explicitly. - addresses @claude (no-op bool cast): drop [m.to(dtype=torch.bool) for m in subgoal_img_masks] in embed_prefix; the masks are already bool. Use torch.stack(subgoal_img_masks).any(dim=0) directly. - addresses @claude (scalar att_masks footgun): add a comment at the subgoal_img_start_emb append site explaining why the scalar [1, 0, ...] is safe under sample_has_subgoal masking and why a future per-sample fix would double-mask. - addresses @claude (V-JEPA2 / ImageNet docstring leftover): rewrite prepare_videos docstring to describe the actual SigLIP normalization path ([0,1] in, [-1,1] internal) — V-JEPA2 / Perceiver references in embed_video / the ASCII diagram were already cleaned up. - addresses @claude (lerobot_dataset.py:1813/1820/1830 docstring vs code): rewrite _load_subgoal_frames docstring to match the new always-on behavior, call out the WeightedDatasetMixture co-training behavior change, and acknowledge the segments-less ~4 s legacy fallback that _sample_subgoal_frame still implements. - addresses @claude (prefix terminator divergence with high-level planner): add a comment noting that the low-level ":\n" terminator intentionally matches pi05 for byte-equivalence on the no-optionals path; the high-level planner's ";\n " terminator is independent. tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[claude-fix] Addressed the items from the review rollup in commit 56a7ce1:
Other inline items (no-op bool cast, scalar att_masks comment, gradient_checkpointing config, is_pad default flip docstrings, prefix terminator divergence comment) were addressed in the same commit — replied individually above with details. Tests passing locally: |
|
@claude fix all unresolved inline comments. If they are stale, reply explicitly. |
- addresses @claude (mixed-batch sample_has_subgoal coverage): add test_mixed_batch_subgoal_pad_masks_follow_sample_has_subgoal that exercises the per-sample subgoal gating with bsize=2 — sample 0 has a real subgoal, sample 1 is pad-only — and asserts header/image/footer pad_masks follow sample_has_subgoal while att_masks stays broadcast. - addresses @claude (num_cross_att_tokens arithmetic coverage): add test_no_optionals_num_cross_att_tokens_matches_pi05_arithmetic pinning the prefix_embs.shape[1] - _action_indicator_len - discrete_action_max_length subtraction and matching prefix_offsets. - addresses @claude (segments-less _load_subgoal_frames coverage): add test_subgoals_load_without_segments_uses_4s_lookahead which mocks episodes[0]={} (no segments key) and asserts the legacy ~4 s lookahead fallback in _sample_subgoal_frame still fires. - addresses @claude (per-camera hf_dataset transform perf nit): hoist hf_dataset[ep_start + subgoal_frame] out of the per-camera loop in _load_subgoal_frames so multi-image-camera datasets decode the row exactly once instead of N times. Single-camera path is unchanged. tests: passed — pytest -m "not gpu" -n auto tests/policies/test_pi07_paligemma_low_level_planner.py tests/datasets/test_optional_keys.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude fix the pre-commit CI failures. |
Co-authored-by: Cursor <cursoragent@cursor.com>
…h,
What this does
fix NameError
embed_prefix now only adds response/subgoal/metadata blocks when at least one sample in the batch has real (non-pad) tokens. Unconditional embedding injected spurious causal-block boundaries (att_masks=[1,...]) that corrupted the cumsum for every subsequent token even when all masks were False (response_drop_prob=1.0 / subgoal_drop_prob=1.0).
Subgoal header ("Subgoal: ") and footer (", ") are now inside
if subgoal_images:. They were always-present real tokens that added spurious cross-attention signal and position-ID shifts in the no-subgoal case.num_cross_att_tokens in forward() now subtracts both the "Action: " indicator length and discrete_action_max_length, matching pi05's discrete_action_indicator_max_length logic. Previously the indicator tokens leaked into the action expert's cross-attention at training time but were absent at inference, causing a train/inference mismatch. prefix_offsets updated to match.
prepare_subgoal_images: track last_subgoal_img / last_mask before the empty-cameras fallback loop to avoid NameError when no subgoal keys are present in the batch and empty_cameras > 0.
Issues addressed:
How was it tested
TestPI07LowLevelPlannerObsHistoryRegression
test_single_frame_shape_and_mask_true
Tested: Encoder at T=1 ignores obs_history_is_pad.
Expected: Output (B, num_video_tokens, H); None / all-True / all-False pads → identical outputs.
test_temporal_mask_is_history_padded
Tested: Temporal mask when only the current frame is real.
Expected: Shape (B·N,1,T,T); past keys fully blocked; current self-only; past cannot see current; ≤1 attendable key per row; mask repeated per patch within a sample.
test_temporal_mask_all_valid_is_causal
Tested: Temporal mask with no padded frames.
Expected: Standard causal lower-triangle; row i has i+1 allowed keys.
test_temporal_mask_none_fallback_matches_all_padded
Tested: Explicit “only current real” pad pattern vs the same pattern duplicated.
Expected: Two explicitly built masks are exactly equal (stand-in for the forward None fallback tensor).
test_temporal_mask_mixed_batch_per_sample_independence
Tested: Different pad rows in one batch and a partial-history row.
Expected: Each row’s mask equals solo build on that row; padded vs valid rows differ; partial row has the specified allow/forbid cells.
TestPI07LowLevelPlannerStateEmbedding
test_state_embedding_single_frame
Tested: T=1 state block in prefix.
Expected: Embeddings (1,1,H); state pad all present; embeddings not all zero.
test_state_embedding_single_frame_pad_irrelevant
Tested: T=1 with different obs_history_is_pad.
Expected: Same state pad slice for None / True / False; still all present.
test_state_embedding_all_history_padded
Tested: T>1 with only last frame real.
Expected: Embeddings (1,T,H); only last state slot present; last slot not all zero.
test_state_embedding_all_pad_true_matches_history_padded
Tested: obs_history_is_pad all True vs usual […,False] on last.
Expected: Same state pad slice and same state embeddings (current forced present).
test_state_embedding_all_valid
Tested: T>1, no padding.
Expected: All T slots present; each timestep embedding not all zero.
test_state_embedding_none_fallback
Tested: obs_history_is_pad None vs explicit padded-history pattern.
Expected: Identical state pad slices.
test_state_embedding_mixed_batch
Tested: Three fixed per-row pad patterns.
Expected: State pad rows [F,F,F,T], [T,T,T,T], [F,T,T,T] (presence = not pad, with last forced).
test_forward_path_invokes_state_proj
Tested: Multi-step state projection for the prefix path.
Expected: Exactly one state_proj call with input shape (B,T,D).
test_sample_actions_path_invokes_state_proj
Tested: Single-step state projection.
Expected: Exactly one call with shape (B,1,D).
TestPI07LowLevelPlannerResponseEmbedding
test_empty_response_comma_and_tokens_padded
Tested: Empty response after prepare_response.
Expected: response_masks all false; response comma and response spans in prefix pad all absent.
test_nonempty_response_comma_and_tokens_unmasked
Tested: Non-empty response.
Expected: Some response_masks true; comma span all present; response span equals response_masks.
test_mixed_batch_response_masking
Tested: ["", real text] batch.
Expected: Row0 comma+response absent; row1 comma present; response span equals row masks.
test_response_masking_forward_vs_sample_actions_parity
Tested: Stability and equivalence of preparation; empty vs real differs; missing key vs "".
Expected: Duplicate prepares → equal tokens/masks; omit key equals explicit empty; duplicate embed_prefix pad_masks equal; empty vs real differ in comma and response spans.
TestPI07LowLevelPlannerMetadataEmbedding
Tested: All fields dropped via *_is_pad.
Expected: metadata_masks all false; metadata comma and metadata spans all absent.
test_all_metadata_present_md_comma_unmasked
Tested: All fields present.
Expected: metadata_masks has some true; metadata comma all present; metadata span equals metadata_masks.
test_metadata_is_pad_exhaustive_single_sample (8 combinations)
Tested: Every (speed_pad, quality_pad, mistake_pad) triple vs sample_has_metadata and prefix slices.
Expected: metadata_masks.any() iff not all three dropped; metadata comma matches metadata_masks.any(dim=1) broadcast; metadata span equals metadata_masks.
test_mixed_batch_metadata_masking
Tested: Three-row batch with different pad patterns.
Expected: Comma row masks [absent, present, present] matching per-row any-metadata; each metadata row equals that row’s metadata_masks.
test_metadata_missing_keys_matches_explicit_all_pad
Tested: Only-state batch vs explicit all-dropped metadata batch.
Expected: prepare_metadata outputs identical; full embed_prefix pad_masks identical.
TestPI07LowLevelPlannerSubgoalEmbedding
test_no_subgoal_keys_sg_block_fully_padded
Tested: Batch with only state (no subgoal keys).
Expected: sample_has_subgoal false; subgoal comma, header, and all subgoal vision tokens absent.
test_subgoal_present_sg_block_unmasked
Tested: Both subgoal cameras present and not pad-dropped.
Expected: sample_has_subgoal true; comma+header present; each camera’s vision span equals that camera’s mask broadcast to tokens.
test_subgoal_tensors_all_is_pad_matches_missing_keys
Tested: Tensors + all subgoal_is_pad vs missing keys.
Expected: prepare_subgoal_images outputs pairwise equal; embed_prefix pad_masks equal.
test_mixed_batch_subgoal_masking
Tested: subgoal_is_pad [True, False] with tensors for both rows.
Expected: sample_has_subgoal == [False, True]; row0 subgoal comma/header/vision absent; row1 comma/header present; row1 full vision span present.
test_subgoal_sample_has_or_across_cameras
Tested: One camera all-false mask, one all-true.
Expected: sample_has_subgoal true; subgoal comma present (OR across cameras).
test_omit_subgoal_is_pad_defaults_all_pad
Tested: Subgoal images present but subgoal_is_pad omitted.
Expected: Masks all false; sample_has_subgoal false; subgoal comma span absent.
test_subgoal_embed_prefix_parity
Tested: Same batch prepared twice.
Expected: embed_prefix pad_masks bitwise equal between runs.
Checklist
Note: Before submitting this PR, please read the contributor guideline.