Adding original pi07 with gemma3 backbone and space-time siglip video…#197
Adding original pi07 with gemma3 backbone and space-time siglip video…#197akshay18iitg wants to merge 12 commits into
Conversation
|
Thanks for putting this together — the structural integration of #167 + #178 + #171 is correct, but on closer reading there are a few issues we'd like to flag, and we'll open a fix-up PR ( Findings1. Critical — text-embedding double-scale fix from #178 was not carried over.
2. Test coverage is GPU-only — none of the architectural invariants from #178/#171 are checked. 3. 4. 5. Tier-3 cosmetic: Open question: The fix-up PR will land Tier-1 and Tier-2 fixes plus ported CPU tests, leaving GPU integration unchanged. We'll link it here once it's open. Generated by Claude Code |
|
Fix-up PR opened: #198 (draft, targets Summary of what landed there:
The one open question — Generated by Claude Code |
|
@claude |
|
Updated #198 to use 4 as the default |
|
[claude-review] summary for commit e24ba88 Re-reviewing on commit Previous review carryover resolved.
Carryover from earlier reviews still applicable.
New findings on this round.
No blocking issues found. |
|
FYI: this will fix #171 |
|
@claude fix on all the agreed points |
- addresses @claude[bot] (low-level is_pad defaults, low_level_planner modeling_pi07_low_level.py:984-986): switched the *_is_pad fallbacks to torch.ones to match the high-level planner — missing speed/quality/mistake no longer fabricate "Speed: 0.0" entries in the prompt. - addresses @claude[bot] (prepare_metadata docstrings): updated both planners' docstrings to reference Gemma 3 (not PaliGemma) and to enumerate robot_type / control_mode (string-valued, empty-string-as-pad). - addresses @claude[bot] (high-level all-empty guard, high_level_planner/modeling_pi07_high_level.py:715): mirrored the low-level "if segments else ''" guard so an all-padded sample emits "" instead of the literal "Metadata: ". Both planners now agree. - addresses @claude[bot] (CPU coverage for prepare_metadata): added TestPrepareMetadataSegments to tests/policies/test_pi07_cpu.py covering (a) robot+control populated, (b) both absent, (c) one populated and one empty, (d) per-sample all-empty emits "" (regression for both planners), and (e) low-level missing speed/quality/mistake never produces a fabricated "Speed: 0.0" segment. tests: passed — pytest -m "not gpu" tests/policies/test_pi07_cpu.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
FYI, the failed CI is fixable my merging latest main into this branch. |
|
@claude fix the remaining items in the review. |
- addresses @claude[bot] (subgoal opt-in docstring): updated _load_subgoal_frames docstring to match the always-on behaviour; replaced the stale test_missing_subgoals_key_in_info_returns_empty test (which encoded the old opt-in gate) with two pinned cases — no-cameras → {}, and "no info.subgoals key still loads subgoals". Added camera_keys / image_keys / episode_data_index attrs to the SimpleNamespace meta in the two existing video tests so they match the new attribute reads. - addresses @claude[bot] (image-dtype fallback row index): new test_image_dtype_fallback_uses_absolute_row_index in tests/datasets/ test_optional_keys.py stubs hf_dataset.__getitem__ and pins that the parquet-row lookup uses ep_start + subgoal_frame, never the within-episode index. - addresses @claude[bot] ("fully 0 padded" comment): rewrote the comment at modeling_pi07_low_level.py:946 to call out -1 ([-1, 1] SigLIP range) and the False-mask role of the placeholder. - addresses @claude[bot] (_action_indicator_len recompute): cached the Action-indicator length at PI07LowLevelPlannerFlowMatching.__init__; both forward sites now read self._action_indicator_len. - addresses @claude[bot] (embed_prefix CPU coverage): added TestEmbedPrefixConditionalGuards in tests/policies/test_pi07_cpu.py with a fake Gemma3WithExpert + tokenizer + state_proj + embed_video so the three guards (response_masks.any(), subgoal availability, metadata_masks.any()) are exercised without GPU. Cases: all-False optional masks → no spurious causal boundary; mixed-availability subgoal batch → header/footer pad mask zeroes the pad-only sample; response.any() → exactly one boundary. - addresses @claude[bot] (.base_layer. stale comment in video_encoder.py): rewrote the wrap comment at video_encoder.py:422 to explain that the wrapper adopts submodules by reference, so wrapped-layer state-dict keys are byte-for-byte identical to a vanilla SiglipEncoderLayer (no .base_layer. prefix). tests: passed — pytest tests/policies tests/datasets -m "not gpu" -n auto (452 passed, 12 skipped; the 1 collection error in tests/policies/test_pi07_paligemma_low_level_planner.py is pre-existing and unrelated — pi07_paligemma still imports VJEPA2VideoEncoder from pi05_mem, which was removed in #171). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude fix according to the new review |
- addresses @claude (LL planner doc): added a docstring note in embed_prefix that has_any_optional is computed batch-wide so per-sample optional layout follows the batch-level OR (kept intentionally for prefix-length uniformity). - addresses @claude (HL planner gate): tightened embed_prefix metadata gate from `metadata_tokens is not None` to `metadata_masks.any()` to match the low-level planner's semantics; lets training cleanly drop both the metadata block and the trailing ";\n " prefix-end when an entire batch is metadata-free. - addresses @claude (GPU regression test): added test_no_optionals_path_on_real_gemma3 in TestPI07LowLevelPlannerIntegration that exercises empty response + all-padded metadata + subgoal_is_pad on the real Gemma 3 backbone, pinning the collapsed prefix length and the position of the first causal boundary. - addresses @claude (test_pi06 stats reshape): replaced `np.full((chunk_size, 32), float(stats.flatten()[0]))` with `np.tile(stats[:1], (config.chunk_size, 1))` so per-feature variance in future fixtures is preserved instead of collapsed to a scalar. tests: passed -- python -m pytest -m "not gpu" tests/policies/test_pi07_cpu.py tests/policies/test_pi07_video_encoder_cpu.py tests/policies/test_pi06.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
To save Claude context, I'll open a fresh PR. |
Brings 16 main commits onto feat/pi07 in preparation for opening the feat/pi07 → main PR. One conflict (tests/policies/test_pi06.py: np.tile vs np.full for chunk-size action stats) — kept HEAD's np.tile fix from claude review feedback on #197, which preserves per-feature variance instead of collapsing every step to stats.flatten()[0]. Also dropped stale --ignore=tests/policies/test_pi07_paligemma_low_level_planner.py from .github/workflows/cpu_test.yml and gpu_test.yml: main deleted the file in #234, but the --ignore line on feat/pi07 (added in #229) was on a different part of the file so git auto-merged without flagging it. Removing both now keeps the workflows clean.
… encoder
What this does
Adds a new
pi07policy built on the Gemma 3 VLM backbone and SpaceTimeSiglip video encoder, replacing the PaliGemma + V-JEPA2 stack frompi07_paligemma. (🗃️ Feature)New files
Key architectural changes vs
pi07_paligemmaresize_imgs_with_paddingdefaults updated,Gemma3MultiModalProjectoroutputs 256 tokens per imageproj_widthdefault changed to 1280 to match the larger action expert'shidden_sizeembed_imagerewritten to directly call vision tower + projector, returning a plain tensor (Gemma 3'sget_image_featuresreturnsBaseModelOutputWithPooling, not a tensor)_lm_head()accessor added —Gemma3ForConditionalGenerationownslm_headdirectly (unlike PaliGemma where it's on the text model)batch_size % num_frames != 0(e.g. single subgoal images viaembed_image)## How it was tested Added tests/policies/test_pi07_high_level_planner.py — TestPI07HighLevelPlannerIntegration exercises the full training forward pass (MSE == 0, CE finite, mask/position-id layout verification) and autoregressive inference (step count, prefix growth, output shapes) Added tests/policies/test_pi07_low_level_planner.py — TestPI07LowLevelPlannerIntegration exercises training forward (VLM + action expert attention masks, normalize/unnormalize round-trip, loss finiteness) and inference via select_action; TestPI07LowLevelPlannerRegression pins 4 regression cases (prepare_metadata returns tensors, zip strict catches mismatch, suffix att_masks are bool, embed_prefix signature has no defaults) Both test files use a tiny VLM config (2 layers, 512/256 hidden) to fit within 24 GB GPU memory while exercising the full code path Examples: - Added `test_something` in `tests/test_stuff.py`. - Added `new_feature` and checked that training converges with policy X on dataset/environment Y. - Optimized `some_function`, it now runs X times faster than previously.
How to checkout & try? (for the reviewer)
Run the high-level planner integration test
Run the low-level planner integration test
Run just the regression tests (faster)
Checklist
Note: Before submitting this PR, please read the contributor guideline.