Skip to content

Adding original pi07 with gemma3 backbone and space-time siglip video…#197

Closed
akshay18iitg wants to merge 12 commits into
mainfrom
feat/pi07
Closed

Adding original pi07 with gemma3 backbone and space-time siglip video…#197
akshay18iitg wants to merge 12 commits into
mainfrom
feat/pi07

Conversation

@akshay18iitg
Copy link
Copy Markdown
Collaborator

… encoder

What this does

Adds a new pi07 policy built on the Gemma 3 VLM backbone and SpaceTimeSiglip video encoder, replacing the PaliGemma + V-JEPA2 stack from pi07_paligemma. (🗃️ Feature)

New files

File | Purpose -- | -- src/opentau/policies/pi07/__init__.py | Package init src/opentau/policies/pi07/gemma3_with_expert.py | Gemma3WithExpertConfig / Gemma3WithExpertModel — Gemma 3 VLM interleaved layer-wise with a Gemma-v1 action expert (AdaRMS, shared attention, knowledge insulation) src/opentau/policies/pi07/high_level_planner/ | High-level planner (memory + subtask prediction) using the Gemma 3 backbone src/opentau/policies/pi07/low_level_planner/ | Low-level planner (flow-matching continuous actions + FAST discrete tokens) using Gemma 3 + SpaceTimeSiglip src/opentau/policies/pi07/low_level_planner/video_encoder.py | SpaceTimeSiglipVideoEncoder — wraps Gemma 3's SigLIP vision tower with space-time separable attention for temporal video encoding tests/policies/test_pi07_high_level_planner.py | GPU integration test for the high-level planner (forward + autoregressive inference) tests/policies/test_pi07_low_level_planner.py | GPU integration test for the low-level planner (forward + select_action + 4 regression tests)

Key architectural changes vs pi07_paligemma

  • Gemma 3 (4B) replaces PaliGemma (3B) as the VLM backbone — larger text hidden size (2560 vs 2048), 34 layers, GQA with per-head Q/K RMSNorm, sliding/global attention pattern
  • SpaceTimeSiglip replaces V-JEPA2 as the video encoder — reuses the Gemma 3 SigLIP vision tower (no weight duplication) with causal temporal self-attention injected at every layer
  • 448×448 image resolution (up from 224×224) — resize_imgs_with_padding defaults updated, Gemma3MultiModalProjector outputs 256 tokens per image
  • proj_width default changed to 1280 to match the larger action expert's hidden_size
  • embed_image rewritten to directly call vision tower + projector, returning a plain tensor (Gemma 3's get_image_features returns BaseModelOutputWithPooling, not a tensor)
  • _lm_head() accessor added — Gemma3ForConditionalGeneration owns lm_head directly (unlike PaliGemma where it's on the text model)
  • SpaceTime short-circuit for non-video inputs — the shared wrapped tower gracefully skips temporal attention when batch_size % num_frames != 0 (e.g. single subgoal images via embed_image)

## How it was tested Added tests/policies/test_pi07_high_level_planner.py — TestPI07HighLevelPlannerIntegration exercises the full training forward pass (MSE == 0, CE finite, mask/position-id layout verification) and autoregressive inference (step count, prefix growth, output shapes) Added tests/policies/test_pi07_low_level_planner.py — TestPI07LowLevelPlannerIntegration exercises training forward (VLM + action expert attention masks, normalize/unnormalize round-trip, loss finiteness) and inference via select_action; TestPI07LowLevelPlannerRegression pins 4 regression cases (prepare_metadata returns tensors, zip strict catches mismatch, suffix att_masks are bool, embed_prefix signature has no defaults) Both test files use a tiny VLM config (2 layers, 512/256 hidden) to fit within 24 GB GPU memory while exercising the full code path Examples: - Added `test_something` in `tests/test_stuff.py`. - Added `new_feature` and checked that training converges with policy X on dataset/environment Y. - Optimized `some_function`, it now runs X times faster than previously.

How to checkout & try? (for the reviewer)

Run the high-level planner integration test

pytest -sx tests/policies/test_pi07_high_level_planner.py -m "gpu and slow"

Run the low-level planner integration test

pytest -sx tests/policies/test_pi07_low_level_planner.py -m "gpu and slow"

Run just the regression tests (faster)

pytest -sx tests/policies/test_pi07_low_level_planner.py::TestPI07LowLevelPlannerRegression -m "gpu and slow"

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

@akshay18iitg akshay18iitg self-assigned this Apr 29, 2026
@akshay18iitg akshay18iitg added feature New feature or request model new model or model request and removed feature New feature or request labels Apr 29, 2026
Copy link
Copy Markdown
Member

Thanks for putting this together — the structural integration of #167 + #178 + #171 is correct, but on closer reading there are a few issues we'd like to flag, and we'll open a fix-up PR (claude/review-pr-197-xg92L → feat/pi07) to help resolve them.

Findings

1. Critical — text-embedding double-scale fix from #178 was not carried over.
PR #178's central architectural correction was that Gemma 3's embed_tokens is a Gemma3TextScaledWordEmbedding which already multiplies by √hidden_size internally, so the pi05-style manual * math.sqrt(emb_dim) must be removed; otherwise text tokens are scaled to ~51× the image-token magnitude and both the bidirectional prefix attention and the FAST/response cross-entropy heads are silently miscalibrated. PR #197 swapped the backbone to Gemma 3 but kept the manual scaling at all 14 sites copied from #167:

  • pi07/low_level_planner/modeling_pi07_low_level.py lines 1169, 1185, 1217, 1228, 1244, 1283, 1296, 1309, 1331
  • pi07/high_level_planner/modeling_pi07_high_level.py lines 921, 933, 946, 965, 980, 1015

2. Test coverage is GPU-only — none of the architectural invariants from #178/#171 are checked.
PR #178 shipped 26 CPU-only tests (RoPE θ symmetry across the two streams, no-sliding-window enforcement, projector-at-448, vision_image_size invariant, embedding-magnitude invariant). PR #171 shipped 46 CPU-only tests (state-dict invariance, single-frame byte-exact invariance, wrapped-layer indices, temporal-PE boundary). PR #197 has only @pytest.mark.gpu @pytest.mark.slow integration tests — neither the double-scale nor a regression to the RoPE-θ pairing or SigLIP weight sharing would be caught on the CPU CI.

3. available_policies not updated.
src/opentau/__init__.py:152 still reads ["pi0", "pi05", "pi05_mem", "value"]. The new pi07_high_level / pi07_low_level registry entries in factory.py are unreachable from that list, and the tests/test_available.py invariant test won't pick them up.

4. SpaceTimeEncoderLayerWrapper short-circuit can silently mis-trigger.
forward bypasses temporal attention when bt % t != 0 so the same wrapped tower can serve subgoal images via embed_image. But if a caller batches subgoals such that B*K happens to be divisible by num_frames, the temporal causal SDPA will fire over images that aren't temporally related, with no warning. Better to plumb an explicit temporal_active flag (we'll wire one through a context manager on the encoder).

5. _fix_pytorch_state_dict_keys only matches the new gemma3_with_expert.* prefix.
A user with a pi07_paligemma checkpoint can't warm-start pi07. PR #178 handled the equivalent by accepting both prefixes in its hook.

Tier-3 cosmetic: pi07/gemma3_with_expert.py:418 says "fa2 attention is not supported for pi06 yet" (copy-paste leftover, should say "pi07"); the embed_image docstring claims Gemma3ForConditionalGeneration.get_image_features returns a BaseModelOutputWithPooling — in current transformers it returns the projected tensor directly.

Open question: spacetime_layer_stride defaults to 1 in PI07LowLevelPlannerConfig (every SigLIP layer wrapped) vs 4 in #171 (every 4th, MEM paper). Was wrapping every layer intentional? We'll leave the default unchanged in the fix-up PR and just call it out.

The fix-up PR will land Tier-1 and Tier-2 fixes plus ported CPU tests, leaving GPU integration unchanged. We'll link it here once it's open.


Generated by Claude Code

Copy link
Copy Markdown
Member

Fix-up PR opened: #198 (draft, targets feat/pi07).

Summary of what landed there:

  • Tier 1 — correctness (blocker). Removes the manual * math.sqrt(*_dim) double-scale at all 14 sites in pi07/{low,high}_level_planner/modeling_pi07_*.py.
  • Tier 2. pi07_high_level / pi07_low_level added to available_policies (+ tests/test_available.py); bt % t != 0 short-circuit replaced with an explicit suppress_spacetime_temporal(...) context manager wired into embed_image; paligemma_with_expert.*gemma3_with_expert.* rewrite added to _fix_pytorch_state_dict_keys for both planners.
  • Tier 3. pi06pi07 in the fa2 NotImplementedError; embed_image docstring corrected.
  • Tests. 8 CPU tests for the Gemma 3 backbone (test_pi07_cpu.py) + 17 CPU tests for the SpaceTime SigLIP video encoder (test_pi07_video_encoder_cpu.py); all 27 pass locally and pre-commit is clean.

The one open question — spacetime_layer_stride default of 1 vs. 4 (#171 / MEM-paper) — is left unchanged in #198. Happy to flip in a follow-up if you confirm 1 was intentional.


Generated by Claude Code

@shuheng-liu
Copy link
Copy Markdown
Member

shuheng-liu commented Apr 29, 2026

@claude spacetime_layer_stride should default to 4.

@shuheng-liu
Copy link
Copy Markdown
Member

Updated #198 to use 4 as the default spacetime_layer_stride in accordance with pi-mem paper.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superseded — see follow-up review and summary comment below.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude-review] Spot review — see PR-level summary comment for details.

Comment thread src/opentau/policies/pi07/low_level_planner/video_encoder.py Outdated
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 29, 2026

[claude-review] summary for commit e24ba88

Re-reviewing on commit e24ba88 fix(pi07): align text attention with paper §VI.B + Fig. 19 (#236). New since the previously-reviewed 62a5677: two upstream commits (6f256ec SDPA + gradient checkpointing, e24ba88 attention-layout fix in pi07_paligemma) plus the [claude-fix] address review feedback commit b383089. Diff vs 62a5677 is +904/-243 across 12 files. Substantive changes in gemma3_with_expert.py (SDPA forward + _run_layer extraction for ckpt), the two pi07 configurations (config plumbing into vlm_config), pi07_paligemma/{high,low}_level_planner (paper §VI.B + Fig. 19 layout reorder), plus matching tests.

Previous review carryover resolved.

  • The five suggestions from the 62a5677 review remain resolved.
  • Two of the four new findings on 62a5677 are addressed in b383089/e24ba88:
    • HL planner gate semantics tightened to metadata_tokens is not None and metadata_masks.any() (pi07/high_level_planner/modeling_pi07_high_level.py:986-989), matching LL behavior.
    • Batch-wide gate documented in pi07/low_level_planner/modeling_pi07_low_level.py:1165-1175 docstring.

Carryover from earlier reviews still applicable.

  • The all-optional-blocks layout in tests/policies/test_pi07_low_level_planner.py:198-222 still hard-codes prefix_end + comma; a GPU regression for the no-optionals path remains open.
  • No CPU test pins the RoPE-θ pairing across the two streams (pi07/gemma3_with_expert.py apply_rope sites); a future regression there would only be caught under GPU integration.
  • test_pi06.py:516-532 actions stats reshape still flattens per-step variance — small fixture-quality nit.

New findings on this round.

  • suggestionsrc/opentau/policies/pi07/low_level_planner/modeling_pi07_low_level.py:1257 (and 1274/1287/1305/1383/1411) — Commit e24ba88 is titled fix(pi07): align text attention with paper §VI.B + Fig. 19 but the diff only touches the legacy pi07_paligemma package. The new gemma3-backbone pi07 policy that this PR primarily ships still emits [0] * N (bidirectional) for language / State: / state tokens / state-end / subgoal-end / ";\n " and [1] + [0] * (N-1) (prefix-LM) for the response span — exactly the pre-fix pattern. The pi07_paligemma diff also reorders the prefix so subgoals sit AFTER metadata + ";\n " (Fig. 19); the gemma3 prefix here keeps the old order. Worth confirming this split is intentional vs. an oversight in cherry-picking.
  • suggestionsrc/opentau/policies/pi07/high_level_planner/modeling_pi07_high_level.py:982 and :1017 — Same point as LL: language and ";\n " blocks are still bidirectional, while pi07_paligemma/.../modeling_pi07_high_level.py:932,962 were flipped to [1] * N per-token causal in e24ba88.
  • suggestionsrc/opentau/policies/pi07/{high_level_planner,low_level_planner}/configuration_pi07_*.py:203-206__post_init__ plumbing uses asymmetric "non-default → propagate" logic. Setting --policy.attention_implementation=eager or --policy.gradient_checkpointing=False to explicitly revert a non-default vlm_config.* value silently no-ops. Documented as intentional via test_post_init_preserves_explicit_vlm_config_when_policy_default, but worth flagging as a footgun for sweep-style overrides.
  • suggestiontests/policies/test_pi07_cpu.py:929 (TestPi07GradCkptEquivalence.test_grad_ckpt_forward_matches_no_ckpt) — Forward-only check; torch.utils.checkpoint.checkpoint is a no-op in forward and recomputes the body in backward, so saved-tensor-hook / RNG bugs in the _run_layer extraction would only manifest after loss.backward(). Per CLAUDE.md's deterministic-training rule, a loss.backward() + per-param grad-equality check would actually exercise the recompute path.
  • suggestiontests/policies/test_pi07_paligemma_attention_layout.py — Tests exercise make_att_2d_masks with hand-coded att_masks arrays, not the post-fix embed_prefix output. A regression flipping att_masks += [1] * num_lang_embs back to [0] * num_lang_embs inside embed_prefix would still pass every assertion here. Pinning the call site (instantiate planner / shape-only stand-in, run embed_prefix, assert emitted att_masks) would close that gap.
  • nitsrc/opentau/policies/pi07/gemma3_with_expert.py:597 (sdpa_attention_forward) — is_causal=False is correct given the precise bool mask, and the test pins fp32 eager↔SDPA equivalence to 1e-4. No bf16 equivalence test (the actual production dtype) — A100/H100 + bf16 path is unverified by CI. Not blocking; the math backend on CPU is fp32 either way, but a bf16 allclose with looser tolerance on a GPU runner would round out the coverage.

No blocking issues found.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude-review] Spot review for commit 3c5ee90 — see PR-level summary comment for the full picture.

Comment thread src/opentau/policies/pi07/low_level_planner/modeling_pi07_low_level.py Outdated
Comment thread src/opentau/policies/pi07/low_level_planner/modeling_pi07_low_level.py Outdated
Comment thread src/opentau/policies/pi07/high_level_planner/modeling_pi07_high_level.py Outdated
@shuheng-liu
Copy link
Copy Markdown
Member

FYI: this will fix #171

@akshay18iitg
Copy link
Copy Markdown
Collaborator Author

@claude fix on all the agreed points

- addresses @claude[bot] (low-level is_pad defaults, low_level_planner
  modeling_pi07_low_level.py:984-986): switched the *_is_pad fallbacks to
  torch.ones to match the high-level planner — missing speed/quality/mistake
  no longer fabricate "Speed: 0.0" entries in the prompt.
- addresses @claude[bot] (prepare_metadata docstrings): updated both
  planners' docstrings to reference Gemma 3 (not PaliGemma) and to enumerate
  robot_type / control_mode (string-valued, empty-string-as-pad).
- addresses @claude[bot] (high-level all-empty guard,
  high_level_planner/modeling_pi07_high_level.py:715): mirrored the
  low-level "if segments else ''" guard so an all-padded sample emits ""
  instead of the literal "Metadata: ". Both planners now agree.
- addresses @claude[bot] (CPU coverage for prepare_metadata): added
  TestPrepareMetadataSegments to tests/policies/test_pi07_cpu.py covering
  (a) robot+control populated, (b) both absent, (c) one populated and one
  empty, (d) per-sample all-empty emits "" (regression for both planners),
  and (e) low-level missing speed/quality/mistake never produces a
  fabricated "Speed: 0.0" segment.

tests: passed — pytest -m "not gpu" tests/policies/test_pi07_cpu.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shuheng-liu
Copy link
Copy Markdown
Member

FYI, the failed CI is fixable my merging latest main into this branch.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude-review] Spot review for commit 2949e73 — see PR-level summary comment for the full picture.

Comment thread src/opentau/datasets/lerobot_dataset.py
Comment thread src/opentau/datasets/lerobot_dataset.py
Comment thread src/opentau/policies/pi07/low_level_planner/modeling_pi07_low_level.py Outdated
@shuheng-liu
Copy link
Copy Markdown
Member

@claude fix the remaining items in the review.

- addresses @claude[bot] (subgoal opt-in docstring): updated _load_subgoal_frames
  docstring to match the always-on behaviour; replaced the stale
  test_missing_subgoals_key_in_info_returns_empty test (which encoded the old
  opt-in gate) with two pinned cases — no-cameras → {}, and "no info.subgoals
  key still loads subgoals". Added camera_keys / image_keys / episode_data_index
  attrs to the SimpleNamespace meta in the two existing video tests so they
  match the new attribute reads.
- addresses @claude[bot] (image-dtype fallback row index): new
  test_image_dtype_fallback_uses_absolute_row_index in tests/datasets/
  test_optional_keys.py stubs hf_dataset.__getitem__ and pins that the
  parquet-row lookup uses ep_start + subgoal_frame, never the within-episode
  index.
- addresses @claude[bot] ("fully 0 padded" comment): rewrote the comment at
  modeling_pi07_low_level.py:946 to call out -1 ([-1, 1] SigLIP range) and the
  False-mask role of the placeholder.
- addresses @claude[bot] (_action_indicator_len recompute): cached the
  Action-indicator length at PI07LowLevelPlannerFlowMatching.__init__; both
  forward sites now read self._action_indicator_len.
- addresses @claude[bot] (embed_prefix CPU coverage): added
  TestEmbedPrefixConditionalGuards in tests/policies/test_pi07_cpu.py with a
  fake Gemma3WithExpert + tokenizer + state_proj + embed_video so the three
  guards (response_masks.any(), subgoal availability, metadata_masks.any()) are
  exercised without GPU. Cases: all-False optional masks → no spurious causal
  boundary; mixed-availability subgoal batch → header/footer pad mask zeroes
  the pad-only sample; response.any() → exactly one boundary.
- addresses @claude[bot] (.base_layer. stale comment in video_encoder.py):
  rewrote the wrap comment at video_encoder.py:422 to explain that the wrapper
  adopts submodules by reference, so wrapped-layer state-dict keys are
  byte-for-byte identical to a vanilla SiglipEncoderLayer (no .base_layer.
  prefix).

tests: passed — pytest tests/policies tests/datasets -m "not gpu" -n auto
(452 passed, 12 skipped; the 1 collection error in
tests/policies/test_pi07_paligemma_low_level_planner.py is pre-existing and
unrelated — pi07_paligemma still imports VJEPA2VideoEncoder from pi05_mem,
which was removed in #171).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude-review] Spot review for commit 62a5677 — see PR-level summary comment for the full picture.

Comment thread src/opentau/policies/pi07/high_level_planner/modeling_pi07_high_level.py Outdated
Comment thread tests/policies/test_pi07_low_level_planner.py
Comment thread tests/policies/test_pi06.py Outdated
@shuheng-liu
Copy link
Copy Markdown
Member

@claude fix according to the new review

- addresses @claude (LL planner doc): added a docstring note in
  embed_prefix that has_any_optional is computed batch-wide so
  per-sample optional layout follows the batch-level OR (kept
  intentionally for prefix-length uniformity).
- addresses @claude (HL planner gate): tightened
  embed_prefix metadata gate from `metadata_tokens is not None`
  to `metadata_masks.any()` to match the low-level planner's
  semantics; lets training cleanly drop both the metadata block
  and the trailing ";\n " prefix-end when an entire batch is
  metadata-free.
- addresses @claude (GPU regression test): added
  test_no_optionals_path_on_real_gemma3 in
  TestPI07LowLevelPlannerIntegration that exercises empty
  response + all-padded metadata + subgoal_is_pad on the real
  Gemma 3 backbone, pinning the collapsed prefix length and
  the position of the first causal boundary.
- addresses @claude (test_pi06 stats reshape): replaced
  `np.full((chunk_size, 32), float(stats.flatten()[0]))` with
  `np.tile(stats[:1], (config.chunk_size, 1))` so per-feature
  variance in future fixtures is preserved instead of collapsed
  to a scalar.

tests: passed -- python -m pytest -m "not gpu" tests/policies/test_pi07_cpu.py tests/policies/test_pi07_video_encoder_cpu.py tests/policies/test_pi06.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[claude-review] Spot review for commit e24ba88 — see PR-level summary comment for the full picture.

Comment thread src/opentau/policies/pi07/low_level_planner/modeling_pi07_low_level.py Outdated
Comment thread src/opentau/policies/pi07/high_level_planner/modeling_pi07_high_level.py Outdated
Comment thread tests/policies/test_pi07_cpu.py
Comment thread tests/policies/test_pi07_paligemma_attention_layout.py Outdated
@shuheng-liu
Copy link
Copy Markdown
Member

To save Claude context, I'll open a fresh PR.

@shuheng-liu shuheng-liu closed this May 4, 2026
shuheng-liu added a commit that referenced this pull request May 4, 2026
Brings 16 main commits onto feat/pi07 in preparation for opening the
feat/pi07 → main PR. One conflict (tests/policies/test_pi06.py:
np.tile vs np.full for chunk-size action stats) — kept HEAD's np.tile
fix from claude review feedback on #197, which preserves per-feature
variance instead of collapsing every step to stats.flatten()[0].

Also dropped stale --ignore=tests/policies/test_pi07_paligemma_low_level_planner.py
from .github/workflows/cpu_test.yml and gpu_test.yml: main deleted
the file in #234, but the --ignore line on feat/pi07 (added in #229)
was on a different part of the file so git auto-merged without
flagging it. Removing both now keeps the workflows clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model new model or model request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

V-JEPA2 video encoder receives unnormalized [0,1] pixels — expects ImageNet normalization

2 participants