feat(policies): add authentic pi06 policy with Gemma 3 4B backbone#178
Merged
Conversation
Adds a new `pi06` policy that ports the π0.6 architecture from Physical
Intelligence (Model Card, Nov 17 2025; arXiv:2511.14759) into OpenTau.
Relative to `pi05` (PaliGemma-3B, 224×224, 10-step flow matching), `pi06`:
* swaps the backbone to Gemma 3 4B (34 interleaved sliding-window/global
layers, SigLIP-400m/14, head_dim 256, GQA with 4 KV heads);
* enlarges the action expert to ~860M params so it matches the backbone
depth (34 Gemma-v1 layers, hidden 1280, intermediate 5120, AdaRMS);
* raises the default image resolution to 448×448;
* halves the default flow-matching schedule to 5 denoising steps.
Training recipe (FAST discrete action co-training, flow matching, Knowledge
Insulation gradient-stop, Beta(1.5, 1.0) time sampler, block-causal prefix
with bidirectional action suffix) is unchanged from `pi05`.
Implementation notes:
* `gemma3_with_expert.py` runs a per-layer interleaved attention loop that
concatenates backbone and expert Q/K/V along the sequence axis, honouring
Gemma 3's q_norm/k_norm, per-layer local vs global RoPE theta, and
sliding-window masks. The existing Gemma-v1 AdaRMS monkey-patches in
`transformers_patch.py` cover the expert; the Gemma 3 backbone runs
stock (no new patches needed).
* `modeling_pi06.py` mirrors `modeling_pi05.py` structure — new tokenizer
(`google/gemma-3-4b-pt`), new `_fix_pytorch_state_dict_keys` that also
accepts legacy `paligemma_with_expert.*` prefixes as a warm-start path.
Other changes:
* Register `pi06` in `policies/factory.py` and `opentau.available_policies`.
* Add `configs/examples/pi06_training_config.json` to bootstrap runs.
* Update `README.md` with a pi06 bullet, comparison-table row, checkpoints
placeholder, and a link to the new config.
* Add `tests/policies/test_pi06.py` covering attention-mask block semantics,
sliding-window masks, RoPE theta selection, padding-mask contiguity,
image resizing, and discrete-action padding.
https://claude.ai/code/session_01MibvjbcZo38nxrx6n9giLi
…ant test `tests/test_available.py::test_available_policies` hardcodes the list of known policy classes and asserts its set of `name`s matches `opentau.available_policies`. The previous commit added `pi06` to the latter without updating the former, breaking the CPU CI run. Picking up `PI06Policy` keeps the invariant satisfied. https://claude.ai/code/session_01MibvjbcZo38nxrx6n9giLi
3 tasks
Per reviewer feedback. Drops the long `# ---` lines that bracketed section headings (now just the heading as a single-line comment), and collapses the `# --- text ---` inline variants to `# text.`. No functional change — ruff-format, ruff-check, and the full pre-commit hook battery (pyupgrade, typos, bandit, gitleaks, etc.) all pass on the touched files. https://claude.ai/code/session_01MibvjbcZo38nxrx6n9giLi
Fixes three distinct correctness issues in the Gemma 3 4B backbone +
Gemma-v1 action expert wiring that a careful second pass uncovered:
1. Vision projector crashes at 448×448. `Gemma3MultiModalProjector`
hard-codes `patches_per_image = image_size // patch_size`, so the
default config's `vision_config.image_size = 896` made the
projector reshape `(B, 1024, 64, 64)` when SigLIP actually emits
`B × 1024` patches at 448 input — a runtime crash on the first
forward pass. Set `image_size = 448` to match π0.6's stated
resolution (→ 32 patches/side, 256 mm tokens/image, which also
matches the model card).
2. RoPE θ asymmetry between backbone and expert for global layers.
Gemma 3 interleaves local layers (θ = 10 000) with global layers
(θ = 1 000 000). The expert's own `rope_theta = 10 000` was being
applied to its Q/K even on global layers, so the shared
cross-attention ran backbone-Q (rotated at 1M) against expert-K
(rotated at 10k) — the `q·R(Δpos)·k` invariant breaks when the two
rotations live in different RoPE bases. Fix: both streams now use
the backbone's per-layer θ; the expert's fallback θ is ignored at
runtime and documented as such.
3. Sliding-window mask used dense indices instead of absolute
positions. During expert cross-attention the Q tokens sit at
`prefix_offsets + chunk_idx`, but `_build_sliding_window_mask`
built its mask from `torch.arange(seq_len)` and the call site
sliced `[:, :T_suffix, :]`. Result: every prefix key farther than
`window` from the dense suffix row index was dropped — i.e. the
sliding layers saw essentially no prefix during cross-attention.
Fix: take `(query_positions, key_positions)` and compute
`|pos_q - pos_k| < window` over absolute positions; stash the
prefix positions in the KV cache so the expert can reconstruct
them on each denoising step.
Tests:
* `TestSlidingWindowMask.test_cross_attention_uses_absolute_positions`
— regression guard for #3.
* `TestGemma3WithExpertConfig.test_vision_image_size_matches_input_resolution`
and `.test_projector_accepts_448_inputs` — regression guards
for #1 (config invariant + end-to-end projector forward).
* `TestRopeThetaSymmetryDuringForward.test_expert_uses_backbone_per_layer_theta`
— regression guard for #2, uses monkeypatched `apply_rope` to
observe exactly which θ each layer asks for.
Local pytest: 29 passed / 1 deselected (up from 25). Full policies/
CPU suite: 75 passed / 2 skipped / 6 deselected. Pre-commit: all
hooks green (ruff, ruff-format, pyupgrade, typos, bandit, gitleaks).
https://claude.ai/code/session_01MibvjbcZo38nxrx6n9giLi
…gnment)
The π0.6 model card specifies "bidirectional attention among ALL of the
image tokens" and a "block-wise causal" prefix attention pattern, with
no mention of sliding-window or local attention. With 4 cams × 256 image
tokens = 1024 image tokens, the wording "all" is incompatible with
Gemma 3's 1024-token sliding window. The most defensible reading is
that π0.6 deliberately runs every backbone layer with the global
block-causal mask; the local layers' pretrained weights are presumably
adapted via training rather than constrained at inference.
This commit:
* Removes `_build_sliding_window_mask` (and the `is_sliding`-conditional
AND in `forward`) — every layer now receives the unmodified prefix
block-causal `attention_mask`.
* Reverts the KV-cache `key_positions` field added for the sliding
mask; cross-attention no longer needs absolute key positions because
no per-layer mask is constructed from them.
* Keeps the per-layer RoPE θ split — local layers still rotate at
θ=10 000 and global layers at θ=1 000 000, because that's baked into
the pretrained Gemma 3 4B weights and we have to honour it.
* Replaces the `TestSlidingWindowMask` class with a regression test
`TestNoSlidingWindowEnforcement` that monkey-patches
`eager_attention_forward` to verify the per-layer mask equals the
input mask on BOTH layer types — guarding against the window
silently sneaking back in.
Local tests: 26 passed / 1 deselected. Pre-commit: all hooks green.
https://claude.ai/code/session_01MibvjbcZo38nxrx6n9giLi
…els-NgCLN # Conflicts: # README.md # src/opentau/policies/factory.py
akshay18iitg
requested changes
Apr 29, 2026
3 tasks
3 tasks
akshay18iitg
approved these changes
Apr 29, 2026
shuheng-liu
added a commit
that referenced
this pull request
Apr 29, 2026
3 tasks
3 tasks
shuheng-liu
added a commit
that referenced
this pull request
Apr 30, 2026
`test_complete_pi06_pipeline_integration_smoke` was added in #178 with chunk_size=10 but the shared `lerobot_dataset_metadata` fixture provides actions stats shaped (50, 32) — matching the default PI06Config. The Normalize buffer was therefore (50, 32) while the test batch's actions were (B, 10, 32), and MIN_MAX normalization in normalize.py:232 raised ``RuntimeError: The size of tensor a (10) must match the size of tensor b (50) at non-singleton dimension 1``. Pre-existing bug — never caught in CI because the test is gated by @pytest.mark.gpu and skipped in CPU runs. Surfaced now while validating this PR's SDPA + grad-ckpt port on a real GPU. Fix by deep-copying the fixture stats and reshaping the actions max/mean/min/std arrays from (50, 32) to (chunk_size, 32) before constructing the policy. Same numeric values, just the right shape.
shuheng-liu
added a commit
that referenced
this pull request
May 2, 2026
`test_complete_pi06_pipeline_integration_smoke` was added in #178 with chunk_size=10 but the shared `lerobot_dataset_metadata` fixture provides actions stats shaped (50, 32) — matching the default PI06Config. The Normalize buffer was therefore (50, 32) while the test batch's actions were (B, 10, 32), and MIN_MAX normalization in normalize.py:232 raised ``RuntimeError: The size of tensor a (10) must match the size of tensor b (50) at non-singleton dimension 1``. Pre-existing bug — never caught in CI because the test is gated by @pytest.mark.gpu and skipped in CPU runs. Surfaced now while validating this PR's SDPA + grad-ckpt port on a real GPU. Fix by deep-copying the fixture stats and reshaping the actions max/mean/min/std arrays from (50, 32) to (chunk_size, 32) before constructing the policy. Same numeric values, just the right shape. (cherry picked from commit 6425cb4)
shuheng-liu
added a commit
that referenced
this pull request
May 2, 2026
`test_complete_pi06_pipeline_integration_smoke` was added in #178 with chunk_size=10 but the shared `lerobot_dataset_metadata` fixture provides actions stats shaped (50, 32) — matching the default PI06Config. The Normalize buffer was therefore (50, 32) while the test batch's actions were (B, 10, 32), and MIN_MAX normalization in normalize.py:232 raised ``RuntimeError: The size of tensor a (10) must match the size of tensor b (50) at non-singleton dimension 1``. Pre-existing bug — never caught in CI because the test is gated by @pytest.mark.gpu and skipped in CPU runs. Surfaced now while validating this PR's SDPA + grad-ckpt port on a real GPU. Fix by deep-copying the fixture stats and reshaping the actions max/mean/min/std arrays from (50, 32) to (chunk_size, 32) before constructing the policy. Same numeric values, just the right shape.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds a new
pi06policy that ports Physical Intelligence's π0.6 architecture (Model Card, 2025-11-17; arXiv:2511.14759) into OpenTau, alongside supporting config, docs, and tests. Label: 🗃️ Feature.What changed vs
pi05pi05pi06head_dim=256q_norm/k_normper attention blockimage_size=448soGemma3MultiModalProjectorreshapes correctly (32 patches/side → 256 mm tokens/view)num_steps=10num_steps=5(~63 ms / chunk on H100)Training recipe (FAST discrete actions co-trained with flow matching, Knowledge Insulation gradient stop,
Beta(1.5, 1.0)time sampler, block-causal prefix + bidirectional action suffix) is identical topi05.Architectural choices that aren't obvious from the model card
These were each surfaced and resolved during careful re-review while writing this PR — flagging them explicitly so future reviewers understand the reasoning:
embed_tokensis aGemma3TextScaledWordEmbeddingthat already multiplies by√hidden_sizeinternally. The pi05-style manual* math.sqrt(lang_emb_dim)would double-scale text tokens to ~51× the image-token magnitude — corrupting both the bidirectional prefix attention and the FAST/response cross-entropy heads. Removed (was #179, merged into this branch).Gemma3WithExpertModel.forwardnow receives the unmodified block-causal mask; local layers still rotate at θ=10 000 (preserving the pretrained RoPE basis), but their attention pattern matches the global layers'.q · R(Δp) · kinvariant only holds when both rotations use the same θ. Backbone Q/K rotates at the layer's θ; expert Q/K is forced to use the same value (its config's ownrope_thetais documented as ignored at runtime).image_sizematches input resolution.Gemma3MultiModalProjectorhardcodespatches_per_image = image_size // patch_size, so a defaultimage_size=896would crash the projector reshape on the first 448×448 forward pass.Implementation notes
gemma3_with_expert.py— runs a per-layer interleaved attention loop that concatenates backbone and expert Q/K/V along the sequence axis, honouring Gemma 3'sq_norm/k_norm, per-layer RoPE θ, and the four-RMSNorm Gemma 3 block (input_layernorm,post_attention_layernorm,pre_feedforward_layernorm,post_feedforward_layernorm). The Gemma-v1 AdaRMS /_gated_residual/ patchedGemmaRMSNormmonkey-patches inopentau.utils.transformers_patchcover the expert path; the Gemma 3 backbone runs stock, so no new patches are introduced.modeling_pi06.py— mirrorsmodeling_pi05.pystructure with the Gemma 3 tokenizer (google/gemma-3-4b-pt) and a_fix_pytorch_state_dict_keysthat also accepts legacypaligemma_with_expert.*prefixes as a warm-start path for users converting pi05 checkpoints.pi06registered inopentau.policies.factoryandopentau.available_policies; README updated with a pi06 bullet, comparison-table row, checkpoints-coming-soon placeholder, and a pointer atconfigs/examples/pi06_training_config.json.How it was tested
tests/policies/test_pi06.pyships 26 CPU-only unit tests (run on each push):apply_ropeshape / dtype preservation and θ sensitivity (zero-position identity).PI06Configdefaults and validators.Gemma3WithExpertConfigtopology — Gemma 3 4B hidden/layer/head counts, 5:1 layer-type pattern, ~860M expert with AdaRMS on, GQA matched.test_vision_image_size_matches_input_resolution+test_projector_accepts_448_inputs— config invariant + end-to-endGemma3MultiModalProjectorforward at 448×448.TestRopeThetaSymmetryDuringForward.test_expert_uses_backbone_per_layer_theta— monkey-patchesapply_ropeto record per-call θ and asserts each layer uses the backbone's θ for both streams.TestNoSlidingWindowEnforcement.test_per_layer_mask_equals_input_mask_on_both_layer_types— monkey-patcheseager_attention_forwardto verify both local and global layers receive the unmodified input mask.resize_with_pad(448×448 default, aspect-ratio preservation) and FAST discrete-action padding/truncation.Other:
tests/test_available.pyupdated so theavailable_policiesinvariant test picks up the newPI06Policy.test_complete_pi06_pipeline_integration_smoke) — marked@pytest.mark.slow+@pytest.mark.gpu, runs on the nightly GPU CI job.tests/policies/CPU suite: 75 passed / 2 skipped / 6 deselected. Pre-commit: every hook green (ruff, ruff-format, pyupgrade, typos, gitleaks, bandit, license headers).How to checkout & try? (for the reviewer)
# Point at a dataset and start a real training run: opentau-train --config_path=configs/examples/pi06_training_config.jsonOut of scope
transformers_patch.py. The Gemma 3 backbone runs stock; the expert keeps using the already-patched Gemma v1.NotImplementedErroruntil we can validate numerics on the four-RMSNorm Gemma 3 block.pistar06._fix_pytorch_state_dict_keyshook is ready when they drop.Checklist
https://claude.ai/code/session_01MibvjbcZo38nxrx6n9giLi