Skip to content

fix(pi07): gate optional prefix tokens in low- and high-level planners#229

Merged
shuheng-liu merged 8 commits into
feat/pi07from
claude/fix-pi07-prefix-gating
May 2, 2026
Merged

fix(pi07): gate optional prefix tokens in low- and high-level planners#229
shuheng-liu merged 8 commits into
feat/pi07from
claude/fix-pi07-prefix-gating

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 2, 2026

What this does

Backports the #226-equivalent prefix-gating fixes from pi07_paligemma (commit b7d3a82 on claude/fix-issue-226-gt8z3, merging into fix/pi07_paligemma_fixes then into main) to pi07's low- and high-level planners.

Low-level planner (pi07/low_level_planner/modeling_pi07_low_level.py::embed_prefix) had three remaining divergences when all optional middle blocks (response / subgoal / metadata) are dropped:

  • State-end was unconditionally ", " (2 tokens) instead of ":\n".
  • ";\n " prefix-end was unconditionally appended even with no optional content to terminate, dangling 2 spurious tokens after the state-end.
  • "Action: " indicator att_masks were [1] + [0]*(N-1) (single bidirectional block); pi05 uses [1]*N (one causal block per token), which shifts the cumsum at the indicator → first-discrete boundary and changes the discrete-action CE loss.

The fix computes has_any_optional once at the top of embed_prefix and gates the state-end string + the prefix-end block on it. Indicator att_masks switched to [1]*N to match pi05. (The earlier 2949e73 commit only ported the response/subgoal/metadata content gating, not the surrounding separators or indicator att_masks.)

High-level planner (pi07/high_level_planner/modeling_pi07_high_level.py::embed_prefix) had one of the three issues — ";\n " was emitted unconditionally, but it is the metadata → "Updated Memory:" separator. With no metadata there's nothing to terminate. Wrapped under if metadata_tokens is not None:. The "Updated Memory: " AR anchor itself stays unconditional because inference relies on it (memory_tokens is None at inference by design).

A byte-identity test against pi06 (analogous to the pi07_paligemma vs pi05 test in #226) is deliberately deferred to a follow-up issue (#230): pi07_low_level uses continuous state with separate State:/Action: indicators while pi06 inlines discretized state in the language prompt, and pi07_high_level uses a different prompt template ("Task: {task}, Past Memory: {past_mem}, State: {state}, " vs pi06's "Task: {task}<eos>State: {state}<eos>Actions:"), so byte-equivalence isn't feasible without aligning pi06's prompts (or building pi06_mem).

Drive-by fixes pulled from main to unblock CI

  • feat(ci): pin Claude workflows to Opus 4.7 1M, allow test-skip on doc fixes (#208) — workflow file must match main to satisfy GitHub workflow validation; was failing the Claude PR Review check.
  • fix(ci): gate extract-claude-lessons on head branch, not PR author (#212) — same reason.
  • ci: skip pi07_paligemma low-level test (broken at import) to unblock CPU CI (#209)feat/pi07 was missing the --ignore=tests/policies/test_pi07_paligemma_low_level_planner.py flag in cpu_test.yml; the test file fails at import (VJEPA2VideoEncoder removed in pi05_mem) and that crashed CPU CI.
  • ci: ignore broken pi07 test in GPU CI; dedupe TODO in cpu_test (#217) — same flag missing in gpu_test.yml + dedup.
  • fix(test): override actions stats shape in pi06 GPU smoke (PR #214)test_complete_pi06_pipeline_integration_smoke was failing on real GPU because the fixture stats shape (50, 32) didn't match chunk_size=10. Pre-existing latent bug; surfaced when running the GPU suite end-to-end on mlbox.

Test fix surfaced by the GPU run

  • _verify_position_ids and _verify_action_expert_attention_mask in test_pi07_low_level_planner.py only excluded DISCRETE_ACTION_MAX_LENGTH from the cross-attention prefix slice. The model (post-2949e73) already excludes the "Action: " indicator length (3 tokens via Gemma 3 tokenizer) too, matching pi05's discrete_action_indicator_max_length logic — so the action expert sees the same prefix length at train and inference. The 3-token gap fails the suffix_position_ids assertion as soon as the GPU integration test runs. Latent break since 2949e73; never caught because the test is gated by @pytest.mark.gpu.

How it was tested

  • pre-commit run --all-files — all hooks pass.
  • CPU subset (pytest -m "not gpu" -n auto tests/policies/): 125 passed, 2 skipped, 1 pre-existing collection error in test_pi07_paligemma_low_level_planner.py (now excluded from CI by the --ignore flag pulled from ci: skip pi07_paligemma low-level test (broken at import) to unblock CPU CI #209/ci: ignore broken pi07 test in GPU CI; dedupe TODO in cpu_test #217).
  • Full GPU suite on mlbox (RTX 5090, 32 GB): pytest -m gpu -n 0 -v --ignore=tests/policies/test_pi07_paligemma_low_level_planner.py tests/20 passed, 2 skipped, 786 deselected, 0 failures, no OOM, 6m10s. Includes tests/policies/test_pi07_low_level_planner.py::test_complete_pi07_low_level_pipeline, test_pi07_high_level_planner.py::test_complete_pi07_high_level_planner_pipeline, the regression class, and pi05 / pi05_mem / pi06 / pi0 / value smoke tests.
  • Specifically tests/policies/test_pi07_cpu.py::TestEmbedPrefixConditionalGuards exercises the gating: test_all_optional_blocks_absent_skips_emission was updated to match the new layout (11 tokens, ":\n" state-end, no trailing prefix-end); test_mixed_subgoal_availability_zeros_pad_only_samples and test_response_mask_any_true_emits_block still pass because they populate at least one optional, so has_any_optional == True and the layout matches the old expectations.
  • Nightly regression tests (regression_test.yml) are deferred to the existing schedule.

How to checkout & try? (for the reviewer)

git fetch origin
git checkout claude/fix-pi07-prefix-gating
uv sync --extra dev
pytest -m "not gpu" -n auto tests/policies/test_pi07_cpu.py tests/policies/test_pi07_low_level_planner.py tests/policies/test_pi07_high_level_planner.py

GPU subset:

pytest -m gpu -n 0 -v --ignore=tests/policies/test_pi07_paligemma_low_level_planner.py tests/

To re-verify the diff against the canonical pi07_paligemma fix:

git diff origin/feat/pi07..claude/fix-pi07-prefix-gating -- src/opentau/policies/pi07/low_level_planner/modeling_pi07_low_level.py
git show b7d3a82 -- src/opentau/policies/pi07_paligemma/low_level_planner/modeling_pi07_low_level.py

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

GPU pytests run on mlbox (RTX 5090) — see "How it was tested" above. Nightly regression tests are deferred to the existing CI schedule. A follow-up issue (#230) is filed for pi06 prompt alignment + future byte-identity tests against pi07 planners.

Note: Before submitting this PR, please read the contributor guideline.

Backports the #226-equivalent prefix-gating fixes from pi07_paligemma
(commit b7d3a82) to pi07's low- and high-level planners, plus updates
the affected CPU regression test to match the new layout.

Low-level (modeling_pi07_low_level.py::embed_prefix) had three remaining
divergences in the all-optionals-dropped case:

- state-end was unconditionally ", " (2 tokens) instead of pi05's ":\n".
- ";\n " prefix-end was unconditionally appended even with no optional
  content to terminate, dangling 2 spurious tokens after the state-end.
- "Action: " indicator att_masks were [1] + [0]*(N-1), making the
  indicator a single bidirectional block; pi05 uses [1]*N (one causal
  block per token), which shifts the cumsum at the indicator -> first-
  discrete boundary and changes the discrete-action CE loss.

Now compute has_any_optional once at the top of embed_prefix and gate
the state-end string + the prefix-end block on it. Indicator att_masks
switched to [1]*N to match pi05.

High-level (modeling_pi07_high_level.py::embed_prefix) had one of the
three: ";\n " was emitted unconditionally as if a free-floating block,
but it is the metadata -> "Updated Memory:" separator. With no metadata
there is nothing to terminate. Wrapped under `if metadata_tokens is not
None:`. The "Updated Memory: " AR anchor itself stays unconditional —
inference relies on it (memory_tokens is None at inference by design).

The existing CPU regression test `test_all_optional_blocks_absent_skips
_emission` had the OLD (buggy) layout baked into its assertions; updated
to match the new gated layout (11 tokens total, ":\n" state-end, no
trailing prefix-end).

A byte-identity test against pi06 (analogous to pi07_paligemma vs pi05
in #226) is deliberately deferred to a follow-up issue: pi07_low_level
uses continuous state with separate State:/Action: indicators while
pi06 inlines discretized state in the language prompt, and
pi07_high_level uses a different prompt template, so byte-equivalence
isn't feasible without aligning pi06's prompts (or building pi06_mem).
@shuheng-liu shuheng-liu added bug Something isn't working model new model or model request labels May 2, 2026
@shuheng-liu shuheng-liu self-assigned this May 2, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backport of #226's pi07_paligemma fix to pi07's planners. Logic mirrors the canonical fix and the corresponding pi05 reference. A few suggestions below — none blocking.

Comment thread src/opentau/policies/pi07/low_level_planner/modeling_pi07_low_level.py Outdated
Comment thread .github/workflows/claude-implement-fixes.yml
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 2, 2026

[claude-review] summary for commit bb94735

No blocking issues found.

Re-reviewed after two test-only follow-up commits since the prior clean review at 1b0e3a9 — no changes to fix logic in modeling_pi07_low_level.py / modeling_pi07_high_level.py.

  • bb94735 updates _verify_position_ids and _verify_action_expert_attention_mask in tests/policies/test_pi07_low_level_planner.py to subtract action_lead_len + DISCRETE_ACTION_MAX_LENGTH from the cross-attention prefix slice. Mirrors the model's training-path logic at modeling_pi07_low_level.py:1576 and :1616 (matches pi05's discrete_action_indicator_max_length convention). action_lead_len is pulled from the existing _indicator_lens helper, so the test tracks future tokenizer changes automatically. Inference branch (inference_mode=True) correctly leaves the entire prefix in cross-attention scope, matching the model at :1815-1816. Tokenizer is plumbed through both call sites in forward_loss/generate_one_step paths.
  • 080c190 cherry-picks a pre-existing pi06 GPU-smoke fix (test_complete_pi06_pipeline_integration_smoke): shared lerobot_dataset_metadata provides actions stats shaped (50, 32) for the default PI06Config(chunk_size=50), but this test uses chunk_size=10. The fix deep-copies the fixture stats and reshapes actions.{max,mean,min,std} to (chunk_size, 32) using a constant scalar from the original stats — same numeric values, correct shape. Test is @pytest.mark.gpu-gated, doesn't run in CPU CI.

All earlier findings remain resolved (per-token causal [1]*N indicator att_mask, has_subgoal style aligned to bool(m.any()), HL/LL gating-asymmetry note acknowledged as pre-existing and internally consistent with the existing metadata-block emission gate at :977). GPU + nightly regression tests are deferred to scheduled CI as documented in the PR body. Prior inline findings are all on commit 844c39e / dda4446 and have been addressed by the author's review-fix commits; GitHub's API does not allow dismissing "COMMENTED" reviews so they remain visible but stale.

@shuheng-liu
Copy link
Copy Markdown
Member Author

@claude fix per review, except for the .github/workflows/claude-implement-fixes.yml comment

github-actions Bot and others added 5 commits May 2, 2026 06:13
- addresses @claude (LL embed_prefix discrete_actions test): added
  test_discrete_actions_indicator_uses_per_token_causal_blocks pinning
  the [1]*N att_masks pattern on the "Action: " indicator + discrete
  action span, so a regression to [1]+[0]*(N-1) fails immediately.
- addresses @claude (has_subgoal style nit): wrapped m.any() in bool(...)
  inside the any(...) generator so has_subgoal materializes Python bools
  matching has_response/has_metadata.

tests: passed - pytest -m "not gpu" -n auto tests/policies/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`test_complete_pi06_pipeline_integration_smoke` was added in #178 with
chunk_size=10 but the shared `lerobot_dataset_metadata` fixture provides
actions stats shaped (50, 32) — matching the default PI06Config. The
Normalize buffer was therefore (50, 32) while the test batch's actions
were (B, 10, 32), and MIN_MAX normalization in normalize.py:232 raised
``RuntimeError: The size of tensor a (10) must match the size of tensor
b (50) at non-singleton dimension 1``.

Pre-existing bug — never caught in CI because the test is gated by
@pytest.mark.gpu and skipped in CPU runs. Surfaced now while validating
this PR's SDPA + grad-ckpt port on a real GPU.

Fix by deep-copying the fixture stats and reshaping the actions
max/mean/min/std arrays from (50, 32) to (chunk_size, 32) before
constructing the policy. Same numeric values, just the right shape.

(cherry picked from commit 6425cb4)
`_verify_position_ids` and `_verify_action_expert_attention_mask` in
test_pi07_low_level_planner.py only excluded DISCRETE_ACTION_MAX_LENGTH
from the cross-attention prefix slice. The model's forward path
(modeling_pi07_low_level.py:1575-1620) excludes both the "Action: "
indicator len (3 tokens via Gemma 3 tokenizer) AND
discrete_action_max_length, matching pi05's
discrete_action_indicator_max_length logic — so the action expert sees
the same prefix length at train and inference.

The test was a latent break since 2949e73 ("Applying pi07_paligemma
fixes to pi07") added the indicator subtraction to the model but never
updated the test. Surfaces as soon as the GPU test
test_complete_pi07_low_level_pipeline runs and asserts
suffix_position_ids — actual was prefix_offset+(token), expected was
prefix_offset+indicator_len+(token), 3-token gap on Gemma 3.

Pull the indicator length from the existing _indicator_lens helper so
the fix tracks future tokenizer changes automatically. Plumb the
tokenizer through the two _verify_action_expert_attention_mask call
sites.
@shuheng-liu shuheng-liu marked this pull request as ready for review May 2, 2026 06:37
@shuheng-liu shuheng-liu merged commit 62a5677 into feat/pi07 May 2, 2026
6 of 7 checks passed
@shuheng-liu shuheng-liu deleted the claude/fix-pi07-prefix-gating branch May 2, 2026 06:39
@claude claude Bot mentioned this pull request May 2, 2026
3 tasks
shuheng-liu added a commit that referenced this pull request May 4, 2026
Brings 16 main commits onto feat/pi07 in preparation for opening the
feat/pi07 → main PR. One conflict (tests/policies/test_pi06.py:
np.tile vs np.full for chunk-size action stats) — kept HEAD's np.tile
fix from claude review feedback on #197, which preserves per-feature
variance instead of collapsing every step to stats.flatten()[0].

Also dropped stale --ignore=tests/policies/test_pi07_paligemma_low_level_planner.py
from .github/workflows/cpu_test.yml and gpu_test.yml: main deleted
the file in #234, but the --ignore line on feat/pi07 (added in #229)
was on a different part of the file so git auto-merged without
flagging it. Removing both now keeps the workflows clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working model new model or model request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant