Skip to content

feat(grounding): add <locNNNN> codec + ensure_loc_tokens for pi05/pi06#237

Merged
shuheng-liu merged 5 commits into
mainfrom
claude/stupefied-yalow-d00a70
May 4, 2026
Merged

feat(grounding): add <locNNNN> codec + ensure_loc_tokens for pi05/pi06#237
shuheng-liu merged 5 commits into
mainfrom
claude/stupefied-yalow-d00a70

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 4, 2026

What this does

Lands the infrastructure for PaliGemma-style location-token grounding data, without yet shipping a concrete grounding dataset. The follow-up — a configurable response formatter that lets new grounding sources be added by config rather than as a new Python class per dataset — is tracked as a separate issue (filed alongside this PR).

New module: src/opentau/datasets/grounding/

  • loc_codec.py — pure functions to convert pixel coordinates to/from <locYMIN><locXMIN><locYMAX><locXMAX> strings. xyxy / xywh boxes, points, and tolerant inverses (loc_tokens_to_xyxy, loc_tokens_to_points). y-then-x order, 1024-bin quantization, computed against the original image dims (NOT the post-resize tensor shape).
  • tokenizer_utils.pyensure_loc_tokens(tokenizer, model=None). Always calls tokenizer.add_tokens([AddedToken(t, special=True, normalized=False) for t in LOC_TOKENS], special_tokens=True). The behavior splits on backbone:
    • PaliGemma: the 1024 strings already live in the SentencePiece vocab at IDs 256000..257023, but the bare HF tokenizer does not register them as added tokens, so a string like "<loc0000>" BPE-fragments into seven pieces (['<', 'loc', '0', '0', '0', '0', '>']). add_tokens with an AddedToken whose string already exists is the documented HF mechanism to promote an existing entry to single-token-match mode without reassigning its ID. Vocab does not grow; no embedding resize is needed.
    • Gemma 3 (π0.6, google/gemma-3-4b-pt): the strings are absent. The same call appends 1024 new IDs and model.resize_token_embeddings(len(tokenizer)) updates the embedding table and tied LM head. New rows are random-init — they learn from the grounding data on first use; there is no PaliGemma loc-embedding transfer.
    • Idempotent — second call returns 0 and does not resize.

Policy wiring

  • pi05/modeling_pi05.py: ensure_loc_tokens(self.language_tokenizer) runs after each AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224") (in PI05Policy.__init__ and PI05FlowMatching.__init__). Promotes the reserved entries; no new IDs.
  • pi06/modeling_pi06.py: the same, plus ensure_loc_tokens(..., model=self.gemma3_with_expert.gemma3) inside PI06FlowMatching.__init__ so the resize fires after the public Gemma 3 weights have already loaded into Gemma3WithExpertModel. A banner comment at the call site documents that the new rows are random-init and that this is unconditional (no config gate).

Disjoint from discrete actions
The <locNNNN> tokens flow through the language vocab + lm_head + response_ce_loss path. Discrete action tokens (FAST processor) live in a separate discrete_action_embedding / da_head with their own integer space — no interaction.

Cleanup

  • Delete src/opentau/datasets/vqa/pixmo.py. The previous implementation encoded points as ASCII integers inside a JSON string, which BPE-fragmented unpredictably and defeated the purpose of having a spatial vocabulary. It was unused as soon as anyone pointed a real <locNNNN>-aware policy at it.
  • Drop the pixmo row in standard_data_format_mapping.py, its side-effect import in factory.py, and three docstring mentions.

How it was tested

Three new test files under tests/datasets/. All pass under pytest -m "not gpu" -n auto.

  • test_loc_codec.py (10 tests, no network): integer-aligned round-trip on 1024×1024, clamping at axis edges, multi-box concat + parse with ; separators, y-then-x order verification on an asymmetric image, garbage-input tolerance, partial-token-count handling, original-image-dims regression test (1920×1080 vs. 224×224 produce different bins), and zero-padding format check.
  • test_loc_tokens_paligemma.py (7 slow tests): ensure_loc_tokens returns 0 on PaliGemma, IDs match the reserved 256000..257023 block and are contiguous, <loc0000> encodes to a single ID after promotion, bbox postfix round-trips through encode/decode, idempotent second call.
  • test_loc_tokens_gemma3.py (7 slow tests): <loc0000> is absent initially; ensure_loc_tokens returns 1024; idempotent; loc tokens become single-id after extension; bbox postfix round-trips; resize_token_embeddings fires on first call (verified against a tiny nn.Embedding stand-in) and does NOT fire on idempotent second call.

Targeted regression sweep:

  • pytest -m "not gpu" -n auto tests/datasets/ tests/policies/test_pi05.py tests/policies/test_pi06.py tests/configs/ → 478 passed, 7 skipped, 0 failures.
  • Pre-commit (ruff lint + format, pyupgrade, license header, gitleaks, bandit, etc.) clean.

The wider pytest -m "not gpu" -n auto shows 786 passed / 13 skipped / 2 failed / 3 errors, but the 2 failures (tests/envs/test_factory.py::TestMakeEnv) and 3 collection errors (test_pi07_paligemma_low_level_planner, test_annotate_subtasks, test_libero_utils) are all pre-existing on main — missing libero / robosuite / anthropic extras and a stale VJEPA2VideoEncoder import in pi05_mem. None of those paths are touched here.

How to checkout & try? (for the reviewer)

gh pr checkout <this-pr>
pytest -m "not gpu" -n auto tests/datasets/test_loc_codec.py tests/datasets/test_loc_tokens_paligemma.py tests/datasets/test_loc_tokens_gemma3.py

Quick sanity check on the policy wiring (loads the tokenizer, not the full model):

python -c "
from transformers import AutoTokenizer
from opentau.datasets.grounding.tokenizer_utils import ensure_loc_tokens

t_pg = AutoTokenizer.from_pretrained('google/paligemma-3b-pt-224')
print('paligemma promoted:', ensure_loc_tokens(t_pg), '(expected 0)')
print('paligemma encode <loc0042>:', t_pg.encode('<loc0042>', add_special_tokens=False))

t_g3 = AutoTokenizer.from_pretrained('google/gemma-3-4b-pt')
print('gemma3 added:', ensure_loc_tokens(t_g3), '(expected 1024)')
print('gemma3 encode <loc0042>:', t_g3.encode('<loc0042>', add_special_tokens=False))
"

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

      Both new loc-token regressions pass on a 32 GB CUDA card; existing pi05 integration smoke also still passes. See the GPU regression results comment for the full output. Nightly regression tests (regression_test.yml) run on schedule.

The model-loop changes are limited to construction-time tokenizer/embedding setup; prepare_response, embed_prefix, embed_language_tokens, and response_ce_loss are untouched. The pi05 path still produces the same vocab shape (no resize); the pi06 LM head grows by 1024 outputs. CLAUDE.md hard rule #3 (determinism check after modeling-loop changes) is partially deferred — the suggested two-seeded-runs check needs the public Gemma 3 / PaliGemma weights and a smoke training step. Happy to run that on a GPU box if you want it on this PR rather than the follow-up.

Note: Before submitting this PR, please read the contributor guideline.

Land the infrastructure for PaliGemma-style location-token grounding
data without yet shipping a concrete grounding dataset.

- src/opentau/datasets/grounding/loc_codec.py — pure functions to convert
  pixel coordinates to/from `<locYMIN><locXMIN><locYMAX><locXMAX>` strings
  (xyxy/xywh boxes, points, tolerant inverses). y-then-x order, 1024-bin
  quantization, computed against original image dims.
- src/opentau/datasets/grounding/tokenizer_utils.py — `ensure_loc_tokens`
  uses `AddedToken(special=True, normalized=False)` to promote the loc
  strings to single-token match mode. Idempotent: 0 new IDs on PaliGemma
  (the strings already live at IDs 256000..257023 but the bare HF
  tokenizer otherwise BPE-fragments them); 1024 new IDs on a fresh
  Gemma 3 tokenizer, with `model.resize_token_embeddings` updating the
  embedding table and tied LM head.
- pi05 / pi06 wire `ensure_loc_tokens` into their `__init__`s. pi06 also
  passes the Gemma 3 model handle so the embedding/LM-head resize fires
  after the public `google/gemma-3-4b-pt` weights have loaded — the new
  rows are random-init.
- Delete the broken `vqa/pixmo.py`. Its JSON-encoded points-as-ASCII
  responses fragmented through BPE; the replacement (a configurable
  grounding dataset) is tracked as a follow-up.
- Tests: 10 codec round-trip / clamping / order tests; 7 PaliGemma
  tokenizer tests; 7 Gemma 3 tokenizer tests (including a fake-model
  resize check). All pass under `pytest -m "not gpu"`.
@shuheng-liu shuheng-liu added the feature New feature or request label May 4, 2026
@shuheng-liu shuheng-liu self-assigned this May 4, 2026
@shuheng-liu shuheng-liu added the feature New feature or request label May 4, 2026
Confirms the loc-token wiring added to PI05Policy / PI06Policy / their
inner FlowMatching modules works end-to-end on GPU.

- pi05: builds the policy, asserts both tokenizer instances encode
  `<loc0042>` and `<loc1023>` to a single ID each, then runs one forward
  pass with a four-loc bbox postfix and asserts MSE / CE are finite.
- pi06: bare `google/gemma-3-4b-pt` tokenizer has no loc strings; after
  policy construction the inner tokenizer has grown by exactly 1024,
  loc tokens encode as single IDs, the Gemma 3 input-embedding row
  count matches the new tokenizer length, and the LM head output dim
  has been resized in lockstep. Closes with a forward pass on a
  loc-token-bearing response and asserts finite loss.

Both tests are `@pytest.mark.gpu` + `@pytest.mark.slow` — they run on
g6.12xlarge nightly and on the worktree's GPU box.
The shared `lerobot_dataset_metadata` fixture carries actions stats sized
(50, 32) for the default `chunk_size`. The new loc-tokens GPU regression
runs at `chunk_size=10` to stay small, so the Normalize buffer is built
from (50, 32) stats while the live actions tensor is (B, 10, 32) — and
`(actions - min) / (max - min + EPS)` errors at dim=1.

Inline the actions-stats override before calling `PI06Policy(config, ...)`
so the buffer matches the test's `chunk_size`. Same pattern that the
existing pi06 smoke test uses on `fix/pi06-paper-alignment`; keeping it
inline here so this PR stays orthogonal to that one.
Loading PaliGemma 3B (~6 GB) and Gemma 3 4B (~8 GB) onto a single 32 GB
GPU and leaving them resident across tests OOMs the next allocation.
Wrap the forward pass in try/finally and `del policy; empty_cache()` at
the end so the loc-tokens regressions can run alongside the broader GPU
suite on a single-card dev box.
@shuheng-liu
Copy link
Copy Markdown
Member Author

GPU regression results

Added two @pytest.mark.gpu @pytest.mark.slow regression tests covering the loc-token wiring:

  • tests/policies/test_pi05.py::test_pi05_loc_tokens_in_response_produce_finite_loss — asserts both PaliGemma tokenizer instances on the policy encode <loc0042> / <loc1023> to a single ID, then runs one forward pass on a four-loc bbox postfix and asserts MSE / CE are finite.
  • tests/policies/test_pi06.py::test_pi06_loc_tokens_extend_vocab_and_resize_embeddings — asserts the inner Gemma 3 tokenizer grew by exactly 1024 vs. bare HF, the input embedding row count matches the new tokenizer length, the LM head output dim grew in lockstep, and a forward pass with a loc-token-bearing response gives finite loss.

Both clean up the policy + torch.cuda.empty_cache() in a finally block so they don't OOM the next test on a single-GPU box.

Run on a 32 GB CUDA card

tests/policies/test_pi05.py::test_pi05_loc_tokens_in_response_produce_finite_loss   PASSED
tests/policies/test_pi06.py::test_pi06_loc_tokens_extend_vocab_and_resize_embeddings PASSED
tests/policies/test_pi05.py::TestPI05Integration::test_complete_pi05_pipeline_integration PASSED
3 passed in 89.05s

Pre-existing failure (not introduced by this PR)

tests/policies/test_pi06.py::test_complete_pi06_pipeline_integration_smoke fails on this branch AND on main with the same RuntimeError: The size of tensor a (10) must match the size of tensor b (50) at non-singleton dimension 1 in Normalize.forward. The shared lerobot_dataset_metadata fixture carries actions stats sized (50, 32) (default chunk_size), but that test runs at chunk_size=10. The fix has already landed on fix/pi06-paper-alignment (eefe8ca fix(test): override actions stats shape in pi06 GPU smoke). My new pi06 GPU test uses the same actions-stats override pattern inline, so it's not affected.

Verified by running the smoke alone on a clean checkout of main:

tests/policies/test_pi06.py::test_complete_pi06_pipeline_integration_smoke FAILED
RuntimeError: The size of tensor a (10) must match the size of tensor b (50) at non-singleton dimension 1

Checklist update

The "GPU pytests" box is now ticked — the new regressions pass and the existing pi05 GPU smoke is unaffected. Nightly regression tests (regression_test.yml) still run separately on schedule.

@shuheng-liu shuheng-liu requested review from WilliamYue37 and akshay18iitg and removed request for WilliamYue37 May 4, 2026 03:58
@shuheng-liu shuheng-liu marked this pull request as ready for review May 4, 2026 03:59
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Targeted review of the loc-token grounding infrastructure. One small docstring/behavior mismatch in the parser; the rest looks correct.

Comment thread src/opentau/datasets/grounding/loc_codec.py
Comment thread src/opentau/policies/pi06/modeling_pi06.py
Comment thread src/opentau/policies/pi06/modeling_pi06.py Outdated
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

[claude-review] summary for commit c68a488

  • suggestionsrc/opentau/datasets/grounding/loc_codec.py:144loc_tokens_to_xyxy collects loc tokens globally and groups them in fours, but its docstring claims segment-aware handling. A malformed multi-box response (e.g. one box has 5 loc tokens) silently misaligns every subsequent box. Either parse per-segment by splitting on ; or update the docstring.
  • nitsrc/opentau/policies/pi06/modeling_pi06.py:754model.resize_token_embeddings random-inits 1024 new Gemma 3 embedding rows from the active RNG. Construction now depends on RNG state, a new ordering hazard for CLAUDE.md hard rule Fixing reward normalizer #3. Either snapshot/restore the RNG inside ensure_loc_tokens, or run the two-seeded-runs smoke before this lands in any training loop (PR already defers this).
  • nitsrc/opentau/policies/pi06/modeling_pi06.py:255PI06Policy and PI06FlowMatching each construct their own Gemma 3 tokenizer and each call ensure_loc_tokens. Both grow deterministically so IDs currently align, but the structure is fragile — sharing one tokenizer instance across the two layers would be safer and save a load.

Encoding/decoding correctness, idempotency of ensure_loc_tokens on PaliGemma's reserved IDs, and the LM-head resize path on Gemma 3 are all covered by the new tests. The pi05/pi06 GPU regressions verify single-token encoding and finite forward loss with <locNNNN> in the response.

@shuheng-liu
Copy link
Copy Markdown
Member Author

@claude fix per suggestion and nits.

- addresses @claude (loc_codec parser): segment-aware loc_tokens_to_xyxy
  / loc_tokens_to_points — split on `;` so a malformed segment cannot
  misalign every subsequent box. Update test_partial_token_count_drops_orphan_pairs
  to the new contract and add regressions for malformed-segment isolation.
- addresses @claude (RNG hazard in ensure_loc_tokens): wrap the
  resize_token_embeddings call in torch.random.fork_rng with a fixed
  internal seed so embedding init is reproducible and does not consume
  entropy from the caller's stream. Add Gemma 3 tests asserting RNG
  isolation and bit-identical new rows across outer seeds.
- addresses @claude (duplicate Gemma 3 / PaliGemma tokenizer load):
  PI06Policy / PI05Policy now share a single tokenizer instance with
  their inner FlowMatching; ensure_loc_tokens runs once. Existing
  pi05/pi06 GPU tests assert the shared identity.

tests: passed — pytest -m "not gpu" -n auto tests/datasets/ tests/policies/test_pi05.py tests/policies/test_pi06.py tests/configs/ (16 HF-gated tests skipped locally for lack of HF auth; CI runs them with auth)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shuheng-liu shuheng-liu merged commit 55d5274 into main May 4, 2026
7 checks passed
@shuheng-liu shuheng-liu deleted the claude/stupefied-yalow-d00a70 branch May 4, 2026 05:00
@claude claude Bot mentioned this pull request May 4, 2026
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant