feat(grounding): add <locNNNN> codec + ensure_loc_tokens for pi05/pi06 by shuheng-liu · Pull Request #237 · TensorAuto/OpenTau

shuheng-liu · 2026-05-04T02:54:44Z

What this does

Lands the infrastructure for PaliGemma-style location-token grounding data, without yet shipping a concrete grounding dataset. The follow-up — a configurable response formatter that lets new grounding sources be added by config rather than as a new Python class per dataset — is tracked as a separate issue (filed alongside this PR).

New module: src/opentau/datasets/grounding/

loc_codec.py — pure functions to convert pixel coordinates to/from <locYMIN><locXMIN><locYMAX><locXMAX> strings. xyxy / xywh boxes, points, and tolerant inverses (loc_tokens_to_xyxy, loc_tokens_to_points). y-then-x order, 1024-bin quantization, computed against the original image dims (NOT the post-resize tensor shape).
tokenizer_utils.py — ensure_loc_tokens(tokenizer, model=None). Always calls tokenizer.add_tokens([AddedToken(t, special=True, normalized=False) for t in LOC_TOKENS], special_tokens=True). The behavior splits on backbone:
- PaliGemma: the 1024 strings already live in the SentencePiece vocab at IDs 256000..257023, but the bare HF tokenizer does not register them as added tokens, so a string like "<loc0000>" BPE-fragments into seven pieces (['<', 'loc', '0', '0', '0', '0', '>']). add_tokens with an AddedToken whose string already exists is the documented HF mechanism to promote an existing entry to single-token-match mode without reassigning its ID. Vocab does not grow; no embedding resize is needed.
- Gemma 3 (π0.6, google/gemma-3-4b-pt): the strings are absent. The same call appends 1024 new IDs and model.resize_token_embeddings(len(tokenizer)) updates the embedding table and tied LM head. New rows are random-init — they learn from the grounding data on first use; there is no PaliGemma loc-embedding transfer.
- Idempotent — second call returns 0 and does not resize.

Policy wiring

pi05/modeling_pi05.py: ensure_loc_tokens(self.language_tokenizer) runs after each AutoTokenizer.from_pretrained("google/paligemma-3b-pt-224") (in PI05Policy.__init__ and PI05FlowMatching.__init__). Promotes the reserved entries; no new IDs.
pi06/modeling_pi06.py: the same, plus ensure_loc_tokens(..., model=self.gemma3_with_expert.gemma3) inside PI06FlowMatching.__init__ so the resize fires after the public Gemma 3 weights have already loaded into Gemma3WithExpertModel. A banner comment at the call site documents that the new rows are random-init and that this is unconditional (no config gate).

Disjoint from discrete actions
The <locNNNN> tokens flow through the language vocab + lm_head + response_ce_loss path. Discrete action tokens (FAST processor) live in a separate discrete_action_embedding / da_head with their own integer space — no interaction.

Cleanup

Delete src/opentau/datasets/vqa/pixmo.py. The previous implementation encoded points as ASCII integers inside a JSON string, which BPE-fragmented unpredictably and defeated the purpose of having a spatial vocabulary. It was unused as soon as anyone pointed a real <locNNNN>-aware policy at it.
Drop the pixmo row in standard_data_format_mapping.py, its side-effect import in factory.py, and three docstring mentions.

How it was tested

Three new test files under tests/datasets/. All pass under pytest -m "not gpu" -n auto.

test_loc_codec.py (10 tests, no network): integer-aligned round-trip on 1024×1024, clamping at axis edges, multi-box concat + parse with ; separators, y-then-x order verification on an asymmetric image, garbage-input tolerance, partial-token-count handling, original-image-dims regression test (1920×1080 vs. 224×224 produce different bins), and zero-padding format check.
test_loc_tokens_paligemma.py (7 slow tests): ensure_loc_tokens returns 0 on PaliGemma, IDs match the reserved 256000..257023 block and are contiguous, <loc0000> encodes to a single ID after promotion, bbox postfix round-trips through encode/decode, idempotent second call.
test_loc_tokens_gemma3.py (7 slow tests): <loc0000> is absent initially; ensure_loc_tokens returns 1024; idempotent; loc tokens become single-id after extension; bbox postfix round-trips; resize_token_embeddings fires on first call (verified against a tiny nn.Embedding stand-in) and does NOT fire on idempotent second call.

Targeted regression sweep:

pytest -m "not gpu" -n auto tests/datasets/ tests/policies/test_pi05.py tests/policies/test_pi06.py tests/configs/ → 478 passed, 7 skipped, 0 failures.
Pre-commit (ruff lint + format, pyupgrade, license header, gitleaks, bandit, etc.) clean.

The wider pytest -m "not gpu" -n auto shows 786 passed / 13 skipped / 2 failed / 3 errors, but the 2 failures (tests/envs/test_factory.py::TestMakeEnv) and 3 collection errors (test_pi07_paligemma_low_level_planner, test_annotate_subtasks, test_libero_utils) are all pre-existing on main — missing libero / robosuite / anthropic extras and a stale VJEPA2VideoEncoder import in pi05_mem. None of those paths are touched here.

How to checkout & try? (for the reviewer)

gh pr checkout <this-pr>
pytest -m "not gpu" -n auto tests/datasets/test_loc_codec.py tests/datasets/test_loc_tokens_paligemma.py tests/datasets/test_loc_tokens_gemma3.py

Quick sanity check on the policy wiring (loads the tokenizer, not the full model):

python -c "
from transformers import AutoTokenizer
from opentau.datasets.grounding.tokenizer_utils import ensure_loc_tokens

t_pg = AutoTokenizer.from_pretrained('google/paligemma-3b-pt-224')
print('paligemma promoted:', ensure_loc_tokens(t_pg), '(expected 0)')
print('paligemma encode <loc0042>:', t_pg.encode('<loc0042>', add_special_tokens=False))

t_g3 = AutoTokenizer.from_pretrained('google/gemma-3-4b-pt')
print('gemma3 added:', ensure_loc_tokens(t_g3), '(expected 1024)')
print('gemma3 encode <loc0042>:', t_g3.encode('<loc0042>', add_special_tokens=False))
"

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.
  
  Both new loc-token regressions pass on a 32 GB CUDA card; existing pi05 integration smoke also still passes. See the GPU regression results comment for the full output. Nightly regression tests (regression_test.yml) run on schedule.

The model-loop changes are limited to construction-time tokenizer/embedding setup; prepare_response, embed_prefix, embed_language_tokens, and response_ce_loss are untouched. The pi05 path still produces the same vocab shape (no resize); the pi06 LM head grows by 1024 outputs. CLAUDE.md hard rule #3 (determinism check after modeling-loop changes) is partially deferred — the suggested two-seeded-runs check needs the public Gemma 3 / PaliGemma weights and a smoke training step. Happy to run that on a GPU box if you want it on this PR rather than the follow-up.

Note: Before submitting this PR, please read the contributor guideline.

Land the infrastructure for PaliGemma-style location-token grounding data without yet shipping a concrete grounding dataset. - src/opentau/datasets/grounding/loc_codec.py — pure functions to convert pixel coordinates to/from `<locYMIN><locXMIN><locYMAX><locXMAX>` strings (xyxy/xywh boxes, points, tolerant inverses). y-then-x order, 1024-bin quantization, computed against original image dims. - src/opentau/datasets/grounding/tokenizer_utils.py — `ensure_loc_tokens` uses `AddedToken(special=True, normalized=False)` to promote the loc strings to single-token match mode. Idempotent: 0 new IDs on PaliGemma (the strings already live at IDs 256000..257023 but the bare HF tokenizer otherwise BPE-fragments them); 1024 new IDs on a fresh Gemma 3 tokenizer, with `model.resize_token_embeddings` updating the embedding table and tied LM head. - pi05 / pi06 wire `ensure_loc_tokens` into their `__init__`s. pi06 also passes the Gemma 3 model handle so the embedding/LM-head resize fires after the public `google/gemma-3-4b-pt` weights have loaded — the new rows are random-init. - Delete the broken `vqa/pixmo.py`. Its JSON-encoded points-as-ASCII responses fragmented through BPE; the replacement (a configurable grounding dataset) is tracked as a follow-up. - Tests: 10 codec round-trip / clamping / order tests; 7 PaliGemma tokenizer tests; 7 Gemma 3 tokenizer tests (including a fake-model resize check). All pass under `pytest -m "not gpu"`.

Confirms the loc-token wiring added to PI05Policy / PI06Policy / their inner FlowMatching modules works end-to-end on GPU. - pi05: builds the policy, asserts both tokenizer instances encode `<loc0042>` and `<loc1023>` to a single ID each, then runs one forward pass with a four-loc bbox postfix and asserts MSE / CE are finite. - pi06: bare `google/gemma-3-4b-pt` tokenizer has no loc strings; after policy construction the inner tokenizer has grown by exactly 1024, loc tokens encode as single IDs, the Gemma 3 input-embedding row count matches the new tokenizer length, and the LM head output dim has been resized in lockstep. Closes with a forward pass on a loc-token-bearing response and asserts finite loss. Both tests are `@pytest.mark.gpu` + `@pytest.mark.slow` — they run on g6.12xlarge nightly and on the worktree's GPU box.

The shared `lerobot_dataset_metadata` fixture carries actions stats sized (50, 32) for the default `chunk_size`. The new loc-tokens GPU regression runs at `chunk_size=10` to stay small, so the Normalize buffer is built from (50, 32) stats while the live actions tensor is (B, 10, 32) — and `(actions - min) / (max - min + EPS)` errors at dim=1. Inline the actions-stats override before calling `PI06Policy(config, ...)` so the buffer matches the test's `chunk_size`. Same pattern that the existing pi06 smoke test uses on `fix/pi06-paper-alignment`; keeping it inline here so this PR stays orthogonal to that one.

Loading PaliGemma 3B (~6 GB) and Gemma 3 4B (~8 GB) onto a single 32 GB GPU and leaving them resident across tests OOMs the next allocation. Wrap the forward pass in try/finally and `del policy; empty_cache()` at the end so the loc-tokens regressions can run alongside the broader GPU suite on a single-card dev box.

shuheng-liu · 2026-05-04T03:54:05Z

GPU regression results

Added two @pytest.mark.gpu @pytest.mark.slow regression tests covering the loc-token wiring:

tests/policies/test_pi05.py::test_pi05_loc_tokens_in_response_produce_finite_loss — asserts both PaliGemma tokenizer instances on the policy encode <loc0042> / <loc1023> to a single ID, then runs one forward pass on a four-loc bbox postfix and asserts MSE / CE are finite.
tests/policies/test_pi06.py::test_pi06_loc_tokens_extend_vocab_and_resize_embeddings — asserts the inner Gemma 3 tokenizer grew by exactly 1024 vs. bare HF, the input embedding row count matches the new tokenizer length, the LM head output dim grew in lockstep, and a forward pass with a loc-token-bearing response gives finite loss.

Both clean up the policy + torch.cuda.empty_cache() in a finally block so they don't OOM the next test on a single-GPU box.

Run on a 32 GB CUDA card

tests/policies/test_pi05.py::test_pi05_loc_tokens_in_response_produce_finite_loss   PASSED
tests/policies/test_pi06.py::test_pi06_loc_tokens_extend_vocab_and_resize_embeddings PASSED
tests/policies/test_pi05.py::TestPI05Integration::test_complete_pi05_pipeline_integration PASSED
3 passed in 89.05s

Pre-existing failure (not introduced by this PR)

tests/policies/test_pi06.py::test_complete_pi06_pipeline_integration_smoke fails on this branch AND on main with the same RuntimeError: The size of tensor a (10) must match the size of tensor b (50) at non-singleton dimension 1 in Normalize.forward. The shared lerobot_dataset_metadata fixture carries actions stats sized (50, 32) (default chunk_size), but that test runs at chunk_size=10. The fix has already landed on fix/pi06-paper-alignment (eefe8ca fix(test): override actions stats shape in pi06 GPU smoke). My new pi06 GPU test uses the same actions-stats override pattern inline, so it's not affected.

Verified by running the smoke alone on a clean checkout of main:

tests/policies/test_pi06.py::test_complete_pi06_pipeline_integration_smoke FAILED
RuntimeError: The size of tensor a (10) must match the size of tensor b (50) at non-singleton dimension 1

Checklist update

The "GPU pytests" box is now ticked — the new regressions pass and the existing pi05 GPU smoke is unaffected. Nightly regression tests (regression_test.yml) still run separately on schedule.

claude

Targeted review of the loc-token grounding infrastructure. One small docstring/behavior mismatch in the parser; the rest looks correct.

claude · 2026-05-04T04:35:40Z

[claude-review] summary for commit c68a488

suggestion — src/opentau/datasets/grounding/loc_codec.py:144 — loc_tokens_to_xyxy collects loc tokens globally and groups them in fours, but its docstring claims segment-aware handling. A malformed multi-box response (e.g. one box has 5 loc tokens) silently misaligns every subsequent box. Either parse per-segment by splitting on ; or update the docstring.
nit — src/opentau/policies/pi06/modeling_pi06.py:754 — model.resize_token_embeddings random-inits 1024 new Gemma 3 embedding rows from the active RNG. Construction now depends on RNG state, a new ordering hazard for CLAUDE.md hard rule Fixing reward normalizer #3. Either snapshot/restore the RNG inside ensure_loc_tokens, or run the two-seeded-runs smoke before this lands in any training loop (PR already defers this).
nit — src/opentau/policies/pi06/modeling_pi06.py:255 — PI06Policy and PI06FlowMatching each construct their own Gemma 3 tokenizer and each call ensure_loc_tokens. Both grow deterministically so IDs currently align, but the structure is fragile — sharing one tokenizer instance across the two layers would be safer and save a load.

Encoding/decoding correctness, idempotency of ensure_loc_tokens on PaliGemma's reserved IDs, and the LM-head resize path on Gemma 3 are all covered by the new tests. The pi05/pi06 GPU regressions verify single-token encoding and finite forward loss with <locNNNN> in the response.

shuheng-liu · 2026-05-04T04:37:27Z

@claude fix per suggestion and nits.

@claude

- addresses @claude (loc_codec parser): segment-aware loc_tokens_to_xyxy / loc_tokens_to_points — split on `;` so a malformed segment cannot misalign every subsequent box. Update test_partial_token_count_drops_orphan_pairs to the new contract and add regressions for malformed-segment isolation. - addresses @claude (RNG hazard in ensure_loc_tokens): wrap the resize_token_embeddings call in torch.random.fork_rng with a fixed internal seed so embedding init is reproducible and does not consume entropy from the caller's stream. Add Gemma 3 tests asserting RNG isolation and bit-identical new rows across outer seeds. - addresses @claude (duplicate Gemma 3 / PaliGemma tokenizer load): PI06Policy / PI05Policy now share a single tokenizer instance with their inner FlowMatching; ensure_loc_tokens runs once. Existing pi05/pi06 GPU tests assert the shared identity. tests: passed — pytest -m "not gpu" -n auto tests/datasets/ tests/policies/test_pi05.py tests/policies/test_pi06.py tests/configs/ (16 HF-gated tests skipped locally for lack of HF auth; CI runs them with auth) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shuheng-liu added the feature New feature or request label May 4, 2026

shuheng-liu self-assigned this May 4, 2026

shuheng-liu added the feature New feature or request label May 4, 2026

shuheng-liu mentioned this pull request May 4, 2026

Configurable response formatting for grounding/VQA datasets #238

Open

shuheng-liu added 3 commits May 3, 2026 20:41

shuheng-liu requested review from WilliamYue37 and akshay18iitg and removed request for WilliamYue37 May 4, 2026 03:58

shuheng-liu marked this pull request as ready for review May 4, 2026 03:59

claude Bot reviewed May 4, 2026

View reviewed changes

Comment thread src/opentau/datasets/grounding/loc_codec.py

Comment thread src/opentau/policies/pi06/modeling_pi06.py

Comment thread src/opentau/policies/pi06/modeling_pi06.py Outdated

shuheng-liu merged commit 55d5274 into main May 4, 2026
7 checks passed

shuheng-liu deleted the claude/stupefied-yalow-d00a70 branch May 4, 2026 05:00

claude Bot mentioned this pull request May 4, 2026

chore(claude): learn from #237 #241

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grounding): add <locNNNN> codec + ensure_loc_tokens for pi05/pi06#237

feat(grounding): add <locNNNN> codec + ensure_loc_tokens for pi05/pi06#237
shuheng-liu merged 5 commits into
mainfrom
claude/stupefied-yalow-d00a70

shuheng-liu commented May 4, 2026 •

edited

Loading

Uh oh!

shuheng-liu commented May 4, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 4, 2026

Uh oh!

shuheng-liu commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuheng-liu commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

shuheng-liu commented May 4, 2026

GPU regression results

Run on a 32 GB CUDA card

Pre-existing failure (not introduced by this PR)

Checklist update

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 4, 2026

Uh oh!

shuheng-liu commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shuheng-liu commented May 4, 2026 •

edited

Loading