feat(grounding): add <locNNNN> codec + ensure_loc_tokens for pi05/pi06#237
Conversation
Land the infrastructure for PaliGemma-style location-token grounding data without yet shipping a concrete grounding dataset. - src/opentau/datasets/grounding/loc_codec.py — pure functions to convert pixel coordinates to/from `<locYMIN><locXMIN><locYMAX><locXMAX>` strings (xyxy/xywh boxes, points, tolerant inverses). y-then-x order, 1024-bin quantization, computed against original image dims. - src/opentau/datasets/grounding/tokenizer_utils.py — `ensure_loc_tokens` uses `AddedToken(special=True, normalized=False)` to promote the loc strings to single-token match mode. Idempotent: 0 new IDs on PaliGemma (the strings already live at IDs 256000..257023 but the bare HF tokenizer otherwise BPE-fragments them); 1024 new IDs on a fresh Gemma 3 tokenizer, with `model.resize_token_embeddings` updating the embedding table and tied LM head. - pi05 / pi06 wire `ensure_loc_tokens` into their `__init__`s. pi06 also passes the Gemma 3 model handle so the embedding/LM-head resize fires after the public `google/gemma-3-4b-pt` weights have loaded — the new rows are random-init. - Delete the broken `vqa/pixmo.py`. Its JSON-encoded points-as-ASCII responses fragmented through BPE; the replacement (a configurable grounding dataset) is tracked as a follow-up. - Tests: 10 codec round-trip / clamping / order tests; 7 PaliGemma tokenizer tests; 7 Gemma 3 tokenizer tests (including a fake-model resize check). All pass under `pytest -m "not gpu"`.
Confirms the loc-token wiring added to PI05Policy / PI06Policy / their inner FlowMatching modules works end-to-end on GPU. - pi05: builds the policy, asserts both tokenizer instances encode `<loc0042>` and `<loc1023>` to a single ID each, then runs one forward pass with a four-loc bbox postfix and asserts MSE / CE are finite. - pi06: bare `google/gemma-3-4b-pt` tokenizer has no loc strings; after policy construction the inner tokenizer has grown by exactly 1024, loc tokens encode as single IDs, the Gemma 3 input-embedding row count matches the new tokenizer length, and the LM head output dim has been resized in lockstep. Closes with a forward pass on a loc-token-bearing response and asserts finite loss. Both tests are `@pytest.mark.gpu` + `@pytest.mark.slow` — they run on g6.12xlarge nightly and on the worktree's GPU box.
The shared `lerobot_dataset_metadata` fixture carries actions stats sized (50, 32) for the default `chunk_size`. The new loc-tokens GPU regression runs at `chunk_size=10` to stay small, so the Normalize buffer is built from (50, 32) stats while the live actions tensor is (B, 10, 32) — and `(actions - min) / (max - min + EPS)` errors at dim=1. Inline the actions-stats override before calling `PI06Policy(config, ...)` so the buffer matches the test's `chunk_size`. Same pattern that the existing pi06 smoke test uses on `fix/pi06-paper-alignment`; keeping it inline here so this PR stays orthogonal to that one.
Loading PaliGemma 3B (~6 GB) and Gemma 3 4B (~8 GB) onto a single 32 GB GPU and leaving them resident across tests OOMs the next allocation. Wrap the forward pass in try/finally and `del policy; empty_cache()` at the end so the loc-tokens regressions can run alongside the broader GPU suite on a single-card dev box.
GPU regression resultsAdded two
Both clean up the policy + Run on a 32 GB CUDA cardPre-existing failure (not introduced by this PR)
Verified by running the smoke alone on a clean checkout of Checklist updateThe "GPU pytests" box is now ticked — the new regressions pass and the existing pi05 GPU smoke is unaffected. Nightly regression tests ( |
|
[claude-review] summary for commit c68a488
Encoding/decoding correctness, idempotency of |
|
@claude fix per suggestion and nits. |
- addresses @claude (loc_codec parser): segment-aware loc_tokens_to_xyxy / loc_tokens_to_points — split on `;` so a malformed segment cannot misalign every subsequent box. Update test_partial_token_count_drops_orphan_pairs to the new contract and add regressions for malformed-segment isolation. - addresses @claude (RNG hazard in ensure_loc_tokens): wrap the resize_token_embeddings call in torch.random.fork_rng with a fixed internal seed so embedding init is reproducible and does not consume entropy from the caller's stream. Add Gemma 3 tests asserting RNG isolation and bit-identical new rows across outer seeds. - addresses @claude (duplicate Gemma 3 / PaliGemma tokenizer load): PI06Policy / PI05Policy now share a single tokenizer instance with their inner FlowMatching; ensure_loc_tokens runs once. Existing pi05/pi06 GPU tests assert the shared identity. tests: passed — pytest -m "not gpu" -n auto tests/datasets/ tests/policies/test_pi05.py tests/policies/test_pi06.py tests/configs/ (16 HF-gated tests skipped locally for lack of HF auth; CI runs them with auth) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What this does
Lands the infrastructure for PaliGemma-style location-token grounding data, without yet shipping a concrete grounding dataset. The follow-up — a configurable response formatter that lets new grounding sources be added by config rather than as a new Python class per dataset — is tracked as a separate issue (filed alongside this PR).
New module:
src/opentau/datasets/grounding/loc_codec.py— pure functions to convert pixel coordinates to/from<locYMIN><locXMIN><locYMAX><locXMAX>strings.xyxy/xywhboxes, points, and tolerant inverses (loc_tokens_to_xyxy,loc_tokens_to_points). y-then-x order, 1024-bin quantization, computed against the original image dims (NOT the post-resize tensor shape).tokenizer_utils.py—ensure_loc_tokens(tokenizer, model=None). Always callstokenizer.add_tokens([AddedToken(t, special=True, normalized=False) for t in LOC_TOKENS], special_tokens=True). The behavior splits on backbone:256000..257023, but the bare HF tokenizer does not register them as added tokens, so a string like"<loc0000>"BPE-fragments into seven pieces (['<', 'loc', '0', '0', '0', '0', '>']).add_tokenswith anAddedTokenwhose string already exists is the documented HF mechanism to promote an existing entry to single-token-match mode without reassigning its ID. Vocab does not grow; no embedding resize is needed.google/gemma-3-4b-pt): the strings are absent. The same call appends 1024 new IDs andmodel.resize_token_embeddings(len(tokenizer))updates the embedding table and tied LM head. New rows are random-init — they learn from the grounding data on first use; there is no PaliGemma loc-embedding transfer.Policy wiring
pi05/modeling_pi05.py:ensure_loc_tokens(self.language_tokenizer)runs after eachAutoTokenizer.from_pretrained("google/paligemma-3b-pt-224")(inPI05Policy.__init__andPI05FlowMatching.__init__). Promotes the reserved entries; no new IDs.pi06/modeling_pi06.py: the same, plusensure_loc_tokens(..., model=self.gemma3_with_expert.gemma3)insidePI06FlowMatching.__init__so the resize fires after the public Gemma 3 weights have already loaded intoGemma3WithExpertModel. A banner comment at the call site documents that the new rows are random-init and that this is unconditional (no config gate).Disjoint from discrete actions
The
<locNNNN>tokens flow through the language vocab +lm_head+response_ce_losspath. Discrete action tokens (FAST processor) live in a separatediscrete_action_embedding/da_headwith their own integer space — no interaction.Cleanup
src/opentau/datasets/vqa/pixmo.py. The previous implementation encoded points as ASCII integers inside a JSON string, which BPE-fragmented unpredictably and defeated the purpose of having a spatial vocabulary. It was unused as soon as anyone pointed a real<locNNNN>-aware policy at it.pixmorow instandard_data_format_mapping.py, its side-effect import infactory.py, and three docstring mentions.How it was tested
Three new test files under
tests/datasets/. All pass underpytest -m "not gpu" -n auto.test_loc_codec.py(10 tests, no network): integer-aligned round-trip on 1024×1024, clamping at axis edges, multi-box concat + parse with;separators, y-then-x order verification on an asymmetric image, garbage-input tolerance, partial-token-count handling, original-image-dims regression test (1920×1080vs.224×224produce different bins), and zero-padding format check.test_loc_tokens_paligemma.py(7 slow tests):ensure_loc_tokensreturns 0 on PaliGemma, IDs match the reserved256000..257023block and are contiguous,<loc0000>encodes to a single ID after promotion, bbox postfix round-trips throughencode/decode, idempotent second call.test_loc_tokens_gemma3.py(7 slow tests):<loc0000>is absent initially;ensure_loc_tokensreturns 1024; idempotent; loc tokens become single-id after extension; bbox postfix round-trips;resize_token_embeddingsfires on first call (verified against a tinynn.Embeddingstand-in) and does NOT fire on idempotent second call.Targeted regression sweep:
pytest -m "not gpu" -n auto tests/datasets/ tests/policies/test_pi05.py tests/policies/test_pi06.py tests/configs/→ 478 passed, 7 skipped, 0 failures.The wider
pytest -m "not gpu" -n autoshows 786 passed / 13 skipped / 2 failed / 3 errors, but the 2 failures (tests/envs/test_factory.py::TestMakeEnv) and 3 collection errors (test_pi07_paligemma_low_level_planner,test_annotate_subtasks,test_libero_utils) are all pre-existing onmain— missinglibero/robosuite/anthropicextras and a staleVJEPA2VideoEncoderimport inpi05_mem. None of those paths are touched here.How to checkout & try? (for the reviewer)
Quick sanity check on the policy wiring (loads the tokenizer, not the full model):
Checklist
If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.
Both new loc-token regressions pass on a 32 GB CUDA card; existing pi05 integration smoke also still passes. See the GPU regression results comment for the full output. Nightly regression tests (regression_test.yml) run on schedule.
The model-loop changes are limited to construction-time tokenizer/embedding setup;
prepare_response,embed_prefix,embed_language_tokens, andresponse_ce_lossare untouched. The pi05 path still produces the same vocab shape (no resize); the pi06 LM head grows by 1024 outputs. CLAUDE.md hard rule #3 (determinism check after modeling-loop changes) is partially deferred — the suggested two-seeded-runs check needs the public Gemma 3 / PaliGemma weights and a smoke training step. Happy to run that on a GPU box if you want it on this PR rather than the follow-up.Note: Before submitting this PR, please read the contributor guideline.