feat(policies): support training and selecting FAST tokenizer#290
Conversation
…ength cap - Make --chunk-size required (no default): pi0.5/pi0.6 use 10, pi0.7 low-level uses 50; picking the wrong value silently produces a tokenizer with the wrong BPE merges. - Default sampler is now a manual per-dataset path: weight-proportional budget allocation, raw-parquet reads, scipy.interp1d for action_freq resampling. Both paths respect the mixture weights and global FPS, but the manual path runs at ~45k chunks/s vs the mixture-dataloader path's ~25 chunks/s. The WeightedDatasetMixture path is preserved behind --use-mixture-dataloader (with thread-parallel per-dataset construction to keep the build phase under 10 min). - Add --max-token-length, defaulting to 64. Without a cap, BPE training on heavily zero-padded action data wastes its entire merge budget on one runaway zero-run token (verified: 1855 merges, one each at lengths 2, 3, ..., 1320; mean token length on test data 478 bytes vs upstream's 78). Capping at chunk_size keeps merges focused on meaningful patterns -- on the pretrain-pi07 mixture this gives mean 68 tokens / chunk (P99 = 93) at chunk_size=50, comfortably under pi0.7's discrete_action_max_length=150. - Bump --max-state-dim default to 128 to cover the largest native state dim in the pretrain-pi07-10% mixture (118-dim). - Replace UniversalActionProcessor.fit() call with a local re-impl so we can override max_token_length on the BpeTrainer (the upstream hard-codes 10000).
The pi0.7 low-level policy hard-coded ``physical-intelligence/fast`` when instantiating its ``discrete_action_processor``. Add a new ``discrete_action_tokenizer_path`` config field (default unchanged) so a specialized FAST tokenizer fit via ``opentau.scripts.fit_fast_tokenizer`` can be plugged in via a single ``--policy.discrete_action_tokenizer_path`` CLI override. The value flows through to the auxiliary CE target at training; inference is unaffected (flow-matching head only). Note: pi0.5 / pi0.6 / pi0.7_paligemma low-level still hard-code the upstream path -- mirroring this change there is left for follow-ups.
Replace the hard-coded ``AutoProcessor.from_pretrained("physical-intelligence/fast", ...)``
call in every policy with a new ``discrete_action_tokenizer_path`` config field.
A single ``--policy.discrete_action_tokenizer_path`` CLI override now plugs in
a specialized FAST tokenizer (e.g. one fit via
``opentau.scripts.fit_fast_tokenizer``) without policy code changes. The value
flows through to the auxiliary CE target at training; inference paths are
unaffected (flow-matching heads only).
Policies touched: pi0.5, pi0.5-mem, pi0.6, pi0.7 high-level + low-level,
pi0.7-paligemma high-level + low-level.
Defaults:
- ``PI07LowLevelConfig.discrete_action_tokenizer_path`` defaults to
``TensorAuto/fast-pi07-pretrain`` -- the pi0.7 pretrain mixture's
specialized fit (vocab=2048, max_token_length=64, mean 68 tokens/chunk
vs upstream's ~146 on the same data).
- All other policies keep the existing default of
``physical-intelligence/fast`` so no behaviour changes for pi0.5/0.6/0.7-HL
callers.
Test pins: GPU tests in ``tests/policies/test_pi07_low_level.py`` now
explicitly pass ``discrete_action_tokenizer_path="physical-intelligence/fast"``
so CI doesn't depend on private-repo credentials.
ReviewOverviewTwo changes shipped together:
What's strong
Issues / questions1. Distribution mismatch between fit-time and training-time normalization (medium) The script's x = np.nan_to_num(chunks.astype(np.float64), nan=0.0, posinf=0.0, neginf=0.0)
norm = 2.0 * (x - action_min) / safe_span - 1.0
...
return np.clip(norm, -1.0, 1.0).astype(np.float32)But production batch[key] = (batch[key] - min) / (max - min + EPS)
batch[key] = batch[key] * 2 - 1— no clip, no 2. The pi0.7-LL default flip creates a hidden credential dependency (medium) After this PR, Two follow-ups worth considering:
3. Per the docstring in 4.
5.
6. In-script re-implementation of The script copies the upstream
7. pi0.5/0.6/0.7 currently happen to use 8. The function mutates the parsed 9. Magic number Could be e.g. Correctness / security / perf
Tests
VerdictApprove in principle once #1 (normalization mismatch — clarify or fix), #2 (pi0.7-LL default — explicit decision on credential cost), and #4 (delete stale Generated by Claude Code |
- Revert ``PI07LowLevelConfig.discrete_action_tokenizer_path`` default to
``physical-intelligence/fast`` so all seven policies share the same
default (avoids a private-repo credential dependency on first plain
instantiation). Specialized fits like ``TensorAuto/fast-pi07-pretrain``
are opted into via ``--policy.discrete_action_tokenizer_path=<...>``.
- Update the CPU mock test to pin the upstream default and exercise the
``TensorAuto/fast-pi07-pretrain`` value via the override path. Drop the
now-redundant explicit override in ``test_pi07_low_level.py`` GPU tests.
- Make ``_normalize_chunks`` byte-match production
``Normalize({"ACTION": MIN_MAX})``: use ``EPS = 1e-8`` (the production
constant), drop the post-normalization ``np.clip``, keep the
``nan_to_num`` (mirrors ``prepare_discrete_actions`` in pi0.7-LL).
- ``_resolve_native_action_key`` now optionally accepts the dataset's
actual stats / column keys and prefers whichever of ``"action"`` /
``"actions"`` is present, eliminating the silent-failure mode for
datasets that aren't registered in ``DATA_FEATURES_NAME_MAPPING`` and
use the OpenTau-canonical plural form.
- ``_aggregate_stats_manual`` now logs a loud warning when any dataset's
native action dim exceeds ``--action-dim``: those dims are silently
dropped, which is only correct if the production policy's
``max_action_dim`` matches.
- ``_build_train_cfg`` uses ``dataclasses.replace`` instead of mutating
the caller's ``DatasetMixtureConfig`` in place.
- ``_build_mixture_parallel`` no longer carries the dead val-dataset
branch: the script forces ``val_split_ratio=0``, so we assert that and
raise loudly if ``make_dataset`` ever returns a tuple.
- Pin the upstream commit SHA used to port ``_fit_tokenizer``'s body
(``UPSTREAM_PINNED_SHA``), so a future drift between our re-impl and
upstream is easy to detect.
- Name the ``+ 100`` merge-headroom magic as ``_MIN_MERGE_HEADROOM``.
- Drop the leftover ``_ = warnings`` shim (``warnings`` is actually used
in ``_aggregate_stats_manual``).
|
Thanks for the review — all addressed in #1 normalization mismatch (medium). Fixed — #2 pi0.7-LL default flip (medium). Reverted — all seven policies now default to #3 #4 #5 val-mixture symmetry break (low). Resolved by going the other direction: dropped the val-validation entirely, asserted #6 in-script re-implementation rot (low). Added #7 silent #8 mixture mutation (nit). Now uses #9 magic Verification: CPU test 4/4, GPU re-run dispatched (will update the PR with results when it lands). |
|
GPU re-run on |
Follow-up on
|
Address the follow-up review on PR #290: - `_normalize_chunks` docstring previously claimed pi0.7's `prepare_discrete_actions` applies `torch.nan_to_num` after `Normalize`. It does not -- neither `Normalize` nor `prepare_discrete_actions` in main has any NaN handling. Rewrite the comment to honestly say the `nan_to_num` is defensive (so a single corrupted chunk doesn't tank an hour-long fit). The script's math is still byte-identical to training for non-NaN inputs. - `_drain_mixture` (the slow `--use-mixture-dataloader` path) doesn't surface the same "native_action_dim > --action-dim" warning the manual path does, because the mixture's standardization pipeline already raises on truncation via `pad_vector`. Add a one-line note explaining why no warning is needed on that path.
|
Fixed in
|
All review items resolved ✅Confirming after
No outstanding items from my review. Ready for human reviewer sign-off. Generated by Claude Code |
There was a problem hiding this comment.
Independent review of 58c905a. Two correctness issues worth fixing before merge — both regressions introduced by the prior fix round, not in the original draft.
- blocking —
src/opentau/scripts/fit_fast_tokenizer.py:598-607,648-654—_build_train_cfgusesdataclasses.replace(mixture_cfg, ...)but never overridesval_split_ratio.DatasetMixtureConfig.val_split_ratiodefaults to 0.05 (src/opentau/configs/default.py:277). The comment at line 648 ("_build_train_cfgforcesval_split_ratio=0(the default)") and the assert at line 652 are both wrong: any user passing a default mixture JSON via--use-mixture-dataloaderwill trip the assert and abort before sampling. Fix: addval_split_ratio=0.0to thedataclasses.replacekwargs. - suggestion —
src/opentau/scripts/fit_fast_tokenizer.py:96,927-935—UPSTREAM_PINNED_SHA = "ec4d..."is documented as the SHA the script tracks, but_find_upstream_sourcecallssnapshot_download(repo_id=..., allow_patterns=...)with norevision=arg, so each invocation downloads the current HEAD ofphysical-intelligence/fast/processing_action_tokenizer.py. The file copied intoout_dir(whichAutoProcessor.from_pretrained(..., trust_remote_code=True)later executes) can drift silently from the ported_fit_tokenizerbody. Passrevision=UPSTREAM_PINNED_SHAtosnapshot_downloadto make the constant load-bearing. - suggestion —
src/opentau/scripts/fit_fast_tokenizer.py:128-130—--chunk-sizehelp text claims "pi0.5 / pi0.6 default to 10; pi0.7 low-level defaults to 50". Every policy config in this repo (pi05,pi05_mem,pi06,pi07/low_level,pi07_paligemma/low_level_planner) hasn_action_steps: int = 50. The "10" reading comes fromconfigs/examples/pi07_libero.jsonwhich overrides — opposite of what the help text says. This actively misleads users into setting--chunk-size 10for pi0.5/0.6 production. - suggestion —
tests/policies/test_pi07_cpu.py:1263-1317— Test docstring at line 1279 says "All seven policies default to the upstream tokenizer", but onlyPI07LowLevelConfigis exercised. A typo in any of the other six config defaults (PI05Config,PI05MEMConfig,PI06Config,PI07HighLevelPlannerConfig,PI07PaligemmaHighLevelConfig,PI07PaligemmaLowLevelConfig) — or a missed plumbing change in their modeling files — would slip through the gate the PR builds for itself. A@pytest.mark.parametrizeover the seven config classes assertingcfg.discrete_action_tokenizer_path == "physical-intelligence/fast"is ~10 lines and closes the gap. - nit —
src/opentau/scripts/fit_fast_tokenizer.py:737-753—_normalize_chunkscastschunks.astype(np.float64)then back tofloat32at return. ProductionNormalizeruns entirely in float32 (src/opentau/policies/normalize.py,create_stats_buffers). The self-review's "byte-identical to training" claim isn't quite right; the values differ in low float32 bits. Functionally moot; just don't claim bit-equality. - nit —
src/opentau/scripts/fit_fast_tokenizer.py:540-548,666-680— Both threaded paths consumeas_completedand write into completion-order lists/extends, so chunk order in the BPE training corpus is nondeterministic across runs even with--seed. BPE tie-breaking on equal pair frequencies depends on first-encounter order, so two seeded runs may not be bit-identical (cf. CLAUDE.md rule 3). Sorting completed results back into mixture-config order at the join point would restore determinism; otherwise document the caveat in the script header.
Cross-checks that hold up: all seven policies default to physical-intelligence/fast; _resolve_native_action_key handles both "action" and "actions" via the new available_keys arg; _ = warnings line is gone; _MIN_MERGE_HEADROOM is properly named; the seven AutoProcessor.from_pretrained call sites all thread config.discrete_action_tokenizer_path correctly with no straggler literal strings. The seven-policy refactor itself is clean and mechanical.
Generated by Claude Code
|
[claude-review] summary for commit 58c905a
|
|
@claude fix per review |
- Blocking: force val_split_ratio=0.0 in _build_train_cfg's replace so the default 0.05 doesn't trip the parallel-mixture-build assertion. - Pin _find_upstream_source snapshot to UPSTREAM_PINNED_SHA so the copied remote-code file matches the ported _fit_tokenizer body. - Fix --chunk-size help text: every policy config defaults n_action_steps to 50 (only configs/examples/pi07_libero.json overrides to 10), so the prior "pi0.5/0.6 default to 10" guidance was inverted. - Parametrize the seven-policy default test so a typo in any of the other six config defaults is caught at CPU-test time. - Soften _normalize_chunks docstring: float64 -> float32 round-trip isn't bit-identical to production's float32 path; agreement is ~1e-7. - Restore deterministic BPE corpus order in _sample_via_manual by collecting into a per-dataset positional list before concatenating, so as_completed completion order can't perturb BPE tie-breaks.
What this does
Two related changes shipped together:
1.
opentau.scripts.fit_fast_tokenizer(new script)CPU-only one-shot fit of a
physical-intelligence/fast-compatible action tokenizer on an OpenTauDatasetMixtureConfig. Output is a directory loadable viaAutoProcessor.from_pretrained(out_dir, trust_remote_code=True).The FAST tokenizer is used during pi0.5 / pi0.7 training as an auxiliary cross-entropy target on top of the flow-matching MSE loss — it tokenizes action chunks via DCT + BPE. The published upstream checkpoint was fit on a generic robotics mixture; this script lets us specialize the BPE to our own action distribution.
Pipeline (all CPU):
$refincludes in the mixture JSON and parse viadraccus.min/maxacross the mixture (NaN-tolerantnanmin/nanmax).mixture.weights.--use-mixture-dataloader):mixture.action_freqviascipy.interpolate.interp1d. ~45k chunks/s.WeightedDatasetMixturedataloader. ~25 chunks/s but uses the exact training pipeline; included for cross-checking.[-1, 1]— mirrorsNormalize({"ACTION": NormalizationMode.MIN_MAX})from pi0.7'sprepare_discrete_actions. Right-padaction_dim.UniversalActionProcessor.fit(...)so we can passmax_token_lengthto the BpeTrainer (the upstream hard-codes 10000). Default cap is 64.save_pretrained+ copyprocessing_action_tokenizer.py. Round-trip a held-out sample and writefit_report.json.Why
--max-token-lengthis needed (the BPE footgun)Without a cap, the BPE training on heavily zero-padded action data wastes its entire merge budget on one runaway zero-run token. On the pretrain-pi07-10% mixture this produced 1855 merges of lengths 2, 3, 4, …, 1320 — one of each, all zero-pad — with mean token length 478 bytes per chunk (vs upstream's 78). Capping at
chunk_sizeforces the BPE to spend merges on real patterns; mean token length drops 42× to 68 tokens / chunk on this mixture, comfortably under pi0.7'sdiscrete_action_max_length=150.Why
--chunk-sizeis required (no default)The BPE merges depend on chunk length. pi0.5 / pi0.6 default to
n_action_steps=10, pi0.7 low-level defaults ton_action_steps=50(configuration_pi07_low_level.py:122-123). The example configs (pi07_libero.json) override to 10, so naive pattern-matching gives the wrong value silently. Better to force the user to pick.2. Make
physical-intelligence/fastconfigurable across all policiesReplace the hard-coded
AutoProcessor.from_pretrained("physical-intelligence/fast", ...)call in every policy with a newdiscrete_action_tokenizer_pathconfig field. A single--policy.discrete_action_tokenizer_pathCLI override now plugs in a specialized FAST tokenizer (e.g. one fit via the script above) without policy code changes. The value flows through to the auxiliary CE target at training; inference paths are unaffected (flow-matching heads only).Policies touched: pi0.5, pi0.5-mem, pi0.6, pi0.7 high-level + low-level, pi0.7-paligemma high-level + low-level.
Defaults:
discrete_action_tokenizer_pathdefaultPI07LowLevelConfigTensorAuto/fast-pi07-pretrain(specialized fit produced by this PR's script — vocab=2048, max_token_length=64, mean 68 tokens/chunk on our pretrain mixture)PI05Config/PI05MemConfig/PI06Configphysical-intelligence/fast(upstream)PI07HighLevelPlannerConfigphysical-intelligence/fastphysical-intelligence/fastThe pi0.7-LL default flip is the only behaviour change; all other policies keep their existing tokenizer.
How it was tested
ruff check,ruff format,pre-commit run— all pass on every file touched (typos, pyupgrade, bandit, gitleaks, ruff, ruff-format).tests/policies/test_pi07_cpu.py::TestPi07ConfigPlumbing::test_discrete_action_tokenizer_path_default_and_overridepins both the default value AND thatPI07LowLevelPolicy.__init__plumbsconfig.discrete_action_tokenizer_pathintoAutoProcessor.from_pretrained(mocked, no network).tests/policies/test_pi07_low_level.pynow explicitly passdiscrete_action_tokenizer_path="physical-intelligence/fast"so CI does not depend on private-repo credentials.fit_fast_tokenizer.pyon a 392-dataset internal pretrain mixture: stats aggregation + 1.5k-chunk sampling (manual path) in 25 s; 0 errors.AutoProcessor.from_pretrained(out_dir, trust_remote_code=True)from a fresh shell.vocab.jsonbefore/after themax_token_lengthcap to confirm the runaway-zero failure mode and that the cap fixes it.physical-intelligence/fastreferences remain in any modeling code; all instances are now config defaults.How to checkout & try? (for the reviewer)
Checklist