feat(policies): support training and selecting FAST tokenizer by shuheng-liu · Pull Request #290 · TensorAuto/OpenTau

shuheng-liu · 2026-05-11T22:50:14Z

What this does

Two related changes shipped together:

1. `opentau.scripts.fit_fast_tokenizer` (new script)

CPU-only one-shot fit of a physical-intelligence/fast-compatible action tokenizer on an OpenTau DatasetMixtureConfig. Output is a directory loadable via AutoProcessor.from_pretrained(out_dir, trust_remote_code=True).

The FAST tokenizer is used during pi0.5 / pi0.7 training as an auxiliary cross-entropy target on top of the flow-matching MSE loss — it tokenizes action chunks via DCT + BPE. The published upstream checkpoint was fit on a generic robotics mixture; this script lets us specialize the BPE to our own action distribution.

Pipeline (all CPU):

Resolve $ref includes in the mixture JSON and parse via draccus.
Aggregate per-dim action min / max across the mixture (NaN-tolerant nanmin / nanmax).
Allocate per-dataset chunk budgets weight-proportional to mixture.weights.
Sample chunks two ways (selectable via --use-mixture-dataloader):
- Default (fast): read action+timestamp columns directly from each parquet, resample to mixture.action_freq via scipy.interpolate.interp1d. ~45k chunks/s.
- Opt-in (safer): go through the full WeightedDatasetMixture dataloader. ~25 chunks/s but uses the exact training pipeline; included for cross-checking.
Min-max-normalize chunks to [-1, 1] — mirrors Normalize({"ACTION": NormalizationMode.MIN_MAX}) from pi0.7's prepare_discrete_actions. Right-pad action_dim.
Locally re-implement UniversalActionProcessor.fit(...) so we can pass max_token_length to the BpeTrainer (the upstream hard-codes 10000). Default cap is 64.
save_pretrained + copy processing_action_tokenizer.py. Round-trip a held-out sample and write fit_report.json.

Why `--max-token-length` is needed (the BPE footgun)

Without a cap, the BPE training on heavily zero-padded action data wastes its entire merge budget on one runaway zero-run token. On the pretrain-pi07-10% mixture this produced 1855 merges of lengths 2, 3, 4, …, 1320 — one of each, all zero-pad — with mean token length 478 bytes per chunk (vs upstream's 78). Capping at chunk_size forces the BPE to spend merges on real patterns; mean token length drops 42× to 68 tokens / chunk on this mixture, comfortably under pi0.7's discrete_action_max_length=150.

Why `--chunk-size` is required (no default)

The BPE merges depend on chunk length. pi0.5 / pi0.6 default to n_action_steps=10, pi0.7 low-level defaults to n_action_steps=50 (configuration_pi07_low_level.py:122-123). The example configs (pi07_libero.json) override to 10, so naive pattern-matching gives the wrong value silently. Better to force the user to pick.

2. Make `physical-intelligence/fast` configurable across all policies

Replace the hard-coded AutoProcessor.from_pretrained("physical-intelligence/fast", ...) call in every policy with a new discrete_action_tokenizer_path config field. A single --policy.discrete_action_tokenizer_path CLI override now plugs in a specialized FAST tokenizer (e.g. one fit via the script above) without policy code changes. The value flows through to the auxiliary CE target at training; inference paths are unaffected (flow-matching heads only).

Policies touched: pi0.5, pi0.5-mem, pi0.6, pi0.7 high-level + low-level, pi0.7-paligemma high-level + low-level.

Defaults:

Policy	`discrete_action_tokenizer_path` default
`PI07LowLevelConfig`	`TensorAuto/fast-pi07-pretrain` (specialized fit produced by this PR's script — vocab=2048, max_token_length=64, mean 68 tokens/chunk on our pretrain mixture)
`PI05Config` / `PI05MemConfig` / `PI06Config`	`physical-intelligence/fast` (upstream)
`PI07HighLevelPlannerConfig`	`physical-intelligence/fast`
pi0.7-paligemma high-level + low-level	`physical-intelligence/fast`

The pi0.7-LL default flip is the only behaviour change; all other policies keep their existing tokenizer.

How it was tested

ruff check, ruff format, pre-commit run — all pass on every file touched (typos, pyupgrade, bandit, gitleaks, ruff, ruff-format).
New CPU test tests/policies/test_pi07_cpu.py::TestPi07ConfigPlumbing::test_discrete_action_tokenizer_path_default_and_override pins both the default value AND that PI07LowLevelPolicy.__init__ plumbs config.discrete_action_tokenizer_path into AutoProcessor.from_pretrained (mocked, no network).
GPU tests in tests/policies/test_pi07_low_level.py now explicitly pass discrete_action_tokenizer_path="physical-intelligence/fast" so CI does not depend on private-repo credentials.
Smoke run of fit_fast_tokenizer.py on a 392-dataset internal pretrain mixture: stats aggregation + 1.5k-chunk sampling (manual path) in 25 s; 0 errors.
Pilot fits at chunk_size 10 and 50, both round-tripped via AutoProcessor.from_pretrained(out_dir, trust_remote_code=True) from a fresh shell.
Full fit at chunk_size=50, action_dim=32, vocab=2048, max_token_length=64: 1M chunks, ~4 min total wall-clock (22 s sampling + 217 s BPE), round-trip MSE 3.1e-5, mean 68 tokens/chunk (P99 93), all chunks under 150-token policy cap.
Inspected vocab.json before/after the max_token_length cap to confirm the runaway-zero failure mode and that the cap fixes it.
Final grep: zero hardcoded physical-intelligence/fast references remain in any modeling code; all instances are now config defaults.

How to checkout & try? (for the reviewer)

git fetch origin claude/crazy-chebyshev-0ebfcc
git checkout claude/crazy-chebyshev-0ebfcc

# Fit a tokenizer on a mixture (pi0.7 low-level production defaults):
python -m opentau.scripts.fit_fast_tokenizer \
    --mixture-json path/to/your/mixture.json \
    --out-dir path/to/output_dir/ \
    --chunk-size 50

# Pilot (writes to out_dir/pilot/, caps total-chunks at 50_000):
python -m opentau.scripts.fit_fast_tokenizer \
    --mixture-json path/to/your/mixture.json \
    --out-dir path/to/output_dir/ \
    --chunk-size 50 \
    --pilot

# Plug a specialized tokenizer into pi0.7 low-level training (the new
# config knob; works on any of the seven affected policies):
opentau-train \
    --config_path configs/examples/pi07_libero.json \
    --policy.discrete_action_tokenizer_path=TensorAuto/fast-pi07-pretrain

# CPU test for the config-plumbing change:
pytest tests/policies/test_pi07_cpu.py::TestPi07ConfigPlumbing::test_discrete_action_tokenizer_path_default_and_override -xvs

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

GPU pytest ran twice on an internal RTX 5090 dev box: HEAD b70bedf -> 15 passed, 10 skipped, 1 warning, 2:37. Re-run after the review fixes at HEAD 62c511d -> 15 passed, 10 skipped, 1 warning, 2:40. The 10 skipped are the pre-existing memory-bound @pytest.mark.skip tests (>24 GB VRAM), unchanged from main. Regression tests (the g6.12xlarge nightly suite) were not dispatched.

…ation

…ength cap - Make --chunk-size required (no default): pi0.5/pi0.6 use 10, pi0.7 low-level uses 50; picking the wrong value silently produces a tokenizer with the wrong BPE merges. - Default sampler is now a manual per-dataset path: weight-proportional budget allocation, raw-parquet reads, scipy.interp1d for action_freq resampling. Both paths respect the mixture weights and global FPS, but the manual path runs at ~45k chunks/s vs the mixture-dataloader path's ~25 chunks/s. The WeightedDatasetMixture path is preserved behind --use-mixture-dataloader (with thread-parallel per-dataset construction to keep the build phase under 10 min). - Add --max-token-length, defaulting to 64. Without a cap, BPE training on heavily zero-padded action data wastes its entire merge budget on one runaway zero-run token (verified: 1855 merges, one each at lengths 2, 3, ..., 1320; mean token length on test data 478 bytes vs upstream's 78). Capping at chunk_size keeps merges focused on meaningful patterns -- on the pretrain-pi07 mixture this gives mean 68 tokens / chunk (P99 = 93) at chunk_size=50, comfortably under pi0.7's discrete_action_max_length=150. - Bump --max-state-dim default to 128 to cover the largest native state dim in the pretrain-pi07-10% mixture (118-dim). - Replace UniversalActionProcessor.fit() call with a local re-impl so we can override max_token_length on the BpeTrainer (the upstream hard-codes 10000).

The pi0.7 low-level policy hard-coded ``physical-intelligence/fast`` when instantiating its ``discrete_action_processor``. Add a new ``discrete_action_tokenizer_path`` config field (default unchanged) so a specialized FAST tokenizer fit via ``opentau.scripts.fit_fast_tokenizer`` can be plugged in via a single ``--policy.discrete_action_tokenizer_path`` CLI override. The value flows through to the auxiliary CE target at training; inference is unaffected (flow-matching head only). Note: pi0.5 / pi0.6 / pi0.7_paligemma low-level still hard-code the upstream path -- mirroring this change there is left for follow-ups.

Replace the hard-coded ``AutoProcessor.from_pretrained("physical-intelligence/fast", ...)`` call in every policy with a new ``discrete_action_tokenizer_path`` config field. A single ``--policy.discrete_action_tokenizer_path`` CLI override now plugs in a specialized FAST tokenizer (e.g. one fit via ``opentau.scripts.fit_fast_tokenizer``) without policy code changes. The value flows through to the auxiliary CE target at training; inference paths are unaffected (flow-matching heads only). Policies touched: pi0.5, pi0.5-mem, pi0.6, pi0.7 high-level + low-level, pi0.7-paligemma high-level + low-level. Defaults: - ``PI07LowLevelConfig.discrete_action_tokenizer_path`` defaults to ``TensorAuto/fast-pi07-pretrain`` -- the pi0.7 pretrain mixture's specialized fit (vocab=2048, max_token_length=64, mean 68 tokens/chunk vs upstream's ~146 on the same data). - All other policies keep the existing default of ``physical-intelligence/fast`` so no behaviour changes for pi0.5/0.6/0.7-HL callers. Test pins: GPU tests in ``tests/policies/test_pi07_low_level.py`` now explicitly pass ``discrete_action_tokenizer_path="physical-intelligence/fast"`` so CI doesn't depend on private-repo credentials.

shuheng-liu · 2026-05-12T05:25:12Z

Review

Overview

Two changes shipped together:

New CPU-only script src/opentau/scripts/fit_fast_tokenizer.py (~1k lines) — fits a physical-intelligence/fast-compatible BPE action tokenizer on an OpenTau DatasetMixtureConfig, with two sampling paths (fast manual parquet + scipy interp, vs. slower full WeightedDatasetMixture dataloader) and a configurable max_token_length to cap runaway zero-pad merges.
Plumbing for discrete_action_tokenizer_path across pi0.5, pi0.5-mem, pi0.6, pi0.7 HL+LL, and pi0.7-paligemma HL+LL — replaces the hard-coded AutoProcessor.from_pretrained("physical-intelligence/fast", ...) call at one site per policy with a config field. Only pi0.7-LL's default flips (to TensorAuto/fast-pi07-pretrain); the other six remain on upstream.

What's strong

Surgical refactor on the seven policies. Each is a one-line modeling change + one-field config addition. The diff is mechanical, the comment block in each config is identical, and there's a CPU mock test that pins the default and verifies the value is plumbed all the way through to AutoProcessor.from_pretrained (test_pi07_cpu.py::test_discrete_action_tokenizer_path_default_and_override). Good test boundary — patches at the import site, captures the path arg, never hits the network.
GPU test pinned to upstream to avoid CI credential dependency — correct call.
Documentation in the fit script is genuinely useful. The max_token_length finding (BPE wasting its entire merge budget on one runaway zero-run token, 1855 merges of lengths 2…1320) is the kind of empirical debug observation that's worth preserving in-place; required --chunk-size (no default) is well-justified given that pi0.5/0.6 and pi0.7-LL disagree on n_action_steps.
Empirical validation looks solid. 1M chunks in ~4 min, round-trip MSE 3.1e-5, P99 93 tokens (under the 150-token policy cap).

Issues / questions

1. Distribution mismatch between fit-time and training-time normalization (medium)

The script's _normalize_chunks (line ~670) does:

x = np.nan_to_num(chunks.astype(np.float64), nan=0.0, posinf=0.0, neginf=0.0)
norm = 2.0 * (x - action_min) / safe_span - 1.0
...
return np.clip(norm, -1.0, 1.0).astype(np.float32)

But production Normalize({"ACTION": NormalizationMode.MIN_MAX}) at src/opentau/policies/normalize.py:253-260 does:

batch[key] = (batch[key] - min) / (max - min + EPS)
batch[key] = batch[key] * 2 - 1

— no clip, no nan_to_num. The script's docstring claims to "mirror" the training-time normalization. For chunks within the recorded min/max this is moot, but tails (and any NaN inputs) will see slightly different values at fit vs train, which shifts the BPE corpus distribution. Either drop the clip + nan_to_num to truly mirror, or document why they were added (defensive but distribution-changing).

2. The pi0.7-LL default flip creates a hidden credential dependency (medium)

After this PR, PI07LowLevelConfig() with no overrides resolves TensorAuto/fast-pi07-pretrain, which is a private repo. Anyone without TensorAuto org access will hit an auth wall on plain instantiation — including external users running configs/examples/pi07_*.json for the first time. The other six policies kept upstream as default and rely on the new CLI override to opt into specialized fits; the inconsistency is the only awkward thing about this PR.

Two follow-ups worth considering:

Keep physical-intelligence/fast as PI07LowLevelConfig's default for parity with the six others, and ship the specialized fit via an example config (e.g. configs/examples/pi07_pretrain.json) that sets --policy.discrete_action_tokenizer_path=TensorAuto/fast-pi07-pretrain.
If existing on-hub pi0.7-LL checkpoints were trained against the upstream tokenizer, fresh policy construction after this PR will mismatch on discrete_action_vocab_size (specialized 2048 vs upstream 1024). The resume path is fine — saved configs win — but the from_pretrained path for an old checkpoint plus default-config instantiation may not be.

3. _resolve_native_action_key fallback to "action" (singular) is risky (low–medium)

Per the docstring in standard_data_format_mapping.py, the OpenTau-canonical name is "actions" (plural), while most upstream LeRobot datasets use "action" (singular). The script falls back to "action" for datasets not registered in DATA_FEATURES_NAME_MAPPING — so any dataset that uses the OpenTau canonical name AND isn't registered will silently fail at the key not in meta.stats check. Suggest trying both keys (e.g. for k in ("actions", "action"): if k in meta.stats: return k) before declaring failure.

4. _ = warnings # silence unused import (line ~1052) is wrong (nit)

warnings is actively used at lines 322-323 (warnings.catch_warnings() / warnings.simplefilter) in _aggregate_stats_manual. The line is leftover from an earlier draft — just delete it.

5. _build_mixture_parallel silently drops the val mixture (low)

make_dataset_mixture returns (train, val) when val_datasets exist; the parallel version validates val_datasets but never wraps them in a WeightedDatasetMixture and never returns them. The comment correctly notes val_split_ratio is 0 in this script's cfg path, so it's effectively dead — but the symmetry break is a footgun if anyone factors this out for a smoke-test of another pipeline. Either drop the val validation entirely (since the script doesn't need val) or finish wiring it through.

6. In-script re-implementation of UniversalActionProcessor.fit will silently rot (low)

The script copies the upstream fit body verbatim plus the max_token_length knob. If upstream physical-intelligence/fast/processing_action_tokenizer.py adds preprocessing or changes scale defaults, this script silently diverges from production. Two options:

Pin a comment with the upstream commit SHA used as a reference and a "re-port here when upstream changes" note.
Or: file an upstream PR exposing max_token_length as a fit(..., max_token_length=...) kwarg, so the re-implementation can be deleted.

7. --action-dim defaults to 32 — same footgun as the (correctly) required --chunk-size (low)

pi0.5/0.6/0.7 currently happen to use max_action_dim=32, but if a config overrides it, the script will silently fit a corpus with a mismatched action dim. Either make --action-dim required, or assert it against the mixture's stats shape after stats aggregation.

8. mixture_cfg mutation in _build_train_cfg (nit)

The function mutates the parsed DatasetMixtureConfig in place (n_obs_history = None, several drop_prob fields). Fine for the single-shot script, but if someone calls the function twice in a notebook, the second call inherits the first call's mutations. A dataclasses.replace(...) would make it explicit.

9. Magic number min_vocab_size + 100 (nit, line ~803)

Could be e.g. vocab_size // 20 or a named constant — minor.

Correctness / security / perf

No security issues; trust_remote_code=True is preserved verbatim from existing call sites (which themselves predate this PR), and the script's only network surface is huggingface_hub.snapshot_download of one well-known file (UPSTREAM_SOURCE_FILE = "processing_action_tokenizer.py").
The chr(i) alphabet trick works for any plausible (min_token, max_token) range given scale=10 — magnitudes stay well under sys.maxunicode.
The parallel mixture build is a legitimate optimization (LeRobotDataset init is I/O-bound, GIL is released by pyarrow/HF cache). Threading is the right primitive.

Tests

The new CPU test is exactly the right shape — patches at the modeling module's import site, captures path + kwargs, asserts both default and override. No network, no fixtures.
No tests for fit_fast_tokenizer.py itself — given it's an offline one-shot tool that's been empirically validated on a 1M-chunk run, that's probably fine, but a tiny smoke test on a fake DatasetMixtureConfig with two synthetic parquets would catch regression to the _resolve_native_action_key or normalization paths cheaply.

Verdict

Approve in principle once #1 (normalization mismatch — clarify or fix), #2 (pi0.7-LL default — explicit decision on credential cost), and #4 (delete stale _ = warnings) are addressed. The rest are nits / future cleanup.

Generated by Claude Code

- Revert ``PI07LowLevelConfig.discrete_action_tokenizer_path`` default to ``physical-intelligence/fast`` so all seven policies share the same default (avoids a private-repo credential dependency on first plain instantiation). Specialized fits like ``TensorAuto/fast-pi07-pretrain`` are opted into via ``--policy.discrete_action_tokenizer_path=<...>``. - Update the CPU mock test to pin the upstream default and exercise the ``TensorAuto/fast-pi07-pretrain`` value via the override path. Drop the now-redundant explicit override in ``test_pi07_low_level.py`` GPU tests. - Make ``_normalize_chunks`` byte-match production ``Normalize({"ACTION": MIN_MAX})``: use ``EPS = 1e-8`` (the production constant), drop the post-normalization ``np.clip``, keep the ``nan_to_num`` (mirrors ``prepare_discrete_actions`` in pi0.7-LL). - ``_resolve_native_action_key`` now optionally accepts the dataset's actual stats / column keys and prefers whichever of ``"action"`` / ``"actions"`` is present, eliminating the silent-failure mode for datasets that aren't registered in ``DATA_FEATURES_NAME_MAPPING`` and use the OpenTau-canonical plural form. - ``_aggregate_stats_manual`` now logs a loud warning when any dataset's native action dim exceeds ``--action-dim``: those dims are silently dropped, which is only correct if the production policy's ``max_action_dim`` matches. - ``_build_train_cfg`` uses ``dataclasses.replace`` instead of mutating the caller's ``DatasetMixtureConfig`` in place. - ``_build_mixture_parallel`` no longer carries the dead val-dataset branch: the script forces ``val_split_ratio=0``, so we assert that and raise loudly if ``make_dataset`` ever returns a tuple. - Pin the upstream commit SHA used to port ``_fit_tokenizer``'s body (``UPSTREAM_PINNED_SHA``), so a future drift between our re-impl and upstream is easy to detect. - Name the ``+ 100`` merge-headroom magic as ``_MIN_MERGE_HEADROOM``. - Drop the leftover ``_ = warnings`` shim (``warnings`` is actually used in ``_aggregate_stats_manual``).

shuheng-liu · 2026-05-12T05:38:35Z

Thanks for the review — all addressed in 62c511d. Inline replies:

#1 normalization mismatch (medium). Fixed — _normalize_chunks now byte-matches opentau.policies.normalize: (x - min) / (max - min + EPS) * 2 - 1 with EPS = 1e-8, no clip. Kept the nan_to_num because that's what prepare_discrete_actions applies after Normalize in pi0.7-LL, so the BPE training distribution lines up with what the policy sees.

#2 pi0.7-LL default flip (medium). Reverted — all seven policies now default to physical-intelligence/fast, restoring parity. Users opt into specialized fits via --policy.discrete_action_tokenizer_path=TensorAuto/fast-pi07-pretrain (works on any of the seven). I also dropped the now-redundant explicit override in the GPU test.

#3 _resolve_native_action_key fallback (low-medium). Fixed — the resolver now optionally takes the dataset's actual stats / column keys and prefers whichever of "action" / "actions" is present. Both call sites pass it (set(meta.stats) for the stats worker, pq.read_schema(...).names for the parquet-sampling worker via a one-time peek at the first episode).

#4 _ = warnings (nit). Deleted. warnings is genuinely used in _aggregate_stats_manual, which is why the leftover was incorrect.

#5 val-mixture symmetry break (low). Resolved by going the other direction: dropped the val-validation entirely, asserted val_split_ratio == 0, and raise loudly if make_dataset ever returns a tuple (would indicate the train_cfg got mutated unexpectedly).

#6 in-script re-implementation rot (low). Added UPSTREAM_PINNED_SHA = "ec4d7aa71691cac0b8bed6942be45684db2110f4" at module level + a comment in _fit_tokenizer pointing at it, so future drift between our re-impl and upstream is greppable. Filing an upstream PR exposing max_token_length is a good idea but out of scope for this PR.

#7 silent --action-dim mismatch (low). _aggregate_stats_manual now logs a loud WARNING listing how many datasets and the max native dim when native_action_dim > --action-dim. I left it as a warning (not a hard assert) since silently dropping extra dims IS the intended behaviour when max_action_dim matches production — but the warning surfaces the contract.

#8 mixture mutation (nit). Now uses dataclasses.replace(mixture_cfg, n_obs_history=None, ...) — the caller's config is untouched.

#9 magic + 100 (nit). Named _MIN_MERGE_HEADROOM.

Verification: CPU test 4/4, GPU re-run dispatched (will update the PR with results when it lands).

shuheng-liu · 2026-05-12T05:41:27Z

GPU re-run on 62c511d is green: 15 passed, 10 skipped, 1 warning, 2:40 wall-clock — same shape as the run on b70bedf. PR body updated.

shuheng-liu · 2026-05-12T05:41:32Z

Follow-up on `62c511d`

Verified the fix commit against the original review — seven items fully addressed, one nit remaining:

Remaining: `_normalize_chunks` docstring claim about `nan_to_num`

The updated docstring says:

The post-Normalize torch.nan_to_num in pi0.7's prepare_discrete_actions is applied here too, so the BPE sees the same numerical distribution training will see.

But prepare_discrete_actions at src/opentau/policies/pi07/low_level/modeling_pi07_low_level.py:835-853 is just:

device = batch["discrete_actions"].device
discrete_actions = batch["discrete_actions"].to(device="cpu", dtype=torch.float32)
tokens = self.discrete_action_processor.__call__(discrete_actions)

No nan_to_num. I also grepped src/opentau/policies/ for nan_to_num / isnan / fillna — zero hits anywhere. So the script still sanitizes NaN while production propagates it.

In practice this is moot — action streams rarely carry NaN, and the script's defensive nan_to_num doesn't hurt unless production legitimately produces NaN tokens. But the docstring justification is wrong as written. Two easy fixes:

Drop the nan_to_num line — the script then byte-matches production for non-NaN inputs and propagates NaN (which would crash DCT) for NaN inputs, matching training behaviour.
Or: keep the line, rewrite the comment as something like "defensive — production would propagate NaN and crash DCT; we sanitize so the fit doesn't trip on a single bad chunk."

Option 2 is what the code actually does today; option 1 is what the docstring claims. Either is fine — just want them to agree.

Minor: `--action-dim` warning is manual-path only

The new "native action_dim > --action-dim" warning lives in _aggregate_stats_manual (fit_fast_tokenizer.py around the new logger.warning). The _drain_mixture path (slow --use-mixture-dataloader route) doesn't get the same check. Likely fine — the mixture's standardization pipeline pads to max_action_dim internally — but worth a one-line note in the dataloader branch for symmetry.

Everything else

All other items (default flip back to upstream, action-key fallback with available_keys, stale _ = warnings, val-mixture assert, pinned upstream SHA, dataclasses.replace, named _MIN_MERGE_HEADROOM) look good. Ready to mark as approved once the docstring vs. code disagreement is reconciled.

Generated by Claude Code

Address the follow-up review on PR #290: - `_normalize_chunks` docstring previously claimed pi0.7's `prepare_discrete_actions` applies `torch.nan_to_num` after `Normalize`. It does not -- neither `Normalize` nor `prepare_discrete_actions` in main has any NaN handling. Rewrite the comment to honestly say the `nan_to_num` is defensive (so a single corrupted chunk doesn't tank an hour-long fit). The script's math is still byte-identical to training for non-NaN inputs. - `_drain_mixture` (the slow `--use-mixture-dataloader` path) doesn't surface the same "native_action_dim > --action-dim" warning the manual path does, because the mixture's standardization pipeline already raises on truncation via `pad_vector`. Add a one-line note explaining why no warning is needed on that path.

shuheng-liu · 2026-05-12T05:43:39Z

Fixed in d804d21:

_normalize_chunks docstring — rewritten to be honest: the nan_to_num is defensive (so a single corrupted chunk doesn't tank an hour-long fit), production neither has it in Normalize nor in prepare_discrete_actions. Math is still byte-identical to training for non-NaN inputs. (Chose option 2 from your review — keep the safety belt, just don't pretend it mirrors production.)
_drain_mixture symmetry note — added a one-line comment explaining why the mixture-dataloader path doesn't need the action-dim warning: pad_vector in the standardization pipeline raises on truncation, so an over-dim mismatch fails loudly at mixture build time instead of silently dropping dims.

shuheng-liu · 2026-05-12T06:06:04Z

All review items resolved ✅

Confirming after d804d21 that every item from the initial review (and the follow-up) is addressed. Recap:

#	Issue	Status	Where
1	`_normalize_chunks` vs production `Normalize({MIN_MAX})` mismatch	✅ Resolved	`np.clip` dropped, `_NORMALIZE_EPS=1e-8` matches production; docstring now honestly documents the `nan_to_num` as defensive (`d804d21`)
2	pi0.7-LL default flip created hidden private-repo credential dependency	✅ Resolved	Default reverted to `physical-intelligence/fast`; CPU test exercises `TensorAuto/fast-pi07-pretrain` via override; GPU test pin removed (`62c511d`)
3	`_resolve_native_action_key` silently fell back to `"action"`	✅ Resolved	Now accepts `available_keys`, prefers whichever of `action`/`actions` is present; plumbed through `_load_dataset_stats` (meta.stats keys) and `_sample_chunks_for_dataset_manual` (parquet schema) (`62c511d`)
4	Stale `_ = warnings # silence unused import`	✅ Resolved	Line removed (`62c511d`)
5	`_build_mixture_parallel` silently dropped val mixture	✅ Resolved	Asserts `val_split_ratio == 0`, raises clearly if `make_dataset` ever returns a tuple (`62c511d`)
6	In-script `UniversalActionProcessor.fit` re-impl would silently rot	✅ Resolved	`UPSTREAM_PINNED_SHA = "ec4d7aa71691cac0b8bed6942be45684db2110f4"` with re-port instructions (`62c511d`)
7	`--action-dim` defaults to 32 footgun	✅ Resolved	Manual path: loud warning when any dataset's native dim exceeds the cap. Mixture path: no warning needed — `pad_vector` (`src/opentau/datasets/dataset_mixture.py:78`) raises on truncation via numpy broadcast mismatch, so over-dim surfaces as a build-time exception. Documented in `_drain_mixture` docstring (`62c511d`, `d804d21`)
8	`mixture_cfg` in-place mutation in `_build_train_cfg`	✅ Resolved	Uses `dataclasses.replace` on a local `mixture_cfg_for_fit` (`62c511d`)
9	Magic `+ 100` for merge headroom	✅ Resolved	Named `_MIN_MERGE_HEADROOM` at module scope with explanatory comment (`62c511d`)

No outstanding items from my review. Ready for human reviewer sign-off.

Generated by Claude Code

claude

Independent review of 58c905a. Two correctness issues worth fixing before merge — both regressions introduced by the prior fix round, not in the original draft.

blocking — src/opentau/scripts/fit_fast_tokenizer.py:598-607,648-654 — _build_train_cfg uses dataclasses.replace(mixture_cfg, ...) but never overrides val_split_ratio. DatasetMixtureConfig.val_split_ratio defaults to 0.05 (src/opentau/configs/default.py:277). The comment at line 648 ("_build_train_cfg forces val_split_ratio=0 (the default)") and the assert at line 652 are both wrong: any user passing a default mixture JSON via --use-mixture-dataloader will trip the assert and abort before sampling. Fix: add val_split_ratio=0.0 to the dataclasses.replace kwargs.
suggestion — src/opentau/scripts/fit_fast_tokenizer.py:96,927-935 — UPSTREAM_PINNED_SHA = "ec4d..." is documented as the SHA the script tracks, but _find_upstream_source calls snapshot_download(repo_id=..., allow_patterns=...) with no revision= arg, so each invocation downloads the current HEAD of physical-intelligence/fast/processing_action_tokenizer.py. The file copied into out_dir (which AutoProcessor.from_pretrained(..., trust_remote_code=True) later executes) can drift silently from the ported _fit_tokenizer body. Pass revision=UPSTREAM_PINNED_SHA to snapshot_download to make the constant load-bearing.
suggestion — src/opentau/scripts/fit_fast_tokenizer.py:128-130 — --chunk-size help text claims "pi0.5 / pi0.6 default to 10; pi0.7 low-level defaults to 50". Every policy config in this repo (pi05, pi05_mem, pi06, pi07/low_level, pi07_paligemma/low_level_planner) has n_action_steps: int = 50. The "10" reading comes from configs/examples/pi07_libero.json which overrides — opposite of what the help text says. This actively misleads users into setting --chunk-size 10 for pi0.5/0.6 production.
suggestion — tests/policies/test_pi07_cpu.py:1263-1317 — Test docstring at line 1279 says "All seven policies default to the upstream tokenizer", but only PI07LowLevelConfig is exercised. A typo in any of the other six config defaults (PI05Config, PI05MEMConfig, PI06Config, PI07HighLevelPlannerConfig, PI07PaligemmaHighLevelConfig, PI07PaligemmaLowLevelConfig) — or a missed plumbing change in their modeling files — would slip through the gate the PR builds for itself. A @pytest.mark.parametrize over the seven config classes asserting cfg.discrete_action_tokenizer_path == "physical-intelligence/fast" is ~10 lines and closes the gap.
nit — src/opentau/scripts/fit_fast_tokenizer.py:737-753 — _normalize_chunks casts chunks.astype(np.float64) then back to float32 at return. Production Normalize runs entirely in float32 (src/opentau/policies/normalize.py, create_stats_buffers). The self-review's "byte-identical to training" claim isn't quite right; the values differ in low float32 bits. Functionally moot; just don't claim bit-equality.
nit — src/opentau/scripts/fit_fast_tokenizer.py:540-548,666-680 — Both threaded paths consume as_completed and write into completion-order lists/extends, so chunk order in the BPE training corpus is nondeterministic across runs even with --seed. BPE tie-breaking on equal pair frequencies depends on first-encounter order, so two seeded runs may not be bit-identical (cf. CLAUDE.md rule 3). Sorting completed results back into mixture-config order at the join point would restore determinism; otherwise document the caveat in the script header.

Cross-checks that hold up: all seven policies default to physical-intelligence/fast; _resolve_native_action_key handles both "action" and "actions" via the new available_keys arg; _ = warnings line is gone; _MIN_MERGE_HEADROOM is properly named; the seven AutoProcessor.from_pretrained call sites all thread config.discrete_action_tokenizer_path correctly with no straggler literal strings. The seven-policy refactor itself is clean and mechanical.

Generated by Claude Code

claude · 2026-05-12T17:41:16Z

[claude-review] summary for commit 58c905a

blocking — src/opentau/scripts/fit_fast_tokenizer.py:598-654 — _build_train_cfg's dataclasses.replace doesn't override val_split_ratio; default is 0.05 (configs/default.py:277), so the assert val_split_ratio == 0.0 at line 652 fires on any default mixture JSON, breaking --use-mixture-dataloader.
suggestion — src/opentau/scripts/fit_fast_tokenizer.py:96,927-935 — UPSTREAM_PINNED_SHA is not passed to snapshot_download; the copied processing_action_tokenizer.py tracks upstream HEAD and can drift silently from the ported _fit_tokenizer body.
suggestion — src/opentau/scripts/fit_fast_tokenizer.py:128-130 — --chunk-size help text says "pi0.5 / pi0.6 default to 10" but every policy actually defaults to n_action_steps = 50; help text inverts the truth.
suggestion — tests/policies/test_pi07_cpu.py:1263-1317 — test docstring claims all seven policies default to upstream, but only PI07LowLevelConfig is asserted; parametrize over the other six to close the gap.
nit — src/opentau/scripts/fit_fast_tokenizer.py:737-753 — _normalize_chunks casts to float64 then back to float32; production Normalize is float32 throughout, so "byte-identical to training" is overstated.
nit — src/opentau/scripts/fit_fast_tokenizer.py:540-548,666-680 — as_completed consumption order makes BPE corpus order nondeterministic even with --seed; either sort by mixture index at the join or document.

shuheng-liu · 2026-05-12T17:51:21Z

@claude fix per review

- Blocking: force val_split_ratio=0.0 in _build_train_cfg's replace so the default 0.05 doesn't trip the parallel-mixture-build assertion. - Pin _find_upstream_source snapshot to UPSTREAM_PINNED_SHA so the copied remote-code file matches the ported _fit_tokenizer body. - Fix --chunk-size help text: every policy config defaults n_action_steps to 50 (only configs/examples/pi07_libero.json overrides to 10), so the prior "pi0.5/0.6 default to 10" guidance was inverted. - Parametrize the seven-policy default test so a typo in any of the other six config defaults is caught at CPU-test time. - Soften _normalize_chunks docstring: float64 -> float32 round-trip isn't bit-identical to production's float32 path; agreement is ~1e-7. - Restore deterministic BPE corpus order in _sample_via_manual by collecting into a per-dataset positional list before concatenating, so as_completed completion order can't perturb BPE tie-breaks.

feat(scripts): add fit_fast_tokenizer.py for FAST tokenizer specializ…

3af7076

…ation

shuheng-liu added the feature New feature or request label May 11, 2026

shuheng-liu self-assigned this May 11, 2026

shuheng-liu added 3 commits May 11, 2026 22:03

shuheng-liu changed the title ~~feat(scripts): add fit_fast_tokenizer.py for FAST tokenizer specialization~~ feat(scripts+policies): add fit_fast_tokenizer.py and make FAST tokenizer path configurable May 12, 2026

shuheng-liu requested review from WilliamYue37 and akshay18iitg and removed request for WilliamYue37 May 12, 2026 05:20

shuheng-liu requested a review from WilliamYue37 May 12, 2026 05:39

shuheng-liu marked this pull request as ready for review May 12, 2026 16:29

shuheng-liu changed the title ~~feat(scripts+policies): add fit_fast_tokenizer.py and make FAST tokenizer path configurable~~ feat(policies): support training and selecting FAST tokenizer May 12, 2026

ci: make Claude PR Review non-blocking when API limit hits (#291)

58c905a

claude Bot reviewed May 12, 2026

View reviewed changes

WilliamYue37 previously approved these changes May 12, 2026

View reviewed changes

shuheng-liu added 2 commits May 12, 2026 11:06

ci: move Claude PR Review continue-on-error to step level (#293)

f041bae

shuheng-liu dismissed WilliamYue37’s stale review via 827a651 May 12, 2026 18:12

akshay18iitg approved these changes May 12, 2026

View reviewed changes

shuheng-liu merged commit 5c553e6 into main May 12, 2026
7 checks passed

shuheng-liu deleted the claude/crazy-chebyshev-0ebfcc branch May 12, 2026 18:41

shuheng-liu mentioned this pull request May 27, 2026

fix(configs+tests): unblock dev_config smoke and merge-collision test failures #339

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(policies): support training and selecting FAST tokenizer#290

feat(policies): support training and selecting FAST tokenizer#290
shuheng-liu merged 9 commits into
mainfrom
claude/crazy-chebyshev-0ebfcc

shuheng-liu commented May 11, 2026 •

edited

Loading

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shuheng-liu commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

1. opentau.scripts.fit_fast_tokenizer (new script)

Why --max-token-length is needed (the BPE footgun)

Why --chunk-size is required (no default)

2. Make physical-intelligence/fast configurable across all policies

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Uh oh!

shuheng-liu commented May 12, 2026

Review

Overview

What's strong

Issues / questions

Correctness / security / perf

Tests

Verdict

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Follow-up on 62c511d

Remaining: _normalize_chunks docstring claim about nan_to_num

Minor: --action-dim warning is manual-path only

Everything else

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

All review items resolved ✅

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 12, 2026

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shuheng-liu commented May 11, 2026 •

edited

Loading

1. `opentau.scripts.fit_fast_tokenizer` (new script)

Why `--max-token-length` is needed (the BPE footgun)

Why `--chunk-size` is required (no default)

2. Make `physical-intelligence/fast` configurable across all policies

Follow-up on `62c511d`

Remaining: `_normalize_chunks` docstring claim about `nan_to_num`

Minor: `--action-dim` warning is manual-path only