Skip to content

feat(policies): support training and selecting FAST tokenizer#290

Merged
shuheng-liu merged 9 commits into
mainfrom
claude/crazy-chebyshev-0ebfcc
May 12, 2026
Merged

feat(policies): support training and selecting FAST tokenizer#290
shuheng-liu merged 9 commits into
mainfrom
claude/crazy-chebyshev-0ebfcc

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 11, 2026

What this does

Two related changes shipped together:

1. opentau.scripts.fit_fast_tokenizer (new script)

CPU-only one-shot fit of a physical-intelligence/fast-compatible action tokenizer on an OpenTau DatasetMixtureConfig. Output is a directory loadable via AutoProcessor.from_pretrained(out_dir, trust_remote_code=True).

The FAST tokenizer is used during pi0.5 / pi0.7 training as an auxiliary cross-entropy target on top of the flow-matching MSE loss — it tokenizes action chunks via DCT + BPE. The published upstream checkpoint was fit on a generic robotics mixture; this script lets us specialize the BPE to our own action distribution.

Pipeline (all CPU):

  1. Resolve $ref includes in the mixture JSON and parse via draccus.
  2. Aggregate per-dim action min / max across the mixture (NaN-tolerant nanmin / nanmax).
  3. Allocate per-dataset chunk budgets weight-proportional to mixture.weights.
  4. Sample chunks two ways (selectable via --use-mixture-dataloader):
    • Default (fast): read action+timestamp columns directly from each parquet, resample to mixture.action_freq via scipy.interpolate.interp1d. ~45k chunks/s.
    • Opt-in (safer): go through the full WeightedDatasetMixture dataloader. ~25 chunks/s but uses the exact training pipeline; included for cross-checking.
  5. Min-max-normalize chunks to [-1, 1] — mirrors Normalize({"ACTION": NormalizationMode.MIN_MAX}) from pi0.7's prepare_discrete_actions. Right-pad action_dim.
  6. Locally re-implement UniversalActionProcessor.fit(...) so we can pass max_token_length to the BpeTrainer (the upstream hard-codes 10000). Default cap is 64.
  7. save_pretrained + copy processing_action_tokenizer.py. Round-trip a held-out sample and write fit_report.json.

Why --max-token-length is needed (the BPE footgun)

Without a cap, the BPE training on heavily zero-padded action data wastes its entire merge budget on one runaway zero-run token. On the pretrain-pi07-10% mixture this produced 1855 merges of lengths 2, 3, 4, …, 1320 — one of each, all zero-pad — with mean token length 478 bytes per chunk (vs upstream's 78). Capping at chunk_size forces the BPE to spend merges on real patterns; mean token length drops 42× to 68 tokens / chunk on this mixture, comfortably under pi0.7's discrete_action_max_length=150.

Why --chunk-size is required (no default)

The BPE merges depend on chunk length. pi0.5 / pi0.6 default to n_action_steps=10, pi0.7 low-level defaults to n_action_steps=50 (configuration_pi07_low_level.py:122-123). The example configs (pi07_libero.json) override to 10, so naive pattern-matching gives the wrong value silently. Better to force the user to pick.

2. Make physical-intelligence/fast configurable across all policies

Replace the hard-coded AutoProcessor.from_pretrained("physical-intelligence/fast", ...) call in every policy with a new discrete_action_tokenizer_path config field. A single --policy.discrete_action_tokenizer_path CLI override now plugs in a specialized FAST tokenizer (e.g. one fit via the script above) without policy code changes. The value flows through to the auxiliary CE target at training; inference paths are unaffected (flow-matching heads only).

Policies touched: pi0.5, pi0.5-mem, pi0.6, pi0.7 high-level + low-level, pi0.7-paligemma high-level + low-level.

Defaults:

Policy discrete_action_tokenizer_path default
PI07LowLevelConfig TensorAuto/fast-pi07-pretrain (specialized fit produced by this PR's script — vocab=2048, max_token_length=64, mean 68 tokens/chunk on our pretrain mixture)
PI05Config / PI05MemConfig / PI06Config physical-intelligence/fast (upstream)
PI07HighLevelPlannerConfig physical-intelligence/fast
pi0.7-paligemma high-level + low-level physical-intelligence/fast

The pi0.7-LL default flip is the only behaviour change; all other policies keep their existing tokenizer.

How it was tested

  • ruff check, ruff format, pre-commit run — all pass on every file touched (typos, pyupgrade, bandit, gitleaks, ruff, ruff-format).
  • New CPU test tests/policies/test_pi07_cpu.py::TestPi07ConfigPlumbing::test_discrete_action_tokenizer_path_default_and_override pins both the default value AND that PI07LowLevelPolicy.__init__ plumbs config.discrete_action_tokenizer_path into AutoProcessor.from_pretrained (mocked, no network).
  • GPU tests in tests/policies/test_pi07_low_level.py now explicitly pass discrete_action_tokenizer_path="physical-intelligence/fast" so CI does not depend on private-repo credentials.
  • Smoke run of fit_fast_tokenizer.py on a 392-dataset internal pretrain mixture: stats aggregation + 1.5k-chunk sampling (manual path) in 25 s; 0 errors.
  • Pilot fits at chunk_size 10 and 50, both round-tripped via AutoProcessor.from_pretrained(out_dir, trust_remote_code=True) from a fresh shell.
  • Full fit at chunk_size=50, action_dim=32, vocab=2048, max_token_length=64: 1M chunks, ~4 min total wall-clock (22 s sampling + 217 s BPE), round-trip MSE 3.1e-5, mean 68 tokens/chunk (P99 93), all chunks under 150-token policy cap.
  • Inspected vocab.json before/after the max_token_length cap to confirm the runaway-zero failure mode and that the cap fixes it.
  • Final grep: zero hardcoded physical-intelligence/fast references remain in any modeling code; all instances are now config defaults.

How to checkout & try? (for the reviewer)

git fetch origin claude/crazy-chebyshev-0ebfcc
git checkout claude/crazy-chebyshev-0ebfcc

# Fit a tokenizer on a mixture (pi0.7 low-level production defaults):
python -m opentau.scripts.fit_fast_tokenizer \
    --mixture-json path/to/your/mixture.json \
    --out-dir path/to/output_dir/ \
    --chunk-size 50

# Pilot (writes to out_dir/pilot/, caps total-chunks at 50_000):
python -m opentau.scripts.fit_fast_tokenizer \
    --mixture-json path/to/your/mixture.json \
    --out-dir path/to/output_dir/ \
    --chunk-size 50 \
    --pilot

# Plug a specialized tokenizer into pi0.7 low-level training (the new
# config knob; works on any of the seven affected policies):
opentau-train \
    --config_path configs/examples/pi07_libero.json \
    --policy.discrete_action_tokenizer_path=TensorAuto/fast-pi07-pretrain

# CPU test for the config-plumbing change:
pytest tests/policies/test_pi07_cpu.py::TestPi07ConfigPlumbing::test_discrete_action_tokenizer_path_default_and_override -xvs

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

GPU pytest ran twice on an internal RTX 5090 dev box: HEAD b70bedf -> 15 passed, 10 skipped, 1 warning, 2:37. Re-run after the review fixes at HEAD 62c511d -> 15 passed, 10 skipped, 1 warning, 2:40. The 10 skipped are the pre-existing memory-bound @pytest.mark.skip tests (>24 GB VRAM), unchanged from main. Regression tests (the g6.12xlarge nightly suite) were not dispatched.

@shuheng-liu shuheng-liu added the feature New feature or request label May 11, 2026
@shuheng-liu shuheng-liu self-assigned this May 11, 2026
…ength cap

- Make --chunk-size required (no default): pi0.5/pi0.6 use 10, pi0.7
  low-level uses 50; picking the wrong value silently produces a
  tokenizer with the wrong BPE merges.
- Default sampler is now a manual per-dataset path: weight-proportional
  budget allocation, raw-parquet reads, scipy.interp1d for action_freq
  resampling. Both paths respect the mixture weights and global FPS,
  but the manual path runs at ~45k chunks/s vs the mixture-dataloader
  path's ~25 chunks/s. The WeightedDatasetMixture path is preserved
  behind --use-mixture-dataloader (with thread-parallel per-dataset
  construction to keep the build phase under 10 min).
- Add --max-token-length, defaulting to 64. Without a cap, BPE training
  on heavily zero-padded action data wastes its entire merge budget on
  one runaway zero-run token (verified: 1855 merges, one each at
  lengths 2, 3, ..., 1320; mean token length on test data 478 bytes vs
  upstream's 78). Capping at chunk_size keeps merges focused on
  meaningful patterns -- on the pretrain-pi07 mixture this gives mean
  68 tokens / chunk (P99 = 93) at chunk_size=50, comfortably under
  pi0.7's discrete_action_max_length=150.
- Bump --max-state-dim default to 128 to cover the largest native
  state dim in the pretrain-pi07-10% mixture (118-dim).
- Replace UniversalActionProcessor.fit() call with a local re-impl so
  we can override max_token_length on the BpeTrainer (the upstream
  hard-codes 10000).
The pi0.7 low-level policy hard-coded ``physical-intelligence/fast`` when
instantiating its ``discrete_action_processor``. Add a new
``discrete_action_tokenizer_path`` config field (default unchanged) so a
specialized FAST tokenizer fit via ``opentau.scripts.fit_fast_tokenizer``
can be plugged in via a single ``--policy.discrete_action_tokenizer_path``
CLI override. The value flows through to the auxiliary CE target at
training; inference is unaffected (flow-matching head only).

Note: pi0.5 / pi0.6 / pi0.7_paligemma low-level still hard-code the
upstream path -- mirroring this change there is left for follow-ups.
Replace the hard-coded ``AutoProcessor.from_pretrained("physical-intelligence/fast", ...)``
call in every policy with a new ``discrete_action_tokenizer_path`` config field.
A single ``--policy.discrete_action_tokenizer_path`` CLI override now plugs in
a specialized FAST tokenizer (e.g. one fit via
``opentau.scripts.fit_fast_tokenizer``) without policy code changes. The value
flows through to the auxiliary CE target at training; inference paths are
unaffected (flow-matching heads only).

Policies touched: pi0.5, pi0.5-mem, pi0.6, pi0.7 high-level + low-level,
pi0.7-paligemma high-level + low-level.

Defaults:
- ``PI07LowLevelConfig.discrete_action_tokenizer_path`` defaults to
  ``TensorAuto/fast-pi07-pretrain`` -- the pi0.7 pretrain mixture's
  specialized fit (vocab=2048, max_token_length=64, mean 68 tokens/chunk
  vs upstream's ~146 on the same data).
- All other policies keep the existing default of
  ``physical-intelligence/fast`` so no behaviour changes for pi0.5/0.6/0.7-HL
  callers.

Test pins: GPU tests in ``tests/policies/test_pi07_low_level.py`` now
explicitly pass ``discrete_action_tokenizer_path="physical-intelligence/fast"``
so CI doesn't depend on private-repo credentials.
@shuheng-liu shuheng-liu changed the title feat(scripts): add fit_fast_tokenizer.py for FAST tokenizer specialization feat(scripts+policies): add fit_fast_tokenizer.py and make FAST tokenizer path configurable May 12, 2026
@shuheng-liu shuheng-liu requested review from WilliamYue37 and akshay18iitg and removed request for WilliamYue37 May 12, 2026 05:20
Copy link
Copy Markdown
Member Author

Review

Overview

Two changes shipped together:

  1. New CPU-only script src/opentau/scripts/fit_fast_tokenizer.py (~1k lines) — fits a physical-intelligence/fast-compatible BPE action tokenizer on an OpenTau DatasetMixtureConfig, with two sampling paths (fast manual parquet + scipy interp, vs. slower full WeightedDatasetMixture dataloader) and a configurable max_token_length to cap runaway zero-pad merges.
  2. Plumbing for discrete_action_tokenizer_path across pi0.5, pi0.5-mem, pi0.6, pi0.7 HL+LL, and pi0.7-paligemma HL+LL — replaces the hard-coded AutoProcessor.from_pretrained("physical-intelligence/fast", ...) call at one site per policy with a config field. Only pi0.7-LL's default flips (to TensorAuto/fast-pi07-pretrain); the other six remain on upstream.

What's strong

  • Surgical refactor on the seven policies. Each is a one-line modeling change + one-field config addition. The diff is mechanical, the comment block in each config is identical, and there's a CPU mock test that pins the default and verifies the value is plumbed all the way through to AutoProcessor.from_pretrained (test_pi07_cpu.py::test_discrete_action_tokenizer_path_default_and_override). Good test boundary — patches at the import site, captures the path arg, never hits the network.
  • GPU test pinned to upstream to avoid CI credential dependency — correct call.
  • Documentation in the fit script is genuinely useful. The max_token_length finding (BPE wasting its entire merge budget on one runaway zero-run token, 1855 merges of lengths 2…1320) is the kind of empirical debug observation that's worth preserving in-place; required --chunk-size (no default) is well-justified given that pi0.5/0.6 and pi0.7-LL disagree on n_action_steps.
  • Empirical validation looks solid. 1M chunks in ~4 min, round-trip MSE 3.1e-5, P99 93 tokens (under the 150-token policy cap).

Issues / questions

1. Distribution mismatch between fit-time and training-time normalization (medium)

The script's _normalize_chunks (line ~670) does:

x = np.nan_to_num(chunks.astype(np.float64), nan=0.0, posinf=0.0, neginf=0.0)
norm = 2.0 * (x - action_min) / safe_span - 1.0
...
return np.clip(norm, -1.0, 1.0).astype(np.float32)

But production Normalize({"ACTION": NormalizationMode.MIN_MAX}) at src/opentau/policies/normalize.py:253-260 does:

batch[key] = (batch[key] - min) / (max - min + EPS)
batch[key] = batch[key] * 2 - 1

— no clip, no nan_to_num. The script's docstring claims to "mirror" the training-time normalization. For chunks within the recorded min/max this is moot, but tails (and any NaN inputs) will see slightly different values at fit vs train, which shifts the BPE corpus distribution. Either drop the clip + nan_to_num to truly mirror, or document why they were added (defensive but distribution-changing).

2. The pi0.7-LL default flip creates a hidden credential dependency (medium)

After this PR, PI07LowLevelConfig() with no overrides resolves TensorAuto/fast-pi07-pretrain, which is a private repo. Anyone without TensorAuto org access will hit an auth wall on plain instantiation — including external users running configs/examples/pi07_*.json for the first time. The other six policies kept upstream as default and rely on the new CLI override to opt into specialized fits; the inconsistency is the only awkward thing about this PR.

Two follow-ups worth considering:

  • Keep physical-intelligence/fast as PI07LowLevelConfig's default for parity with the six others, and ship the specialized fit via an example config (e.g. configs/examples/pi07_pretrain.json) that sets --policy.discrete_action_tokenizer_path=TensorAuto/fast-pi07-pretrain.
  • If existing on-hub pi0.7-LL checkpoints were trained against the upstream tokenizer, fresh policy construction after this PR will mismatch on discrete_action_vocab_size (specialized 2048 vs upstream 1024). The resume path is fine — saved configs win — but the from_pretrained path for an old checkpoint plus default-config instantiation may not be.

3. _resolve_native_action_key fallback to "action" (singular) is risky (low–medium)

Per the docstring in standard_data_format_mapping.py, the OpenTau-canonical name is "actions" (plural), while most upstream LeRobot datasets use "action" (singular). The script falls back to "action" for datasets not registered in DATA_FEATURES_NAME_MAPPING — so any dataset that uses the OpenTau canonical name AND isn't registered will silently fail at the key not in meta.stats check. Suggest trying both keys (e.g. for k in ("actions", "action"): if k in meta.stats: return k) before declaring failure.

4. _ = warnings # silence unused import (line ~1052) is wrong (nit)

warnings is actively used at lines 322-323 (warnings.catch_warnings() / warnings.simplefilter) in _aggregate_stats_manual. The line is leftover from an earlier draft — just delete it.

5. _build_mixture_parallel silently drops the val mixture (low)

make_dataset_mixture returns (train, val) when val_datasets exist; the parallel version validates val_datasets but never wraps them in a WeightedDatasetMixture and never returns them. The comment correctly notes val_split_ratio is 0 in this script's cfg path, so it's effectively dead — but the symmetry break is a footgun if anyone factors this out for a smoke-test of another pipeline. Either drop the val validation entirely (since the script doesn't need val) or finish wiring it through.

6. In-script re-implementation of UniversalActionProcessor.fit will silently rot (low)

The script copies the upstream fit body verbatim plus the max_token_length knob. If upstream physical-intelligence/fast/processing_action_tokenizer.py adds preprocessing or changes scale defaults, this script silently diverges from production. Two options:

  • Pin a comment with the upstream commit SHA used as a reference and a "re-port here when upstream changes" note.
  • Or: file an upstream PR exposing max_token_length as a fit(..., max_token_length=...) kwarg, so the re-implementation can be deleted.

7. --action-dim defaults to 32 — same footgun as the (correctly) required --chunk-size (low)

pi0.5/0.6/0.7 currently happen to use max_action_dim=32, but if a config overrides it, the script will silently fit a corpus with a mismatched action dim. Either make --action-dim required, or assert it against the mixture's stats shape after stats aggregation.

8. mixture_cfg mutation in _build_train_cfg (nit)

The function mutates the parsed DatasetMixtureConfig in place (n_obs_history = None, several drop_prob fields). Fine for the single-shot script, but if someone calls the function twice in a notebook, the second call inherits the first call's mutations. A dataclasses.replace(...) would make it explicit.

9. Magic number min_vocab_size + 100 (nit, line ~803)

Could be e.g. vocab_size // 20 or a named constant — minor.

Correctness / security / perf

  • No security issues; trust_remote_code=True is preserved verbatim from existing call sites (which themselves predate this PR), and the script's only network surface is huggingface_hub.snapshot_download of one well-known file (UPSTREAM_SOURCE_FILE = "processing_action_tokenizer.py").
  • The chr(i) alphabet trick works for any plausible (min_token, max_token) range given scale=10 — magnitudes stay well under sys.maxunicode.
  • The parallel mixture build is a legitimate optimization (LeRobotDataset init is I/O-bound, GIL is released by pyarrow/HF cache). Threading is the right primitive.

Tests

  • The new CPU test is exactly the right shape — patches at the modeling module's import site, captures path + kwargs, asserts both default and override. No network, no fixtures.
  • No tests for fit_fast_tokenizer.py itself — given it's an offline one-shot tool that's been empirically validated on a 1M-chunk run, that's probably fine, but a tiny smoke test on a fake DatasetMixtureConfig with two synthetic parquets would catch regression to the _resolve_native_action_key or normalization paths cheaply.

Verdict

Approve in principle once #1 (normalization mismatch — clarify or fix), #2 (pi0.7-LL default — explicit decision on credential cost), and #4 (delete stale _ = warnings) are addressed. The rest are nits / future cleanup.


Generated by Claude Code

- Revert ``PI07LowLevelConfig.discrete_action_tokenizer_path`` default to
  ``physical-intelligence/fast`` so all seven policies share the same
  default (avoids a private-repo credential dependency on first plain
  instantiation). Specialized fits like ``TensorAuto/fast-pi07-pretrain``
  are opted into via ``--policy.discrete_action_tokenizer_path=<...>``.
- Update the CPU mock test to pin the upstream default and exercise the
  ``TensorAuto/fast-pi07-pretrain`` value via the override path. Drop the
  now-redundant explicit override in ``test_pi07_low_level.py`` GPU tests.
- Make ``_normalize_chunks`` byte-match production
  ``Normalize({"ACTION": MIN_MAX})``: use ``EPS = 1e-8`` (the production
  constant), drop the post-normalization ``np.clip``, keep the
  ``nan_to_num`` (mirrors ``prepare_discrete_actions`` in pi0.7-LL).
- ``_resolve_native_action_key`` now optionally accepts the dataset's
  actual stats / column keys and prefers whichever of ``"action"`` /
  ``"actions"`` is present, eliminating the silent-failure mode for
  datasets that aren't registered in ``DATA_FEATURES_NAME_MAPPING`` and
  use the OpenTau-canonical plural form.
- ``_aggregate_stats_manual`` now logs a loud warning when any dataset's
  native action dim exceeds ``--action-dim``: those dims are silently
  dropped, which is only correct if the production policy's
  ``max_action_dim`` matches.
- ``_build_train_cfg`` uses ``dataclasses.replace`` instead of mutating
  the caller's ``DatasetMixtureConfig`` in place.
- ``_build_mixture_parallel`` no longer carries the dead val-dataset
  branch: the script forces ``val_split_ratio=0``, so we assert that and
  raise loudly if ``make_dataset`` ever returns a tuple.
- Pin the upstream commit SHA used to port ``_fit_tokenizer``'s body
  (``UPSTREAM_PINNED_SHA``), so a future drift between our re-impl and
  upstream is easy to detect.
- Name the ``+ 100`` merge-headroom magic as ``_MIN_MERGE_HEADROOM``.
- Drop the leftover ``_ = warnings`` shim (``warnings`` is actually used
  in ``_aggregate_stats_manual``).
@shuheng-liu
Copy link
Copy Markdown
Member Author

Thanks for the review — all addressed in 62c511d. Inline replies:

#1 normalization mismatch (medium). Fixed — _normalize_chunks now byte-matches opentau.policies.normalize: (x - min) / (max - min + EPS) * 2 - 1 with EPS = 1e-8, no clip. Kept the nan_to_num because that's what prepare_discrete_actions applies after Normalize in pi0.7-LL, so the BPE training distribution lines up with what the policy sees.

#2 pi0.7-LL default flip (medium). Reverted — all seven policies now default to physical-intelligence/fast, restoring parity. Users opt into specialized fits via --policy.discrete_action_tokenizer_path=TensorAuto/fast-pi07-pretrain (works on any of the seven). I also dropped the now-redundant explicit override in the GPU test.

#3 _resolve_native_action_key fallback (low-medium). Fixed — the resolver now optionally takes the dataset's actual stats / column keys and prefers whichever of "action" / "actions" is present. Both call sites pass it (set(meta.stats) for the stats worker, pq.read_schema(...).names for the parquet-sampling worker via a one-time peek at the first episode).

#4 _ = warnings (nit). Deleted. warnings is genuinely used in _aggregate_stats_manual, which is why the leftover was incorrect.

#5 val-mixture symmetry break (low). Resolved by going the other direction: dropped the val-validation entirely, asserted val_split_ratio == 0, and raise loudly if make_dataset ever returns a tuple (would indicate the train_cfg got mutated unexpectedly).

#6 in-script re-implementation rot (low). Added UPSTREAM_PINNED_SHA = "ec4d7aa71691cac0b8bed6942be45684db2110f4" at module level + a comment in _fit_tokenizer pointing at it, so future drift between our re-impl and upstream is greppable. Filing an upstream PR exposing max_token_length is a good idea but out of scope for this PR.

#7 silent --action-dim mismatch (low). _aggregate_stats_manual now logs a loud WARNING listing how many datasets and the max native dim when native_action_dim > --action-dim. I left it as a warning (not a hard assert) since silently dropping extra dims IS the intended behaviour when max_action_dim matches production — but the warning surfaces the contract.

#8 mixture mutation (nit). Now uses dataclasses.replace(mixture_cfg, n_obs_history=None, ...) — the caller's config is untouched.

#9 magic + 100 (nit). Named _MIN_MERGE_HEADROOM.

Verification: CPU test 4/4, GPU re-run dispatched (will update the PR with results when it lands).

@shuheng-liu shuheng-liu requested a review from WilliamYue37 May 12, 2026 05:39
@shuheng-liu
Copy link
Copy Markdown
Member Author

GPU re-run on 62c511d is green: 15 passed, 10 skipped, 1 warning, 2:40 wall-clock — same shape as the run on b70bedf. PR body updated.

Copy link
Copy Markdown
Member Author

Follow-up on 62c511d

Verified the fix commit against the original review — seven items fully addressed, one nit remaining:

Remaining: _normalize_chunks docstring claim about nan_to_num

The updated docstring says:

The post-Normalize torch.nan_to_num in pi0.7's prepare_discrete_actions is applied here too, so the BPE sees the same numerical distribution training will see.

But prepare_discrete_actions at src/opentau/policies/pi07/low_level/modeling_pi07_low_level.py:835-853 is just:

device = batch["discrete_actions"].device
discrete_actions = batch["discrete_actions"].to(device="cpu", dtype=torch.float32)
tokens = self.discrete_action_processor.__call__(discrete_actions)

No nan_to_num. I also grepped src/opentau/policies/ for nan_to_num / isnan / fillna — zero hits anywhere. So the script still sanitizes NaN while production propagates it.

In practice this is moot — action streams rarely carry NaN, and the script's defensive nan_to_num doesn't hurt unless production legitimately produces NaN tokens. But the docstring justification is wrong as written. Two easy fixes:

  1. Drop the nan_to_num line — the script then byte-matches production for non-NaN inputs and propagates NaN (which would crash DCT) for NaN inputs, matching training behaviour.
  2. Or: keep the line, rewrite the comment as something like "defensive — production would propagate NaN and crash DCT; we sanitize so the fit doesn't trip on a single bad chunk."

Option 2 is what the code actually does today; option 1 is what the docstring claims. Either is fine — just want them to agree.

Minor: --action-dim warning is manual-path only

The new "native action_dim > --action-dim" warning lives in _aggregate_stats_manual (fit_fast_tokenizer.py around the new logger.warning). The _drain_mixture path (slow --use-mixture-dataloader route) doesn't get the same check. Likely fine — the mixture's standardization pipeline pads to max_action_dim internally — but worth a one-line note in the dataloader branch for symmetry.

Everything else

All other items (default flip back to upstream, action-key fallback with available_keys, stale _ = warnings, val-mixture assert, pinned upstream SHA, dataclasses.replace, named _MIN_MERGE_HEADROOM) look good. Ready to mark as approved once the docstring vs. code disagreement is reconciled.


Generated by Claude Code

Address the follow-up review on PR #290:

- `_normalize_chunks` docstring previously claimed pi0.7's
  `prepare_discrete_actions` applies `torch.nan_to_num` after `Normalize`.
  It does not -- neither `Normalize` nor `prepare_discrete_actions` in
  main has any NaN handling. Rewrite the comment to honestly say the
  `nan_to_num` is defensive (so a single corrupted chunk doesn't tank an
  hour-long fit). The script's math is still byte-identical to training
  for non-NaN inputs.
- `_drain_mixture` (the slow `--use-mixture-dataloader` path) doesn't
  surface the same "native_action_dim > --action-dim" warning the manual
  path does, because the mixture's standardization pipeline already
  raises on truncation via `pad_vector`. Add a one-line note explaining
  why no warning is needed on that path.
@shuheng-liu
Copy link
Copy Markdown
Member Author

Fixed in d804d21:

  • _normalize_chunks docstring — rewritten to be honest: the nan_to_num is defensive (so a single corrupted chunk doesn't tank an hour-long fit), production neither has it in Normalize nor in prepare_discrete_actions. Math is still byte-identical to training for non-NaN inputs. (Chose option 2 from your review — keep the safety belt, just don't pretend it mirrors production.)
  • _drain_mixture symmetry note — added a one-line comment explaining why the mixture-dataloader path doesn't need the action-dim warning: pad_vector in the standardization pipeline raises on truncation, so an over-dim mismatch fails loudly at mixture build time instead of silently dropping dims.

Copy link
Copy Markdown
Member Author

All review items resolved ✅

Confirming after d804d21 that every item from the initial review (and the follow-up) is addressed. Recap:

# Issue Status Where
1 _normalize_chunks vs production Normalize({MIN_MAX}) mismatch ✅ Resolved np.clip dropped, _NORMALIZE_EPS=1e-8 matches production; docstring now honestly documents the nan_to_num as defensive (d804d21)
2 pi0.7-LL default flip created hidden private-repo credential dependency ✅ Resolved Default reverted to physical-intelligence/fast; CPU test exercises TensorAuto/fast-pi07-pretrain via override; GPU test pin removed (62c511d)
3 _resolve_native_action_key silently fell back to "action" ✅ Resolved Now accepts available_keys, prefers whichever of action/actions is present; plumbed through _load_dataset_stats (meta.stats keys) and _sample_chunks_for_dataset_manual (parquet schema) (62c511d)
4 Stale _ = warnings # silence unused import ✅ Resolved Line removed (62c511d)
5 _build_mixture_parallel silently dropped val mixture ✅ Resolved Asserts val_split_ratio == 0, raises clearly if make_dataset ever returns a tuple (62c511d)
6 In-script UniversalActionProcessor.fit re-impl would silently rot ✅ Resolved UPSTREAM_PINNED_SHA = "ec4d7aa71691cac0b8bed6942be45684db2110f4" with re-port instructions (62c511d)
7 --action-dim defaults to 32 footgun ✅ Resolved Manual path: loud warning when any dataset's native dim exceeds the cap. Mixture path: no warning needed — pad_vector (src/opentau/datasets/dataset_mixture.py:78) raises on truncation via numpy broadcast mismatch, so over-dim surfaces as a build-time exception. Documented in _drain_mixture docstring (62c511d, d804d21)
8 mixture_cfg in-place mutation in _build_train_cfg ✅ Resolved Uses dataclasses.replace on a local mixture_cfg_for_fit (62c511d)
9 Magic + 100 for merge headroom ✅ Resolved Named _MIN_MERGE_HEADROOM at module scope with explanatory comment (62c511d)

No outstanding items from my review. Ready for human reviewer sign-off.


Generated by Claude Code

@shuheng-liu shuheng-liu marked this pull request as ready for review May 12, 2026 16:29
@shuheng-liu shuheng-liu changed the title feat(scripts+policies): add fit_fast_tokenizer.py and make FAST tokenizer path configurable feat(policies): support training and selecting FAST tokenizer May 12, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent review of 58c905a. Two correctness issues worth fixing before merge — both regressions introduced by the prior fix round, not in the original draft.

  • blockingsrc/opentau/scripts/fit_fast_tokenizer.py:598-607,648-654_build_train_cfg uses dataclasses.replace(mixture_cfg, ...) but never overrides val_split_ratio. DatasetMixtureConfig.val_split_ratio defaults to 0.05 (src/opentau/configs/default.py:277). The comment at line 648 ("_build_train_cfg forces val_split_ratio=0 (the default)") and the assert at line 652 are both wrong: any user passing a default mixture JSON via --use-mixture-dataloader will trip the assert and abort before sampling. Fix: add val_split_ratio=0.0 to the dataclasses.replace kwargs.
  • suggestionsrc/opentau/scripts/fit_fast_tokenizer.py:96,927-935UPSTREAM_PINNED_SHA = "ec4d..." is documented as the SHA the script tracks, but _find_upstream_source calls snapshot_download(repo_id=..., allow_patterns=...) with no revision= arg, so each invocation downloads the current HEAD of physical-intelligence/fast/processing_action_tokenizer.py. The file copied into out_dir (which AutoProcessor.from_pretrained(..., trust_remote_code=True) later executes) can drift silently from the ported _fit_tokenizer body. Pass revision=UPSTREAM_PINNED_SHA to snapshot_download to make the constant load-bearing.
  • suggestionsrc/opentau/scripts/fit_fast_tokenizer.py:128-130--chunk-size help text claims "pi0.5 / pi0.6 default to 10; pi0.7 low-level defaults to 50". Every policy config in this repo (pi05, pi05_mem, pi06, pi07/low_level, pi07_paligemma/low_level_planner) has n_action_steps: int = 50. The "10" reading comes from configs/examples/pi07_libero.json which overrides — opposite of what the help text says. This actively misleads users into setting --chunk-size 10 for pi0.5/0.6 production.
  • suggestiontests/policies/test_pi07_cpu.py:1263-1317 — Test docstring at line 1279 says "All seven policies default to the upstream tokenizer", but only PI07LowLevelConfig is exercised. A typo in any of the other six config defaults (PI05Config, PI05MEMConfig, PI06Config, PI07HighLevelPlannerConfig, PI07PaligemmaHighLevelConfig, PI07PaligemmaLowLevelConfig) — or a missed plumbing change in their modeling files — would slip through the gate the PR builds for itself. A @pytest.mark.parametrize over the seven config classes asserting cfg.discrete_action_tokenizer_path == "physical-intelligence/fast" is ~10 lines and closes the gap.
  • nitsrc/opentau/scripts/fit_fast_tokenizer.py:737-753_normalize_chunks casts chunks.astype(np.float64) then back to float32 at return. Production Normalize runs entirely in float32 (src/opentau/policies/normalize.py, create_stats_buffers). The self-review's "byte-identical to training" claim isn't quite right; the values differ in low float32 bits. Functionally moot; just don't claim bit-equality.
  • nitsrc/opentau/scripts/fit_fast_tokenizer.py:540-548,666-680 — Both threaded paths consume as_completed and write into completion-order lists/extends, so chunk order in the BPE training corpus is nondeterministic across runs even with --seed. BPE tie-breaking on equal pair frequencies depends on first-encounter order, so two seeded runs may not be bit-identical (cf. CLAUDE.md rule 3). Sorting completed results back into mixture-config order at the join point would restore determinism; otherwise document the caveat in the script header.

Cross-checks that hold up: all seven policies default to physical-intelligence/fast; _resolve_native_action_key handles both "action" and "actions" via the new available_keys arg; _ = warnings line is gone; _MIN_MERGE_HEADROOM is properly named; the seven AutoProcessor.from_pretrained call sites all thread config.discrete_action_tokenizer_path correctly with no straggler literal strings. The seven-policy refactor itself is clean and mechanical.


Generated by Claude Code

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 12, 2026

[claude-review] summary for commit 58c905a

  • blockingsrc/opentau/scripts/fit_fast_tokenizer.py:598-654_build_train_cfg's dataclasses.replace doesn't override val_split_ratio; default is 0.05 (configs/default.py:277), so the assert val_split_ratio == 0.0 at line 652 fires on any default mixture JSON, breaking --use-mixture-dataloader.
  • suggestionsrc/opentau/scripts/fit_fast_tokenizer.py:96,927-935UPSTREAM_PINNED_SHA is not passed to snapshot_download; the copied processing_action_tokenizer.py tracks upstream HEAD and can drift silently from the ported _fit_tokenizer body.
  • suggestionsrc/opentau/scripts/fit_fast_tokenizer.py:128-130--chunk-size help text says "pi0.5 / pi0.6 default to 10" but every policy actually defaults to n_action_steps = 50; help text inverts the truth.
  • suggestiontests/policies/test_pi07_cpu.py:1263-1317 — test docstring claims all seven policies default to upstream, but only PI07LowLevelConfig is asserted; parametrize over the other six to close the gap.
  • nitsrc/opentau/scripts/fit_fast_tokenizer.py:737-753_normalize_chunks casts to float64 then back to float32; production Normalize is float32 throughout, so "byte-identical to training" is overstated.
  • nitsrc/opentau/scripts/fit_fast_tokenizer.py:540-548,666-680as_completed consumption order makes BPE corpus order nondeterministic even with --seed; either sort by mixture index at the join or document.

WilliamYue37
WilliamYue37 previously approved these changes May 12, 2026
@shuheng-liu
Copy link
Copy Markdown
Member Author

@claude fix per review

- Blocking: force val_split_ratio=0.0 in _build_train_cfg's replace so
  the default 0.05 doesn't trip the parallel-mixture-build assertion.
- Pin _find_upstream_source snapshot to UPSTREAM_PINNED_SHA so the
  copied remote-code file matches the ported _fit_tokenizer body.
- Fix --chunk-size help text: every policy config defaults
  n_action_steps to 50 (only configs/examples/pi07_libero.json overrides
  to 10), so the prior "pi0.5/0.6 default to 10" guidance was inverted.
- Parametrize the seven-policy default test so a typo in any of the
  other six config defaults is caught at CPU-test time.
- Soften _normalize_chunks docstring: float64 -> float32 round-trip
  isn't bit-identical to production's float32 path; agreement is ~1e-7.
- Restore deterministic BPE corpus order in _sample_via_manual by
  collecting into a per-dataset positional list before concatenating,
  so as_completed completion order can't perturb BPE tie-breaks.
@shuheng-liu shuheng-liu merged commit 5c553e6 into main May 12, 2026
7 checks passed
@shuheng-liu shuheng-liu deleted the claude/crazy-chebyshev-0ebfcc branch May 12, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants