feat(eval): configurable pi07 metadata on EnvConfig#296
Conversation
shuheng-liu
left a comment
There was a problem hiding this comment.
Reviewed against prepare_metadata in src/opentau/policies/pi07/low_level/modeling_pi07_low_level.py (L1075–1151) and the training-time emitter BaseDataset._emit_optional_keys in src/opentau/datasets/lerobot_dataset.py (L855–963). The injection contract lines up: dtypes (torch.long for speed/quality, torch.bool for mistake, list[str] for robot_type/control_mode), per-sample _is_pad=False flags on the numeric keys only, no separate pad flag for the string keys (empty string is the pad signal at training time). The all-None default keeps the existing batches byte-identical with the pad path. Validation is well-targeted — the isinstance(x, bool) guards on the int fields are the right call.
A few things worth addressing or at least acknowledging before merge:
1. No unit test for add_eval_metadata itself. tests/envs/test_configs.py covers the validator thoroughly, but the injection helper in src/opentau/envs/utils.py has zero coverage — dtype/device propagation, (B,) broadcasting, the missing-key skip path, and the cfg.env is None early-return are all untested. A small CPU test against a fake observation dict (just {"state": torch.zeros(B, D)}) would lock in the contract that downstream prepare_metadata actually relies on. This is the highest-value gap.
2. Speed bucket size is implicitly coupled to #295. The validator hard-codes multiple of 10 in two places (the % 10 != 0 check and the error string). If #295 ever revisits the bucket size, this validator silently drifts. Consider extracting SPEED_BUCKET_SECONDS = 10 either next to CONTROL_MODE_CHOICES here, or — better — importing it from wherever the training-side bucketing lives, so the two sides can't diverge silently.
3. Architectural placement. EnvMetadataConfig is attached to every EnvConfig, but only pi07 / pi05_mem-class policies actually read these keys. Setting --env.metadata.speed=20 against a vanilla pi0 eval will pass validation, get broadcast into the obs dict, and then be silently ignored by the policy. Not a blocker (it does match the "property of what is being run" framing in the docstring), but worth a one-line note that only pi07-family policies consume these so a future reader doesn't grep for downstream effects in vain.
4. mistake=False vs mistake=None semantic. These produce very different prompts: False injects "Mistake: False, " into the prefix, None drops the segment entirely. The docstring lists "True / False, or None" without flagging this, and it's the most plausible footgun in the API surface (someone defaulting mistake=False thinking it's the "no mistake" knob when they actually wanted "leave it missing"). One sentence in the field's docstring would close this.
5. cfg.env is None early-return is dead. eval_policy dereferences cfg.env.type at src/opentau/scripts/eval.py:582 before rollout() ever runs, so by the time add_eval_metadata is called cfg.env is guaranteed non-None. Fine to keep as cheap defense-in-depth if the helper is meant to be reusable, but per the repo's "no error handling for impossible scenarios" rule it could be dropped.
6. PR description verification command has a small drift. prepare_metadata builds segments with trailing ", " and then ' '.join(...)s them, so the actual printed string has double spaces between fields: "Metadata: Speed: 20, Robot: UR5, Control: ee, ". The PR body expects a single space. Cosmetic — but the suggested print-and-eyeball sanity check won't match a substring search written with single spaces. (The underlying double-space format is pre-existing in prepare_metadata and not this PR's concern.)
Nits (non-blocking):
CONTROL_MODE_CHOICES = ("joint", "ee")is duplicated with theLiteral["joint", "ee"]in the annotation.typing.get_args(EnvMetadataConfig.__annotations__["control_mode"])(after unwrapping| None) would keep them in lockstep, though the current form is more readable. Up to you.batch_size = observation["state"].shape[0]works becausepreprocess_observationalways emits"state", butadd_envs_taskreaches forenv.num_envsinstead — passing the same signal here would make the dependency explicit. Minor.
Overall: shape of the PR looks right, contract with prepare_metadata is correct, validation is solid. Main ask is closing the test gap on add_eval_metadata.
Generated by Claude Code
|
Addressed in 2fae6a4 — items 1–5 + nit-2 covered; item 6 covered in the PR body; nit-1 left as-is per below. 1. Test gap on 2. Speed bucket coupled to #295 — Extracted 3. Architectural placement note — Added one sentence to the 4. 5. Dead 6. PR body verification example double-space drift — Fixed in the description above: the suggested check is now phrased as three substring expectations ( Nits:
|
shuheng-liu
left a comment
There was a problem hiding this comment.
Re-reviewed against the new commit 2fae6a4 (test(eval): cover add_eval_metadata + address review nits). All five substantive points from the previous round are addressed:
- ✅
add_eval_metadatatest coverage —TestAddEvalMetadataintests/envs/test_utils.pypins dtype/shape/value per field, mutated-in-place contract, partial-field skip, theFalsevsNonesemantic onmistake, and device routing viadevice="meta"(nice trick — keeps the CPU suite self-contained while still exercising the device branch). The_make_cfgSimpleNamespaceduck-type is a clean way to skip the fullTrainPipelineConfigconstruction cost. - ✅
SPEED_BUCKET_SECONDS = 10extracted with a code comment pointing at the training-side source. Validator and error string both go through it now. - ✅ Docstring note on pi07-only consumption — added.
- ✅
mistake=Falsevsmistake=Nonefoot-gun — called out in the docstring and pinned as a dedicated test case (test_mistake_false_vs_none_differ). - ✅
cfg.env is Noneearly-return dropped, and the docstring now explains the invarianteval_policyprovides.
Two small follow-ups remain (both non-blocking — either can be deferred to a follow-up PR):
-
SPEED_BUCKET_SECONDSis still single-sided. The constant lives insrc/opentau/envs/configs.py, but the comment notes the training-sideround(duration_s / 10) * 10(inBaseDataset._emit_optional_keysover inlerobot_dataset.py) still uses a literal10. The comment is good defense, but the underlying lockstep risk isn't eliminated — only a shared constant (defined in the dataset module and imported here) closes the loop. Since the training-side change is #295's territory, a follow-up issue / TODO to hoist the constant once #295 lands would be enough. -
tests/envs/test_configs.py:205—test_speed_must_be_positive_multiple_of_10hard-codes10in its regex (match=r"speed must be a positive multiple of 10"). The validator builds the error fromSPEED_BUCKET_SECONDS, so today this happens to match, but the test name and regex would silently miss a constant-only change. Cheap fix:match=rf"speed must be a positive multiple of {SPEED_BUCKET_SECONDS}", and rename totest_speed_must_be_positive_multiple_of_bucket(or similar). Cosmetic.
LGTM modulo those nits — happy to see this go in.
Generated by Claude Code
|
Re-examined both nits: Nit 1 (Literal/ Nit 2 (
Happy to revisit if you'd like a different framing (e.g. an explicit |
shuheng-liu
left a comment
There was a problem hiding this comment.
Re-reviewed fbdbf05 (refactor(envs): derive CONTROL_MODE_CHOICES from Literal alias) — clean addressing of the Literal / runtime-tuple duplication nit from round one. ControlMode = Literal["joint", "ee"] plus CONTROL_MODE_CHOICES = get_args(ControlMode) is the right shape; tuple[str, ...] annotation is correct for the get_args return, and no test behaviour changes since the produced tuple is identical (("joint", "ee")).
The two non-blocking items from the last round (training-side SPEED_BUCKET_SECONDS lockstep, hard-coded 10 in test_speed_must_be_positive_multiple_of_10's regex) are unchanged — both were explicitly defer-able, no objection to leaving them for a follow-up.
LGTM, nothing else to flag.
Generated by Claude Code
What this does
Adds an
EnvMetadataConfigdataclass attached to everyEnvConfigsubclass via the abstract base (CLI:--env.metadata.*), so the optional pi07 inference-time metadata fields —speed,quality,mistake,robot_type,control_mode— become configurable per eval run instead of always defaulting to "missing" / pad.When a field is
None(the default for every field), the corresponding batch key is omitted and the policy'sprepare_metadatapad path produces no segment in the prefix → existing eval rollouts are byte-identical. When a field is set, the value is broadcast across the rollout batch with the dataset-matching dtype (torch.longforspeed/quality,torch.boolformistake,list[str]forrobot_type/control_mode) and a parallel{key}_is_pad=Falseflag where applicable.Note:
mistake=Falseis semantically distinct frommistake=None.Falseemits aMistake: Falsesegment into the prefix;Noneomits the segment entirely.Validation pinned in
EnvMetadataConfig.__post_init__:speed: positive int, multiple ofSPEED_BUCKET_SECONDS(= 10 seconds) — matches feat(datasets): bucket speed by duration-seconds, not native frames #295's training-side bucket (duration-seconds rounded to 10 s). Constant exposed so the divisor only lives in one place on the eval side.quality: int in[1, 5].mistake: bool.robot_type: non-empty string.control_mode:Literal["joint", "ee"].Only the pi07 family of policies consumes these keys today — setting them against a vanilla pi0 / pi05 eval will pass validation but the values get ignored downstream.
CLI usage:
opentau-eval --config_path=... \ --env.metadata.speed=20 \ --env.metadata.quality=3 \ --env.metadata.mistake=false \ --env.metadata.robot_type=UR5 \ --env.metadata.control_mode=eeresponseandsubgoal*stay absent for now (the policy's same pad-default mechanism handles them, and there's no eval-time source for them yet). The gRPC inference server is out of scope — wiring these fields to deployment would need new proto fields inrobot_inference.protoand matching unpacking inserver.py::_prepare_observation.The injection point is a new helper
add_eval_metadata(observation, cfg)inenvs/utils.py, called in the rollout loop inscripts/eval.pyright afteradd_envs_task. The helper trustscfg.envnon-Nonebecauseeval_policydereferencescfg.env.typebefore the rollout loop ever runs.Refs #295 — the training-side
speed_rawis being switched to a 10-second duration bucket; this PR keeps the eval-side validator in lockstep so a sweep like--env.metadata.speed=20lines up with values the model actually saw in training.How it was tested
pre-commit run --files src/opentau/envs/configs.py src/opentau/envs/utils.py src/opentau/scripts/eval.py tests/envs/test_configs.py tests/envs/test_utils.py— all hooks pass (ruff lint+format, pyupgrade, bandit, typos, gitleaks, etc.).pytest -m "not gpu" -n auto tests/configs tests/envs/test_utils.py tests/envs/test_configs.py tests/scripts— 315 passed, 0 failed. Coverage spans:tests/envs/test_configs.py::TestEnvMetadataConfig+::TestEnvConfigMetadataField, 37 cases) — valid values, invalidspeed/quality/mistake/robot_type/control_mode, theLiteral["joint", "ee"]constraint, the all-Nonedefault field on a freshEnvConfigsubclass.tests/envs/test_utils.py::TestAddEvalMetadata, 9 cases parametrised to 14 runs) — dtype/shape per field, mutated-in-place return contract, theFalsevsNonesemantic onmistake, partial-fields skip path, list-of-strings broadcast for string fields, device propagation (viadevice="meta"so the CPU suite still exercises the routing branch).tests/envs/test_factory.pyandtests/utils/test_libero_utils.pyunder the fullpytest -m "not gpu"run are unrelated — they're LIBERO config bootstrap issues (interactiveinput()fromlibero.libero.__init__when~/.libero/config.yamlis missing andLIBERO_CONFIG_PATHis unset). CI setsLIBERO_CONFIG_PATH=.github/assets/liberoincpu_test.yml, so they pass there.How to checkout & try? (for the reviewer)
gh pr checkout 296 uv sync --extra dev pytest -sx tests/envs/test_configs.py::TestEnvMetadataConfig \ tests/envs/test_configs.py::TestEnvConfigMetadataField \ tests/envs/test_utils.py::TestAddEvalMetadataEnd-to-end sanity that the new keys reach the policy at eval time (drop a
print(metadata[0])insideprepare_metadataat thef"Metadata: {' '.join(segments)}"line, then run the LIBERO smoke):opentau-eval --config_path=configs/dev/dev_config.json \ --env.metadata.speed=20 \ --env.metadata.robot_type=UR5 \ --env.metadata.control_mode=ee # Expect the prefix to include the substrings "Speed: 20", "Robot: UR5", # and "Control: ee" (the actual format ends up with doubled spaces between # segments because prepare_metadata does " ".join() on segments that already # end with ", " — not relevant to this PR, just don't grep for one literal).Also confirm the no-op default — same command without any
--env.metadata.*overrides should print an empty metadata string (the policy's pad path).Checklist
Note: Before submitting this PR, please read the contributor guideline.