feat(eval): configurable pi07 metadata on EnvConfig by shuheng-liu · Pull Request #296 · TensorAuto/OpenTau

shuheng-liu · 2026-05-12T19:52:57Z

What this does

Adds an EnvMetadataConfig dataclass attached to every EnvConfig subclass via the abstract base (CLI: --env.metadata.*), so the optional pi07 inference-time metadata fields — speed, quality, mistake, robot_type, control_mode — become configurable per eval run instead of always defaulting to "missing" / pad.

When a field is None (the default for every field), the corresponding batch key is omitted and the policy's prepare_metadata pad path produces no segment in the prefix → existing eval rollouts are byte-identical. When a field is set, the value is broadcast across the rollout batch with the dataset-matching dtype (torch.long for speed/quality, torch.bool for mistake, list[str] for robot_type/control_mode) and a parallel {key}_is_pad=False flag where applicable.

Note: mistake=False is semantically distinct from mistake=None. False emits a Mistake: False segment into the prefix; None omits the segment entirely.

Validation pinned in EnvMetadataConfig.__post_init__:

speed: positive int, multiple of SPEED_BUCKET_SECONDS (= 10 seconds) — matches feat(datasets): bucket speed by duration-seconds, not native frames #295's training-side bucket (duration-seconds rounded to 10 s). Constant exposed so the divisor only lives in one place on the eval side.
quality: int in [1, 5].
mistake: bool.
robot_type: non-empty string.
control_mode: Literal["joint", "ee"].

Only the pi07 family of policies consumes these keys today — setting them against a vanilla pi0 / pi05 eval will pass validation but the values get ignored downstream.

CLI usage:

opentau-eval --config_path=... \
    --env.metadata.speed=20 \
    --env.metadata.quality=3 \
    --env.metadata.mistake=false \
    --env.metadata.robot_type=UR5 \
    --env.metadata.control_mode=ee

response and subgoal* stay absent for now (the policy's same pad-default mechanism handles them, and there's no eval-time source for them yet). The gRPC inference server is out of scope — wiring these fields to deployment would need new proto fields in robot_inference.proto and matching unpacking in server.py::_prepare_observation.

The injection point is a new helper add_eval_metadata(observation, cfg) in envs/utils.py, called in the rollout loop in scripts/eval.py right after add_envs_task. The helper trusts cfg.env non-None because eval_policy dereferences cfg.env.type before the rollout loop ever runs.

Refs #295 — the training-side speed_raw is being switched to a 10-second duration bucket; this PR keeps the eval-side validator in lockstep so a sweep like --env.metadata.speed=20 lines up with values the model actually saw in training.

How it was tested

pre-commit run --files src/opentau/envs/configs.py src/opentau/envs/utils.py src/opentau/scripts/eval.py tests/envs/test_configs.py tests/envs/test_utils.py — all hooks pass (ruff lint+format, pyupgrade, bandit, typos, gitleaks, etc.).
pytest -m "not gpu" -n auto tests/configs tests/envs/test_utils.py tests/envs/test_configs.py tests/scripts — 315 passed, 0 failed. Coverage spans:
- Config validator (tests/envs/test_configs.py::TestEnvMetadataConfig + ::TestEnvConfigMetadataField, 37 cases) — valid values, invalid speed / quality / mistake / robot_type / control_mode, the Literal["joint", "ee"] constraint, the all-None default field on a fresh EnvConfig subclass.
- Injection helper (tests/envs/test_utils.py::TestAddEvalMetadata, 9 cases parametrised to 14 runs) — dtype/shape per field, mutated-in-place return contract, the False vs None semantic on mistake, partial-fields skip path, list-of-strings broadcast for string fields, device propagation (via device="meta" so the CPU suite still exercises the routing branch).
Pre-existing failures in tests/envs/test_factory.py and tests/utils/test_libero_utils.py under the full pytest -m "not gpu" run are unrelated — they're LIBERO config bootstrap issues (interactive input() from libero.libero.__init__ when ~/.libero/config.yaml is missing and LIBERO_CONFIG_PATH is unset). CI sets LIBERO_CONFIG_PATH=.github/assets/libero in cpu_test.yml, so they pass there.

How to checkout & try? (for the reviewer)

gh pr checkout 296
uv sync --extra dev
pytest -sx tests/envs/test_configs.py::TestEnvMetadataConfig \
           tests/envs/test_configs.py::TestEnvConfigMetadataField \
           tests/envs/test_utils.py::TestAddEvalMetadata

End-to-end sanity that the new keys reach the policy at eval time (drop a print(metadata[0]) inside prepare_metadata at the f"Metadata: {' '.join(segments)}" line, then run the LIBERO smoke):

opentau-eval --config_path=configs/dev/dev_config.json \
    --env.metadata.speed=20 \
    --env.metadata.robot_type=UR5 \
    --env.metadata.control_mode=ee
# Expect the prefix to include the substrings "Speed: 20", "Robot: UR5",
# and "Control: ee" (the actual format ends up with doubled spaces between
# segments because prepare_metadata does " ".join() on segments that already
# end with ", " — not relevant to this PR, just don't grep for one literal).

Also confirm the no-op default — same command without any --env.metadata.* overrides should print an empty metadata string (the policy's pad path).

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

shuheng-liu

Reviewed against prepare_metadata in src/opentau/policies/pi07/low_level/modeling_pi07_low_level.py (L1075–1151) and the training-time emitter BaseDataset._emit_optional_keys in src/opentau/datasets/lerobot_dataset.py (L855–963). The injection contract lines up: dtypes (torch.long for speed/quality, torch.bool for mistake, list[str] for robot_type/control_mode), per-sample _is_pad=False flags on the numeric keys only, no separate pad flag for the string keys (empty string is the pad signal at training time). The all-None default keeps the existing batches byte-identical with the pad path. Validation is well-targeted — the isinstance(x, bool) guards on the int fields are the right call.

A few things worth addressing or at least acknowledging before merge:

1. No unit test for add_eval_metadata itself. tests/envs/test_configs.py covers the validator thoroughly, but the injection helper in src/opentau/envs/utils.py has zero coverage — dtype/device propagation, (B,) broadcasting, the missing-key skip path, and the cfg.env is None early-return are all untested. A small CPU test against a fake observation dict (just {"state": torch.zeros(B, D)}) would lock in the contract that downstream prepare_metadata actually relies on. This is the highest-value gap.

2. Speed bucket size is implicitly coupled to #295. The validator hard-codes multiple of 10 in two places (the % 10 != 0 check and the error string). If #295 ever revisits the bucket size, this validator silently drifts. Consider extracting SPEED_BUCKET_SECONDS = 10 either next to CONTROL_MODE_CHOICES here, or — better — importing it from wherever the training-side bucketing lives, so the two sides can't diverge silently.

3. Architectural placement. EnvMetadataConfig is attached to every EnvConfig, but only pi07 / pi05_mem-class policies actually read these keys. Setting --env.metadata.speed=20 against a vanilla pi0 eval will pass validation, get broadcast into the obs dict, and then be silently ignored by the policy. Not a blocker (it does match the "property of what is being run" framing in the docstring), but worth a one-line note that only pi07-family policies consume these so a future reader doesn't grep for downstream effects in vain.

4. mistake=False vs mistake=None semantic. These produce very different prompts: False injects "Mistake: False, " into the prefix, None drops the segment entirely. The docstring lists "True / False, or None" without flagging this, and it's the most plausible footgun in the API surface (someone defaulting mistake=False thinking it's the "no mistake" knob when they actually wanted "leave it missing"). One sentence in the field's docstring would close this.

5. cfg.env is None early-return is dead. eval_policy dereferences cfg.env.type at src/opentau/scripts/eval.py:582 before rollout() ever runs, so by the time add_eval_metadata is called cfg.env is guaranteed non-None. Fine to keep as cheap defense-in-depth if the helper is meant to be reusable, but per the repo's "no error handling for impossible scenarios" rule it could be dropped.

6. PR description verification command has a small drift. prepare_metadata builds segments with trailing ", " and then ' '.join(...)s them, so the actual printed string has double spaces between fields: "Metadata: Speed: 20, Robot: UR5, Control: ee, ". The PR body expects a single space. Cosmetic — but the suggested print-and-eyeball sanity check won't match a substring search written with single spaces. (The underlying double-space format is pre-existing in prepare_metadata and not this PR's concern.)

Nits (non-blocking):

CONTROL_MODE_CHOICES = ("joint", "ee") is duplicated with the Literal["joint", "ee"] in the annotation. typing.get_args(EnvMetadataConfig.__annotations__["control_mode"]) (after unwrapping | None) would keep them in lockstep, though the current form is more readable. Up to you.
batch_size = observation["state"].shape[0] works because preprocess_observation always emits "state", but add_envs_task reaches for env.num_envs instead — passing the same signal here would make the dependency explicit. Minor.

Overall: shape of the PR looks right, contract with prepare_metadata is correct, validation is solid. Main ask is closing the test gap on add_eval_metadata.

Generated by Claude Code

shuheng-liu · 2026-05-12T20:07:21Z

Addressed in 2fae6a4 — items 1–5 + nit-2 covered; item 6 covered in the PR body; nit-1 left as-is per below.

1. Test gap on add_eval_metadata — Added TestAddEvalMetadata in tests/envs/test_utils.py (9 cases, parametrised to 14 runs). Covers dtype + shape per field, the mutated-in-place return contract, the False vs None semantic on mistake (item 4), the partial-fields skip path, list-of-strings broadcast for the string fields, and device propagation via device="meta" so the CPU suite still exercises the routing branch.

2. Speed bucket coupled to #295 — Extracted SPEED_BUCKET_SECONDS = 10 next to CONTROL_MODE_CHOICES, used in both the modulo check and the error string. Added a comment flagging that the training side currently hard-codes the same literal 10 inside BaseDataset._emit_optional_keys; if either side ever needs to change the bucket they must change together. Once #295 lands and exposes its own constant, swapping to a cross-module import is a one-liner.

3. Architectural placement note — Added one sentence to the EnvMetadataConfig class docstring: "Only the pi07 family of policies consumes these keys today; setting them when evaluating another policy (e.g. pi0, pi05) will pass validation but the values will be ignored downstream."

4. mistake=False vs mistake=None foot-gun — Added the explicit note to the mistake field docstring, mirrored in the class docstring, and locked the behaviour in with test_mistake_false_vs_none_differ.

5. Dead cfg.env is None early-return — Dropped. Updated the helper's docstring with a one-line note that eval_policy dereferences cfg.env.type before the rollout loop, so the helper trusts its input.

6. PR body verification example double-space drift — Fixed in the description above: the suggested check is now phrased as three substring expectations ("Speed: 20", "Robot: UR5", "Control: ee") instead of one literal, with a parenthetical noting the doubled spaces come from prepare_metadata's " ".join(...) on segments that themselves end with ", " — out of scope for this PR but worth flagging.

Nits:

Skipping typing.get_args(...) for control_mode: the current CONTROL_MODE_CHOICES tuple is more readable, and the duplication is now only one line away from the annotation so drift is easy to spot.
Skipping the env.num_envs vs state.shape[0] swap: passing env into the helper just to read num_envs would widen the signature unnecessarily; the state-derived path keeps the helper's dependency surface tight to the observation dict, which add_envs_task already mutates.

shuheng-liu

Re-reviewed against the new commit 2fae6a4 (test(eval): cover add_eval_metadata + address review nits). All five substantive points from the previous round are addressed:

✅ add_eval_metadata test coverage — TestAddEvalMetadata in tests/envs/test_utils.py pins dtype/shape/value per field, mutated-in-place contract, partial-field skip, the False vs None semantic on mistake, and device routing via device="meta" (nice trick — keeps the CPU suite self-contained while still exercising the device branch). The _make_cfg SimpleNamespace duck-type is a clean way to skip the full TrainPipelineConfig construction cost.
✅ SPEED_BUCKET_SECONDS = 10 extracted with a code comment pointing at the training-side source. Validator and error string both go through it now.
✅ Docstring note on pi07-only consumption — added.
✅ mistake=False vs mistake=None foot-gun — called out in the docstring and pinned as a dedicated test case (test_mistake_false_vs_none_differ).
✅ cfg.env is None early-return dropped, and the docstring now explains the invariant eval_policy provides.

Two small follow-ups remain (both non-blocking — either can be deferred to a follow-up PR):

SPEED_BUCKET_SECONDS is still single-sided. The constant lives in src/opentau/envs/configs.py, but the comment notes the training-side round(duration_s / 10) * 10 (in BaseDataset._emit_optional_keys over in lerobot_dataset.py) still uses a literal 10. The comment is good defense, but the underlying lockstep risk isn't eliminated — only a shared constant (defined in the dataset module and imported here) closes the loop. Since the training-side change is #295's territory, a follow-up issue / TODO to hoist the constant once #295 lands would be enough.
tests/envs/test_configs.py:205 — test_speed_must_be_positive_multiple_of_10 hard-codes 10 in its regex (match=r"speed must be a positive multiple of 10"). The validator builds the error from SPEED_BUCKET_SECONDS, so today this happens to match, but the test name and regex would silently miss a constant-only change. Cheap fix: match=rf"speed must be a positive multiple of {SPEED_BUCKET_SECONDS}", and rename to test_speed_must_be_positive_multiple_of_bucket (or similar). Cosmetic.

LGTM modulo those nits — happy to see this go in.

Generated by Claude Code

shuheng-liu · 2026-05-12T20:13:34Z

Re-examined both nits:

Nit 1 (Literal/get_args) — Good catch, applied in fbdbf05. Introduced a ControlMode = Literal["joint", "ee"] alias and derived CONTROL_MODE_CHOICES = get_args(ControlMode), so the annotation and the runtime tuple are now one source. Smoke-verified get_args returns ("joint", "ee") and all 43 validator + helper tests still pass. The dataclass annotation reads control_mode: ControlMode | None = None, which keeps the on-disk JSON/CLI surface unchanged.

Nit 2 (state.shape[0] vs env.num_envs) — Still pushing back, with a more concrete read this time:

add_envs_task takes env because it structurally needs it: env.call("get_wrapper_attr", task_key) is its core operation, plus the len(tasks) != env.num_envs cross-check. num_envs is a free byproduct of a handle the helper already requires.
add_eval_metadata has no analogous need for env — it broadcasts a Python scalar across a batch dim and copies a device handle. Pulling env into the signature just to read num_envs widens the parameter surface for one scalar.
More importantly: observation["state"].shape[0] is the contract from preprocess_observation — the prior helper guarantees state has shape (B, max_state_dim) where B is the batch dim the policy will receive. Every downstream helper should align with that, not with env.num_envs (which is the configured env count, a different conceptual thing even if numerically equal today). If a future change ever introduces per-step batch trimming, state.shape[0] adapts and env.num_envs would silently mismatch.
The current helper does state = observation["state"]; batch_size = state.shape[0]; device = state.device — one dereference, two derived values. Splitting batch_size to come from env while device still comes from state would mix sources for no functional gain.

Happy to revisit if you'd like a different framing (e.g. an explicit _assert_obs_batch_dim(env, observation) helper) — but I'd lean toward leaving the asymmetry as-is and treating the divergence in signatures as faithfully reflecting what each helper actually needs.

shuheng-liu

Re-reviewed fbdbf05 (refactor(envs): derive CONTROL_MODE_CHOICES from Literal alias) — clean addressing of the Literal / runtime-tuple duplication nit from round one. ControlMode = Literal["joint", "ee"] plus CONTROL_MODE_CHOICES = get_args(ControlMode) is the right shape; tuple[str, ...] annotation is correct for the get_args return, and no test behaviour changes since the produced tuple is identical (("joint", "ee")).

The two non-blocking items from the last round (training-side SPEED_BUCKET_SECONDS lockstep, hard-coded 10 in test_speed_must_be_positive_multiple_of_10's regex) are unchanged — both were explicitly defer-able, no objection to leaving them for a follow-up.

LGTM, nothing else to flag.

Generated by Claude Code

feat(eval): configurable pi07 metadata on EnvConfig

26416a8

shuheng-liu added the feature New feature or request label May 12, 2026

shuheng-liu self-assigned this May 12, 2026

shuheng-liu marked this pull request as ready for review May 12, 2026 19:56

shuheng-liu commented May 12, 2026

View reviewed changes

test(eval): cover add_eval_metadata + address review nits

2fae6a4

shuheng-liu commented May 12, 2026

View reviewed changes

refactor(envs): derive CONTROL_MODE_CHOICES from Literal alias

fbdbf05

shuheng-liu commented May 12, 2026

View reviewed changes

shuheng-liu merged commit ccdb27b into main May 12, 2026
7 checks passed

shuheng-liu deleted the feat/eval-metadata-config branch May 12, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): configurable pi07 metadata on EnvConfig#296

feat(eval): configurable pi07 metadata on EnvConfig#296
shuheng-liu merged 3 commits into
mainfrom
feat/eval-metadata-config

shuheng-liu commented May 12, 2026 •

edited

Loading

Uh oh!

shuheng-liu left a comment

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu left a comment

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuheng-liu commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

shuheng-liu left a comment

Choose a reason for hiding this comment

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu left a comment

Choose a reason for hiding this comment

Uh oh!

shuheng-liu commented May 12, 2026

Uh oh!

shuheng-liu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shuheng-liu commented May 12, 2026 •

edited

Loading