Skip to content

feat(eval): configurable pi07 metadata on EnvConfig#296

Merged
shuheng-liu merged 3 commits into
mainfrom
feat/eval-metadata-config
May 12, 2026
Merged

feat(eval): configurable pi07 metadata on EnvConfig#296
shuheng-liu merged 3 commits into
mainfrom
feat/eval-metadata-config

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 12, 2026

What this does

Adds an EnvMetadataConfig dataclass attached to every EnvConfig subclass via the abstract base (CLI: --env.metadata.*), so the optional pi07 inference-time metadata fields — speed, quality, mistake, robot_type, control_mode — become configurable per eval run instead of always defaulting to "missing" / pad.

When a field is None (the default for every field), the corresponding batch key is omitted and the policy's prepare_metadata pad path produces no segment in the prefix → existing eval rollouts are byte-identical. When a field is set, the value is broadcast across the rollout batch with the dataset-matching dtype (torch.long for speed/quality, torch.bool for mistake, list[str] for robot_type/control_mode) and a parallel {key}_is_pad=False flag where applicable.

Note: mistake=False is semantically distinct from mistake=None. False emits a Mistake: False segment into the prefix; None omits the segment entirely.

Validation pinned in EnvMetadataConfig.__post_init__:

  • speed: positive int, multiple of SPEED_BUCKET_SECONDS (= 10 seconds) — matches feat(datasets): bucket speed by duration-seconds, not native frames #295's training-side bucket (duration-seconds rounded to 10 s). Constant exposed so the divisor only lives in one place on the eval side.
  • quality: int in [1, 5].
  • mistake: bool.
  • robot_type: non-empty string.
  • control_mode: Literal["joint", "ee"].

Only the pi07 family of policies consumes these keys today — setting them against a vanilla pi0 / pi05 eval will pass validation but the values get ignored downstream.

CLI usage:

opentau-eval --config_path=... \
    --env.metadata.speed=20 \
    --env.metadata.quality=3 \
    --env.metadata.mistake=false \
    --env.metadata.robot_type=UR5 \
    --env.metadata.control_mode=ee

response and subgoal* stay absent for now (the policy's same pad-default mechanism handles them, and there's no eval-time source for them yet). The gRPC inference server is out of scope — wiring these fields to deployment would need new proto fields in robot_inference.proto and matching unpacking in server.py::_prepare_observation.

The injection point is a new helper add_eval_metadata(observation, cfg) in envs/utils.py, called in the rollout loop in scripts/eval.py right after add_envs_task. The helper trusts cfg.env non-None because eval_policy dereferences cfg.env.type before the rollout loop ever runs.

Refs #295 — the training-side speed_raw is being switched to a 10-second duration bucket; this PR keeps the eval-side validator in lockstep so a sweep like --env.metadata.speed=20 lines up with values the model actually saw in training.

How it was tested

  • pre-commit run --files src/opentau/envs/configs.py src/opentau/envs/utils.py src/opentau/scripts/eval.py tests/envs/test_configs.py tests/envs/test_utils.py — all hooks pass (ruff lint+format, pyupgrade, bandit, typos, gitleaks, etc.).
  • pytest -m "not gpu" -n auto tests/configs tests/envs/test_utils.py tests/envs/test_configs.py tests/scripts — 315 passed, 0 failed. Coverage spans:
    • Config validator (tests/envs/test_configs.py::TestEnvMetadataConfig + ::TestEnvConfigMetadataField, 37 cases) — valid values, invalid speed / quality / mistake / robot_type / control_mode, the Literal["joint", "ee"] constraint, the all-None default field on a fresh EnvConfig subclass.
    • Injection helper (tests/envs/test_utils.py::TestAddEvalMetadata, 9 cases parametrised to 14 runs) — dtype/shape per field, mutated-in-place return contract, the False vs None semantic on mistake, partial-fields skip path, list-of-strings broadcast for string fields, device propagation (via device="meta" so the CPU suite still exercises the routing branch).
  • Pre-existing failures in tests/envs/test_factory.py and tests/utils/test_libero_utils.py under the full pytest -m "not gpu" run are unrelated — they're LIBERO config bootstrap issues (interactive input() from libero.libero.__init__ when ~/.libero/config.yaml is missing and LIBERO_CONFIG_PATH is unset). CI sets LIBERO_CONFIG_PATH=.github/assets/libero in cpu_test.yml, so they pass there.

How to checkout & try? (for the reviewer)

gh pr checkout 296
uv sync --extra dev
pytest -sx tests/envs/test_configs.py::TestEnvMetadataConfig \
           tests/envs/test_configs.py::TestEnvConfigMetadataField \
           tests/envs/test_utils.py::TestAddEvalMetadata

End-to-end sanity that the new keys reach the policy at eval time (drop a print(metadata[0]) inside prepare_metadata at the f"Metadata: {' '.join(segments)}" line, then run the LIBERO smoke):

opentau-eval --config_path=configs/dev/dev_config.json \
    --env.metadata.speed=20 \
    --env.metadata.robot_type=UR5 \
    --env.metadata.control_mode=ee
# Expect the prefix to include the substrings "Speed: 20", "Robot: UR5",
# and "Control: ee" (the actual format ends up with doubled spaces between
# segments because prepare_metadata does " ".join() on segments that already
# end with ", " — not relevant to this PR, just don't grep for one literal).

Also confirm the no-op default — same command without any --env.metadata.* overrides should print an empty metadata string (the policy's pad path).

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

@shuheng-liu shuheng-liu added the feature New feature or request label May 12, 2026
@shuheng-liu shuheng-liu self-assigned this May 12, 2026
@shuheng-liu shuheng-liu marked this pull request as ready for review May 12, 2026 19:56
Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against prepare_metadata in src/opentau/policies/pi07/low_level/modeling_pi07_low_level.py (L1075–1151) and the training-time emitter BaseDataset._emit_optional_keys in src/opentau/datasets/lerobot_dataset.py (L855–963). The injection contract lines up: dtypes (torch.long for speed/quality, torch.bool for mistake, list[str] for robot_type/control_mode), per-sample _is_pad=False flags on the numeric keys only, no separate pad flag for the string keys (empty string is the pad signal at training time). The all-None default keeps the existing batches byte-identical with the pad path. Validation is well-targeted — the isinstance(x, bool) guards on the int fields are the right call.

A few things worth addressing or at least acknowledging before merge:

1. No unit test for add_eval_metadata itself. tests/envs/test_configs.py covers the validator thoroughly, but the injection helper in src/opentau/envs/utils.py has zero coverage — dtype/device propagation, (B,) broadcasting, the missing-key skip path, and the cfg.env is None early-return are all untested. A small CPU test against a fake observation dict (just {"state": torch.zeros(B, D)}) would lock in the contract that downstream prepare_metadata actually relies on. This is the highest-value gap.

2. Speed bucket size is implicitly coupled to #295. The validator hard-codes multiple of 10 in two places (the % 10 != 0 check and the error string). If #295 ever revisits the bucket size, this validator silently drifts. Consider extracting SPEED_BUCKET_SECONDS = 10 either next to CONTROL_MODE_CHOICES here, or — better — importing it from wherever the training-side bucketing lives, so the two sides can't diverge silently.

3. Architectural placement. EnvMetadataConfig is attached to every EnvConfig, but only pi07 / pi05_mem-class policies actually read these keys. Setting --env.metadata.speed=20 against a vanilla pi0 eval will pass validation, get broadcast into the obs dict, and then be silently ignored by the policy. Not a blocker (it does match the "property of what is being run" framing in the docstring), but worth a one-line note that only pi07-family policies consume these so a future reader doesn't grep for downstream effects in vain.

4. mistake=False vs mistake=None semantic. These produce very different prompts: False injects "Mistake: False, " into the prefix, None drops the segment entirely. The docstring lists "True / False, or None" without flagging this, and it's the most plausible footgun in the API surface (someone defaulting mistake=False thinking it's the "no mistake" knob when they actually wanted "leave it missing"). One sentence in the field's docstring would close this.

5. cfg.env is None early-return is dead. eval_policy dereferences cfg.env.type at src/opentau/scripts/eval.py:582 before rollout() ever runs, so by the time add_eval_metadata is called cfg.env is guaranteed non-None. Fine to keep as cheap defense-in-depth if the helper is meant to be reusable, but per the repo's "no error handling for impossible scenarios" rule it could be dropped.

6. PR description verification command has a small drift. prepare_metadata builds segments with trailing ", " and then ' '.join(...)s them, so the actual printed string has double spaces between fields: "Metadata: Speed: 20, Robot: UR5, Control: ee, ". The PR body expects a single space. Cosmetic — but the suggested print-and-eyeball sanity check won't match a substring search written with single spaces. (The underlying double-space format is pre-existing in prepare_metadata and not this PR's concern.)

Nits (non-blocking):

  • CONTROL_MODE_CHOICES = ("joint", "ee") is duplicated with the Literal["joint", "ee"] in the annotation. typing.get_args(EnvMetadataConfig.__annotations__["control_mode"]) (after unwrapping | None) would keep them in lockstep, though the current form is more readable. Up to you.
  • batch_size = observation["state"].shape[0] works because preprocess_observation always emits "state", but add_envs_task reaches for env.num_envs instead — passing the same signal here would make the dependency explicit. Minor.

Overall: shape of the PR looks right, contract with prepare_metadata is correct, validation is solid. Main ask is closing the test gap on add_eval_metadata.


Generated by Claude Code

@shuheng-liu
Copy link
Copy Markdown
Member Author

Addressed in 2fae6a4 — items 1–5 + nit-2 covered; item 6 covered in the PR body; nit-1 left as-is per below.

1. Test gap on add_eval_metadata — Added TestAddEvalMetadata in tests/envs/test_utils.py (9 cases, parametrised to 14 runs). Covers dtype + shape per field, the mutated-in-place return contract, the False vs None semantic on mistake (item 4), the partial-fields skip path, list-of-strings broadcast for the string fields, and device propagation via device="meta" so the CPU suite still exercises the routing branch.

2. Speed bucket coupled to #295 — Extracted SPEED_BUCKET_SECONDS = 10 next to CONTROL_MODE_CHOICES, used in both the modulo check and the error string. Added a comment flagging that the training side currently hard-codes the same literal 10 inside BaseDataset._emit_optional_keys; if either side ever needs to change the bucket they must change together. Once #295 lands and exposes its own constant, swapping to a cross-module import is a one-liner.

3. Architectural placement note — Added one sentence to the EnvMetadataConfig class docstring: "Only the pi07 family of policies consumes these keys today; setting them when evaluating another policy (e.g. pi0, pi05) will pass validation but the values will be ignored downstream."

4. mistake=False vs mistake=None foot-gun — Added the explicit note to the mistake field docstring, mirrored in the class docstring, and locked the behaviour in with test_mistake_false_vs_none_differ.

5. Dead cfg.env is None early-return — Dropped. Updated the helper's docstring with a one-line note that eval_policy dereferences cfg.env.type before the rollout loop, so the helper trusts its input.

6. PR body verification example double-space drift — Fixed in the description above: the suggested check is now phrased as three substring expectations ("Speed: 20", "Robot: UR5", "Control: ee") instead of one literal, with a parenthetical noting the doubled spaces come from prepare_metadata's " ".join(...) on segments that themselves end with ", " — out of scope for this PR but worth flagging.

Nits:

  • Skipping typing.get_args(...) for control_mode: the current CONTROL_MODE_CHOICES tuple is more readable, and the duplication is now only one line away from the annotation so drift is easy to spot.
  • Skipping the env.num_envs vs state.shape[0] swap: passing env into the helper just to read num_envs would widen the signature unnecessarily; the state-derived path keeps the helper's dependency surface tight to the observation dict, which add_envs_task already mutates.

Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed against the new commit 2fae6a4 (test(eval): cover add_eval_metadata + address review nits). All five substantive points from the previous round are addressed:

  • add_eval_metadata test coverageTestAddEvalMetadata in tests/envs/test_utils.py pins dtype/shape/value per field, mutated-in-place contract, partial-field skip, the False vs None semantic on mistake, and device routing via device="meta" (nice trick — keeps the CPU suite self-contained while still exercising the device branch). The _make_cfg SimpleNamespace duck-type is a clean way to skip the full TrainPipelineConfig construction cost.
  • SPEED_BUCKET_SECONDS = 10 extracted with a code comment pointing at the training-side source. Validator and error string both go through it now.
  • Docstring note on pi07-only consumption — added.
  • mistake=False vs mistake=None foot-gun — called out in the docstring and pinned as a dedicated test case (test_mistake_false_vs_none_differ).
  • cfg.env is None early-return dropped, and the docstring now explains the invariant eval_policy provides.

Two small follow-ups remain (both non-blocking — either can be deferred to a follow-up PR):

  1. SPEED_BUCKET_SECONDS is still single-sided. The constant lives in src/opentau/envs/configs.py, but the comment notes the training-side round(duration_s / 10) * 10 (in BaseDataset._emit_optional_keys over in lerobot_dataset.py) still uses a literal 10. The comment is good defense, but the underlying lockstep risk isn't eliminated — only a shared constant (defined in the dataset module and imported here) closes the loop. Since the training-side change is #295's territory, a follow-up issue / TODO to hoist the constant once #295 lands would be enough.

  2. tests/envs/test_configs.py:205test_speed_must_be_positive_multiple_of_10 hard-codes 10 in its regex (match=r"speed must be a positive multiple of 10"). The validator builds the error from SPEED_BUCKET_SECONDS, so today this happens to match, but the test name and regex would silently miss a constant-only change. Cheap fix: match=rf"speed must be a positive multiple of {SPEED_BUCKET_SECONDS}", and rename to test_speed_must_be_positive_multiple_of_bucket (or similar). Cosmetic.

LGTM modulo those nits — happy to see this go in.


Generated by Claude Code

@shuheng-liu
Copy link
Copy Markdown
Member Author

Re-examined both nits:

Nit 1 (Literal/get_args) — Good catch, applied in fbdbf05. Introduced a ControlMode = Literal["joint", "ee"] alias and derived CONTROL_MODE_CHOICES = get_args(ControlMode), so the annotation and the runtime tuple are now one source. Smoke-verified get_args returns ("joint", "ee") and all 43 validator + helper tests still pass. The dataclass annotation reads control_mode: ControlMode | None = None, which keeps the on-disk JSON/CLI surface unchanged.

Nit 2 (state.shape[0] vs env.num_envs) — Still pushing back, with a more concrete read this time:

  • add_envs_task takes env because it structurally needs it: env.call("get_wrapper_attr", task_key) is its core operation, plus the len(tasks) != env.num_envs cross-check. num_envs is a free byproduct of a handle the helper already requires.
  • add_eval_metadata has no analogous need for env — it broadcasts a Python scalar across a batch dim and copies a device handle. Pulling env into the signature just to read num_envs widens the parameter surface for one scalar.
  • More importantly: observation["state"].shape[0] is the contract from preprocess_observation — the prior helper guarantees state has shape (B, max_state_dim) where B is the batch dim the policy will receive. Every downstream helper should align with that, not with env.num_envs (which is the configured env count, a different conceptual thing even if numerically equal today). If a future change ever introduces per-step batch trimming, state.shape[0] adapts and env.num_envs would silently mismatch.
  • The current helper does state = observation["state"]; batch_size = state.shape[0]; device = state.device — one dereference, two derived values. Splitting batch_size to come from env while device still comes from state would mix sources for no functional gain.

Happy to revisit if you'd like a different framing (e.g. an explicit _assert_obs_batch_dim(env, observation) helper) — but I'd lean toward leaving the asymmetry as-is and treating the divergence in signatures as faithfully reflecting what each helper actually needs.

Copy link
Copy Markdown
Member Author

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed fbdbf05 (refactor(envs): derive CONTROL_MODE_CHOICES from Literal alias) — clean addressing of the Literal / runtime-tuple duplication nit from round one. ControlMode = Literal["joint", "ee"] plus CONTROL_MODE_CHOICES = get_args(ControlMode) is the right shape; tuple[str, ...] annotation is correct for the get_args return, and no test behaviour changes since the produced tuple is identical (("joint", "ee")).

The two non-blocking items from the last round (training-side SPEED_BUCKET_SECONDS lockstep, hard-coded 10 in test_speed_must_be_positive_multiple_of_10's regex) are unchanged — both were explicitly defer-able, no objection to leaving them for a follow-up.

LGTM, nothing else to flag.


Generated by Claude Code

@shuheng-liu shuheng-liu merged commit ccdb27b into main May 12, 2026
7 checks passed
@shuheng-liu shuheng-liu deleted the feat/eval-metadata-config branch May 12, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant