Skip to content

RoboCasa: verify eval seed/split reproduces the benchmark eval protocol #383

@shuheng-liu

Description

@shuheng-liu

Problem

LIBERO pins canonical per-task initial states (init_states + set_init_state, sourced from the benchmark demos) so eval runs the same conditions as the benchmark and success rates are directly comparable to published numbers.

RoboCasa instead derives a per-worker seed (worker_seed = seed + episode_index in RoboCasaEnv.reset) and lets RoboCasaGymEnv sample layout/style from it, plus the split (pretrain/target) selection. Seed-based held-out scenes may well be RoboCasa's intended protocol (it's how upstream LeRobot's wrapper works) — but we have not verified that our seeding + split reproduces RoboCasa's official eval configurations.

Why it matters

If our seed/split scheme doesn't match the benchmark harness, RoboCasa success rates aren't comparable to published results, and runs aren't reproducible against the reference.

Suggested approach

  • Confirm the split semantics and seed protocol against RoboCasa's eval harness; document the mapping.
  • Decide whether to pin specific eval scene configurations (the init_states analogue) for reproducibility, and whether the eval-time per-rank seed offset interacts correctly.

References

  • RoboCasaEnv.reset (worker_seed) in src/opentau/envs/robocasa.py; seeding in src/opentau/scripts/eval.py
  • cf. LIBERO init_states (src/opentau/envs/libero.py, src/opentau/envs/factory.py)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions