Problem
LIBERO pins canonical per-task initial states (init_states + set_init_state, sourced from the benchmark demos) so eval runs the same conditions as the benchmark and success rates are directly comparable to published numbers.
RoboCasa instead derives a per-worker seed (worker_seed = seed + episode_index in RoboCasaEnv.reset) and lets RoboCasaGymEnv sample layout/style from it, plus the split (pretrain/target) selection. Seed-based held-out scenes may well be RoboCasa's intended protocol (it's how upstream LeRobot's wrapper works) — but we have not verified that our seeding + split reproduces RoboCasa's official eval configurations.
Why it matters
If our seed/split scheme doesn't match the benchmark harness, RoboCasa success rates aren't comparable to published results, and runs aren't reproducible against the reference.
Suggested approach
- Confirm the
split semantics and seed protocol against RoboCasa's eval harness; document the mapping.
- Decide whether to pin specific eval scene configurations (the
init_states analogue) for reproducibility, and whether the eval-time per-rank seed offset interacts correctly.
References
RoboCasaEnv.reset (worker_seed) in src/opentau/envs/robocasa.py; seeding in src/opentau/scripts/eval.py
- cf. LIBERO
init_states (src/opentau/envs/libero.py, src/opentau/envs/factory.py)
Problem
LIBERO pins canonical per-task initial states (
init_states+set_init_state, sourced from the benchmark demos) so eval runs the same conditions as the benchmark and success rates are directly comparable to published numbers.RoboCasa instead derives a per-worker seed (
worker_seed = seed + episode_indexinRoboCasaEnv.reset) and letsRoboCasaGymEnvsample layout/style from it, plus thesplit(pretrain/target) selection. Seed-based held-out scenes may well be RoboCasa's intended protocol (it's how upstream LeRobot's wrapper works) — but we have not verified that our seeding + split reproduces RoboCasa's official eval configurations.Why it matters
If our seed/split scheme doesn't match the benchmark harness, RoboCasa success rates aren't comparable to published results, and runs aren't reproducible against the reference.
Suggested approach
splitsemantics and seed protocol against RoboCasa's eval harness; document the mapping.init_statesanalogue) for reproducibility, and whether the eval-time per-rank seed offset interacts correctly.References
RoboCasaEnv.reset(worker_seed) insrc/opentau/envs/robocasa.py; seeding insrc/opentau/scripts/eval.pyinit_states(src/opentau/envs/libero.py,src/opentau/envs/factory.py)