Context
The initial RoboCasa365 sim-eval integration (src/opentau/envs/robocasa.py, RoboCasaEnv config, factory dispatch) makes the eval half of the loop work: parallel vec envs, success-rate aggregation, and grid_summary wandb videos all run against the real sim. But RoboCasa is currently eval-only — there is no RoboCasa training data in the dataset mixture and no projection head sized for its action/state, so eval has nothing meaningful to run.
LIBERO, by contrast, is a closed loop: TensorAuto/libero (20 fps v2.1) is co-trained, then eval on the LIBERO sim yields benchmark-comparable success rates.
Problem
- No RoboCasa LeRobot dataset is wired into
DatasetMixtureConfig.
- RoboCasa's robot (PandaOmron) is 12-D action / 16-D state / 3 cameras, distinct from LIBERO's 7-D/8-D. There is no validated per-
(robot_type, control_mode) projection head for it, and no norm stats.
- The example eval config (
configs/examples/pi05_robocasa_eval_config.json) loads a LIBERO checkpoint as a plumbing smoke, so rollouts are effectively random — "success rate" is not meaningful for RoboCasa yet.
Why it matters
This is the single thing that turns RoboCasa from "validated plumbing" into a real benchmark sibling of LIBERO. Until a RoboCasa-trained policy exists, the eval metrics are not interpretable.
Suggested approach
References
Follow-up to the initial RoboCasa env integration (branch claude/lucid-albattani-b33067).
Context
The initial RoboCasa365 sim-eval integration (
src/opentau/envs/robocasa.py,RoboCasaEnvconfig, factory dispatch) makes the eval half of the loop work: parallel vec envs, success-rate aggregation, andgrid_summarywandb videos all run against the real sim. But RoboCasa is currently eval-only — there is no RoboCasa training data in the dataset mixture and no projection head sized for its action/state, so eval has nothing meaningful to run.LIBERO, by contrast, is a closed loop:
TensorAuto/libero(20 fps v2.1) is co-trained, then eval on the LIBERO sim yields benchmark-comparable success rates.Problem
DatasetMixtureConfig.(robot_type, control_mode)projection head for it, and no norm stats.configs/examples/pi05_robocasa_eval_config.json) loads a LIBERO checkpoint as a plumbing smoke, so rollouts are effectively random — "success rate" is not meaningful for RoboCasa yet.Why it matters
This is the single thing that turns RoboCasa from "validated plumbing" into a real benchmark sibling of LIBERO. Until a RoboCasa-trained policy exists, the eval metrics are not interpretable.
Suggested approach
lerobot/robocasa_*repos) to the mixture and confirm image/state/action keys line up with the env'sfeatures_map.(robot_type="PandaOmron", control_mode=...)projection head for the 12-D action / 16-D state, reusing the per-(robot_type, control_mode)projection machinery (cf. feat(pi07): per-(robot_type, control_mode) state/action projections #371 / feat(pi07_paligemma): per-(robot_type, control_mode) state/action projections #370 / feat(train): aggregate val loss per (dataset, control_mode), parallelize #374), and confirm norm stats.CloseFridge(the end-to-end validation that makes "success rate" mean something).References
src/opentau/envs/robocasa.py,RoboCasaEnvinsrc/opentau/envs/configs.py(robot_type, control_mode)projections: PRs feat(pi07_paligemma): per-(robot_type, control_mode) state/action projections #370, feat(pi07): per-(robot_type, control_mode) state/action projections #371, feat(train): aggregate val loss per (dataset, control_mode), parallelize #374Follow-up to the initial RoboCasa env integration (branch
claude/lucid-albattani-b33067).