RoboCasa: verify eval seed/split reproduces the benchmark eval protocol

## Problem

LIBERO pins canonical per-task initial states (`init_states` + `set_init_state`, sourced from the benchmark demos) so eval runs the same conditions as the benchmark and success rates are directly comparable to published numbers.

RoboCasa instead derives a per-worker seed (`worker_seed = seed + episode_index` in `RoboCasaEnv.reset`) and lets `RoboCasaGymEnv` sample layout/style from it, plus the `split` (`pretrain`/`target`) selection. Seed-based held-out scenes may well be RoboCasa's intended protocol (it's how upstream LeRobot's wrapper works) — but we have **not verified** that our seeding + split reproduces RoboCasa's official eval configurations.

## Why it matters

If our seed/split scheme doesn't match the benchmark harness, RoboCasa success rates aren't comparable to published results, and runs aren't reproducible against the reference.

## Suggested approach

- Confirm the `split` semantics and seed protocol against RoboCasa's eval harness; document the mapping.
- Decide whether to pin specific eval scene configurations (the `init_states` analogue) for reproducibility, and whether the eval-time per-rank seed offset interacts correctly.

## References

- `RoboCasaEnv.reset` (`worker_seed`) in `src/opentau/envs/robocasa.py`; seeding in `src/opentau/scripts/eval.py`
- cf. LIBERO `init_states` (`src/opentau/envs/libero.py`, `src/opentau/envs/factory.py`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoboCasa: verify eval seed/split reproduces the benchmark eval protocol #383

Problem

Why it matters

Suggested approach

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RoboCasa: verify eval seed/split reproduces the benchmark eval protocol #383

Description

Problem

Why it matters

Suggested approach

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions