feat(scripts): add annotate_mistakes.py for per-frame mistake labels#280
Conversation
|
[claude-review] summary for commit bcfaebb Latest commit
Stale inline findings from prior reviews have been deleted (the GitHub dismissals API rejects COMMENTED reviews, so deletion is the only way to clear line-anchored findings carried forward across runs). No blocking issues found. |
There was a problem hiding this comment.
Inline findings below. None are blocking — primary concerns are (1) bool(parsed["success"]) returning True for stringy "false" and (2) info.json being updated before any parquet is annotated. The new docs commit only adds tutorial prose; the code findings carry over unchanged. See summary for the full list.
There was a problem hiding this comment.
Latest commit e12aa55 only adds configs/examples/annotate_mistakes_example.json — annotate_mistakes.py is byte-identical to db60953, so the prior six inline findings carry over unchanged on the same lines. One new nit on the example config itself: the docs (docs/source/tutorials/datasets.rst) and the script's own docstring (src/opentau/scripts/annotate_mistakes.py:53) still cite configs/examples/train_mixture_config.json rather than the newly-added example. See summary for the full list.
There was a problem hiding this comment.
Inline findings below — none blocking. Latest commit (c7eeaf9 removing the output token caps) is correct in both branches; raised one minor follow-up on the hardcoded Claude max_tokens=32000. Prior code-level findings carry over unchanged on the new line numbers (re-anchored after the +2-line shift in _call_claude_single). See summary for the full list.
There was a problem hiding this comment.
Inline findings below — none blocking. Latest commit re-enables Gemini ER thinking; this resolves the prior cross-script inconsistency vs annotate_subtasks.py. Remaining findings carry over with re-anchored line numbers (-6 lines after the gemini_config simplification).
|
@claude fix |
- addresses @claude-review (max_tokens): replaced hardcoded 32000 cap on the Anthropic call with a model-agnostic 1024 cap, fitting inside the smallest claude-3.5 ceiling and updated comment accordingly. - addresses @claude-review (bool coercion): _parse_success_response now rejects non-bool 'success' values (stringy "false" / int 1) instead of silently coercing — caller treats parse errors as mistake=0, so failing closed is safer than flipping the verdict. - addresses @claude-review (gemini thinking note): added a one-line comment in _call_gemini_single explaining why the prior thinking_budget=0 workaround was retired (default ceiling fits both reasoning and one-line JSON). - addresses @claude-review (silent API failures): _query_subtask_success now returns (success, ok); _annotate_episode aggregates per-segment API failures and missing-frame counts and surfaces both in the per-dataset summary log so a quiet outage doesn't masquerade as "no mistakes found". - addresses @claude-review (frame-extraction summary): same change above also surfaces n_missing_frames in the per-dataset summary. - addresses @claude-review (write_info ordering): deferred registering the 'mistake' feature in info.json until after the first parquet is successfully rewritten, so a mid-dataset crash before any annotation doesn't leave info.json advertising a column that exists nowhere. - addresses @claude-review (unit tests): added tests/scripts/test_annotate_mistakes.py with 20 tests covering _find_response_runs (empty/single/skip-empty/skip-None/alternating) and _parse_success_response (fences, whitespace, bool/string/int/ missing-key/non-object/invalid-json edge cases). - addresses @claude-review (hub-cache-dir docs): expanded the --hub-cache-dir help text and the docs flag table to explain that the default deliberately matches annotate_subtasks.py for cache reuse. - addresses @claude-review (example config + v2.1 pin): pinned "revision": "v2.1" on the example config's repo_id, and rewrote the tutorial / docstring to point at configs/examples/annotate_mistakes_example.json instead of the unrelated train_mixture_config.json. tests: passed — pytest -m "not gpu" -n auto tests/scripts/test_annotate_mistakes.py tests/scripts/test_annotate_subtasks.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Latest commit bcfaebb cleanly addresses every prior finding: _parse_success_response now strict-validates isinstance(success, bool) (with a test that rejects int 1 since bool subclasses int); Claude max_tokens dropped from 32000 to a model-agnostic 1024; Gemini-thinking-retired comment added; (success, ok) tuple plumbed through so API failures surface in the dataset summary; n_missing_frames likewise; write_info deferred until after the first parquet rewrite; 20 unit tests added covering both helpers; --hub-cache-dir help text now explains the shared default; example config pins revision: v2.1 and is referenced from the tutorial. No new blocking issues. Cleared all prior inline comments; see summary for the updated state.
What this does
Adds
src/opentau/scripts/annotate_mistakes.py, a sibling toannotate_subtasks.py. For every episode in a dataset mixture config, the script:responsecolumn already written byannotate_subtasks.pyfrom each episode parquet, and treats every contiguous run of identicalresponsevalues as a single subtask segment.camera0video (resolved with the same lookup chain asannotate_subtasks.py) once per episode and pulls the last frame of each contiguous run — no temporal subsampling, just one frame per segment.gemini-robotics-er-1.6-preview; Anthropic Claude also supported via--model) and asks for a{"success": bool, "reason": str}JSON verdict.mistake=1if the VLM reports failure,0otherwise. Any parse / API failure defaults to0(no mistake).int64mistakecolumn and registersmistakeinmeta/info.jsonfeatures ({"dtype": "int64", "shape": (1,), "names": None}) the first time it is added to a dataset.Resumability semantics:
mistakecolumn in the schema are skipped (cheap O(1) check viapq.read_metadata).responsecolumn are skipped with a warning — runannotate_subtasks.pyfirst.--target-size(default 448) to keep image-token cost bounded; we never upsample.Helpers shared with
annotate_subtasks.py(_resize_and_center_crop,_to_jpeg_bytes,_to_b64_jpeg,_is_gemini_model,_resolve_camera0_video_key,_resolve_root,_load_datasets_from_config) are imported rather than duplicated.How it was tested
python -c "from opentau.scripts import annotate_mistakes"— module imports cleanly.pre-commit run --files src/opentau/scripts/annotate_mistakes.py— all hooks pass (ruff, ruff-format, pyupgrade, bandit, gitleaks, license header).annotate_subtasks.pywhich also has no tests. End-to-end exercise requires API keys + a real LeRobot dataset that has already been processed byannotate_subtasks.py.How to checkout & try? (for the reviewer)
Checklist
Note: Before submitting this PR, please read the contributor guideline.