feat(scripts): add annotate_mistakes.py for per-frame mistake labels by WilliamYue37 · Pull Request #280 · TensorAuto/OpenTau

WilliamYue37 · 2026-05-06T22:30:23Z

What this does

Adds src/opentau/scripts/annotate_mistakes.py, a sibling to annotate_subtasks.py. For every episode in a dataset mixture config, the script:

Reads the per-frame response column already written by annotate_subtasks.py from each episode parquet, and treats every contiguous run of identical response values as a single subtask segment.
Decodes the dataset's camera0 video (resolved with the same lookup chain as annotate_subtasks.py) once per episode and pulls the last frame of each contiguous run — no temporal subsampling, just one frame per segment.
Sends that single frame plus the segment's subtask string to a VLM (default: gemini-robotics-er-1.6-preview; Anthropic Claude also supported via --model) and asks for a {"success": bool, "reason": str} JSON verdict.
Sets every parquet row in the segment to mistake=1 if the VLM reports failure, 0 otherwise. Any parse / API failure defaults to 0 (no mistake).
Atomically rewrites the episode parquet with the new int64 mistake column and registers mistake in meta/info.json features ({"dtype": "int64", "shape": (1,), "names": None}) the first time it is added to a dataset.

Resumability semantics:

Episodes whose parquet already has a mistake column in the schema are skipped (cheap O(1) check via pq.read_metadata).
Episodes whose parquet has no response column are skipped with a warning — run annotate_subtasks.py first.
Frames are still spatially downsampled / center-cropped to --target-size (default 448) to keep image-token cost bounded; we never upsample.

Helpers shared with annotate_subtasks.py (_resize_and_center_crop, _to_jpeg_bytes, _to_b64_jpeg, _is_gemini_model, _resolve_camera0_video_key, _resolve_root, _load_datasets_from_config) are imported rather than duplicated.

How it was tested

python -c "from opentau.scripts import annotate_mistakes" — module imports cleanly.
pre-commit run --files src/opentau/scripts/annotate_mistakes.py — all hooks pass (ruff, ruff-format, pyupgrade, bandit, gitleaks, license header).
No automated tests added — the script is a thin orchestrator over the Anthropic / Gemini SDKs and PyAV decoding, identical in shape to annotate_subtasks.py which also has no tests. End-to-end exercise requires API keys + a real LeRobot dataset that has already been processed by annotate_subtasks.py.

How to checkout & try? (for the reviewer)

# Default Gemini ER path (requires GEMINI_API_KEY or GOOGLE_API_KEY)
python src/opentau/scripts/annotate_mistakes.py \
    --config-path configs/examples/train_mixture_config.json \
    --max-episodes-per-dataset 1

# Claude path
python src/opentau/scripts/annotate_mistakes.py \
    --config-path configs/examples/train_mixture_config.json \
    --model claude-opus-4-7 \
    --max-episodes-per-dataset 1

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

claude

superseded by next review

claude

Inline findings below. None are blocking — primary concerns are (1) bool(parsed["success"]) returning True for stringy "false" and (2) info.json being updated before any parquet is annotated. See summary comment for the full list.

claude · 2026-05-06T22:34:40Z

[claude-review] summary for commit bcfaebb

Latest commit bcfaebb ("address review feedback on #280") cleanly addresses every prior finding from the review history. Verified by re-reading the current state of src/opentau/scripts/annotate_mistakes.py, tests/scripts/test_annotate_mistakes.py, docs/source/tutorials/datasets.rst, and configs/examples/annotate_mistakes_example.json:

_parse_success_response now rejects non-bool success values, with test_rejects_int correctly anticipating that bool subclasses int (so a literal 1 is not silently accepted).
Claude max_tokens=32000 → model-agnostic max_tokens=1024, with a comment explaining the choice fits the smallest 3.5-family ceiling without underutilizing larger models.
_call_gemini_single carries a comment documenting why the prior thinking_budget=0 workaround was retired (default budget verified empirically to fit reasoning + one-line JSON).
_query_subtask_success returns (success, ok); _annotate_episode returns (processed, n_api_failures, n_missing_frames); the per-dataset summary log surfaces both totals so quiet API outages and short-video frame misses no longer masquerade as "no mistakes found".
mistake feature registration in info.json is deferred inside the per-episode loop until after the first parquet has actually been rewritten — matches the resumability claim in the docs.
New tests/scripts/test_annotate_mistakes.py covers _find_response_runs (empty / single / multiple runs / None / empty-string / alternating / singleton) and _parse_success_response (plain true/false, JSON / bare fences, surrounding whitespace, stringy "false" rejection, stringy "true" rejection, int 1 rejection, missing key, non-object, invalid JSON).
--hub-cache-dir help text explains the deliberate match with annotate_subtasks.py for cache reuse; the docs flag table mirrors the same note.
configs/examples/annotate_mistakes_example.json pins "revision": "v2.1" on the Hub repo_id, mirroring the docs' v2.1-only support note. Tutorial section and module docstring both link the new example file.

Stale inline findings from prior reviews have been deleted (the GitHub dismissals API rejects COMMENTED reviews, so deletion is the only way to clear line-anchored findings carried forward across runs).

No blocking issues found.

claude

Inline findings below. None are blocking — primary concerns are (1) bool(parsed["success"]) returning True for stringy "false" and (2) info.json being updated before any parquet is annotated. The new docs commit only adds tutorial prose; the code findings carry over unchanged. See summary for the full list.

claude

Latest commit e12aa55 only adds configs/examples/annotate_mistakes_example.json — annotate_mistakes.py is byte-identical to db60953, so the prior six inline findings carry over unchanged on the same lines. One new nit on the example config itself: the docs (docs/source/tutorials/datasets.rst) and the script's own docstring (src/opentau/scripts/annotate_mistakes.py:53) still cite configs/examples/train_mixture_config.json rather than the newly-added example. See summary for the full list.

claude

Inline findings below — none blocking. Latest commit's thinking_budget=0 workaround is correct; flagged a follow-up since annotate_subtasks.py calls the same Gemini ER family without it. See summary comment for the full list.

claude

Inline findings below — none blocking. Latest commit (c7eeaf9 removing the output token caps) is correct in both branches; raised one minor follow-up on the hardcoded Claude max_tokens=32000. Prior code-level findings carry over unchanged on the new line numbers (re-anchored after the +2-line shift in _call_claude_single). See summary for the full list.

claude

Inline findings below — none blocking. Latest commit re-enables Gemini ER thinking; this resolves the prior cross-script inconsistency vs annotate_subtasks.py. Remaining findings carry over with re-anchored line numbers (-6 lines after the gemini_config simplification).

WilliamYue37 · 2026-05-07T00:12:01Z

@claude fix

@claude-review

- addresses @claude-review (max_tokens): replaced hardcoded 32000 cap on the Anthropic call with a model-agnostic 1024 cap, fitting inside the smallest claude-3.5 ceiling and updated comment accordingly. - addresses @claude-review (bool coercion): _parse_success_response now rejects non-bool 'success' values (stringy "false" / int 1) instead of silently coercing — caller treats parse errors as mistake=0, so failing closed is safer than flipping the verdict. - addresses @claude-review (gemini thinking note): added a one-line comment in _call_gemini_single explaining why the prior thinking_budget=0 workaround was retired (default ceiling fits both reasoning and one-line JSON). - addresses @claude-review (silent API failures): _query_subtask_success now returns (success, ok); _annotate_episode aggregates per-segment API failures and missing-frame counts and surfaces both in the per-dataset summary log so a quiet outage doesn't masquerade as "no mistakes found". - addresses @claude-review (frame-extraction summary): same change above also surfaces n_missing_frames in the per-dataset summary. - addresses @claude-review (write_info ordering): deferred registering the 'mistake' feature in info.json until after the first parquet is successfully rewritten, so a mid-dataset crash before any annotation doesn't leave info.json advertising a column that exists nowhere. - addresses @claude-review (unit tests): added tests/scripts/test_annotate_mistakes.py with 20 tests covering _find_response_runs (empty/single/skip-empty/skip-None/alternating) and _parse_success_response (fences, whitespace, bool/string/int/ missing-key/non-object/invalid-json edge cases). - addresses @claude-review (hub-cache-dir docs): expanded the --hub-cache-dir help text and the docs flag table to explain that the default deliberately matches annotate_subtasks.py for cache reuse. - addresses @claude-review (example config + v2.1 pin): pinned "revision": "v2.1" on the example config's repo_id, and rewrote the tutorial / docstring to point at configs/examples/annotate_mistakes_example.json instead of the unrelated train_mixture_config.json. tests: passed — pytest -m "not gpu" -n auto tests/scripts/test_annotate_mistakes.py tests/scripts/test_annotate_subtasks.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

Latest commit bcfaebb cleanly addresses every prior finding: _parse_success_response now strict-validates isinstance(success, bool) (with a test that rejects int 1 since bool subclasses int); Claude max_tokens dropped from 32000 to a model-agnostic 1024; Gemini-thinking-retired comment added; (success, ok) tuple plumbed through so API failures surface in the dataset summary; n_missing_frames likewise; write_info deferred until after the first parquet rewrite; 20 unit tests added covering both helpers; --hub-cache-dir help text now explains the shared default; example config pins revision: v2.1 and is referenced from the tutorial. No new blocking issues. Cleared all prior inline comments; see summary for the updated state.

feat(scripts): add annotate_mistakes.py for per-frame mistake labels

da34a3d

claude Bot reviewed May 6, 2026

View reviewed changes

docs(datasets): document annotate_mistakes.py

db60953

claude Bot reviewed May 6, 2026

View reviewed changes

WilliamYue37 self-assigned this May 6, 2026

WilliamYue37 added the feature New feature or request label May 6, 2026

docs(configs): add annotate_mistakes example mixture

e12aa55

claude Bot reviewed May 6, 2026

View reviewed changes

fix(annotate_mistakes): disable Gemini ER thinking to free output budget

a278704

claude Bot reviewed May 6, 2026

View reviewed changes

feat(annotate_mistakes): remove output token caps

c7eeaf9

claude Bot reviewed May 7, 2026

View reviewed changes

feat(annotate_mistakes): re-enable Gemini ER thinking

67ea92c

claude Bot reviewed May 7, 2026

View reviewed changes

WilliamYue37 requested review from akshay18iitg and shuheng-liu May 7, 2026 00:27

WilliamYue37 added the enhancement label May 7, 2026

shuheng-liu approved these changes May 7, 2026

View reviewed changes

WilliamYue37 merged commit 4de4b54 into main May 7, 2026
9 checks passed

WilliamYue37 deleted the feat/annotate-mistakes branch May 7, 2026 17:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scripts): add annotate_mistakes.py for per-frame mistake labels#280

feat(scripts): add annotate_mistakes.py for per-frame mistake labels#280
WilliamYue37 merged 7 commits into
mainfrom
feat/annotate-mistakes

WilliamYue37 commented May 6, 2026

Uh oh!

claude Bot left a comment •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot commented May 6, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

WilliamYue37 commented May 7, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WilliamYue37 commented May 6, 2026

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

claude Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

WilliamYue37 commented May 7, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot left a comment •

edited

Loading

claude Bot commented May 6, 2026 •

edited

Loading