Skip to content

feat(scripts): add annotate_mistakes.py for per-frame mistake labels#280

Merged
WilliamYue37 merged 7 commits into
mainfrom
feat/annotate-mistakes
May 7, 2026
Merged

feat(scripts): add annotate_mistakes.py for per-frame mistake labels#280
WilliamYue37 merged 7 commits into
mainfrom
feat/annotate-mistakes

Conversation

@WilliamYue37
Copy link
Copy Markdown
Member

What this does

Adds src/opentau/scripts/annotate_mistakes.py, a sibling to annotate_subtasks.py. For every episode in a dataset mixture config, the script:

  1. Reads the per-frame response column already written by annotate_subtasks.py from each episode parquet, and treats every contiguous run of identical response values as a single subtask segment.
  2. Decodes the dataset's camera0 video (resolved with the same lookup chain as annotate_subtasks.py) once per episode and pulls the last frame of each contiguous run — no temporal subsampling, just one frame per segment.
  3. Sends that single frame plus the segment's subtask string to a VLM (default: gemini-robotics-er-1.6-preview; Anthropic Claude also supported via --model) and asks for a {"success": bool, "reason": str} JSON verdict.
  4. Sets every parquet row in the segment to mistake=1 if the VLM reports failure, 0 otherwise. Any parse / API failure defaults to 0 (no mistake).
  5. Atomically rewrites the episode parquet with the new int64 mistake column and registers mistake in meta/info.json features ({"dtype": "int64", "shape": (1,), "names": None}) the first time it is added to a dataset.

Resumability semantics:

  • Episodes whose parquet already has a mistake column in the schema are skipped (cheap O(1) check via pq.read_metadata).
  • Episodes whose parquet has no response column are skipped with a warning — run annotate_subtasks.py first.
  • Frames are still spatially downsampled / center-cropped to --target-size (default 448) to keep image-token cost bounded; we never upsample.

Helpers shared with annotate_subtasks.py (_resize_and_center_crop, _to_jpeg_bytes, _to_b64_jpeg, _is_gemini_model, _resolve_camera0_video_key, _resolve_root, _load_datasets_from_config) are imported rather than duplicated.

How it was tested

  • python -c "from opentau.scripts import annotate_mistakes" — module imports cleanly.
  • pre-commit run --files src/opentau/scripts/annotate_mistakes.py — all hooks pass (ruff, ruff-format, pyupgrade, bandit, gitleaks, license header).
  • No automated tests added — the script is a thin orchestrator over the Anthropic / Gemini SDKs and PyAV decoding, identical in shape to annotate_subtasks.py which also has no tests. End-to-end exercise requires API keys + a real LeRobot dataset that has already been processed by annotate_subtasks.py.

How to checkout & try? (for the reviewer)

# Default Gemini ER path (requires GEMINI_API_KEY or GOOGLE_API_KEY)
python src/opentau/scripts/annotate_mistakes.py \
    --config-path configs/examples/train_mixture_config.json \
    --max-episodes-per-dataset 1

# Claude path
python src/opentau/scripts/annotate_mistakes.py \
    --config-path configs/examples/train_mixture_config.json \
    --model claude-opus-4-7 \
    --max-episodes-per-dataset 1

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

superseded by next review

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings below. None are blocking — primary concerns are (1) bool(parsed["success"]) returning True for stringy "false" and (2) info.json being updated before any parquet is annotated. See summary comment for the full list.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 6, 2026

[claude-review] summary for commit bcfaebb

Latest commit bcfaebb ("address review feedback on #280") cleanly addresses every prior finding from the review history. Verified by re-reading the current state of src/opentau/scripts/annotate_mistakes.py, tests/scripts/test_annotate_mistakes.py, docs/source/tutorials/datasets.rst, and configs/examples/annotate_mistakes_example.json:

  • _parse_success_response now rejects non-bool success values, with test_rejects_int correctly anticipating that bool subclasses int (so a literal 1 is not silently accepted).
  • Claude max_tokens=32000 → model-agnostic max_tokens=1024, with a comment explaining the choice fits the smallest 3.5-family ceiling without underutilizing larger models.
  • _call_gemini_single carries a comment documenting why the prior thinking_budget=0 workaround was retired (default budget verified empirically to fit reasoning + one-line JSON).
  • _query_subtask_success returns (success, ok); _annotate_episode returns (processed, n_api_failures, n_missing_frames); the per-dataset summary log surfaces both totals so quiet API outages and short-video frame misses no longer masquerade as "no mistakes found".
  • mistake feature registration in info.json is deferred inside the per-episode loop until after the first parquet has actually been rewritten — matches the resumability claim in the docs.
  • New tests/scripts/test_annotate_mistakes.py covers _find_response_runs (empty / single / multiple runs / None / empty-string / alternating / singleton) and _parse_success_response (plain true/false, JSON / bare fences, surrounding whitespace, stringy "false" rejection, stringy "true" rejection, int 1 rejection, missing key, non-object, invalid JSON).
  • --hub-cache-dir help text explains the deliberate match with annotate_subtasks.py for cache reuse; the docs flag table mirrors the same note.
  • configs/examples/annotate_mistakes_example.json pins "revision": "v2.1" on the Hub repo_id, mirroring the docs' v2.1-only support note. Tutorial section and module docstring both link the new example file.

Stale inline findings from prior reviews have been deleted (the GitHub dismissals API rejects COMMENTED reviews, so deletion is the only way to clear line-anchored findings carried forward across runs).

No blocking issues found.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings below. None are blocking — primary concerns are (1) bool(parsed["success"]) returning True for stringy "false" and (2) info.json being updated before any parquet is annotated. The new docs commit only adds tutorial prose; the code findings carry over unchanged. See summary for the full list.

@WilliamYue37 WilliamYue37 self-assigned this May 6, 2026
@WilliamYue37 WilliamYue37 added the feature New feature or request label May 6, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit e12aa55 only adds configs/examples/annotate_mistakes_example.jsonannotate_mistakes.py is byte-identical to db60953, so the prior six inline findings carry over unchanged on the same lines. One new nit on the example config itself: the docs (docs/source/tutorials/datasets.rst) and the script's own docstring (src/opentau/scripts/annotate_mistakes.py:53) still cite configs/examples/train_mixture_config.json rather than the newly-added example. See summary for the full list.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings below — none blocking. Latest commit's thinking_budget=0 workaround is correct; flagged a follow-up since annotate_subtasks.py calls the same Gemini ER family without it. See summary comment for the full list.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings below — none blocking. Latest commit (c7eeaf9 removing the output token caps) is correct in both branches; raised one minor follow-up on the hardcoded Claude max_tokens=32000. Prior code-level findings carry over unchanged on the new line numbers (re-anchored after the +2-line shift in _call_claude_single). See summary for the full list.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings below — none blocking. Latest commit re-enables Gemini ER thinking; this resolves the prior cross-script inconsistency vs annotate_subtasks.py. Remaining findings carry over with re-anchored line numbers (-6 lines after the gemini_config simplification).

@WilliamYue37
Copy link
Copy Markdown
Member Author

@claude fix

- addresses @claude-review (max_tokens): replaced hardcoded 32000 cap on
  the Anthropic call with a model-agnostic 1024 cap, fitting inside the
  smallest claude-3.5 ceiling and updated comment accordingly.
- addresses @claude-review (bool coercion): _parse_success_response now
  rejects non-bool 'success' values (stringy "false" / int 1) instead of
  silently coercing — caller treats parse errors as mistake=0, so failing
  closed is safer than flipping the verdict.
- addresses @claude-review (gemini thinking note): added a one-line
  comment in _call_gemini_single explaining why the prior
  thinking_budget=0 workaround was retired (default ceiling fits both
  reasoning and one-line JSON).
- addresses @claude-review (silent API failures): _query_subtask_success
  now returns (success, ok); _annotate_episode aggregates per-segment
  API failures and missing-frame counts and surfaces both in the
  per-dataset summary log so a quiet outage doesn't masquerade as
  "no mistakes found".
- addresses @claude-review (frame-extraction summary): same change above
  also surfaces n_missing_frames in the per-dataset summary.
- addresses @claude-review (write_info ordering): deferred registering
  the 'mistake' feature in info.json until after the first parquet is
  successfully rewritten, so a mid-dataset crash before any annotation
  doesn't leave info.json advertising a column that exists nowhere.
- addresses @claude-review (unit tests): added
  tests/scripts/test_annotate_mistakes.py with 20 tests covering
  _find_response_runs (empty/single/skip-empty/skip-None/alternating)
  and _parse_success_response (fences, whitespace, bool/string/int/
  missing-key/non-object/invalid-json edge cases).
- addresses @claude-review (hub-cache-dir docs): expanded the
  --hub-cache-dir help text and the docs flag table to explain that the
  default deliberately matches annotate_subtasks.py for cache reuse.
- addresses @claude-review (example config + v2.1 pin): pinned
  "revision": "v2.1" on the example config's repo_id, and rewrote the
  tutorial / docstring to point at configs/examples/annotate_mistakes_example.json
  instead of the unrelated train_mixture_config.json.

tests: passed — pytest -m "not gpu" -n auto tests/scripts/test_annotate_mistakes.py tests/scripts/test_annotate_subtasks.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit bcfaebb cleanly addresses every prior finding: _parse_success_response now strict-validates isinstance(success, bool) (with a test that rejects int 1 since bool subclasses int); Claude max_tokens dropped from 32000 to a model-agnostic 1024; Gemini-thinking-retired comment added; (success, ok) tuple plumbed through so API failures surface in the dataset summary; n_missing_frames likewise; write_info deferred until after the first parquet rewrite; 20 unit tests added covering both helpers; --hub-cache-dir help text now explains the shared default; example config pins revision: v2.1 and is referenced from the tutorial. No new blocking issues. Cleared all prior inline comments; see summary for the updated state.

@WilliamYue37 WilliamYue37 merged commit 4de4b54 into main May 7, 2026
9 checks passed
@WilliamYue37 WilliamYue37 deleted the feat/annotate-mistakes branch May 7, 2026 17:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants