feat(scripts): annotate_subtasks.py — VLM subtask labelling for dataset mixtures by WilliamYue37 · Pull Request #215 · TensorAuto/OpenTau

WilliamYue37 · 2026-04-29T23:43:22Z

What this does

Adds src/opentau/scripts/annotate_subtasks.py, a new offline annotation script that automatically labels every episode in a dataset mixture with subtask boundaries using claude-opus-4-7.

How it works (efficiently):

Samples 1 fps from each episode video (30–50× reduction vs. raw frame rate), controlled by --sample-fps
Resizes frames to 640 px wide before JPEG-encoding (reduces image tokens ~6×)
Sends all sampled frames in a single API call per episode with timestamps; Claude returns [{"time": float, "subtask": str}, ...] boundaries
Skips already-annotated episodes — fully resumable after a crash

Hub dataset support: datasets without a local root are downloaded via huggingface_hub.snapshot_download into ~/.cache/huggingface/opentau_subtasks/ before processing.

Output is written as per-episode JSONs compatible with the existing add_subtask_response.py, and optionally expanded into a response column in each episode parquet (--write-response-column, on by default).

Adds anthropic>=0.55.0 as a project dependency. Adds configs/examples/train_mixture_config.json as a public example config pointing at lerobot/droid_100 (pinned to v2.1). Adds documentation in the Datasets tutorial.

How it was tested

Ran against lerobot/droid_100 at revision=v2.1 (Hub download path) and the local shuheng_bottle_lift dataset (local path):

# Hub dataset — downloads, annotates 1 episode, checks subtask JSON
python src/opentau/scripts/annotate_subtasks.py \
    --config-path configs/examples/train_mixture_config.json \
    --max-episodes-per-dataset 1 \
    --no-write-response-column

Sample output for lerobot/droid_100 episode 0 (task: "Put the marker in the pot"):

[
  {"time": 0.0,  "subtask": "approaching the marker on the table"},
  {"time": 4.0,  "subtask": "grasping the marker"},
  {"time": 6.0,  "subtask": "lifting and moving marker toward pot"},
  {"time": 8.0,  "subtask": "placing marker into the pot"},
  {"time": 10.0, "subtask": "retracting arm away from pot"}
]

Also verified:

Idempotency: re-run skips completed episodes in O(1)
Parquet write path: response column added correctly, meta/info.json updated with subtask_path and response feature
All pre-commit hooks pass

How to checkout & try? (for the reviewer)

git checkout feat/annotate-subtasks
uv sync --extra dev

# Dry run — annotates 1 episode from lerobot/droid_100 (downloads ~464 MB at v2.1)
ANTHROPIC_API_KEY=<your-key> python src/opentau/scripts/annotate_subtasks.py \
    --config-path configs/examples/train_mixture_config.json \
    --max-episodes-per-dataset 1 \
    --no-write-response-column

# Check the result
cat ~/.cache/huggingface/opentau_subtasks/lerobot--droid_100/subtasks/episode_000000.json

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

Adds a new script that samples 1 fps frames from episode videos, sends them to claude-opus-4-7, and writes per-episode subtask boundary JSONs compatible with add_subtask_response.py. Hub-only datasets (no root) are downloaded via snapshot_download before processing. Includes a public example config at example/train_mixture_config.json. Adds anthropic>=0.55.0 as a project dependency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ted kwarg Replaces the placeholder local/example dataset with the real public TensorAuto/IceLemonade_100 Hub dataset and removes the fake lerobot/pusht entry. Also drops the deprecated local_dir_use_symlinks=False kwarg from snapshot_download (huggingface_hub ≥0.24 no longer needs it). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Inline findings on annotate_subtasks.py — see line comments.

claude

Inline findings posted.

claude · 2026-04-29T23:46:50Z

[claude-review] summary for commit 915af4a

Latest commit (915af4a) addresses four of the prior findings: response.content is now iterated for text-typed blocks, malformed entries are filtered with a fail-fast ValueError if none survive, both prompt templates are parameterised on sample_fps, and the parquet step short-circuits via pq.read_metadata(...).schema.names when the response column already exists. Three findings remain plus a couple of new notes.

suggestion — src/opentau/scripts/annotate_subtasks.py:164 — no cap on sampled frame count; Anthropic Messages API rejects >100 images per request, so any episode >~100 s at default --sample-fps=1.0 will fail mid-run. Reinstate a --max-frames (uniformly subsample on overflow) or batch the request — long-horizon datasets will hit this.
suggestion — src/opentau/scripts/annotate_subtasks.py:341 — still uses table.num_rows as ground truth instead of reconciling against episodes.jsonl length the way add_subtask_response.py:156-167 does. Adopting the same warn-and-pad/truncate pattern catches a corrupt parquet up front.
suggestion — src/opentau/scripts/annotate_subtasks.py (whole file) — still no tests for new behaviour. CLAUDE.md flags missing tests as a review focus; cheap unit tests for _parse_json_response (markdown-fence stripping, non-array rejection), the entry-filtering branch (line 226-232), and the time=0.0 backfill (line 237-238) would prevent regression on the parsing path that was just hardened.
suggestion — src/opentau/scripts/annotate_subtasks.py:336-338 — new "skip if response column exists" path is silent (logger.debug). Combined with _annotate_episode's "skip if subtask JSON exists", a rerun with a different --sample-fps does nothing — neither the JSON nor the parquet is regenerated. Either log at INFO when skipping (so a user notices) or document the "delete the column to force regeneration" instruction in the script's --help / module docstring, not just the inline comment.
suggestion — src/opentau/scripts/annotate_subtasks.py:371-381 — docs/source/tutorials/datasets.rst claims "Only LeRobot v2.1 datasets are supported" but no version check is enforced; pointing the script at a v3.0 dataset will fail downstream rather than at config-load. Add an explicit assert info["codebase_version"] == "v2.1" (or a logger.warning for non-v2.1) in _process_dataset, or soften the doc to "tested only against v2.1".
nit — src/opentau/scripts/annotate_subtasks.py:468 — description=__doc__ dumps the entire module docstring in --help; a short string + epilog=__doc__ is cleaner.

Note: the new ValueError raised at line 233-234 when Claude returns no valid subtask entries is correctly absorbed by _process_dataset's try/except at line 404, so the failure mode is "skip episode, log exception" — that's the right call.

Note: PR description's sample output for lerobot/droid_100 still shows gerund subtasks ("approaching the marker", "grasping the marker", …) but the prompt now mandates imperative form ("approach the marker", …). Worth re-running the smoke test and refreshing the example so reviewers see what the script actually emits today.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Inline findings on annotate_subtasks.py — see line comments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Inline findings on annotate_subtasks.py — see line comments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude

Inline findings on annotate_subtasks.py — see line comments. Substantive code unchanged from prior review (only the prompt example was reworded for imperative mood).

…ust parsing - Skip parquet update when 'response' column already exists (metadata-only check), so reruns are actually O(1) per episode instead of re-reading and re-writing every parquet. - Thread --sample-fps into both system and user prompts (was hardcoded to "1 fps") so Claude isn't misled when a non-default rate is used. - Pick the first text block from response.content instead of [0].text; validate parsed subtask entries have time+subtask before use.

claude

Inline findings on annotate_subtasks.py — only the still-applicable issues from the previous review are reposted; the four fixes in 915af4a (text-block iteration, valid-entry filtering, fps-templated prompts, parquet-skip on rerun) clear the rest.

shuheng-liu

LGTM as long as Claude's reviews are addressed.

akshay18iitg · 2026-04-30T16:49:39Z

@WilliamYue37 Can we also take care of rate limiters. So, the script doesn't crash because of it

WilliamYue37 · 2026-04-30T19:49:40Z

@claude fix

@claude

- addresses @claude[bot] (frame cap): add MAX_FRAMES_PER_REQUEST=100 with --max-frames CLI flag; long clips are uniformly subsampled instead of silently exceeding the Anthropic Messages 100-image limit. Comment why the stride floor is needed. - addresses @claude[bot] (parquet length): trust episodes.jsonl length in _update_parquet_response, warn + pad/truncate on parquet row mismatch (mirrors add_subtask_response.py:156-167). - addresses @claude[bot] (--help): short description= + epilog=__doc__ so the flag list is no longer buried under the module docstring. - addresses @claude[bot] (silent rerun skip): bump 'response column already present' log to INFO with delete-to-regenerate hint; document the same in the module docstring. - addresses @claude[bot] (v2.1 expectation): warn when info.codebase_version is not 'v2.1' and document the limitation in the module docstring. - addresses @claude[bot] (no tests): add tests/scripts/test_annotate_subtasks.py covering _parse_json_response (fence stripping, non-array rejection) and _coerce_subtasks (entry filtering, time=0.0 backfill, empty rejection). Extracted _coerce_subtasks helper to make filtering testable. - addresses @akshay18iitg (rate limits): pass max_retries=--max-api-retries (default 8) to the Anthropic client so 429/5xx responses retry with the SDK's built-in exponential backoff instead of crashing the run. tests: passed -- pytest -m "not gpu" -n auto tests/scripts/test_annotate_subtasks.py tests/scripts/test_add_subtask_response.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude · 2026-04-30T19:59:32Z

[claude-fix] @akshay18iitg done in 58bc039. The Anthropic SDK has built-in retry-with-backoff for 429 (rate limit) and 5xx responses; the script now constructs the client with max_retries=args.max_api_retries (default 8, configurable via the new --max-api-retries CLI flag) so transient bursts on rate-limited tiers are absorbed automatically rather than crashing a long batch.

WilliamYue37 and others added 3 commits April 29, 2026 15:53

chore: move train_mixture_config.json to configs/examples/

7737436

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude Bot reviewed Apr 29, 2026

View reviewed changes

chore: switch example config to lerobot/droid_100 at v2.1

0bd581a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

WilliamYue37 self-assigned this Apr 30, 2026

claude Bot reviewed Apr 30, 2026

View reviewed changes

WilliamYue37 and others added 2 commits April 29, 2026 17:07

refactor(annotate_subtasks): remove --max-frames, use --sample-fps only

1184ef6

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(datasets): document annotate_subtasks.py in Datasets tutorial

a141b9b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude Bot reviewed Apr 30, 2026

View reviewed changes

feat(annotate_subtasks): use imperative mood for subtask descriptions

5c85836

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude Bot reviewed Apr 30, 2026

View reviewed changes

WilliamYue37 requested review from akshay18iitg and shuheng-liu April 30, 2026 01:27

claude Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/opentau/scripts/annotate_subtasks.py

Comment thread src/opentau/scripts/annotate_subtasks.py Outdated

Comment thread src/opentau/scripts/annotate_subtasks.py Outdated

shuheng-liu previously approved these changes Apr 30, 2026

View reviewed changes

claude Bot dismissed shuheng-liu’s stale review via 58bc039 April 30, 2026 19:59

WilliamYue37 requested a review from shuheng-liu April 30, 2026 20:27

akshay18iitg approved these changes Apr 30, 2026

View reviewed changes

WilliamYue37 removed the request for review from shuheng-liu April 30, 2026 20:56

WilliamYue37 merged commit 8009141 into main Apr 30, 2026
7 checks passed

WilliamYue37 deleted the feat/annotate-subtasks branch April 30, 2026 20:57

Conversation

WilliamYue37 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

How it was tested

How to checkout & try? (for the reviewer)

Checklist

Note: Before submitting this PR, please read the contributor guideline.

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuheng-liu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akshay18iitg commented Apr 30, 2026

Uh oh!

WilliamYue37 commented Apr 30, 2026

Uh oh!

claude Bot commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WilliamYue37 commented Apr 29, 2026 •

edited

Loading

claude Bot commented Apr 29, 2026 •

edited

Loading

shuheng-liu left a comment •

edited

Loading