Skip to content

feat(scripts): annotate_subtasks.py — VLM subtask labelling for dataset mixtures#215

Merged
WilliamYue37 merged 9 commits into
mainfrom
feat/annotate-subtasks
Apr 30, 2026
Merged

feat(scripts): annotate_subtasks.py — VLM subtask labelling for dataset mixtures#215
WilliamYue37 merged 9 commits into
mainfrom
feat/annotate-subtasks

Conversation

@WilliamYue37
Copy link
Copy Markdown
Member

@WilliamYue37 WilliamYue37 commented Apr 29, 2026

What this does

Adds src/opentau/scripts/annotate_subtasks.py, a new offline annotation script that automatically labels every episode in a dataset mixture with subtask boundaries using claude-opus-4-7.

How it works (efficiently):

  • Samples 1 fps from each episode video (30–50× reduction vs. raw frame rate), controlled by --sample-fps
  • Resizes frames to 640 px wide before JPEG-encoding (reduces image tokens ~6×)
  • Sends all sampled frames in a single API call per episode with timestamps; Claude returns [{"time": float, "subtask": str}, ...] boundaries
  • Skips already-annotated episodes — fully resumable after a crash

Hub dataset support: datasets without a local root are downloaded via huggingface_hub.snapshot_download into ~/.cache/huggingface/opentau_subtasks/ before processing.

Output is written as per-episode JSONs compatible with the existing add_subtask_response.py, and optionally expanded into a response column in each episode parquet (--write-response-column, on by default).

Adds anthropic>=0.55.0 as a project dependency. Adds configs/examples/train_mixture_config.json as a public example config pointing at lerobot/droid_100 (pinned to v2.1). Adds documentation in the Datasets tutorial.

How it was tested

Ran against lerobot/droid_100 at revision=v2.1 (Hub download path) and the local shuheng_bottle_lift dataset (local path):

# Hub dataset — downloads, annotates 1 episode, checks subtask JSON
python src/opentau/scripts/annotate_subtasks.py \
    --config-path configs/examples/train_mixture_config.json \
    --max-episodes-per-dataset 1 \
    --no-write-response-column

Sample output for lerobot/droid_100 episode 0 (task: "Put the marker in the pot"):

[
  {"time": 0.0,  "subtask": "approaching the marker on the table"},
  {"time": 4.0,  "subtask": "grasping the marker"},
  {"time": 6.0,  "subtask": "lifting and moving marker toward pot"},
  {"time": 8.0,  "subtask": "placing marker into the pot"},
  {"time": 10.0, "subtask": "retracting arm away from pot"}
]

Also verified:

  • Idempotency: re-run skips completed episodes in O(1)
  • Parquet write path: response column added correctly, meta/info.json updated with subtask_path and response feature
  • All pre-commit hooks pass

How to checkout & try? (for the reviewer)

git checkout feat/annotate-subtasks
uv sync --extra dev

# Dry run — annotates 1 episode from lerobot/droid_100 (downloads ~464 MB at v2.1)
ANTHROPIC_API_KEY=<your-key> python src/opentau/scripts/annotate_subtasks.py \
    --config-path configs/examples/train_mixture_config.json \
    --max-episodes-per-dataset 1 \
    --no-write-response-column

# Check the result
cat ~/.cache/huggingface/opentau_subtasks/lerobot--droid_100/subtasks/episode_000000.json

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Note: Before submitting this PR, please read the contributor guideline.

WilliamYue37 and others added 3 commits April 29, 2026 15:53
Adds a new script that samples 1 fps frames from episode videos, sends
them to claude-opus-4-7, and writes per-episode subtask boundary JSONs
compatible with add_subtask_response.py.  Hub-only datasets (no root)
are downloaded via snapshot_download before processing.  Includes a
public example config at example/train_mixture_config.json.  Adds
anthropic>=0.55.0 as a project dependency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ted kwarg

Replaces the placeholder local/example dataset with the real public
TensorAuto/IceLemonade_100 Hub dataset and removes the fake lerobot/pusht
entry.  Also drops the deprecated local_dir_use_symlinks=False kwarg from
snapshot_download (huggingface_hub ≥0.24 no longer needs it).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings on annotate_subtasks.py — see line comments.

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings posted.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 29, 2026

[claude-review] summary for commit 915af4a

Latest commit (915af4a) addresses four of the prior findings: response.content is now iterated for text-typed blocks, malformed entries are filtered with a fail-fast ValueError if none survive, both prompt templates are parameterised on sample_fps, and the parquet step short-circuits via pq.read_metadata(...).schema.names when the response column already exists. Three findings remain plus a couple of new notes.

  • suggestionsrc/opentau/scripts/annotate_subtasks.py:164 — no cap on sampled frame count; Anthropic Messages API rejects >100 images per request, so any episode >~100 s at default --sample-fps=1.0 will fail mid-run. Reinstate a --max-frames (uniformly subsample on overflow) or batch the request — long-horizon datasets will hit this.
  • suggestionsrc/opentau/scripts/annotate_subtasks.py:341 — still uses table.num_rows as ground truth instead of reconciling against episodes.jsonl length the way add_subtask_response.py:156-167 does. Adopting the same warn-and-pad/truncate pattern catches a corrupt parquet up front.
  • suggestionsrc/opentau/scripts/annotate_subtasks.py (whole file) — still no tests for new behaviour. CLAUDE.md flags missing tests as a review focus; cheap unit tests for _parse_json_response (markdown-fence stripping, non-array rejection), the entry-filtering branch (line 226-232), and the time=0.0 backfill (line 237-238) would prevent regression on the parsing path that was just hardened.
  • suggestionsrc/opentau/scripts/annotate_subtasks.py:336-338 — new "skip if response column exists" path is silent (logger.debug). Combined with _annotate_episode's "skip if subtask JSON exists", a rerun with a different --sample-fps does nothing — neither the JSON nor the parquet is regenerated. Either log at INFO when skipping (so a user notices) or document the "delete the column to force regeneration" instruction in the script's --help / module docstring, not just the inline comment.
  • suggestionsrc/opentau/scripts/annotate_subtasks.py:371-381docs/source/tutorials/datasets.rst claims "Only LeRobot v2.1 datasets are supported" but no version check is enforced; pointing the script at a v3.0 dataset will fail downstream rather than at config-load. Add an explicit assert info["codebase_version"] == "v2.1" (or a logger.warning for non-v2.1) in _process_dataset, or soften the doc to "tested only against v2.1".
  • nitsrc/opentau/scripts/annotate_subtasks.py:468description=__doc__ dumps the entire module docstring in --help; a short string + epilog=__doc__ is cleaner.

Note: the new ValueError raised at line 233-234 when Claude returns no valid subtask entries is correctly absorbed by _process_dataset's try/except at line 404, so the failure mode is "skip episode, log exception" — that's the right call.

Note: PR description's sample output for lerobot/droid_100 still shows gerund subtasks ("approaching the marker", "grasping the marker", …) but the prompt now mandates imperative form ("approach the marker", …). Worth re-running the smoke test and refreshing the example so reviewers see what the script actually emits today.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@WilliamYue37 WilliamYue37 self-assigned this Apr 30, 2026
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings on annotate_subtasks.py — see line comments.

WilliamYue37 and others added 2 commits April 29, 2026 17:07
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings on annotate_subtasks.py — see line comments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings on annotate_subtasks.py — see line comments. Substantive code unchanged from prior review (only the prompt example was reworded for imperative mood).

…ust parsing

- Skip parquet update when 'response' column already exists (metadata-only
  check), so reruns are actually O(1) per episode instead of re-reading and
  re-writing every parquet.
- Thread --sample-fps into both system and user prompts (was hardcoded to
  "1 fps") so Claude isn't misled when a non-default rate is used.
- Pick the first text block from response.content instead of [0].text;
  validate parsed subtask entries have time+subtask before use.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline findings on annotate_subtasks.py — only the still-applicable issues from the previous review are reposted; the four fixes in 915af4a (text-block iteration, valid-entry filtering, fps-templated prompts, parquet-skip on rerun) clear the rest.

Comment thread src/opentau/scripts/annotate_subtasks.py
Comment thread src/opentau/scripts/annotate_subtasks.py Outdated
Comment thread src/opentau/scripts/annotate_subtasks.py Outdated
shuheng-liu
shuheng-liu previously approved these changes Apr 30, 2026
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as Claude's reviews are addressed.

@akshay18iitg
Copy link
Copy Markdown
Collaborator

@WilliamYue37 Can we also take care of rate limiters. So, the script doesn't crash because of it

@WilliamYue37
Copy link
Copy Markdown
Member Author

@claude fix

- addresses @claude[bot] (frame cap): add MAX_FRAMES_PER_REQUEST=100 with
  --max-frames CLI flag; long clips are uniformly subsampled instead of
  silently exceeding the Anthropic Messages 100-image limit. Comment why
  the stride floor is needed.
- addresses @claude[bot] (parquet length): trust episodes.jsonl length
  in _update_parquet_response, warn + pad/truncate on parquet row mismatch
  (mirrors add_subtask_response.py:156-167).
- addresses @claude[bot] (--help): short description= + epilog=__doc__ so
  the flag list is no longer buried under the module docstring.
- addresses @claude[bot] (silent rerun skip): bump 'response column already
  present' log to INFO with delete-to-regenerate hint; document the same in
  the module docstring.
- addresses @claude[bot] (v2.1 expectation): warn when info.codebase_version
  is not 'v2.1' and document the limitation in the module docstring.
- addresses @claude[bot] (no tests): add tests/scripts/test_annotate_subtasks.py
  covering _parse_json_response (fence stripping, non-array rejection) and
  _coerce_subtasks (entry filtering, time=0.0 backfill, empty rejection).
  Extracted _coerce_subtasks helper to make filtering testable.
- addresses @akshay18iitg (rate limits): pass max_retries=--max-api-retries
  (default 8) to the Anthropic client so 429/5xx responses retry with the
  SDK's built-in exponential backoff instead of crashing the run.

tests: passed -- pytest -m "not gpu" -n auto tests/scripts/test_annotate_subtasks.py tests/scripts/test_add_subtask_response.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 30, 2026

[claude-fix] @akshay18iitg done in 58bc039. The Anthropic SDK has built-in retry-with-backoff for 429 (rate limit) and 5xx responses; the script now constructs the client with max_retries=args.max_api_retries (default 8, configurable via the new --max-api-retries CLI flag) so transient bursts on rate-limited tiers are absorbed automatically rather than crashing a long batch.

@WilliamYue37 WilliamYue37 removed the request for review from shuheng-liu April 30, 2026 20:56
@WilliamYue37 WilliamYue37 merged commit 8009141 into main Apr 30, 2026
7 checks passed
@WilliamYue37 WilliamYue37 deleted the feat/annotate-subtasks branch April 30, 2026 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants