Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions configs/examples/annotate_mistakes_example.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"dataset_mixture": {
"datasets": [
{
"repo_id": "RoboCOIN/leju_robot_moving_parts_a",
"revision": "v2.1",
"data_features_name_mapping": {
"camera0": "observation.images.camera_head_rgb"
}
}
]
}
}
127 changes: 127 additions & 0 deletions docs/source/tutorials/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,133 @@ At 1 fps sampling with frames resized to 640 px wide (≈ 410 image tokens each)
Costs scale linearly with episode count × episode duration. Use ``--sample-fps 0.5`` or
lower to halve/quarter costs on longer episodes.

Automatically annotating mistakes with a VLM
---------------------------------------------

``annotate_mistakes.py`` adds a per-frame ``mistake`` column (``int64`` ∈ ``{0, 1}``) to
every episode parquet in a dataset mixture, by asking a VLM whether each subtask was
completed successfully. It runs **after** ``annotate_subtasks.py`` and reuses the same
mixture config format.

How it works
^^^^^^^^^^^^

For each episode the script:

1. Reads the per-frame ``response`` column from the episode parquet (written by
``annotate_subtasks.py``). Every contiguous run of identical ``response`` values is
treated as one subtask segment.
2. Decodes the ``camera0`` video once (resolved with the same lookup chain as
``annotate_subtasks.py``: inline ``data_features_name_mapping``, then
``DATA_FEATURES_NAME_MAPPING``, then the first ``dtype=='video'`` feature) and
pulls the **last frame of each contiguous run** — no temporal subsampling, just one
frame per segment. Frames whose shorter side exceeds ``--target-size`` (default 448)
are downsampled and center-cropped before JPEG encoding; smaller frames pass through
unchanged.
3. Sends that single frame plus the segment's subtask string to the configured VLM
(default: ``gemini-robotics-er-1.6-preview``; Anthropic Claude is supported via
``--model``) and asks for a ``{"success": bool, "reason": str}`` JSON verdict.
4. Sets every parquet row in the segment to ``mistake=1`` if the VLM reports failure,
``0`` otherwise. Any parse / API failure defaults to ``0`` (no mistake).
5. Atomically rewrites the episode parquet with the new ``mistake`` column and registers
it in ``meta/info.json`` features the first time it is added to a dataset.

Episodes whose parquet already contains a ``mistake`` column are skipped (cheap O(1)
schema check), making the script **fully resumable**. Episodes whose parquet has no
``response`` column are skipped with a warning — run ``annotate_subtasks.py`` first.

Prerequisites
^^^^^^^^^^^^^

Set the API key for the provider you intend to use:

.. code-block:: bash

# Gemini (default)
export GEMINI_API_KEY="..." # or GOOGLE_API_KEY

# Anthropic (when using --model claude-*)
export ANTHROPIC_API_KEY="sk-ant-..."

The dataset must already have been processed by ``annotate_subtasks.py`` so that each
episode parquet has a non-empty ``response`` column.

Running the script
^^^^^^^^^^^^^^^^^^

Reuse the same dataset mixture config you passed to ``annotate_subtasks.py``.
A minimal one-dataset example (with the Hub revision pinned to ``v2.1``, since
this script has only been tested against v2.1 datasets) is checked in at
``configs/examples/annotate_mistakes_example.json``:

.. code-block:: bash

python src/opentau/scripts/annotate_mistakes.py \
--config-path configs/examples/annotate_mistakes_example.json

For a dry run that processes only 1 episode per dataset:

.. code-block:: bash

python src/opentau/scripts/annotate_mistakes.py \
--config-path configs/examples/annotate_mistakes_example.json \
--max-episodes-per-dataset 1

To annotate with Claude instead of Gemini:

.. code-block:: bash

ANTHROPIC_API_KEY=... python src/opentau/scripts/annotate_mistakes.py \
--config-path configs/examples/annotate_mistakes_example.json \
--model claude-opus-4-7

Full list of flags:

.. list-table::
:header-rows: 1
:widths: 30 15 55

* - Flag
- Default
- Description
* - ``--config-path``
- *(required)*
- Path to dataset mixture config JSON.
* - ``--target-size``
- ``448``
- Downsample frames whose shorter side exceeds this many pixels (then center-crop
to a square). Frames at or below this size pass through unchanged — never upsamples.
* - ``--model``
- ``gemini-robotics-er-1.6-preview``
- Model ID to use. IDs starting with ``gemini`` or ``robotics-er`` go through
``GEMINI_API_KEY`` (or ``GOOGLE_API_KEY``) via ``google-genai``; Anthropic IDs
(e.g. ``claude-opus-4-7``) go through ``ANTHROPIC_API_KEY``.
* - ``--max-episodes-per-dataset``
- *(none)*
- Cap the number of episodes processed per dataset — useful for dry runs.
* - ``--max-api-retries``
- ``8``
- Anthropic SDK retry count for 429/5xx responses (ignored for Gemini).
* - ``--hub-cache-dir``
- ``~/.cache/huggingface/opentau_subtasks``
- Directory for caching Hub dataset downloads. The default deliberately matches
``annotate_subtasks.py`` so this script reuses datasets already downloaded by
the prior step — pass the same value here if you overrode it there.

Output
^^^^^^

For each processed episode the script:

- Adds a ``mistake`` column to the episode parquet, where every frame row contains
``0`` (subtask completed successfully, per the VLM) or ``1`` (subtask flagged as a
failure). All frames within the same contiguous ``response`` run share the same value.
- Adds a ``mistake`` feature entry to ``meta/info.json``
(``{"dtype": "int64", "shape": (1,), "names": None}``).

To force regeneration of the mistake labels, drop the ``mistake`` column from the
relevant episode parquets (or delete the cached dataset) before rerunning.

Adding subtask responses to a dataset
--------------------------------------

Expand Down
Loading
Loading