TensorAuto · WilliamYue37 · May 7, 2026 · May 6, 2026 · May 6, 2026 · May 6, 2026
diff --git a/configs/examples/annotate_mistakes_example.json b/configs/examples/annotate_mistakes_example.json
@@ -0,0 +1,13 @@
+{
+  "dataset_mixture": {
+    "datasets": [
+      {
+        "repo_id": "RoboCOIN/leju_robot_moving_parts_a",
+        "revision": "v2.1",
+        "data_features_name_mapping": {
+          "camera0": "observation.images.camera_head_rgb"
+        }
+      }
+    ]
+  }
+}
diff --git a/docs/source/tutorials/datasets.rst b/docs/source/tutorials/datasets.rst
@@ -261,6 +261,133 @@ At 1 fps sampling with frames resized to 640 px wide (≈ 410 image tokens each)
 Costs scale linearly with episode count × episode duration.  Use ``--sample-fps 0.5`` or
 lower to halve/quarter costs on longer episodes.
 
+Automatically annotating mistakes with a VLM
+---------------------------------------------
+
+``annotate_mistakes.py`` adds a per-frame ``mistake`` column (``int64`` ∈ ``{0, 1}``) to
+every episode parquet in a dataset mixture, by asking a VLM whether each subtask was
+completed successfully. It runs **after** ``annotate_subtasks.py`` and reuses the same
+mixture config format.
+
+How it works
+^^^^^^^^^^^^
+
+For each episode the script:
+
+1. Reads the per-frame ``response`` column from the episode parquet (written by
+   ``annotate_subtasks.py``). Every contiguous run of identical ``response`` values is
+   treated as one subtask segment.
+2. Decodes the ``camera0`` video once (resolved with the same lookup chain as
+   ``annotate_subtasks.py``: inline ``data_features_name_mapping``, then
+   ``DATA_FEATURES_NAME_MAPPING``, then the first ``dtype=='video'`` feature) and
+   pulls the **last frame of each contiguous run** — no temporal subsampling, just one
+   frame per segment. Frames whose shorter side exceeds ``--target-size`` (default 448)
+   are downsampled and center-cropped before JPEG encoding; smaller frames pass through
+   unchanged.
+3. Sends that single frame plus the segment's subtask string to the configured VLM
+   (default: ``gemini-robotics-er-1.6-preview``; Anthropic Claude is supported via
+   ``--model``) and asks for a ``{"success": bool, "reason": str}`` JSON verdict.
+4. Sets every parquet row in the segment to ``mistake=1`` if the VLM reports failure,
+   ``0`` otherwise. Any parse / API failure defaults to ``0`` (no mistake).
+5. Atomically rewrites the episode parquet with the new ``mistake`` column and registers
+   it in ``meta/info.json`` features the first time it is added to a dataset.
+
+Episodes whose parquet already contains a ``mistake`` column are skipped (cheap O(1)
+schema check), making the script **fully resumable**. Episodes whose parquet has no
+``response`` column are skipped with a warning — run ``annotate_subtasks.py`` first.
+
+Prerequisites
+^^^^^^^^^^^^^
+
+Set the API key for the provider you intend to use:
+
+.. code-block:: bash
+
+    # Gemini (default)
+    export GEMINI_API_KEY="..."   # or GOOGLE_API_KEY
+
+    # Anthropic (when using --model claude-*)
+    export ANTHROPIC_API_KEY="sk-ant-..."
+
+The dataset must already have been processed by ``annotate_subtasks.py`` so that each
+episode parquet has a non-empty ``response`` column.
+
+Running the script
+^^^^^^^^^^^^^^^^^^
+
+Reuse the same dataset mixture config you passed to ``annotate_subtasks.py``.
+A minimal one-dataset example (with the Hub revision pinned to ``v2.1``, since
+this script has only been tested against v2.1 datasets) is checked in at
+``configs/examples/annotate_mistakes_example.json``:
+
+.. code-block:: bash
+
+    python src/opentau/scripts/annotate_mistakes.py \
+        --config-path configs/examples/annotate_mistakes_example.json
+
+For a dry run that processes only 1 episode per dataset:
+
+.. code-block:: bash
+
+    python src/opentau/scripts/annotate_mistakes.py \
+        --config-path configs/examples/annotate_mistakes_example.json \
+        --max-episodes-per-dataset 1
+
+To annotate with Claude instead of Gemini:
+
+.. code-block:: bash
+
+    ANTHROPIC_API_KEY=... python src/opentau/scripts/annotate_mistakes.py \
+        --config-path configs/examples/annotate_mistakes_example.json \
+        --model claude-opus-4-7
+
+Full list of flags:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 30 15 55
+
+   * - Flag
+     - Default
+     - Description
+   * - ``--config-path``
+     - *(required)*
+     - Path to dataset mixture config JSON.
+   * - ``--target-size``
+     - ``448``
+     - Downsample frames whose shorter side exceeds this many pixels (then center-crop
+       to a square). Frames at or below this size pass through unchanged — never upsamples.
+   * - ``--model``
+     - ``gemini-robotics-er-1.6-preview``
+     - Model ID to use. IDs starting with ``gemini`` or ``robotics-er`` go through
+       ``GEMINI_API_KEY`` (or ``GOOGLE_API_KEY``) via ``google-genai``; Anthropic IDs
+       (e.g. ``claude-opus-4-7``) go through ``ANTHROPIC_API_KEY``.
+   * - ``--max-episodes-per-dataset``
+     - *(none)*
+     - Cap the number of episodes processed per dataset — useful for dry runs.
+   * - ``--max-api-retries``
+     - ``8``
+     - Anthropic SDK retry count for 429/5xx responses (ignored for Gemini).
+   * - ``--hub-cache-dir``
+     - ``~/.cache/huggingface/opentau_subtasks``
+     - Directory for caching Hub dataset downloads. The default deliberately matches
+       ``annotate_subtasks.py`` so this script reuses datasets already downloaded by
+       the prior step — pass the same value here if you overrode it there.
+
+Output
+^^^^^^
+
+For each processed episode the script:
+
+- Adds a ``mistake`` column to the episode parquet, where every frame row contains
+  ``0`` (subtask completed successfully, per the VLM) or ``1`` (subtask flagged as a
+  failure). All frames within the same contiguous ``response`` run share the same value.
+- Adds a ``mistake`` feature entry to ``meta/info.json``
+  (``{"dtype": "int64", "shape": (1,), "names": None}``).
+
+To force regeneration of the mistake labels, drop the ``mistake`` column from the
+relevant episode parquets (or delete the cached dataset) before rerunning.
+
 Adding subtask responses to a dataset
 --------------------------------------