TensorAuto · shuheng-liu · Apr 22, 2026 · Apr 22, 2026 · Apr 23, 2026 · Apr 24, 2026
diff --git a/README.md b/README.md
@@ -45,6 +45,7 @@ OpenTau ($\tau$) is a tool developed by *[Tensor][1]* to bridge this gap, and we
 |                       Visualize dataset with URDF models |            ❌            |                ❌                 |      ✅      |
 |            Simulation Environments for Evaluating Models |            ❌            |                ✅                 |      ✅      |
 |                 Create Validation Splits During Training |            ❌            |                ❌                 |      ✅      |
+|    Drop-in Training Profiler & Unused-Param Auditor      |            ❌            |                ❌                 |      ✅      |
 |    $\pi^{*}_{0.6}$ style Reinforcement Learning Pipeline |            ❌            |                ❌                 |      ✅      |
 |                               Post-training on Human Data|            ❌            |                ❌                 |      ✅      |
 |                                                Framework |      Jax / PyTorch       |             PyTorch               |   PyTorch    |
@@ -59,6 +60,27 @@ For using local notebooks to train and evaluate models, find the notebooks at [n
 
 For using the Google Colab notebooks to train and evaluate models, find the colab notebooks here: [pi05_training](https://colab.research.google.com/drive/1DeU0lNnEzs1KHo0Nkgh4YKBr-xu9moBM?usp=sharing) and [pi05_evaluation_only](https://colab.research.google.com/drive/1U_AyuH9WYMT4anEWvsOtIT7g01jA0WGm?usp=sharing) respectively.
 
+## Training Diagnostics
+
+OpenTau ships three drop-in scripts under `src/opentau/scripts/` to help you figure out where a training run is spending its time. Each reads the same `TrainPipelineConfig` as `opentau-train`, so they reproduce your exact model / dataset / batch size — no reconfiguration needed.
+
+| Script | What it answers |
+|---|---|
+| [`profile_step.py`](https://github.com/TensorAuto/OpenTau/blob/main/src/opentau/scripts/profile_step.py) | Where does each training step's wall-clock go? (forward / backward / optimizer / sync phases, with mean / median / p95) |
+| [`profile_dataloader.py`](https://github.com/TensorAuto/OpenTau/blob/main/src/opentau/scripts/profile_dataloader.py) | Is the dataloader keeping up with the GPUs? (pure input-pipeline ceiling, no model, no collective) |
+| [`find_unused_params.py`](https://github.com/TensorAuto/OpenTau/blob/main/src/opentau/scripts/find_unused_params.py) | Are any parameters dead? (lists params DDP would refuse to sync with `find_unused_parameters=False`) |
+
+A one-command example — see where your training time is going:
+
+```bash
+accelerate launch \
+    --config_file configs/examples/accelerate_ddp_config.yaml \
+    src/opentau/scripts/profile_step.py \
+    --config_path=<your_training_config.json>
+```
+
+Full tutorial with annotated example output and env-var knobs: [docs/tutorials/benchmarking](https://opentau.readthedocs.io/en/latest/tutorials/benchmarking.html). A worked example investigation that used these tools to find and fix a 2.9× throughput regression is tracked in [issue #177](https://github.com/TensorAuto/OpenTau/issues/177).
+
 ## Checkpoints
 We provide fully functioning $\pi_{0.5}$ checkpoints trained with high success rates. We plan to release more models in the near future.
 

diff --git a/configs/examples/accelerate_ddp_config.yaml b/configs/examples/accelerate_ddp_config.yaml
@@ -3,12 +3,11 @@ debug: false
 distributed_type: MULTI_GPU
 downcast_bf16: 'no'
 enable_cpu_affinity: false
-gpu_ids: 0,1
 machine_rank: 0
 main_training_function: main
-mixed_precision: 'no'
+mixed_precision: bf16
 num_machines: 1
-num_processes: 2
+num_processes: 8
 rdzv_backend: static
 same_network: true
 tpu_env: []

diff --git a/configs/examples/add_subtask_response.json b/configs/examples/add_subtask_response.json
@@ -0,0 +1,8 @@
+{
+    "datasets": [
+        {
+            "repo_id": "TensorAuto/ice-lemonade",
+            "root": "/path/to/local/dataset"
+        }
+    ]
+}
diff --git a/configs/examples/attach_metadata_annotations.json b/configs/examples/attach_metadata_annotations.json
@@ -0,0 +1,50 @@
+[
+  {
+    "episode_id": 0,
+    "quality": 3,
+    "segments": [
+      {
+        "start": 0,
+        "subtask": "approach the blue cup",
+        "success": false
+      },
+      {
+        "start": 50,
+        "subtask": "pick up the blue cup",
+        "success": true
+      },
+      {
+        "start": 120,
+        "subtask": "place the cup in the tray",
+        "success": true
+      }
+    ]
+  },
+  {
+    "episode_id": 1,
+    "quality": 5,
+    "segments": [
+      {
+        "start": 0,
+        "subtask": "pick up the bottle",
+        "success": true
+      },
+      {
+        "start": 80,
+        "subtask": "place the bottle in the tray",
+        "success": true
+      }
+    ]
+  },
+  {
+    "episode_id": 7,
+    "quality": 4,
+    "segments": [
+      {
+        "start": 0,
+        "subtask": "reset the workspace",
+        "success": true
+      }
+    ]
+  }
+]
diff --git a/configs/libero/reproduce_pi05_libero_accelerate_config.yaml b/configs/libero/reproduce_pi05_libero_accelerate_config.yaml
@@ -1,20 +1,13 @@
 compute_environment: LOCAL_MACHINE
 debug: false
-deepspeed_config:
-  gradient_accumulation_steps: 1
-  gradient_clipping: 10
-  offload_optimizer_device: none
-  offload_param_device: none
-  zero3_init_flag: false
-  zero_stage: 2
-distributed_type: DEEPSPEED
+distributed_type: MULTI_GPU
 downcast_bf16: 'no'
 enable_cpu_affinity: false
 machine_rank: 0
 main_training_function: main
 mixed_precision: bf16
 num_machines: 1
-num_processes: 4
+num_processes: 8
 rdzv_backend: static
 same_network: true
 tpu_env: []

diff --git a/docs/source/concepts.rst b/docs/source/concepts.rst
@@ -40,6 +40,8 @@ Metadata is crucial for defining the structure and statistics of a dataset. Hand
 
 Metadata is stored in JSON files (``info.json``, ``stats.json``) and JSONL files (``tasks.jsonl``) within the dataset directory.
 
+.. _standard-data-format:
+
 Standard Data Format
 --------------------
 To ensure compatibility across different datasets and policies, OpenTau introduces the **Standard Data Format**.
@@ -105,6 +107,139 @@ The following fields are set in ``DatasetMixtureConfig``:
 
 Cameras should be labeled in order of importance (e.g. camera0 is the most important camera, camera1 is the second most important camera, etc.). The model dataset will select the most important cameras to use if num_cams is less than the number of cameras in the dataset.
 
+.. _standard-data-format-optional-keys:
+
+Optional Standard-Format Keys
+-----------------------------
+
+On top of the core fields above, ``__getitem__`` emits several *optional*
+keys when the dataset has been enriched with segment metadata (see
+:doc:`tutorials/attach_metadata`) or for the subgoal images sampled from
+future video frames. Each optional key is **always present**. Numeric
+and image keys pair with an ``{key}_is_pad`` boolean flag — zero-filled
++ flag True means "unavailable or masked". String keys
+(``response``, ``memory``, ``next_memory``) don't get a separate flag:
+the empty string ``""`` is itself the pad signal, which also keeps the
+default PyTorch collate happy (list of strings, same length as batch).
+
+.. code-block:: python
+
+    {
+        # ... core keys above ...
+
+        "memory": str,             # Cumulative subtask summary for the current frame's segment.
+                                   # Empty string ("") when memory_raw is absent
+                                   # (legacy / unannotated dataset).
+        "next_memory": str,        # Memory string for frame t+1 (same as `memory` within a
+                                   # segment, differs at segment boundaries). Clipped at episode
+                                   # end. Empty string when unavailable.
+
+        "speed": torch.LongTensor,     # Scalar; episode length in frames rounded to the nearest multiple of
+                                       # 500 (so short <250-frame episodes bucket to 0). Populated
+                                       # unconditionally from ``info.json`` — available on every
+                                       # LeRobotDataset regardless of whether the dataset went through
+                                       # ``attach_metadata``. Name is historical; think
+                                       # "episode-length bucket".
+        "speed_is_pad": torch.BoolTensor,  # True only when the dataset has no episode-length metadata
+                                           # (pure VQA / legacy fake datasets) or when the metadata drop
+                                           # rolls in _emit_optional_keys fire at training time.
+
+        "mistake": torch.BoolTensor,   # Scalar; True iff the current segment's success flag is False.
+        "mistake_is_pad": torch.BoolTensor,
+
+        "quality": torch.LongTensor,   # Scalar in {1,2,3,4,5}; episode-level quality score.
+        "quality_is_pad": torch.BoolTensor,
+
+        "subgoal0": torch.Tensor,       # shape (3, H, W), values in [0,1]. A single future frame from
+                                        # camera0 sampled either at end-of-segment (with probability
+                                        # `subgoal_end_of_segment_prob`) or uniformly in [t, t+4 seconds].
+        # ...
+        "subgoal{num_cams-1}": torch.Tensor,
+        "subgoal_is_pad": torch.BoolTensor,   # Single flag covering every `subgoalK`. Subgoals are either
+                                              # all present (annotated dataset, not dropped this step) or
+                                              # all padded (legacy dataset, or `subgoal_drop_prob` fired).
+
+        # `response` (already in the core fields) may be replaced with ""
+        # when `response_drop_prob` fires — consumers read "" as masked,
+        # same convention as `memory` / `next_memory`.
+    }
+
+Subgoals are always rank-3 ``(3, H, W)`` regardless of
+``n_obs_history`` — they represent a single future target frame, not a
+temporal window. All camera slots share a single ``subgoal_is_pad``
+flag because subgoals are all-or-none.
+
+Subgoal image **paths** are read from ``meta/info.json`` under the
+``subgoals`` key. When the key is absent (the state of every LeRobot
+dataset today), ``_load_subgoal_frames`` returns ``{}`` and every
+``subgoalK`` tensor comes out zero-filled with ``subgoal_is_pad=True``.
+Datasets opt in to subgoals by adding the key; the loader then uses the
+frame-selection machinery (end-of-segment vs. uniform ``[t, t+4 s]``)
+described below.
+
+Training-time dropout
+^^^^^^^^^^^^^^^^^^^^^
+
+Six probability fields on ``DatasetMixtureConfig`` control how often
+each optional key is masked during a single ``__getitem__`` call. Masks
+are independent per sample (each call rolls fresh). ``DataLoader``
+workers seed their own torch RNG, so samples within a batch are
+independent across workers; seed globally via ``torch.manual_seed(...)``
+for reproducibility.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 34 14 52
+
+   * - Field
+     - Default
+     - Effect
+   * - ``history_state_drop_prob``
+     - ``0.3``
+     - Zero-fills ``state`` and historical camera frames (when
+       ``n_obs_history > 1``); sets ``obs_history_is_pad`` all True.
+   * - ``subgoal_drop_prob``
+     - ``0.75``
+     - Zero-fills every ``subgoal{K}`` image together and sets the single
+       shared ``subgoal_is_pad`` flag to True.
+   * - ``subgoal_end_of_segment_prob``
+     - ``0.25``
+     - Probability that a *present* subgoal is sourced from the end of
+       the current segment. Otherwise sampled uniformly in time from
+       the current timestamp through ``t + 4 s`` (clipped at segment
+       end, then episode end).
+   * - ``response_drop_prob``
+     - ``0.3``
+     - Replaces ``response`` with the empty string. Only rolled when
+       subgoals are NOT dropped (dropping both response and subgoals
+       would remove the primary task signal).
+   * - ``metadata_drop_all_prob``
+     - ``0.15``
+     - Masks ``speed``, ``mistake``, and ``quality`` together.
+   * - ``metadata_drop_each_prob``
+     - ``0.05``
+     - Per-field independent mask roll for each of ``speed``,
+       ``mistake``, ``quality``. Only rolled when the shared drop did
+       not fire.
+   * - ``val_enable_optional_key_dropout``
+     - ``False``
+     - Whether the five drop rolls above also fire on the **validation**
+       split. Default is ``False`` so validation metrics aren't
+       artificially noisy. Set to ``True`` if you want the validation
+       distribution to match training. Subgoal *frame* selection
+       (end-of-segment vs. uniform in the next 4 s) stays random either
+       way — only the masking logic is gated.
+
+``make_dataset`` enforces this by giving the validation subset its own
+shallow-copied dataset instance with ``enable_optional_key_dropout``
+flipped accordingly; the underlying ``meta`` / ``hf_dataset`` objects
+are still shared with the training subset, so the extra copy is cheap.
+
+Legacy datasets that have not been passed through
+:mod:`opentau.scripts.attach_metadata` still load: every optional key
+appears with a zero/empty value and ``_is_pad=True``, so policies that
+consume these fields can train without gating on dataset provenance.
+
 Configs
 -------
 Configuration management is handled using `Draccus <https://github.com/dlwh/draccus>`_.

diff --git a/docs/source/installation.rst b/docs/source/installation.rst
@@ -126,4 +126,6 @@ Configure accelerate for your distributed training setup:
 
     accelerate config
 
-This will create an accelerate config file at `~/.cache/huggingface/accelerate/default_config.yaml`. We are currently using DeepSpeed ZeRO2 for model parallelism distributed training. For an accelerate config example, see `this config file <https://github.com/TensorAuto/OpenTau/blob/main/configs/examples/accelerate_deepspeed_config.yaml>`_ used for our CI pipelines.
+This will create an accelerate config file at `~/.cache/huggingface/accelerate/default_config.yaml`. The recommended setup for models that fit in GPU memory (including the pi05 reference policy) is plain DDP with bf16 mixed precision. For an example, see `configs/examples/accelerate_ddp_config.yaml <https://github.com/TensorAuto/OpenTau/blob/main/configs/examples/accelerate_ddp_config.yaml>`_.
+
+A DeepSpeed ZeRO-2 config is also available at `configs/examples/accelerate_deepspeed_config.yaml <https://github.com/TensorAuto/OpenTau/blob/main/configs/examples/accelerate_deepspeed_config.yaml>`_ for memory-constrained scenarios (very large models, long sequences), but note that it can be significantly slower than DDP on mid-sized policies with many small parameter tensors due to per-parameter gradient-reduce hooks. See issue #177 for benchmarks.
diff --git a/docs/source/tutorials.rst b/docs/source/tutorials.rst
@@ -10,8 +10,10 @@ This section provides step-by-step guides for common tasks in OpenTau, including
    tutorials/training
    tutorials/inference
    tutorials/evaluation
+   tutorials/benchmarking
    tutorials/deployment
    tutorials/datasets
+   tutorials/attach_metadata
    tutorials/visualization
    RL
    tutorials/human_demo