Verify per-dataset normalization PR end-to-end on GPU

Track the manual verification steps that were skipped on the per-dataset
normalization PR (#336) because they require a GPU runner / multi-GPU
node, beyond what CPU CI can exercise.

## What's already covered

CPU CI (1188 passing tests, including new parametrised tests under
`tests/policies/test_normalize_per_dataset.py`,
`tests/policies/test_save_pretrained_skip_stats.py`, and
`tests/datasets/test_tagged_dataset.py`) validates the unit-level
behaviour of:

- the stacked `(D, *feat_shape)` Normalize/Unnormalize buffers and the
  per-sample `index_select` + broadcast path,
- the `_TaggedDataset` wrapper and default-collate batching,
- the save-with-stats / save-without-stats round-trip through
  `safetensors.safe_open`.

The `pytest -m "gpu" -n 0` subset (19 passed / 10 skipped / 1359
deselected) was also run on an internal GPU dev box during PR
preparation and is green on this branch.

## What still needs a real GPU runner

- **Smoke training run** on a small config (e.g.
  `configs/examples/pi05_training_config.json` with `steps=40`, or
  `configs/dev/dev_config.json` with `steps=2`):
  - Confirm `forward` does not trip the new inf-assertion when the
    dataloader pipeline emits `dataset_index` and `dataset_repo_id`.
  - Confirm the per-dataset validation dataloaders iterate without
    `KeyError("dataset_index")`.
  - Inspect `model.safetensors` after a `save_normalization_stats=false`
    save and assert no `normalize_*.buffer_*` /
    `unnormalize_*.buffer_*` keys made it to disk:
    ```bash
    python -c "from safetensors import safe_open; \
               f=safe_open('outputs/.../checkpoints/last/pretrained_model/model.safetensors','pt'); \
               keys=list(f.keys()); \
               assert not any('normalize_inputs.buffer' in k for k in keys), keys"
    ```
  - Re-load via `make_policy(cfg, ds_meta=mixture.meta)` and verify
    `_inject_stats` repopulates the buffers and forward succeeds.

- **Determinism check** (per CLAUDE.md rule #3): run the smoke config
  twice with `seed=0` on a single GPU and diff the per-step loss
  series — confirm it is bit-identical. Required because this PR
  touches `policies/normalize.py`, every pi policy's
  `forward`/`sample_actions`, and the datasets pipeline.

- **Distributed sanity** under DDP and DeepSpeed ZeRO-2: launch the
  smoke config with `--num_processes=2` and `--num_processes=8`,
  watch for any NCCL desync — the new `index_select` is a local op
  with no new collectives, so the risk is low, but worth confirming
  on the real backend.

- **Nightly regression suite** (`regression_test.yml` runs on
  g6.12xlarge). If pi05 / pi07 short training runs land within their
  historical loss envelopes after this change, that's the strongest
  signal that the per-dataset path doesn't silently regress a
  single-dataset config (which is the most common case today).

## Why this is split out

CI's CPU subset can't fail-fast on these — they need real CUDA
collectives and a real training step. Splitting keeps the PR shippable
on green CPU CI while the heavier checks happen on a runner that has
access to a multi-GPU node.

## Done when

- [x] `pytest -m "gpu" -n 0` passes on a CUDA box.
- [ ] Smoke train on `configs/dev/dev_config.json` (2 steps) succeeds.
- [ ] Smoke train with `--policy.save_normalization_stats=false` produces
      a safetensors file with no `normalize_*.buffer_*` keys; reload via
      `make_policy(..., ds_meta=...)` succeeds.
- [ ] Seeded determinism: two runs with `seed=0` produce bit-identical
      losses.
- [ ] DDP `--num_processes=2` smoke run completes without NCCL desync.
- [ ] Nightly regression suite is green on the merged commit.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify per-dataset normalization PR end-to-end on GPU #335

What's already covered

What still needs a real GPU runner

Why this is split out

Done when

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Verify per-dataset normalization PR end-to-end on GPU #335

Description

What's already covered

What still needs a real GPU runner

Why this is split out

Done when

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions