feat(train): log per-rank HBM around LIBERO eval boundary by shuheng-liu · Pull Request #294 · TensorAuto/OpenTau

shuheng-liu · 2026-05-12T18:50:21Z

What this does

Adds a per-rank HBM probe around the LIBERO eval block in src/opentau/scripts/train.py:

Just before the eval rollouts: snapshot torch.cuda.memory_allocated() / memory_reserved() and call reset_peak_memory_stats() so the peak counters cover only the eval window.
Just after the post-eval accelerator.wait_for_everyone() barrier: gather {pre, peak, post} allocated/reserved per rank via gather_object, and emit one structured logging.info line per rank on rank 0 with the post − pre retained delta.

No-op when eval_envs is None (the existing eval gate) or when CUDA is unavailable.

Motivation

We have empirical evidence from an 8-GPU ZeRO-2 training run that crashes correlate with the eval_freq step boundary:

Two CUDA OOM failures in accelerator.backward ~3 training steps after LIBERO eval finishes (78.8 GiB used / 310 MiB free on rank 0, vs. the steady-state training peak of ~50 GiB for the same cam=4 ZeRO-2 sdpa+ckpt cell in feat(pi07,fsdp): enable FSDP-FULL_SHARD for full unfreeze + profile_step audit + ZeRO-2 vs FSDP matrix #273's matrix).
One NCCL ALLGATHER timeout where only one rank had enqueued the next collective post-eval — the other 7 were stuck in a non-collective code path. Consistent with asymmetric env teardown.

Both patterns are consistent with LIBERO sim envs / vectorized rollouts retaining CUDA buffers across the eval → training boundary, but we can't confirm without a measurement. This probe makes the retained-vs-peak distinction visible directly in the run log every eval_freq steps (~once every 24 min of wall-clock on the current config), so the next failure has actionable data attached.

How it was tested

pre-commit run --files src/opentau/scripts/train.py — all hooks pass (ruff, ruff-format, pyupgrade, typos, bandit, license-header, secret-scan).
pytest -m "not gpu" -n auto tests/scripts/ — 155 passed locally.

End-to-end verification needs a GPU node and is not covered by CI; on the next training launch with eval_freq > 0 and eval_envs configured, the log will get one new line per rank per eval boundary in the form:

Eval HBM probe step=13000 rank=0 pre=50.21/50.78 peak=72.40/78.83 post=58.10/60.22 retained_alloc=+7.89 GiB

Overhead

Negligible. torch.cuda.memory_* reads poll allocator bookkeeping (no kernel launch, no CUDA sync). gather_object of a ~50-byte dict across 8 ranks is sub-ms and is placed immediately after the existing accelerator.wait_for_everyone() so it doesn't add a sync point. Fires once per eval_freq training steps.

How to checkout & try? (for the reviewer)

git fetch origin claude/awesome-booth-b2f762
git checkout claude/awesome-booth-b2f762
uv sync --extra dev --extra libero

CPU sanity:

pytest -m "not gpu" -n auto tests/scripts/

Inspect the new lines:

sed -n '700,795p' src/opentau/scripts/train.py

To exercise the probe end-to-end (needs a GPU + LIBERO env):

opentau-train \
    --accelerate-config configs/examples/accelerate_deepspeed_config.yaml \
    --config_path=configs/examples/pi05_training_config.json \
    --eval_freq=10 --steps=20

Checklist

I have added Google-style docstrings to important functions and ensured function parameters are typed.
My PR includes policy-related changes.
- If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Adds a pre/post snapshot of CUDA memory_allocated and memory_reserved around the LIBERO eval block in train.py, plus reset_peak_memory_stats across the eval window. The post-barrier reporter gathers all-rank stats on rank 0 and emits one structured log line per rank with the retained delta. Motivation: we have evidence that ZeRO-2 training crashes (CUDA OOM in backward, or NCCL ALLGATHER timeouts from post-eval rank divergence) correlate with the step-1000 eval boundary, suggesting LIBERO eval is retaining device buffers that the next training backward can't fit around. This probe makes the retained-vs-peak distinction visible in the run log without needing a profiler attach.

shuheng-liu added the feature New feature or request label May 12, 2026

shuheng-liu self-assigned this May 12, 2026

shuheng-liu marked this pull request as ready for review May 12, 2026 18:52

shuheng-liu merged commit 824a081 into main May 12, 2026
7 checks passed

shuheng-liu deleted the claude/awesome-booth-b2f762 branch May 12, 2026 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(train): log per-rank HBM around LIBERO eval boundary#294

feat(train): log per-rank HBM around LIBERO eval boundary#294
shuheng-liu merged 1 commit into
mainfrom
claude/awesome-booth-b2f762

shuheng-liu commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shuheng-liu commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this does

Motivation

How it was tested

Overhead

How to checkout & try? (for the reviewer)

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shuheng-liu commented May 12, 2026 •

edited

Loading