Skip to content

feat(train): log per-rank HBM around LIBERO eval boundary#294

Merged
shuheng-liu merged 1 commit into
mainfrom
claude/awesome-booth-b2f762
May 12, 2026
Merged

feat(train): log per-rank HBM around LIBERO eval boundary#294
shuheng-liu merged 1 commit into
mainfrom
claude/awesome-booth-b2f762

Conversation

@shuheng-liu
Copy link
Copy Markdown
Member

@shuheng-liu shuheng-liu commented May 12, 2026

What this does

Adds a per-rank HBM probe around the LIBERO eval block in src/opentau/scripts/train.py:

  • Just before the eval rollouts: snapshot torch.cuda.memory_allocated() / memory_reserved() and call reset_peak_memory_stats() so the peak counters cover only the eval window.
  • Just after the post-eval accelerator.wait_for_everyone() barrier: gather {pre, peak, post} allocated/reserved per rank via gather_object, and emit one structured logging.info line per rank on rank 0 with the post − pre retained delta.

No-op when eval_envs is None (the existing eval gate) or when CUDA is unavailable.

Motivation

We have empirical evidence from an 8-GPU ZeRO-2 training run that crashes correlate with the eval_freq step boundary:

Both patterns are consistent with LIBERO sim envs / vectorized rollouts retaining CUDA buffers across the eval → training boundary, but we can't confirm without a measurement. This probe makes the retained-vs-peak distinction visible directly in the run log every eval_freq steps (~once every 24 min of wall-clock on the current config), so the next failure has actionable data attached.

How it was tested

  • pre-commit run --files src/opentau/scripts/train.py — all hooks pass (ruff, ruff-format, pyupgrade, typos, bandit, license-header, secret-scan).
  • pytest -m "not gpu" -n auto tests/scripts/ — 155 passed locally.

End-to-end verification needs a GPU node and is not covered by CI; on the next training launch with eval_freq > 0 and eval_envs configured, the log will get one new line per rank per eval boundary in the form:

Eval HBM probe step=13000 rank=0 pre=50.21/50.78 peak=72.40/78.83 post=58.10/60.22 retained_alloc=+7.89 GiB

Overhead

Negligible. torch.cuda.memory_* reads poll allocator bookkeeping (no kernel launch, no CUDA sync). gather_object of a ~50-byte dict across 8 ranks is sub-ms and is placed immediately after the existing accelerator.wait_for_everyone() so it doesn't add a sync point. Fires once per eval_freq training steps.

How to checkout & try? (for the reviewer)

git fetch origin claude/awesome-booth-b2f762
git checkout claude/awesome-booth-b2f762
uv sync --extra dev --extra libero

CPU sanity:

pytest -m "not gpu" -n auto tests/scripts/

Inspect the new lines:

sed -n '700,795p' src/opentau/scripts/train.py

To exercise the probe end-to-end (needs a GPU + LIBERO env):

opentau-train \
    --accelerate-config configs/examples/accelerate_deepspeed_config.yaml \
    --config_path=configs/examples/pi05_training_config.json \
    --eval_freq=10 --steps=20

Checklist

  • I have added Google-style docstrings to important functions and ensured function parameters are typed.
  • My PR includes policy-related changes.
    • If the above is checked: I have run the GPU pytests (pytest -m "gpu") and regression tests.

Adds a pre/post snapshot of CUDA memory_allocated and memory_reserved
around the LIBERO eval block in train.py, plus reset_peak_memory_stats
across the eval window. The post-barrier reporter gathers all-rank stats
on rank 0 and emits one structured log line per rank with the retained
delta.

Motivation: we have evidence that ZeRO-2 training crashes (CUDA OOM in
backward, or NCCL ALLGATHER timeouts from post-eval rank divergence)
correlate with the step-1000 eval boundary, suggesting LIBERO eval is
retaining device buffers that the next training backward can't fit
around. This probe makes the retained-vs-peak distinction visible in
the run log without needing a profiler attach.
@shuheng-liu shuheng-liu added the feature New feature or request label May 12, 2026
@shuheng-liu shuheng-liu self-assigned this May 12, 2026
@shuheng-liu shuheng-liu marked this pull request as ready for review May 12, 2026 18:52
@shuheng-liu shuheng-liu merged commit 824a081 into main May 12, 2026
7 checks passed
@shuheng-liu shuheng-liu deleted the claude/awesome-booth-b2f762 branch May 12, 2026 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant