feat(train): log per-rank HBM around LIBERO eval boundary#294
Merged
Conversation
Adds a pre/post snapshot of CUDA memory_allocated and memory_reserved around the LIBERO eval block in train.py, plus reset_peak_memory_stats across the eval window. The post-barrier reporter gathers all-rank stats on rank 0 and emits one structured log line per rank with the retained delta. Motivation: we have evidence that ZeRO-2 training crashes (CUDA OOM in backward, or NCCL ALLGATHER timeouts from post-eval rank divergence) correlate with the step-1000 eval boundary, suggesting LIBERO eval is retaining device buffers that the next training backward can't fit around. This probe makes the retained-vs-peak distinction visible in the run log without needing a profiler attach.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds a per-rank HBM probe around the LIBERO eval block in
src/opentau/scripts/train.py:torch.cuda.memory_allocated()/memory_reserved()and callreset_peak_memory_stats()so the peak counters cover only the eval window.accelerator.wait_for_everyone()barrier: gather{pre, peak, post}allocated/reserved per rank viagather_object, and emit one structuredlogging.infoline per rank on rank 0 with thepost − preretained delta.No-op when
eval_envs is None(the existing eval gate) or when CUDA is unavailable.Motivation
We have empirical evidence from an 8-GPU ZeRO-2 training run that crashes correlate with the
eval_freqstep boundary:CUDA OOMfailures inaccelerator.backward~3 training steps after LIBERO eval finishes (78.8 GiB used / 310 MiB free on rank 0, vs. the steady-state training peak of ~50 GiB for the same cam=4 ZeRO-2 sdpa+ckpt cell in feat(pi07,fsdp): enable FSDP-FULL_SHARD for full unfreeze + profile_step audit + ZeRO-2 vs FSDP matrix #273's matrix).ALLGATHERtimeout where only one rank had enqueued the next collective post-eval — the other 7 were stuck in a non-collective code path. Consistent with asymmetric env teardown.Both patterns are consistent with LIBERO sim envs / vectorized rollouts retaining CUDA buffers across the eval → training boundary, but we can't confirm without a measurement. This probe makes the retained-vs-peak distinction visible directly in the run log every
eval_freqsteps (~once every 24 min of wall-clock on the current config), so the next failure has actionable data attached.How it was tested
pre-commit run --files src/opentau/scripts/train.py— all hooks pass (ruff, ruff-format, pyupgrade, typos, bandit, license-header, secret-scan).pytest -m "not gpu" -n auto tests/scripts/— 155 passed locally.End-to-end verification needs a GPU node and is not covered by CI; on the next training launch with
eval_freq > 0andeval_envsconfigured, the log will get one new line per rank per eval boundary in the form:Overhead
Negligible.
torch.cuda.memory_*reads poll allocator bookkeeping (no kernel launch, no CUDA sync).gather_objectof a ~50-byte dict across 8 ranks is sub-ms and is placed immediately after the existingaccelerator.wait_for_everyone()so it doesn't add a sync point. Fires once pereval_freqtraining steps.How to checkout & try? (for the reviewer)
CPU sanity:
pytest -m "not gpu" -n auto tests/scripts/Inspect the new lines:
sed -n '700,795p' src/opentau/scripts/train.pyTo exercise the probe end-to-end (needs a GPU + LIBERO env):
opentau-train \ --accelerate-config configs/examples/accelerate_deepspeed_config.yaml \ --config_path=configs/examples/pi05_training_config.json \ --eval_freq=10 --steps=20Checklist