fix: Release gradient memory after policy training#1147
Conversation
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a memory issue by releasing gradient memory after policy training to prevent out-of-memory (OOM) errors during rollouts. The fix addresses OOM problems that occur at the beginning of the second step rollouts when gradient memory accumulates without being freed.
- Adds
optimizer.zero_grad()calls after training loops to release gradient memory - Prevents OOM errors during subsequent rollout phases
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| nemo_rl/models/policy/dtensor_policy_worker_v2.py | Adds gradient memory cleanup after training loop completion |
| nemo_rl/models/policy/dtensor_policy_worker.py | Adds gradient memory cleanup after training loop completion |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
ℹ️ File Consistency CheckCheck based on commit: 1571f57 (PR #1147 from This is a test comment This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
nemo_rl/models/policy/dtensor_policy_worker_v2.py (3)
804-805: Use zero_grad(set_to_none=True) to actually free gradient memory.This better matches the PR goal and reduces allocator pressure vs writing zeros.
- # release gradient memory before rollouts - self.optimizer.zero_grad() + # release gradient memory before rollouts + self.optimizer.zero_grad(set_to_none=True)
542-543: Make the earlier zero_grad consistent for memory release.Same rationale; this prevents carrying grads between rollouts.
- self.optimizer.zero_grad() + self.optimizer.zero_grad(set_to_none=True)
777-780: Guard metric accumulation for dummy microbatches.num_valid_samples can bleed from a prior mb; only append for real mbs.
- if num_valid_samples > 0: - mb_losses.append(loss.item()) - all_mb_metrics.append(loss_metrics) + # Only keep metrics for non-dummy microbatches with valid samples + if mb_idx < iterator_len and loss_metrics.get("num_valid_samples", 0) > 0: + mb_losses.append(loss.item()) + all_mb_metrics.append(loss_metrics)nemo_rl/models/policy/dtensor_policy_worker.py (3)
860-861: Use zero_grad(set_to_none=True) for real gradient memory release.Aligns with the PR intent and PyTorch best practice.
- # release gradient memory before rollouts - self.optimizer.zero_grad() + # release gradient memory before rollouts + self.optimizer.zero_grad(set_to_none=True)
598-599: Apply set_to_none=True at the start of each rollout as well.Ensures grads from prior step aren’t retained.
- self.optimizer.zero_grad() + self.optimizer.zero_grad(set_to_none=True)
833-836: Avoid leaking num_valid_samples across dummy microbatches.Append metrics only for non-dummy mbs.
- if num_valid_samples > 0: - mb_losses.append(loss.item()) - all_mb_metrics.append(loss_metrics) + if mb_idx < iterator_len and loss_metrics.get("num_valid_samples", 0) > 0: + mb_losses.append(loss.item()) + all_mb_metrics.append(loss_metrics)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
nemo_rl/models/policy/dtensor_policy_worker.py(1 hunks)nemo_rl/models/policy/dtensor_policy_worker_v2.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Post submodule check comment / Comment on PR
parthchadha
left a comment
There was a problem hiding this comment.
LGTM, thanks for the fix.
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
1571f57 to
cef1c63
Compare
ℹ️ File Consistency CheckCheck based on commit: cef1c63 (PR #1147 from ✅ DTensor Policy Worker Synchronization CheckBoth DTensor policy worker files were modified in this PR:
Please ensure that the changes are consistent between both files where applicable. This check ensures that related file implementations remain synchronized across the codebase. If you believe this warning is incorrect or the files should intentionally differ, please add a comment explaining the reasoning. |
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com>
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com>
Signed-off-by: Jarno Seppänen <jseppanen@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
What does this PR do ?
Release gradient memory after policy training to avoid OOM during rollouts.
Issues
Currently the training is likely to OOM at the beginning of 2nd step rollouts because gradient memory is not freed.
Summary by CodeRabbit