[megatron] separate offloading gradients from offloading params for megatron #563

erictang000 · 2025-10-23T22:48:45Z

Overview

Previously gradient offloading happened with params -> it should happen when offloading the optimizer state, since most commonly we want just the model to be on gpu for weight syncing.

gemini-code-assist

Code Review

This pull request addresses a bug in Megatron's distributed checkpointing by replacing the persistent backend with a manual enforcement of the 'spawn' start method, which is a more robust approach for environments like Ray. The changes also include refactoring gradient offloading logic and adding memory debugging prints and tests.

I've identified a critical bug in the new gradient offloading/loading implementation that would prevent gradients from being correctly restored. Additionally, I found some issues in the new test logic, including a critical flaw that invalidates the checkpoint test and some incorrect variable usage in debug prints. I've also included a couple of medium-severity suggestions for code style and maintainability.

gemini-code-assist · 2025-10-23T22:50:31Z

skyrl-train/skyrl_train/distributed/megatron/megatron_utils.py

+def offload_megatron_grads_to_cpu(models):
+    for model_chunk in models:
+        if isinstance(model_chunk, DDP):
+            model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers]
+            for buffers in model_chunk_all_buffers:
+                for buffer in buffers:
+                    if buffer.grad_data.storage().size() > 0:
+                        buffer.grad_data.storage().resize_(0)
+        else:
+            # we need this for ref module
+            for _, param in model_chunk.named_parameters():
+                if param.grad is not None:
+                    param.grad = param.grad.to("cpu", non_blocking=True)
+    gc.collect()
+    torch.cuda.empty_cache()
+
+
+@torch.no_grad()
+def load_megatron_grads_to_gpu(models):
+    for model_chunk in models:
+        if isinstance(model_chunk, DDP):
+            model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers]
+            for buffers in model_chunk_all_buffers:
+                for buffer in buffers:
+                    if buffer.grad_data.storage().size() > 0:
+                        buffer.grad_data.storage().resize_(buffer.grad_data_size)
+                        buffer.grad_data.zero_()
+        else:
+            # we need this for ref module
+            for _, param in model_chunk.named_parameters():
+                if param.grad is not None:
+                    param.grad = param.grad.to(torch.cuda.current_device(), non_blocking=True)
+    gc.collect()
+    torch.cuda.empty_cache()


There's a critical issue in the new gradient offloading/loading logic that will prevent gradients from being correctly restored for DDP models:

offload_megatron_grads_to_cpu does not save buffer.grad_data_size before resizing the gradient buffer to zero. This size is necessary for load_megatron_grads_to_gpu to restore the buffer.

load_megatron_grads_to_gpu uses a condition if buffer.grad_data.storage().size() > 0, which will always be false for offloaded gradients (since their storage size is 0). This prevents the gradients from ever being loaded back to the GPU.

This will cause gradients to be lost and break training. The suggestion below restores the correct logic.

@torch.no_grad() def offload_megatron_grads_to_cpu(models): for model_chunk in models: if isinstance(model_chunk, DDP): model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers] for buffers in model_chunk_all_buffers: for buffer in buffers: if buffer.grad_data.storage().size() > 0: buffer.grad_data_size = buffer.grad_data.storage().size() buffer.grad_data.storage().resize_(0) else: # we need this for ref module for _, param in model_chunk.named_parameters(): if param.grad is not None: param.grad = param.grad.to("cpu", non_blocking=True) gc.collect() torch.cuda.empty_cache() @torch.no_grad() def load_megatron_grads_to_gpu(models): for model_chunk in models: if isinstance(model_chunk, DDP): model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers] for buffers in model_chunk_all_buffers: for buffer in buffers: if hasattr(buffer, 'grad_data_size') and buffer.grad_data.storage().size() == 0: buffer.grad_data.storage().resize_(buffer.grad_data_size) buffer.grad_data.zero_() else: # we need this for ref module for _, param in model_chunk.named_parameters(): if param.grad is not None: param.grad = param.grad.to(torch.cuda.current_device(), non_blocking=True) gc.collect() torch.cuda.empty_cache()

gemini-code-assist · 2025-10-23T22:50:31Z

skyrl-train/skyrl_train/distributed/megatron/megatron_strategy.py

+        import multiprocessing as mp
+
+        mp.set_start_method("spawn", force=True)


The import multiprocessing as mp statement is inside the __init__ method. According to PEP 8 style guidelines, imports should be at the top of the file. This improves readability and makes dependencies clear. Please move this import to the top of the module.

Suggested change

import multiprocessing as mp

mp.set_start_method("spawn", force=True)

mp.set_start_method("spawn", force=True)

skyrl-train/skyrl_train/trainer.py

…tron_optimizer_offload

…egatron (NovaSky-AI#563) ## Overview Previously gradient offloading happened with params -> it should happen when offloading the optimizer state, since most commonly we want just the model to be on gpu for weight syncing.

erictang000 added 2 commits October 22, 2025 03:44

X

357eb7c

x

bdccdb5

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

erictang000 added 3 commits October 24, 2025 02:01

x

e207a6d

x

2f5360a

x'

0a3dd88

erictang000 changed the title ~~[megatron][bug fix] Remove persistent dist-checkpointing backend to fix megatron offloading after checkpoint~~ [megatron] separate offloading gradients from offloading params for megatron Oct 24, 2025

erictang000 added 4 commits November 21, 2025 19:34

Merge branch 'main' of https://github.com/erictang000/SkyRL into mega…

0705ce4

…tron_optimizer_offload

x

48672b0

x

ecd981a

x

e9ce5ce

erictang000 requested a review from SumanthRH November 21, 2025 19:46

SumanthRH approved these changes Nov 22, 2025

View reviewed changes

erictang000 merged commit aaabfc1 into NovaSky-AI:main Nov 22, 2025
3 checks passed

erictang000 deleted the megatron_optimizer_offload branch November 22, 2025 00:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[megatron] separate offloading gradients from offloading params for megatron #563

[megatron] separate offloading gradients from offloading params for megatron #563

Uh oh!

erictang000 commented Oct 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		import multiprocessing as mp

		mp.set_start_method("spawn", force=True)

[megatron] separate offloading gradients from offloading params for megatron #563

[megatron] separate offloading gradients from offloading params for megatron #563

Uh oh!

Conversation

erictang000 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erictang000 commented Oct 23, 2025 •

edited

Loading