Skip to content

Conversation

@erictang000
Copy link
Collaborator

@erictang000 erictang000 commented Oct 23, 2025

Overview

Previously gradient offloading happened with params -> it should happen when offloading the optimizer state, since most commonly we want just the model to be on gpu for weight syncing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug in Megatron's distributed checkpointing by replacing the persistent backend with a manual enforcement of the 'spawn' start method, which is a more robust approach for environments like Ray. The changes also include refactoring gradient offloading logic and adding memory debugging prints and tests.

I've identified a critical bug in the new gradient offloading/loading implementation that would prevent gradients from being correctly restored. Additionally, I found some issues in the new test logic, including a critical flaw that invalidates the checkpoint test and some incorrect variable usage in debug prints. I've also included a couple of medium-severity suggestions for code style and maintainability.

Comment on lines 111 to 144
def offload_megatron_grads_to_cpu(models):
for model_chunk in models:
if isinstance(model_chunk, DDP):
model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers]
for buffers in model_chunk_all_buffers:
for buffer in buffers:
if buffer.grad_data.storage().size() > 0:
buffer.grad_data.storage().resize_(0)
else:
# we need this for ref module
for _, param in model_chunk.named_parameters():
if param.grad is not None:
param.grad = param.grad.to("cpu", non_blocking=True)
gc.collect()
torch.cuda.empty_cache()


@torch.no_grad()
def load_megatron_grads_to_gpu(models):
for model_chunk in models:
if isinstance(model_chunk, DDP):
model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers]
for buffers in model_chunk_all_buffers:
for buffer in buffers:
if buffer.grad_data.storage().size() > 0:
buffer.grad_data.storage().resize_(buffer.grad_data_size)
buffer.grad_data.zero_()
else:
# we need this for ref module
for _, param in model_chunk.named_parameters():
if param.grad is not None:
param.grad = param.grad.to(torch.cuda.current_device(), non_blocking=True)
gc.collect()
torch.cuda.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a critical issue in the new gradient offloading/loading logic that will prevent gradients from being correctly restored for DDP models:

  1. offload_megatron_grads_to_cpu does not save buffer.grad_data_size before resizing the gradient buffer to zero. This size is necessary for load_megatron_grads_to_gpu to restore the buffer.
  2. load_megatron_grads_to_gpu uses a condition if buffer.grad_data.storage().size() > 0, which will always be false for offloaded gradients (since their storage size is 0). This prevents the gradients from ever being loaded back to the GPU.

This will cause gradients to be lost and break training. The suggestion below restores the correct logic.

@torch.no_grad()
def offload_megatron_grads_to_cpu(models):
    for model_chunk in models:
        if isinstance(model_chunk, DDP):
            model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers]
            for buffers in model_chunk_all_buffers:
                for buffer in buffers:
                    if buffer.grad_data.storage().size() > 0:
                        buffer.grad_data_size = buffer.grad_data.storage().size()
                        buffer.grad_data.storage().resize_(0)
        else:
            # we need this for ref module
            for _, param in model_chunk.named_parameters():
                if param.grad is not None:
                    param.grad = param.grad.to("cpu", non_blocking=True)
    gc.collect()
    torch.cuda.empty_cache()


@torch.no_grad()
def load_megatron_grads_to_gpu(models):
    for model_chunk in models:
        if isinstance(model_chunk, DDP):
            model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers]
            for buffers in model_chunk_all_buffers:
                for buffer in buffers:
                    if hasattr(buffer, 'grad_data_size') and buffer.grad_data.storage().size() == 0:
                        buffer.grad_data.storage().resize_(buffer.grad_data_size)
                        buffer.grad_data.zero_()
        else:
            # we need this for ref module
            for _, param in model_chunk.named_parameters():
                if param.grad is not None:
                    param.grad = param.grad.to(torch.cuda.current_device(), non_blocking=True)
    gc.collect()
    torch.cuda.empty_cache()

Comment on lines 58 to 60
import multiprocessing as mp

mp.set_start_method("spawn", force=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import multiprocessing as mp statement is inside the __init__ method. According to PEP 8 style guidelines, imports should be at the top of the file. This improves readability and makes dependencies clear. Please move this import to the top of the module.

Suggested change
import multiprocessing as mp
mp.set_start_method("spawn", force=True)
mp.set_start_method("spawn", force=True)

@erictang000 erictang000 changed the title [megatron][bug fix] Remove persistent dist-checkpointing backend to fix megatron offloading after checkpoint [megatron] separate offloading gradients from offloading params for megatron Oct 24, 2025
@erictang000 erictang000 merged commit aaabfc1 into NovaSky-AI:main Nov 22, 2025
3 checks passed
@erictang000 erictang000 deleted the megatron_optimizer_offload branch November 22, 2025 00:13
li-boxuan pushed a commit to li-boxuan/SkyRL that referenced this pull request Nov 23, 2025
…egatron (NovaSky-AI#563)

## Overview
Previously gradient offloading happened with params -> it should happen
when offloading the optimizer state, since most commonly we want just
the model to be on gpu for weight syncing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants