-
Notifications
You must be signed in to change notification settings - Fork 220
[megatron] separate offloading gradients from offloading params for megatron #563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[megatron] separate offloading gradients from offloading params for megatron #563
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in Megatron's distributed checkpointing by replacing the persistent backend with a manual enforcement of the 'spawn' start method, which is a more robust approach for environments like Ray. The changes also include refactoring gradient offloading logic and adding memory debugging prints and tests.
I've identified a critical bug in the new gradient offloading/loading implementation that would prevent gradients from being correctly restored. Additionally, I found some issues in the new test logic, including a critical flaw that invalidates the checkpoint test and some incorrect variable usage in debug prints. I've also included a couple of medium-severity suggestions for code style and maintainability.
| def offload_megatron_grads_to_cpu(models): | ||
| for model_chunk in models: | ||
| if isinstance(model_chunk, DDP): | ||
| model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers] | ||
| for buffers in model_chunk_all_buffers: | ||
| for buffer in buffers: | ||
| if buffer.grad_data.storage().size() > 0: | ||
| buffer.grad_data.storage().resize_(0) | ||
| else: | ||
| # we need this for ref module | ||
| for _, param in model_chunk.named_parameters(): | ||
| if param.grad is not None: | ||
| param.grad = param.grad.to("cpu", non_blocking=True) | ||
| gc.collect() | ||
| torch.cuda.empty_cache() | ||
|
|
||
|
|
||
| @torch.no_grad() | ||
| def load_megatron_grads_to_gpu(models): | ||
| for model_chunk in models: | ||
| if isinstance(model_chunk, DDP): | ||
| model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers] | ||
| for buffers in model_chunk_all_buffers: | ||
| for buffer in buffers: | ||
| if buffer.grad_data.storage().size() > 0: | ||
| buffer.grad_data.storage().resize_(buffer.grad_data_size) | ||
| buffer.grad_data.zero_() | ||
| else: | ||
| # we need this for ref module | ||
| for _, param in model_chunk.named_parameters(): | ||
| if param.grad is not None: | ||
| param.grad = param.grad.to(torch.cuda.current_device(), non_blocking=True) | ||
| gc.collect() | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a critical issue in the new gradient offloading/loading logic that will prevent gradients from being correctly restored for DDP models:
offload_megatron_grads_to_cpudoes not savebuffer.grad_data_sizebefore resizing the gradient buffer to zero. This size is necessary forload_megatron_grads_to_gputo restore the buffer.load_megatron_grads_to_gpuuses a conditionif buffer.grad_data.storage().size() > 0, which will always be false for offloaded gradients (since their storage size is 0). This prevents the gradients from ever being loaded back to the GPU.
This will cause gradients to be lost and break training. The suggestion below restores the correct logic.
@torch.no_grad()
def offload_megatron_grads_to_cpu(models):
for model_chunk in models:
if isinstance(model_chunk, DDP):
model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers]
for buffers in model_chunk_all_buffers:
for buffer in buffers:
if buffer.grad_data.storage().size() > 0:
buffer.grad_data_size = buffer.grad_data.storage().size()
buffer.grad_data.storage().resize_(0)
else:
# we need this for ref module
for _, param in model_chunk.named_parameters():
if param.grad is not None:
param.grad = param.grad.to("cpu", non_blocking=True)
gc.collect()
torch.cuda.empty_cache()
@torch.no_grad()
def load_megatron_grads_to_gpu(models):
for model_chunk in models:
if isinstance(model_chunk, DDP):
model_chunk_all_buffers = [model_chunk.buffers, model_chunk.expert_parallel_buffers]
for buffers in model_chunk_all_buffers:
for buffer in buffers:
if hasattr(buffer, 'grad_data_size') and buffer.grad_data.storage().size() == 0:
buffer.grad_data.storage().resize_(buffer.grad_data_size)
buffer.grad_data.zero_()
else:
# we need this for ref module
for _, param in model_chunk.named_parameters():
if param.grad is not None:
param.grad = param.grad.to(torch.cuda.current_device(), non_blocking=True)
gc.collect()
torch.cuda.empty_cache()| import multiprocessing as mp | ||
|
|
||
| mp.set_start_method("spawn", force=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import multiprocessing as mp statement is inside the __init__ method. According to PEP 8 style guidelines, imports should be at the top of the file. This improves readability and makes dependencies clear. Please move this import to the top of the module.
| import multiprocessing as mp | |
| mp.set_start_method("spawn", force=True) | |
| mp.set_start_method("spawn", force=True) |
…tron_optimizer_offload
…egatron (NovaSky-AI#563) ## Overview Previously gradient offloading happened with params -> it should happen when offloading the optimizer state, since most commonly we want just the model to be on gpu for weight syncing.
Overview
Previously gradient offloading happened with params -> it should happen when offloading the optimizer state, since most commonly we want just the model to be on gpu for weight syncing.