Skip to content

fix #7188 #7371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

fix #7188 #7371

wants to merge 3 commits into from

Conversation

lpnpcs
Copy link

@lpnpcs lpnpcs commented Jun 19, 2025

I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue #7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications.

before
image

after
image

@lpnpcs lpnpcs requested review from tjruwase and tohtana as code owners June 19, 2025 08:07
Signed-off-by: vinceliu <lpnpcs@gmail.com>
@tjruwase
Copy link
Contributor

@lpnpcs, thanks for contributing this fix. I am a bit concerned of the perf impact of synchronizing the device. Are you able to measure the perf before/after the fix. This will help guide whether to pursue finer-grained synchronization on streams instead of device.

@tjruwase
Copy link
Contributor

@lpnpcs, please use the following to fix the formatting issues
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

tjruwase and others added 2 commits June 19, 2025 08:55
@lpnpcs
Copy link
Author

lpnpcs commented Jun 20, 2025

@lpnpcs, thanks for contributing this fix. I am a bit concerned of the perf impact of synchronizing the device. Are you able to measure the perf before/after the fix. This will help guide whether to pursue finer-grained synchronization on streams instead of device.

Thank you for your review. I conducted the following experiments to illustrate the impact on performance.

I trained the Qwen2.5-vl-7b model using 8 A100 GPUs with 1,000 samples for 3 epochs. Below are the performances under several cases.

1 Original code
image

2 device synchronize
image

3 streams synchronize
image

Overall, adding synchronization makes the code slightly slower than the original, but it avoids bugs. Stream-level synchronization shows some improvement compared to device-level synchronization. Stream-level synchronization might be more precise and can also solve the issue, so I made some changes to the code.

@hwchen2017
Copy link
Contributor

Hi @lpnpcs , can you share your full repo - including source code, dataset, and launch script? I’d be happy to help investigate further if I can reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants