fix #7188 #7371

lpnpcs · 2025-06-19T08:07:44Z

I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue #7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications.

before

after

Signed-off-by: vinceliu <lpnpcs@gmail.com>

tjruwase · 2025-06-19T12:47:04Z

@lpnpcs, thanks for contributing this fix. I am a bit concerned of the perf impact of synchronizing the device. Are you able to measure the perf before/after the fix. This will help guide whether to pursue finer-grained synchronization on streams instead of device.

tjruwase · 2025-06-19T12:50:30Z

@lpnpcs, please use the following to fix the formatting issues
https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

Signed-off-by: vinceliu <lpnpcs@gmail.com>

lpnpcs · 2025-06-20T09:13:14Z

@lpnpcs, thanks for contributing this fix. I am a bit concerned of the perf impact of synchronizing the device. Are you able to measure the perf before/after the fix. This will help guide whether to pursue finer-grained synchronization on streams instead of device.

Thank you for your review. I conducted the following experiments to illustrate the impact on performance.

I trained the Qwen2.5-vl-7b model using 8 A100 GPUs with 1,000 samples for 3 epochs. Below are the performances under several cases.

1 Original code

2 device synchronize

3 streams synchronize

Overall, adding synchronization makes the code slightly slower than the original, but it avoids bugs. Stream-level synchronization shows some improvement compared to device-level synchronization. Stream-level synchronization might be more precise and can also solve the issue, so I made some changes to the code.

hwchen2017 · 2025-06-20T20:03:13Z

Hi @lpnpcs , can you share your full repo - including source code, dataset, and launch script? I’d be happy to help investigate further if I can reproduce the issue.

lpnpcs requested review from tjruwase and tohtana as code owners June 19, 2025 08:07

fix data race when swap ipg_index in zero 2

90ee7f6

Signed-off-by: vinceliu <lpnpcs@gmail.com>

lpnpcs force-pushed the fix_grad_nan branch from 5216c33 to 90ee7f6 Compare June 19, 2025 08:21

tjruwase and others added 2 commits June 19, 2025 08:55

Merge branch 'master' into fix_grad_nan

b163912

change device sync to streams wait

3b90b77

Signed-off-by: vinceliu <lpnpcs@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix #7188 #7371

fix #7188 #7371

lpnpcs commented Jun 19, 2025

Uh oh!

tjruwase commented Jun 19, 2025

Uh oh!

tjruwase commented Jun 19, 2025

Uh oh!

lpnpcs commented Jun 20, 2025

Uh oh!

hwchen2017 commented Jun 20, 2025

Uh oh!

Uh oh!

fix #7188 #7371

Are you sure you want to change the base?

fix #7188 #7371

Conversation

lpnpcs commented Jun 19, 2025

Uh oh!

tjruwase commented Jun 19, 2025

Uh oh!

tjruwase commented Jun 19, 2025

Uh oh!

lpnpcs commented Jun 20, 2025

Uh oh!

hwchen2017 commented Jun 20, 2025

Uh oh!

Uh oh!