Distributed training all-reduce order #107

zhiqi-0 · 2021-05-31T04:49:03Z

Hi,

I'm just wondering if there is a potential issue for all-reduce order when both data parallelism and tensor model parallelism are enabled during training. With torch DDP, both tensor model parallelism and data parallelism use all-reduce, and are launched on different streams. While the execution order is determined by hardware, will it cause hanging in some cases like:

GPU1: [MP] all-reduce -> [DP] all-reduce
GPU2: [DP] all-reduce -> [MP] all-reduce

From issues discussed here, I think it may be unsafe for undetermined order.

github-actions · 2023-07-10T18:26:08Z

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2023-09-19T18:23:54Z

Marking as stale. No activity in 60 days.

fwyc0573 · 2024-01-08T10:29:18Z

I have encountered the same problem. Can anyone provide some opinions?

deepakn94 · 2024-01-08T10:48:10Z

@fwyc0573 are you seeing a hang? Can you describe the setting, perhaps provide an example command line, and also paste the last couple of lines in the logs?

github-actions · 2024-03-08T18:21:39Z

Marking as stale. No activity in 60 days.

github-actions bot added the stale No activity in 60 days on issue or PR label Jul 10, 2023

github-actions bot closed this as completed Jul 20, 2023

jon-barker reopened this Jul 20, 2023

github-actions bot removed the stale No activity in 60 days on issue or PR label Jul 21, 2023

github-actions bot added the stale No activity in 60 days on issue or PR label Sep 19, 2023

github-actions bot removed the stale No activity in 60 days on issue or PR label Jan 8, 2024

github-actions bot added the stale No activity in 60 days on issue or PR label Mar 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training all-reduce order #107

Distributed training all-reduce order #107

zhiqi-0 commented May 31, 2021

github-actions bot commented Jul 10, 2023

github-actions bot commented Sep 19, 2023

fwyc0573 commented Jan 8, 2024

deepakn94 commented Jan 8, 2024

github-actions bot commented Mar 8, 2024

Distributed training all-reduce order #107

Distributed training all-reduce order #107

Comments

zhiqi-0 commented May 31, 2021

github-actions bot commented Jul 10, 2023

github-actions bot commented Sep 19, 2023

fwyc0573 commented Jan 8, 2024

deepakn94 commented Jan 8, 2024

github-actions bot commented Mar 8, 2024